NLP (10): RAG and Knowledge Enhancement Systems

Large language models are powerful, but they have a critical weakness: their knowledge is "frozen" in training data. When users ask about recent events, private documents, or domain-specific knowledge, models often provide outdated or incorrect answers. Worse, models can "hallucinate" plausible-sounding but non-existent information — this is the hallucination problem.

Retrieval-Augmented Generation (RAG) technology solves this with a simple yet effective approach: before generating an answer, first retrieve relevant information from an external knowledge base, then input the retrieved documents together with the user query into the generative model. This way, the model generates answers based on real external knowledge rather than relying solely on training-time memories.

However, building an efficient RAG system is far from simple. Vector database selection determines retrieval speed and scalability; Embedding model quality directly affects retrieval precision; retrieval strategies (dense, sparse, hybrid) must be carefully designed based on data characteristics; reranking techniques further improve result quality; query rewriting and expansion significantly enhance retrieval effectiveness. This article dives deep into each component of RAG systems, from principles to implementation, from optimization to deployment, helping readers build production-grade RAG applications.

RAG Fundamentals and Architecture

Core Concept of RAG

RAG's core idea can be summarized in a simple formula: generation = retrieval + augmented generation. Specifically, the system first retrieves document chunks relevant to the query from a large-scale knowledge base, then inputs these documents as "context" into the generative model, enabling the model to generate answers based on real external knowledge.

Mathematical Representation:

RAG decomposes the generation process into two steps: retrieval and generation. For query, the output of a RAG system is:where: -is the set of retrieved documents (typically top-k, e.g., top-5) -is the retrieval probability, representing the relevance of documentto query, typically computed using vector similarity (e.g., cosine similarity) -is the generation probability, representing the probability of generating answerbased on retrieved documentand query Why It Works:

RAG's advantage lies in separating "memory" and "reasoning": knowledge is stored in vector databases and can be updated anytime; reasoning capability is provided by generative models. Combining both ensures knowledge timeliness while avoiding model retraining. More importantly, RAG provides interpretability: every answer can be traced back to specific source documents, which is crucial for production applications.

RAG Architecture Flow

A typical RAG system includes the following steps:

Document Processing: Split, vectorize, and store original documents in a vector database
Query Processing: Convert user queries to vector representations
Retrieval: Retrieve relevant documents from the vector database
Reranking: Fine-rank retrieval results
Generation: Input retrieved documents and query into the generative model

from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# 1. Load documents
loader = TextLoader("documents.txt")
documents = loader.load()

# 2. Split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)

# 3. Create vector database
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = FAISS.from_documents(chunks, embeddings)

# 4. Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 5. Create RAG chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    return_source_documents=True
)

# 6. Query
query = "What is machine learning?"
result = qa_chain({"query": query})
print(result["result"])
print(result["source_documents"])

RAG vs Fine-tuning

Method	Advantages	Disadvantages	Use Cases
RAG	Dynamic knowledge updates, low cost, high interpretability	Depends on retrieval quality, may have hallucinations	Frequently updated knowledge bases, multi-domain knowledge
Fine-tuning	Model fully adapts to task, potentially better performance	High cost, difficult to update, may forget	Specific tasks, relatively stable knowledge

Vector Database Selection

Vector databases are core components of RAG systems, responsible for storing and retrieving document vectors. Different vector databases have different characteristics and use cases.

FAISS (Facebook AI Similarity Search)

FAISS is Facebook's open-source vector similarity search library, supporting both CPU and GPU acceleration.

Characteristics: - High performance: Supports multiple index algorithms (IVF, HNSW, LSH) - Memory efficient: Supports memory mapping and quantization - Easy integration: Simple Python API

Use Cases: - Small to medium-scale datasets (millions of vectors) - Need for rapid prototyping - Local deployment

import faiss
import numpy as np

# Create index
dimension = 768  # Vector dimension
index = faiss.IndexFlatL2(dimension)  # L2 distance

# Add vectors
vectors = np.random.random((10000, dimension)).astype('float32')
index.add(vectors)

# Search
query_vector = np.random.random((1, dimension)).astype('float32')
k = 5  # Return top-k
distances, indices = index.search(query_vector, k)

print(f"Top {k} similar vectors:")
for i, idx in enumerate(indices[0]):
    print(f"Rank {i+1}: Index {idx}, Distance {distances[0][i]}")

# Use IVF index (faster, but requires training)
nlist = 100  # Number of cluster centers
quantizer = faiss.IndexFlatL2(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist)
index_ivf.train(vectors)  # Train index
index_ivf.add(vectors)
index_ivf.nprobe = 10  # Number of clusters to check during search

Milvus

Milvus is a cloud-native vector database supporting distributed deployment and horizontal scaling.

Characteristics: - Distributed architecture: Supports cluster deployment - High availability: Supports data replication and fault recovery - Rich features: Supports scalar filtering, time series, etc.

Use Cases: - Large-scale datasets (tens of millions+) - Production deployment - Need for high availability and scalability

from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

# Connect to Milvus
connections.connect(
    alias="default",
    host="localhost",
    port="19530"
)

# Define schema
fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=1000),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
]

schema = CollectionSchema(fields, "RAG collection")
collection = Collection("rag_collection", schema)

# Create index
index_params = {
    "metric_type": "L2",
    "index_type": "IVF_FLAT",
    "params": {"nlist": 1024}
}
collection.create_index("embedding", index_params)

# Insert data
data = [
    ["Document 1", "Document 2", "Document 3"],
    [[0.1] * 768, [0.2] * 768, [0.3] * 768]  # Example vectors
]
collection.insert(data)
collection.load()

# Search
search_params = {"metric_type": "L2", "params": {"nprobe": 10 }}
results = collection.search(
    data=[[0.15] * 768],  # Query vector
    anns_field="embedding",
    param=search_params,
    limit=5
)

Pinecone

Pinecone is a fully managed vector database service requiring no infrastructure management.

Characteristics: - Fully managed: No need to manage servers - Auto-scaling: Automatically adjusts based on load - Simple to use: RESTful API

Use Cases: - Rapid deployment - Small to medium-scale applications - Don't want to manage infrastructure

import pinecone

# Initialize
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")

# Create index
index_name = "rag-index"
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=768,
        metric="cosine"
    )

# Connect to index
index = pinecone.Index(index_name)

# Insert vectors
vectors = [
    ("vec1", [0.1] * 768, {"text": "Document 1"}),
    ("vec2", [0.2] * 768, {"text": "Document 2"})
]
index.upsert(vectors=vectors)

# Search
query_vector = [0.15] * 768
results = index.query(
    vector=query_vector,
    top_k=5,
    include_metadata=True
)

Chroma

Chroma is a lightweight vector database focused on ease of use and developer experience.

Characteristics: - Lightweight: Low resource usage - Ease of use: Clean API design - Flexibility: Supports multiple deployment methods

Use Cases: - Development and testing - Small-scale applications - Rapid prototyping

import chromadb
from chromadb.config import Settings

# Create client
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./chroma_db"
))

# Create collection
collection = client.create_collection(
    name="rag_collection",
    metadata={"hnsw:space": "cosine"}
)

# Add documents
collection.add(
    documents=["Document 1", "Document 2", "Document 3"],
    embeddings=[[0.1] * 768, [0.2] * 768, [0.3] * 768],
    ids=["id1", "id2", "id3"],
    metadatas=[{"source": "doc1"}, {"source": "doc2"}, {"source": "doc3"}]
)

# Query
results = collection.query(
    query_embeddings=[[0.15] * 768],
    n_results=5
)

Vector Database Comparison

Database	Scale	Deployment	Characteristics	Use Cases
FAISS	Millions	Local	High performance, easy to use	Development, small-medium scale
Milvus	Tens of millions+	Distributed	Scalable, high availability	Production, large scale
Pinecone	Millions	Managed	Simple, no ops needed	Rapid deployment
Chroma	Hundreds of thousands	Local/Cloud	Lightweight, easy to use	Development, small scale

Embedding Model Comparison

The quality of Embedding models directly affects retrieval performance. Different models have different characteristics and use cases.

General Embedding Models

OpenAI text-embedding-ada-002: - Dimension: 1536 - Advantages: Excellent performance, multilingual support - Disadvantages: Requires API calls, has cost

sentence-transformers: - Open-source model collection - Advantages: Free, can deploy locally, good performance - Common models: - all-MiniLM-L6-v2: Fast, lightweight - all-mpnet-base-v2: Better performance - multi-qa-mpnet-base: Optimized for Q&A

from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer('all-mpnet-base-v2')

# Encode text
texts = [
    "Machine learning is a branch of artificial intelligence",
    "Deep learning uses neural networks",
    "Natural language processing handles text data"
]
embeddings = model.encode(texts)

print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
print("Similarity matrix:")
print(similarity_matrix)

Domain-Specific Embeddings

For specific domains, Embedding models can be fine-tuned using domain data.

from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load base model
model = SentenceTransformer('all-mpnet-base-v2')

# Prepare training data
train_examples = [
    InputExample(texts=["Machine Learning", "ML"]),
    InputExample(texts=["Deep Learning", "DL"]),
    InputExample(texts=["Natural Language Processing", "NLP"])
]

# Define loss function (contrastive learning)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

# Fine-tune
model.fit(
    train_objectives=[(train_dataloader, train_loss)],
    epochs=10,
    warmup_steps=100
)

# Save model
model.save('./domain-specific-embedding')

Embedding Model Selection Guide

Model Type	Dimension	Speed	Accuracy	Use Cases
text-embedding-ada-002	1536	Medium	High	Production, multilingual
all-MiniLM-L6-v2	384	Fast	Medium	Rapid prototyping, resource-constrained
all-mpnet-base-v2	768	Medium	High	Balance performance and speed
multi-qa-mpnet-base	768	Medium	High	Q&A tasks

Retrieval Strategy Optimization

Retrieval is the core component of RAG systems, and retrieval quality directly affects final answer accuracy. Different retrieval strategies have different advantages and use cases, and understanding their differences is crucial for building efficient RAG systems.

Dense Retrieval

Dense Retrieval converts both queries and documents into high-dimensional vectors through Embedding models, then computes vector similarity (typically cosine similarity) to retrieve relevant documents. This is currently the most mainstream retrieval method.

How It Works:

The core assumption of dense retrieval is: semantically similar texts should be close in vector space. Through Embedding models trained on large-scale text pairs (e.g., sentence-transformers), the system encodes semantic information into vector representations. During retrieval, it computes similarity between query vectors and all document vectors, selecting the top-k documents with highest similarity.

Advantages: - Strong semantic understanding: Can understand synonyms, near-synonyms, and semantically similar concepts - Cross-lingual capability: Multilingual Embedding models support cross-language retrieval - Simple implementation: Only requires vector similarity computation, no complex feature engineering

Disadvantages: - Less sensitive to exact keyword matching: If queries contain specific terms (e.g., product names, code identifiers), may miss exact matching documents - Computational cost: Requires computing Embeddings for all documents, costly at scale - Domain adaptability: General Embedding models may underperform in specific domains

def dense_retrieval(query_embedding, vectorstore, top_k=5):
    """Dense retrieval"""
    # Compute similarity
    similarities = vectorstore.similarity_search_with_score(
        query_embedding,
        k=top_k
    )
    
    return similarities

Sparse Retrieval

Sparse Retrieval uses keyword matching (e.g., BM25) for retrieval.

Advantages: - Precise keyword matching - Sensitive to exact terms - No Embedding model needed

Disadvantages: - Cannot understand semantics - Not sensitive to synonyms

from rank_bm25 import BM25Okapi
import jieba

def sparse_retrieval(query, documents, top_k=5):
    """Sparse retrieval (BM25)"""
    # Tokenize
    tokenized_docs = [jieba.lcut(doc) for doc in documents]
    tokenized_query = jieba.lcut(query)
    
    # Create BM25 index
    bm25 = BM25Okapi(tokenized_docs)
    
    # Retrieve
    scores = bm25.get_scores(tokenized_query)
    top_indices = np.argsort(scores)[::-1][:top_k]
    
    return [(documents[i], scores[i]) for i in top_indices]

Hybrid Retrieval

Dense and Sparse retrieval each have advantages, and hybrid retrieval combines both for optimal results. In practice, hybrid retrieval typically improves retrieval precision by 10-30% compared to single methods.

Why Hybrid Retrieval:

Dense retrieval excels at semantic understanding but may miss exact matches
Sparse retrieval excels at keyword matching but cannot understand semantics
Hybrid retrieval combines both strengths, ensuring semantic relevance while guaranteeing exact matches aren't missed

Fusion Strategies:

RRF (Reciprocal Rank Fusion): The most common fusion method, weighted merging of rankings from both retrieval results. RRF score formula:whereis the set of all retrieval method results,is the rank of documentin method, andis a smoothing parameter (typically 60).
Weighted Fusion: Weighted sum of similarity scores from both retrieval results, requiring weight adjustment based on data characteristics (e.g., dense:sparse = 0.7:0.3).
Reranking Fusion: First retrieve top-k candidates with both methods (e.g., 20 each), merge and deduplicate to get candidate set, then use Cross-Encoder reranking model to fine-rank all candidates, selecting final top-k.

def hybrid_retrieval(query, query_embedding, vectorstore, documents, top_k=5):
    """Hybrid retrieval"""
    # Dense retrieval
    dense_results = dense_retrieval(query_embedding, vectorstore, top_k=top_k*2)
    dense_scores = {doc: score for doc, score in dense_results}
    
    # Sparse retrieval
    sparse_results = sparse_retrieval(query, documents, top_k=top_k*2)
    sparse_scores = {doc: score for doc, score in sparse_results}
    
    # RRF fusion
    def rrf_score(doc, rank, k=60):
        return 1 / (k + rank)
    
    # Compute RRF scores
    combined_scores = {}
    for rank, (doc, _) in enumerate(dense_results, 1):
        combined_scores[doc] = combined_scores.get(doc, 0) + rrf_score(doc, rank)
    
    for rank, (doc, _) in enumerate(sparse_results, 1):
        combined_scores[doc] = combined_scores.get(doc, 0) + rrf_score(doc, rank)
    
    # Sort and return top-k
    sorted_docs = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
    return sorted_docs[:top_k]

Reranking Techniques

Reranking fine-ranks initial retrieval results to improve final result quality.

Cross-Encoder Reranking

Cross-Encoder inputs query and document together into the model to compute relevance scores.

Advantages: - High accuracy - Can understand query-document interactions

Disadvantages: - High computational cost (cannot pre-compute) - Slow

from sentence_transformers import CrossEncoder

# Load Cross-Encoder model
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, documents, top_k=5):
    """Reranking"""
    # Build query-document pairs
    pairs = [[query, doc] for doc in documents]
    
    # Compute relevance scores
    scores = reranker.predict(pairs)
    
    # Sort
    ranked_indices = np.argsort(scores)[::-1]
    ranked_docs = [documents[i] for i in ranked_indices[:top_k]]
    
    return ranked_docs

Multi-Stage Retrieval

Multi-stage retrieval combines fast retrieval with precise reranking:

Stage 1: Use fast methods (Dense/Sparse) to retrieve many candidates (e.g., 100)
Stage 2: Use reranking model to fine-rank candidates (e.g., top-5)

def multi_stage_retrieval(query, query_embedding, vectorstore, top_k=5):
    """Multi-stage retrieval"""
    # Stage 1: Fast retrieval of many candidates
    candidates = dense_retrieval(query_embedding, vectorstore, top_k=100)
    candidate_docs = [doc for doc, _ in candidates]
    
    # Stage 2: Reranking
    reranked_docs = rerank(query, candidate_docs, top_k=top_k)
    
    return reranked_docs

Query Rewriting and Expansion

Query optimization can improve retrieval performance, including query rewriting, query expansion, and query decomposition.

Query Rewriting

Query rewriting converts user queries into forms more suitable for retrieval.

Methods: 1. Synonym expansion: Add synonyms 2. Query completion: Complete incomplete queries 3. Query simplification: Remove redundant words

from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

def rewrite_query(original_query, llm):
    """Query rewriting"""
    prompt = PromptTemplate(
        input_variables=["query"],
        template="""
        Rewrite the following query into a form more suitable for information retrieval, maintaining the original meaning but using more precise terms:
        
        Original query: {query}
        
        Rewritten query:
        """
    )
    
    rewritten = llm(prompt.format(query=original_query))
    return rewritten.strip()

Query Expansion

Query expansion adds related terms and concepts.

def expand_query(query, llm):
    """Query expansion"""
    prompt = f"""
    Generate relevant keywords and synonyms for the following query for information retrieval:
    
    Query: {query}
    
    Relevant keywords (comma-separated):
    """
    
    expanded_terms = llm(prompt).strip().split(',')
    expanded_query = query + " " + " ".join(expanded_terms)
    
    return expanded_query

Query Decomposition

For complex queries, decompose into multiple sub-queries.

def decompose_query(query, llm):
    """Query decomposition"""
    prompt = f"""
    Decompose the following complex query into multiple simple sub-queries:
    
    Complex query: {query}
    
    Sub-queries (one per line):
    """
    
    subqueries = llm(prompt).strip().split('\n')
    return [q.strip() for q in subqueries if q.strip()]

Practical: Building Enterprise-Grade RAG Systems

Building RAG with LangChain

LangChain provides a complete RAG toolchain.

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

# 1. Load documents
loader = DirectoryLoader("./documents", glob="*.txt")
documents = loader.load()

# 2. Split documents
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len
)
chunks = text_splitter.split_documents(documents)

# 3. Create Embedding
embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-mpnet-base-v2"
)

# 4. Create vector database
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# 5. Custom Prompt
prompt_template = """Use the following context to answer the question. If you don't know the answer, say you don't know, don't make up an answer.

Context: {context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# 6. Create RAG chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True
)

# 7. Query
query = "What is RAG?"
result = qa_chain({"query": query})
print(f"Answer: {result['result']}")
print(f"Source documents: {result['source_documents']}")

Building RAG with LlamaIndex

LlamaIndex focuses on the data layer for LLM applications.

from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import OpenAI

# 1. Load documents
documents = SimpleDirectoryReader("./documents").load_data()

# 2. Create service context
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-mpnet-base-v2"
)
llm = OpenAI(temperature=0)

service_context = ServiceContext.from_defaults(
    llm=llm,
    embed_model=embed_model,
    chunk_size=1000,
    chunk_overlap=200
)

# 3. Create index
index = VectorStoreIndex.from_documents(
    documents,
    service_context=service_context
)

# 4. Create query engine
query_engine = index.as_query_engine(
    similarity_top_k=5,
    response_mode="compact"
)

# 5. Query
response = query_engine.query("What is RAG?")
print(response)
print(response.source_nodes)

Advanced RAG Patterns

Parent-Child Retrieval: - Storage: Split documents into small chunks (child chunks) - Retrieval: Retrieve child chunks, but return parent chunks (containing more context)

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create parent-child document splitter
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

# Split documents
parent_docs = parent_splitter.split_documents(documents)
child_docs = []

for parent in parent_docs:
    children = child_splitter.split_documents([parent])
    for child in children:
        child.metadata["parent_id"] = parent.metadata.get("id")
    child_docs.extend(children)

# Store child chunks (for retrieval)
vectorstore = Chroma.from_documents(child_docs, embeddings)

# Return parent chunks during retrieval
def retrieve_with_parent(query, vectorstore):
    # Retrieve child chunks
    child_results = vectorstore.similarity_search(query, k=5)
    
    # Get parent chunks
    parent_ids = set([r.metadata["parent_id"] for r in child_results])
    parent_docs = [doc for doc in parent_docs if doc.metadata.get("id") in parent_ids]
    
    return parent_docs

Self-RAG: - Use LLM to judge whether retrieval is needed - Critically evaluate retrieval results - Decide whether to use retrieved information based on evaluation

def self_rag(query, llm, vectorstore):
    """Self-RAG implementation"""
    # 1. Judge whether retrieval is needed
    need_retrieval_prompt = f"""
    Determine whether the following query needs to retrieve from external knowledge base:
    
    Query: {query}
    
    If retrieval is needed, answer "yes", otherwise answer "no".
    """
    
    need_retrieval = llm(need_retrieval_prompt).strip()
    
    if "yes" in need_retrieval.lower():
        # 2. Retrieve
        docs = vectorstore.similarity_search(query, k=5)
        
        # 3. Evaluate retrieval results
        evaluation_prompt = f"""
        Evaluate the relevance of the following retrieval results to the query:
        
        Query: {query}
        
        Retrieval results:
        {chr(10).join([f"{i+1}. {doc.page_content[:200]}" for i, doc in enumerate(docs)])}
        
        Give a relevance score (1-5) for each result and explain why.
        """
        
        evaluation = llm(evaluation_prompt)
        
        # 4. Generate answer
        answer_prompt = f"""
        Answer the question based on the following retrieval results:
        
        Query: {query}
        
        Retrieval results:
        {chr(10).join([doc.page_content for doc in docs])}
        
        Relevance evaluation:
        {evaluation}
        
        Answer:
        """
        
        answer = llm(answer_prompt)
        return answer, docs
    else:
        # Direct generation (no retrieval needed)
        answer = llm(f"Answer the following question: {query}")
        return answer, []

❓ Q&A: Common Questions on RAG

Q1: What's the difference between RAG and fine-tuning? When to use RAG?

A: - RAG: Dynamically retrieves external knowledge, suitable for frequently updated knowledge bases, need to access latest information, multi-domain knowledge - Fine-tuning: Encodes knowledge into model parameters, suitable for specific tasks, relatively stable knowledge, need for ultimate performance - Choice: If knowledge needs frequent updates or involves private data, choose RAG; if task is fixed and performance requirements are high, consider fine-tuning

Q2: How to choose a vector database?

A: Choice depends on: - Data scale: Millions use FAISS, tens of millions+ use Milvus - Deployment: Local use FAISS/Chroma, cloud deployment use Milvus/Pinecone - Ops capability: Don't want to manage infrastructure use Pinecone, have ops team use Milvus - Development stage: Rapid prototyping use FAISS/Chroma, production use Milvus

Q3: How to choose Embedding models?

A: - General scenarios: all-mpnet-base-v2 or text-embedding-ada-002 - Resource-constrained: all-MiniLM-L6-v2 (fast, low dimension) - Q&A tasks: multi-qa-mpnet-base - Multilingual: paraphrase-multilingual-mpnet-base-v2 - Domain-specific: Fine-tune general models with domain data

Q4: How to choose between Dense Retrieval and Sparse Retrieval?

A: - Dense Retrieval: Suitable for semantic understanding, synonym matching, concept retrieval - Sparse Retrieval: Suitable for exact keyword matching, term retrieval - Recommendation: Use Hybrid Retrieval, combining strengths of both

Q5: How to improve retrieval accuracy?

A: Multiple approaches: 1. Optimize Embedding: Use better models or domain fine-tuning 2. Improve splitting strategy: Choose appropriate splitting methods based on document characteristics 3. Use Reranking: Cross-Encoder reranking 4. Query optimization: Query rewriting, expansion, decomposition 5. Multi-stage retrieval: Coarse ranking then fine ranking

Q6: What to do when RAG systems have hallucinations?

A: 1. Improve retrieval quality: Ensure retrieved documents are relevant to query 2. Prompt design: Clearly require model to answer based on retrieved content, say "don't know" if unknown 3. Result verification: Fact-check key information 4. Use Self-RAG: Let model evaluate relevance of retrieval results 5. Confidence scoring: Give confidence scores to generated results, prompt user when confidence is low

Q7: How to handle long documents?

A: 1. Parent-Child Retrieval: Retrieve small chunks, return large chunks 2. Sliding window: Include adjacent chunks during retrieval 3. Document summarization: Generate summaries for long documents, retrieve summaries 4. Hierarchical retrieval: Retrieve chapters first, then specific content

Q8: How to optimize RAG system latency?

A: 1. Async retrieval: Retrieve multiple queries in parallel 2. Caching: Cache results of common queries 3. Index optimization: Use faster indexes (e.g., HNSW) 4. Batch processing: Process multiple queries in batches 5. Model optimization: Use faster Embedding and generative models

Q9: How to evaluate RAG system performance?

A: Evaluation metrics: - Retrieval metrics: Recall@K, MRR (Mean Reciprocal Rank), NDCG - Generation metrics: BLEU, ROUGE, BERTScore, human evaluation - End-to-end metrics: Answer accuracy, relevance, completeness - System metrics: Latency, throughput, cost

def evaluate_rag_system(qa_chain, test_set):
    """Evaluate RAG system"""
    results = []
    
    for query, ground_truth in test_set:
        # Generate answer
        result = qa_chain({"query": query})
        answer = result["result"]
        
        # Compute metrics
        from rouge_score import rouge_scorer
        scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
        scores = scorer.score(ground_truth, answer)
        
        results.append({
            "query": query,
            "answer": answer,
            "ground_truth": ground_truth,
            "rouge1": scores["rouge1"].fmeasure,
            "rougeL": scores["rougeL"].fmeasure
        })
    
    # Compute average scores
    avg_rouge1 = sum([r["rouge1"] for r in results]) / len(results)
    avg_rougeL = sum([r["rougeL"] for r in results]) / len(results)
    
    return {
        "avg_rouge1": avg_rouge1,
        "avg_rougeL": avg_rougeL,
        "results": results
    }

Q10: How to build a multi-turn conversational RAG system?

A: 1. Context management: Maintain conversation history 2. Query rewriting: Combine current query with history 3. Contextual retrieval: Consider conversation context during retrieval 4. Memory mechanism: Distinguish short-term memory (current conversation) and long-term memory (knowledge base)

class ConversationalRAG:
    def __init__(self, qa_chain):
        self.qa_chain = qa_chain
        self.conversation_history = []
    
    def chat(self, query):
        # 1. Build contextual query
        context_query = self._build_contextual_query(query)
        
        # 2. Retrieve and generate
        result = self.qa_chain({"query": context_query})
        
        # 3. Update history
        self.conversation_history.append({
            "user": query,
            "assistant": result["result"]
        })
        
        return result["result"]
    
    def _build_contextual_query(self, current_query):
        """Build contextual query"""
        if not self.conversation_history:
            return current_query
        
        # Add recent conversations
        recent_history = self.conversation_history[-3:]
        context = "\n".join([
            f"User: {h['user']}\nAssistant: {h['assistant']}"
            for h in recent_history
        ])
        
        return f"""
        Conversation history:
        {context}
        
        Current question: {current_query}
        
        Please answer the current question based on the conversation history.
        """

RAG technology provides large language models with the ability to access external knowledge, a key technology for building knowledge-enhanced AI systems. An excellent RAG system requires careful design of various components, from vector database selection to retrieval strategy optimization, from query processing to result generation. In practice, it's necessary to select appropriate components and technologies based on specific needs, continuously optimize and iterate, to build efficient and accurate RAG systems.