NLP (10): RAG and Knowledge Enhancement Systems
Chen Kai BOSS

Large language models are powerful, but they have a critical weakness: their knowledge is "frozen" in training data. When users ask about recent events, private documents, or domain-specific knowledge, models often provide outdated or incorrect answers. Worse, models can "hallucinate" plausible-sounding but non-existent information — this is the hallucination problem.

Retrieval-Augmented Generation (RAG) technology solves this with a simple yet effective approach: before generating an answer, first retrieve relevant information from an external knowledge base, then input the retrieved documents together with the user query into the generative model. This way, the model generates answers based on real external knowledge rather than relying solely on training-time memories.

However, building an efficient RAG system is far from simple. Vector database selection determines retrieval speed and scalability; Embedding model quality directly affects retrieval precision; retrieval strategies (dense, sparse, hybrid) must be carefully designed based on data characteristics; reranking techniques further improve result quality; query rewriting and expansion significantly enhance retrieval effectiveness. This article dives deep into each component of RAG systems, from principles to implementation, from optimization to deployment, helping readers build production-grade RAG applications.

RAG Fundamentals and Architecture

Core Concept of RAG

RAG's core idea can be summarized in a simple formula: generation = retrieval + augmented generation. Specifically, the system first retrieves document chunks relevant to the query from a large-scale knowledge base, then inputs these documents as "context" into the generative model, enabling the model to generate answers based on real external knowledge.

Mathematical Representation:

RAG decomposes the generation process into two steps: retrieval and generation. For query, the output of a RAG system is:where: -is the set of retrieved documents (typically top-k, e.g., top-5) -is the retrieval probability, representing the relevance of documentto query, typically computed using vector similarity (e.g., cosine similarity) -is the generation probability, representing the probability of generating answerbased on retrieved documentand query Why It Works:

RAG's advantage lies in separating "memory" and "reasoning": knowledge is stored in vector databases and can be updated anytime; reasoning capability is provided by generative models. Combining both ensures knowledge timeliness while avoiding model retraining. More importantly, RAG provides interpretability: every answer can be traced back to specific source documents, which is crucial for production applications.

RAG Architecture Flow

A typical RAG system includes the following steps:

  1. Document Processing: Split, vectorize, and store original documents in a vector database
  2. Query Processing: Convert user queries to vector representations
  3. Retrieval: Retrieve relevant documents from the vector database
  4. Reranking: Fine-rank retrieval results
  5. Generation: Input retrieved documents and query into the generative model
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

# 1. Load documents
loader = TextLoader("documents.txt")
documents = loader.load()

# 2. Split documents
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
chunks = text_splitter.split_documents(documents)

# 3. Create vector database
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
vectorstore = FAISS.from_documents(chunks, embeddings)

# 4. Create retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# 5. Create RAG chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
return_source_documents=True
)

# 6. Query
query = "What is machine learning?"
result = qa_chain({"query": query})
print(result["result"])
print(result["source_documents"])

RAG vs Fine-tuning

Method Advantages Disadvantages Use Cases
RAG Dynamic knowledge updates, low cost, high interpretability Depends on retrieval quality, may have hallucinations Frequently updated knowledge bases, multi-domain knowledge
Fine-tuning Model fully adapts to task, potentially better performance High cost, difficult to update, may forget Specific tasks, relatively stable knowledge

Vector Database Selection

Vector databases are core components of RAG systems, responsible for storing and retrieving document vectors. Different vector databases have different characteristics and use cases.

FAISS is Facebook's open-source vector similarity search library, supporting both CPU and GPU acceleration.

Characteristics: - High performance: Supports multiple index algorithms (IVF, HNSW, LSH) - Memory efficient: Supports memory mapping and quantization - Easy integration: Simple Python API

Use Cases: - Small to medium-scale datasets (millions of vectors) - Need for rapid prototyping - Local deployment

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import faiss
import numpy as np

# Create index
dimension = 768 # Vector dimension
index = faiss.IndexFlatL2(dimension) # L2 distance

# Add vectors
vectors = np.random.random((10000, dimension)).astype('float32')
index.add(vectors)

# Search
query_vector = np.random.random((1, dimension)).astype('float32')
k = 5 # Return top-k
distances, indices = index.search(query_vector, k)

print(f"Top {k} similar vectors:")
for i, idx in enumerate(indices[0]):
print(f"Rank {i+1}: Index {idx}, Distance {distances[0][i]}")

# Use IVF index (faster, but requires training)
nlist = 100 # Number of cluster centers
quantizer = faiss.IndexFlatL2(dimension)
index_ivf = faiss.IndexIVFFlat(quantizer, dimension, nlist)
index_ivf.train(vectors) # Train index
index_ivf.add(vectors)
index_ivf.nprobe = 10 # Number of clusters to check during search

Milvus

Milvus is a cloud-native vector database supporting distributed deployment and horizontal scaling.

Characteristics: - Distributed architecture: Supports cluster deployment - High availability: Supports data replication and fault recovery - Rich features: Supports scalar filtering, time series, etc.

Use Cases: - Large-scale datasets (tens of millions+) - Production deployment - Need for high availability and scalability

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType

# Connect to Milvus
connections.connect(
alias="default",
host="localhost",
port="19530"
)

# Define schema
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=1000),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768)
]

schema = CollectionSchema(fields, "RAG collection")
collection = Collection("rag_collection", schema)

# Create index
index_params = {
"metric_type": "L2",
"index_type": "IVF_FLAT",
"params": {"nlist": 1024}
}
collection.create_index("embedding", index_params)

# Insert data
data = [
["Document 1", "Document 2", "Document 3"],
[[0.1] * 768, [0.2] * 768, [0.3] * 768] # Example vectors
]
collection.insert(data)
collection.load()

# Search
search_params = {"metric_type": "L2", "params": {"nprobe": 10 }}
results = collection.search(
data=[[0.15] * 768], # Query vector
anns_field="embedding",
param=search_params,
limit=5
)

Pinecone

Pinecone is a fully managed vector database service requiring no infrastructure management.

Characteristics: - Fully managed: No need to manage servers - Auto-scaling: Automatically adjusts based on load - Simple to use: RESTful API

Use Cases: - Rapid deployment - Small to medium-scale applications - Don't want to manage infrastructure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import pinecone

# Initialize
pinecone.init(api_key="your-api-key", environment="us-west1-gcp")

# Create index
index_name = "rag-index"
if index_name not in pinecone.list_indexes():
pinecone.create_index(
index_name,
dimension=768,
metric="cosine"
)

# Connect to index
index = pinecone.Index(index_name)

# Insert vectors
vectors = [
("vec1", [0.1] * 768, {"text": "Document 1"}),
("vec2", [0.2] * 768, {"text": "Document 2"})
]
index.upsert(vectors=vectors)

# Search
query_vector = [0.15] * 768
results = index.query(
vector=query_vector,
top_k=5,
include_metadata=True
)

Chroma

Chroma is a lightweight vector database focused on ease of use and developer experience.

Characteristics: - Lightweight: Low resource usage - Ease of use: Clean API design - Flexibility: Supports multiple deployment methods

Use Cases: - Development and testing - Small-scale applications - Rapid prototyping

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import chromadb
from chromadb.config import Settings

# Create client
client = chromadb.Client(Settings(
chroma_db_impl="duckdb+parquet",
persist_directory="./chroma_db"
))

# Create collection
collection = client.create_collection(
name="rag_collection",
metadata={"hnsw:space": "cosine"}
)

# Add documents
collection.add(
documents=["Document 1", "Document 2", "Document 3"],
embeddings=[[0.1] * 768, [0.2] * 768, [0.3] * 768],
ids=["id1", "id2", "id3"],
metadatas=[{"source": "doc1"}, {"source": "doc2"}, {"source": "doc3"}]
)

# Query
results = collection.query(
query_embeddings=[[0.15] * 768],
n_results=5
)

Vector Database Comparison

Database Scale Deployment Characteristics Use Cases
FAISS Millions Local High performance, easy to use Development, small-medium scale
Milvus Tens of millions+ Distributed Scalable, high availability Production, large scale
Pinecone Millions Managed Simple, no ops needed Rapid deployment
Chroma Hundreds of thousands Local/Cloud Lightweight, easy to use Development, small scale

Embedding Model Comparison

The quality of Embedding models directly affects retrieval performance. Different models have different characteristics and use cases.

General Embedding Models

OpenAI text-embedding-ada-002: - Dimension: 1536 - Advantages: Excellent performance, multilingual support - Disadvantages: Requires API calls, has cost

sentence-transformers: - Open-source model collection - Advantages: Free, can deploy locally, good performance - Common models: - all-MiniLM-L6-v2: Fast, lightweight - all-mpnet-base-v2: Better performance - multi-qa-mpnet-base: Optimized for Q&A

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
from sentence_transformers import SentenceTransformer
import numpy as np

# Load model
model = SentenceTransformer('all-mpnet-base-v2')

# Encode text
texts = [
"Machine learning is a branch of artificial intelligence",
"Deep learning uses neural networks",
"Natural language processing handles text data"
]
embeddings = model.encode(texts)

print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding dimension: {embeddings.shape[1]}")

# Compute similarity
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(embeddings)
print("Similarity matrix:")
print(similarity_matrix)

Domain-Specific Embeddings

For specific domains, Embedding models can be fine-tuned using domain data.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader

# Load base model
model = SentenceTransformer('all-mpnet-base-v2')

# Prepare training data
train_examples = [
InputExample(texts=["Machine Learning", "ML"]),
InputExample(texts=["Deep Learning", "DL"]),
InputExample(texts=["Natural Language Processing", "NLP"])
]

# Define loss function (contrastive learning)
train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=16)
train_loss = losses.CosineSimilarityLoss(model)

# Fine-tune
model.fit(
train_objectives=[(train_dataloader, train_loss)],
epochs=10,
warmup_steps=100
)

# Save model
model.save('./domain-specific-embedding')

Embedding Model Selection Guide

Model Type Dimension Speed Accuracy Use Cases
text-embedding-ada-002 1536 Medium High Production, multilingual
all-MiniLM-L6-v2 384 Fast Medium Rapid prototyping, resource-constrained
all-mpnet-base-v2 768 Medium High Balance performance and speed
multi-qa-mpnet-base 768 Medium High Q&A tasks

Retrieval Strategy Optimization

Retrieval is the core component of RAG systems, and retrieval quality directly affects final answer accuracy. Different retrieval strategies have different advantages and use cases, and understanding their differences is crucial for building efficient RAG systems.

Dense Retrieval

Dense Retrieval converts both queries and documents into high-dimensional vectors through Embedding models, then computes vector similarity (typically cosine similarity) to retrieve relevant documents. This is currently the most mainstream retrieval method.

How It Works:

The core assumption of dense retrieval is: semantically similar texts should be close in vector space. Through Embedding models trained on large-scale text pairs (e.g., sentence-transformers), the system encodes semantic information into vector representations. During retrieval, it computes similarity between query vectors and all document vectors, selecting the top-k documents with highest similarity.

Advantages: - Strong semantic understanding: Can understand synonyms, near-synonyms, and semantically similar concepts - Cross-lingual capability: Multilingual Embedding models support cross-language retrieval - Simple implementation: Only requires vector similarity computation, no complex feature engineering

Disadvantages: - Less sensitive to exact keyword matching: If queries contain specific terms (e.g., product names, code identifiers), may miss exact matching documents - Computational cost: Requires computing Embeddings for all documents, costly at scale - Domain adaptability: General Embedding models may underperform in specific domains

1
2
3
4
5
6
7
8
9
def dense_retrieval(query_embedding, vectorstore, top_k=5):
"""Dense retrieval"""
# Compute similarity
similarities = vectorstore.similarity_search_with_score(
query_embedding,
k=top_k
)

return similarities

Sparse Retrieval

Sparse Retrieval uses keyword matching (e.g., BM25) for retrieval.

Advantages: - Precise keyword matching - Sensitive to exact terms - No Embedding model needed

Disadvantages: - Cannot understand semantics - Not sensitive to synonyms

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from rank_bm25 import BM25Okapi
import jieba

def sparse_retrieval(query, documents, top_k=5):
"""Sparse retrieval (BM25)"""
# Tokenize
tokenized_docs = [jieba.lcut(doc) for doc in documents]
tokenized_query = jieba.lcut(query)

# Create BM25 index
bm25 = BM25Okapi(tokenized_docs)

# Retrieve
scores = bm25.get_scores(tokenized_query)
top_indices = np.argsort(scores)[::-1][:top_k]

return [(documents[i], scores[i]) for i in top_indices]

Hybrid Retrieval

Dense and Sparse retrieval each have advantages, and hybrid retrieval combines both for optimal results. In practice, hybrid retrieval typically improves retrieval precision by 10-30% compared to single methods.

Why Hybrid Retrieval:

  • Dense retrieval excels at semantic understanding but may miss exact matches
  • Sparse retrieval excels at keyword matching but cannot understand semantics
  • Hybrid retrieval combines both strengths, ensuring semantic relevance while guaranteeing exact matches aren't missed

Fusion Strategies:

  1. RRF (Reciprocal Rank Fusion): The most common fusion method, weighted merging of rankings from both retrieval results. RRF score formula:whereis the set of all retrieval method results,is the rank of documentin method, andis a smoothing parameter (typically 60).

  2. Weighted Fusion: Weighted sum of similarity scores from both retrieval results, requiring weight adjustment based on data characteristics (e.g., dense:sparse = 0.7:0.3).

  3. Reranking Fusion: First retrieve top-k candidates with both methods (e.g., 20 each), merge and deduplicate to get candidate set, then use Cross-Encoder reranking model to fine-rank all candidates, selecting final top-k.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def hybrid_retrieval(query, query_embedding, vectorstore, documents, top_k=5):
"""Hybrid retrieval"""
# Dense retrieval
dense_results = dense_retrieval(query_embedding, vectorstore, top_k=top_k*2)
dense_scores = {doc: score for doc, score in dense_results}

# Sparse retrieval
sparse_results = sparse_retrieval(query, documents, top_k=top_k*2)
sparse_scores = {doc: score for doc, score in sparse_results}

# RRF fusion
def rrf_score(doc, rank, k=60):
return 1 / (k + rank)

# Compute RRF scores
combined_scores = {}
for rank, (doc, _) in enumerate(dense_results, 1):
combined_scores[doc] = combined_scores.get(doc, 0) + rrf_score(doc, rank)

for rank, (doc, _) in enumerate(sparse_results, 1):
combined_scores[doc] = combined_scores.get(doc, 0) + rrf_score(doc, rank)

# Sort and return top-k
sorted_docs = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)
return sorted_docs[:top_k]

Reranking Techniques

Reranking fine-ranks initial retrieval results to improve final result quality.

Cross-Encoder Reranking

Cross-Encoder inputs query and document together into the model to compute relevance scores.

Advantages: - High accuracy - Can understand query-document interactions

Disadvantages: - High computational cost (cannot pre-compute) - Slow

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from sentence_transformers import CrossEncoder

# Load Cross-Encoder model
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank(query, documents, top_k=5):
"""Reranking"""
# Build query-document pairs
pairs = [[query, doc] for doc in documents]

# Compute relevance scores
scores = reranker.predict(pairs)

# Sort
ranked_indices = np.argsort(scores)[::-1]
ranked_docs = [documents[i] for i in ranked_indices[:top_k]]

return ranked_docs

Multi-Stage Retrieval

Multi-stage retrieval combines fast retrieval with precise reranking:

  1. Stage 1: Use fast methods (Dense/Sparse) to retrieve many candidates (e.g., 100)
  2. Stage 2: Use reranking model to fine-rank candidates (e.g., top-5)
1
2
3
4
5
6
7
8
9
10
def multi_stage_retrieval(query, query_embedding, vectorstore, top_k=5):
"""Multi-stage retrieval"""
# Stage 1: Fast retrieval of many candidates
candidates = dense_retrieval(query_embedding, vectorstore, top_k=100)
candidate_docs = [doc for doc, _ in candidates]

# Stage 2: Reranking
reranked_docs = rerank(query, candidate_docs, top_k=top_k)

return reranked_docs

Query Rewriting and Expansion

Query optimization can improve retrieval performance, including query rewriting, query expansion, and query decomposition.

Query Rewriting

Query rewriting converts user queries into forms more suitable for retrieval.

Methods: 1. Synonym expansion: Add synonyms 2. Query completion: Complete incomplete queries 3. Query simplification: Remove redundant words

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

def rewrite_query(original_query, llm):
"""Query rewriting"""
prompt = PromptTemplate(
input_variables=["query"],
template="""
Rewrite the following query into a form more suitable for information retrieval, maintaining the original meaning but using more precise terms:

Original query: {query}

Rewritten query:
"""
)

rewritten = llm(prompt.format(query=original_query))
return rewritten.strip()

Query Expansion

Query expansion adds related terms and concepts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
def expand_query(query, llm):
"""Query expansion"""
prompt = f"""
Generate relevant keywords and synonyms for the following query for information retrieval:

Query: {query}

Relevant keywords (comma-separated):
"""

expanded_terms = llm(prompt).strip().split(',')
expanded_query = query + " " + " ".join(expanded_terms)

return expanded_query

Query Decomposition

For complex queries, decompose into multiple sub-queries.

1
2
3
4
5
6
7
8
9
10
11
12
def decompose_query(query, llm):
"""Query decomposition"""
prompt = f"""
Decompose the following complex query into multiple simple sub-queries:

Complex query: {query}

Sub-queries (one per line):
"""

subqueries = llm(prompt).strip().split('\n')
return [q.strip() for q in subqueries if q.strip()]

Practical: Building Enterprise-Grade RAG Systems

Building RAG with LangChain

LangChain provides a complete RAG toolchain.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.prompts import PromptTemplate

# 1. Load documents
loader = DirectoryLoader("./documents", glob="*.txt")
documents = loader.load()

# 2. Split documents
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len
)
chunks = text_splitter.split_documents(documents)

# 3. Create Embedding
embeddings = HuggingFaceEmbeddings(
model_name="sentence-transformers/all-mpnet-base-v2"
)

# 4. Create vector database
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)

# 5. Custom Prompt
prompt_template = """Use the following context to answer the question. If you don't know the answer, say you don't know, don't make up an answer.

Context: {context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)

# 6. Create RAG chain
llm = OpenAI(temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True
)

# 7. Query
query = "What is RAG?"
result = qa_chain({"query": query})
print(f"Answer: {result['result']}")
print(f"Source documents: {result['source_documents']}")

Building RAG with LlamaIndex

LlamaIndex focuses on the data layer for LLM applications.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import OpenAI

# 1. Load documents
documents = SimpleDirectoryReader("./documents").load_data()

# 2. Create service context
embed_model = HuggingFaceEmbedding(
model_name="sentence-transformers/all-mpnet-base-v2"
)
llm = OpenAI(temperature=0)

service_context = ServiceContext.from_defaults(
llm=llm,
embed_model=embed_model,
chunk_size=1000,
chunk_overlap=200
)

# 3. Create index
index = VectorStoreIndex.from_documents(
documents,
service_context=service_context
)

# 4. Create query engine
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="compact"
)

# 5. Query
response = query_engine.query("What is RAG?")
print(response)
print(response.source_nodes)

Advanced RAG Patterns

Parent-Child Retrieval: - Storage: Split documents into small chunks (child chunks) - Retrieval: Retrieve child chunks, but return parent chunks (containing more context)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create parent-child document splitter
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
child_splitter = RecursiveCharacterTextSplitter(chunk_size=500)

# Split documents
parent_docs = parent_splitter.split_documents(documents)
child_docs = []

for parent in parent_docs:
children = child_splitter.split_documents([parent])
for child in children:
child.metadata["parent_id"] = parent.metadata.get("id")
child_docs.extend(children)

# Store child chunks (for retrieval)
vectorstore = Chroma.from_documents(child_docs, embeddings)

# Return parent chunks during retrieval
def retrieve_with_parent(query, vectorstore):
# Retrieve child chunks
child_results = vectorstore.similarity_search(query, k=5)

# Get parent chunks
parent_ids = set([r.metadata["parent_id"] for r in child_results])
parent_docs = [doc for doc in parent_docs if doc.metadata.get("id") in parent_ids]

return parent_docs

Self-RAG: - Use LLM to judge whether retrieval is needed - Critically evaluate retrieval results - Decide whether to use retrieved information based on evaluation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
def self_rag(query, llm, vectorstore):
"""Self-RAG implementation"""
# 1. Judge whether retrieval is needed
need_retrieval_prompt = f"""
Determine whether the following query needs to retrieve from external knowledge base:

Query: {query}

If retrieval is needed, answer "yes", otherwise answer "no".
"""

need_retrieval = llm(need_retrieval_prompt).strip()

if "yes" in need_retrieval.lower():
# 2. Retrieve
docs = vectorstore.similarity_search(query, k=5)

# 3. Evaluate retrieval results
evaluation_prompt = f"""
Evaluate the relevance of the following retrieval results to the query:

Query: {query}

Retrieval results:
{chr(10).join([f"{i+1}. {doc.page_content[:200]}" for i, doc in enumerate(docs)])}

Give a relevance score (1-5) for each result and explain why.
"""

evaluation = llm(evaluation_prompt)

# 4. Generate answer
answer_prompt = f"""
Answer the question based on the following retrieval results:

Query: {query}

Retrieval results:
{chr(10).join([doc.page_content for doc in docs])}

Relevance evaluation:
{evaluation}

Answer:
"""

answer = llm(answer_prompt)
return answer, docs
else:
# Direct generation (no retrieval needed)
answer = llm(f"Answer the following question: {query}")
return answer, []

❓ Q&A: Common Questions on RAG

Q1: What's the difference between RAG and fine-tuning? When to use RAG?

A: - RAG: Dynamically retrieves external knowledge, suitable for frequently updated knowledge bases, need to access latest information, multi-domain knowledge - Fine-tuning: Encodes knowledge into model parameters, suitable for specific tasks, relatively stable knowledge, need for ultimate performance - Choice: If knowledge needs frequent updates or involves private data, choose RAG; if task is fixed and performance requirements are high, consider fine-tuning

Q2: How to choose a vector database?

A: Choice depends on: - Data scale: Millions use FAISS, tens of millions+ use Milvus - Deployment: Local use FAISS/Chroma, cloud deployment use Milvus/Pinecone - Ops capability: Don't want to manage infrastructure use Pinecone, have ops team use Milvus - Development stage: Rapid prototyping use FAISS/Chroma, production use Milvus

Q3: How to choose Embedding models?

A: - General scenarios: all-mpnet-base-v2 or text-embedding-ada-002 - Resource-constrained: all-MiniLM-L6-v2 (fast, low dimension) - Q&A tasks: multi-qa-mpnet-base - Multilingual: paraphrase-multilingual-mpnet-base-v2 - Domain-specific: Fine-tune general models with domain data

Q4: How to choose between Dense Retrieval and Sparse Retrieval?

A: - Dense Retrieval: Suitable for semantic understanding, synonym matching, concept retrieval - Sparse Retrieval: Suitable for exact keyword matching, term retrieval - Recommendation: Use Hybrid Retrieval, combining strengths of both

Q5: How to improve retrieval accuracy?

A: Multiple approaches: 1. Optimize Embedding: Use better models or domain fine-tuning 2. Improve splitting strategy: Choose appropriate splitting methods based on document characteristics 3. Use Reranking: Cross-Encoder reranking 4. Query optimization: Query rewriting, expansion, decomposition 5. Multi-stage retrieval: Coarse ranking then fine ranking

Q6: What to do when RAG systems have hallucinations?

A: 1. Improve retrieval quality: Ensure retrieved documents are relevant to query 2. Prompt design: Clearly require model to answer based on retrieved content, say "don't know" if unknown 3. Result verification: Fact-check key information 4. Use Self-RAG: Let model evaluate relevance of retrieval results 5. Confidence scoring: Give confidence scores to generated results, prompt user when confidence is low

Q7: How to handle long documents?

A: 1. Parent-Child Retrieval: Retrieve small chunks, return large chunks 2. Sliding window: Include adjacent chunks during retrieval 3. Document summarization: Generate summaries for long documents, retrieve summaries 4. Hierarchical retrieval: Retrieve chapters first, then specific content

Q8: How to optimize RAG system latency?

A: 1. Async retrieval: Retrieve multiple queries in parallel 2. Caching: Cache results of common queries 3. Index optimization: Use faster indexes (e.g., HNSW) 4. Batch processing: Process multiple queries in batches 5. Model optimization: Use faster Embedding and generative models

Q9: How to evaluate RAG system performance?

A: Evaluation metrics: - Retrieval metrics: Recall@K, MRR (Mean Reciprocal Rank), NDCG - Generation metrics: BLEU, ROUGE, BERTScore, human evaluation - End-to-end metrics: Answer accuracy, relevance, completeness - System metrics: Latency, throughput, cost

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def evaluate_rag_system(qa_chain, test_set):
"""Evaluate RAG system"""
results = []

for query, ground_truth in test_set:
# Generate answer
result = qa_chain({"query": query})
answer = result["result"]

# Compute metrics
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(ground_truth, answer)

results.append({
"query": query,
"answer": answer,
"ground_truth": ground_truth,
"rouge1": scores["rouge1"].fmeasure,
"rougeL": scores["rougeL"].fmeasure
})

# Compute average scores
avg_rouge1 = sum([r["rouge1"] for r in results]) / len(results)
avg_rougeL = sum([r["rougeL"] for r in results]) / len(results)

return {
"avg_rouge1": avg_rouge1,
"avg_rougeL": avg_rougeL,
"results": results
}

Q10: How to build a multi-turn conversational RAG system?

A: 1. Context management: Maintain conversation history 2. Query rewriting: Combine current query with history 3. Contextual retrieval: Consider conversation context during retrieval 4. Memory mechanism: Distinguish short-term memory (current conversation) and long-term memory (knowledge base)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
class ConversationalRAG:
def __init__(self, qa_chain):
self.qa_chain = qa_chain
self.conversation_history = []

def chat(self, query):
# 1. Build contextual query
context_query = self._build_contextual_query(query)

# 2. Retrieve and generate
result = self.qa_chain({"query": context_query})

# 3. Update history
self.conversation_history.append({
"user": query,
"assistant": result["result"]
})

return result["result"]

def _build_contextual_query(self, current_query):
"""Build contextual query"""
if not self.conversation_history:
return current_query

# Add recent conversations
recent_history = self.conversation_history[-3:]
context = "\n".join([
f"User: {h['user']}\nAssistant: {h['assistant']}"
for h in recent_history
])

return f"""
Conversation history:
{context}

Current question: {current_query}

Please answer the current question based on the conversation history.
"""

RAG technology provides large language models with the ability to access external knowledge, a key technology for building knowledge-enhanced AI systems. An excellent RAG system requires careful design of various components, from vector database selection to retrieval strategy optimization, from query processing to result generation. In practice, it's necessary to select appropriate components and technologies based on specific needs, continuously optimize and iterate, to build efficient and accurate RAG systems.

  • Post title:NLP (10): RAG and Knowledge Enhancement Systems
  • Post author:Chen Kai
  • Create time:2024-03-28 15:00:00
  • Post link:https://www.chenk.top/en/nlp-rag-knowledge-enhancement/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments