Large language models are powerful, but they have a critical weakness: their knowledge is "frozen" in training data. When users ask about recent events, private documents, or domain-specific knowledge, models often provide outdated or incorrect answers. Worse, models can "hallucinate" plausible-sounding but non-existent information — this is the hallucination problem.
Retrieval-Augmented Generation (RAG) technology solves this with a simple yet effective approach: before generating an answer, first retrieve relevant information from an external knowledge base, then input the retrieved documents together with the user query into the generative model. This way, the model generates answers based on real external knowledge rather than relying solely on training-time memories.
However, building an efficient RAG system is far from simple. Vector database selection determines retrieval speed and scalability; Embedding model quality directly affects retrieval precision; retrieval strategies (dense, sparse, hybrid) must be carefully designed based on data characteristics; reranking techniques further improve result quality; query rewriting and expansion significantly enhance retrieval effectiveness. This article dives deep into each component of RAG systems, from principles to implementation, from optimization to deployment, helping readers build production-grade RAG applications.
RAG Fundamentals and Architecture
Core Concept of RAG
RAG's core idea can be summarized in a simple formula: generation = retrieval + augmented generation. Specifically, the system first retrieves document chunks relevant to the query from a large-scale knowledge base, then inputs these documents as "context" into the generative model, enabling the model to generate answers based on real external knowledge.
Mathematical Representation:
RAG decomposes the generation process into two steps: retrieval and
generation. For query
RAG's advantage lies in separating "memory" and "reasoning": knowledge is stored in vector databases and can be updated anytime; reasoning capability is provided by generative models. Combining both ensures knowledge timeliness while avoiding model retraining. More importantly, RAG provides interpretability: every answer can be traced back to specific source documents, which is crucial for production applications.
RAG Architecture Flow
A typical RAG system includes the following steps:
- Document Processing: Split, vectorize, and store original documents in a vector database
- Query Processing: Convert user queries to vector representations
- Retrieval: Retrieve relevant documents from the vector database
- Reranking: Fine-rank retrieval results
- Generation: Input retrieved documents and query into the generative model
1 | from langchain.document_loaders import TextLoader |
RAG vs Fine-tuning
| Method | Advantages | Disadvantages | Use Cases |
|---|---|---|---|
| RAG | Dynamic knowledge updates, low cost, high interpretability | Depends on retrieval quality, may have hallucinations | Frequently updated knowledge bases, multi-domain knowledge |
| Fine-tuning | Model fully adapts to task, potentially better performance | High cost, difficult to update, may forget | Specific tasks, relatively stable knowledge |
Vector Database Selection
Vector databases are core components of RAG systems, responsible for storing and retrieving document vectors. Different vector databases have different characteristics and use cases.
FAISS (Facebook AI Similarity Search)
FAISS is Facebook's open-source vector similarity search library, supporting both CPU and GPU acceleration.
Characteristics: - High performance: Supports multiple index algorithms (IVF, HNSW, LSH) - Memory efficient: Supports memory mapping and quantization - Easy integration: Simple Python API
Use Cases: - Small to medium-scale datasets (millions of vectors) - Need for rapid prototyping - Local deployment
1 | import faiss |
Milvus
Milvus is a cloud-native vector database supporting distributed deployment and horizontal scaling.
Characteristics: - Distributed architecture: Supports cluster deployment - High availability: Supports data replication and fault recovery - Rich features: Supports scalar filtering, time series, etc.
Use Cases: - Large-scale datasets (tens of millions+) - Production deployment - Need for high availability and scalability
1 | from pymilvus import connections, Collection, FieldSchema, CollectionSchema, DataType |
Pinecone
Pinecone is a fully managed vector database service requiring no infrastructure management.
Characteristics: - Fully managed: No need to manage servers - Auto-scaling: Automatically adjusts based on load - Simple to use: RESTful API
Use Cases: - Rapid deployment - Small to medium-scale applications - Don't want to manage infrastructure
1 | import pinecone |
Chroma
Chroma is a lightweight vector database focused on ease of use and developer experience.
Characteristics: - Lightweight: Low resource usage - Ease of use: Clean API design - Flexibility: Supports multiple deployment methods
Use Cases: - Development and testing - Small-scale applications - Rapid prototyping
1 | import chromadb |
Vector Database Comparison
| Database | Scale | Deployment | Characteristics | Use Cases |
|---|---|---|---|---|
| FAISS | Millions | Local | High performance, easy to use | Development, small-medium scale |
| Milvus | Tens of millions+ | Distributed | Scalable, high availability | Production, large scale |
| Pinecone | Millions | Managed | Simple, no ops needed | Rapid deployment |
| Chroma | Hundreds of thousands | Local/Cloud | Lightweight, easy to use | Development, small scale |
Embedding Model Comparison
The quality of Embedding models directly affects retrieval performance. Different models have different characteristics and use cases.
General Embedding Models
OpenAI text-embedding-ada-002: - Dimension: 1536 - Advantages: Excellent performance, multilingual support - Disadvantages: Requires API calls, has cost
sentence-transformers: - Open-source model
collection - Advantages: Free, can deploy locally, good performance -
Common models: - all-MiniLM-L6-v2: Fast, lightweight -
all-mpnet-base-v2: Better performance -
multi-qa-mpnet-base: Optimized for Q&A
1 | from sentence_transformers import SentenceTransformer |
Domain-Specific Embeddings
For specific domains, Embedding models can be fine-tuned using domain data.
1 | from sentence_transformers import SentenceTransformer, InputExample, losses |
Embedding Model Selection Guide
| Model Type | Dimension | Speed | Accuracy | Use Cases |
|---|---|---|---|---|
| text-embedding-ada-002 | 1536 | Medium | High | Production, multilingual |
| all-MiniLM-L6-v2 | 384 | Fast | Medium | Rapid prototyping, resource-constrained |
| all-mpnet-base-v2 | 768 | Medium | High | Balance performance and speed |
| multi-qa-mpnet-base | 768 | Medium | High | Q&A tasks |
Retrieval Strategy Optimization
Retrieval is the core component of RAG systems, and retrieval quality directly affects final answer accuracy. Different retrieval strategies have different advantages and use cases, and understanding their differences is crucial for building efficient RAG systems.
Dense Retrieval
Dense Retrieval converts both queries and documents into high-dimensional vectors through Embedding models, then computes vector similarity (typically cosine similarity) to retrieve relevant documents. This is currently the most mainstream retrieval method.
How It Works:
The core assumption of dense retrieval is: semantically similar texts should be close in vector space. Through Embedding models trained on large-scale text pairs (e.g., sentence-transformers), the system encodes semantic information into vector representations. During retrieval, it computes similarity between query vectors and all document vectors, selecting the top-k documents with highest similarity.
Advantages: - Strong semantic understanding: Can understand synonyms, near-synonyms, and semantically similar concepts - Cross-lingual capability: Multilingual Embedding models support cross-language retrieval - Simple implementation: Only requires vector similarity computation, no complex feature engineering
Disadvantages: - Less sensitive to exact keyword matching: If queries contain specific terms (e.g., product names, code identifiers), may miss exact matching documents - Computational cost: Requires computing Embeddings for all documents, costly at scale - Domain adaptability: General Embedding models may underperform in specific domains
1 | def dense_retrieval(query_embedding, vectorstore, top_k=5): |
Sparse Retrieval
Sparse Retrieval uses keyword matching (e.g., BM25) for retrieval.
Advantages: - Precise keyword matching - Sensitive to exact terms - No Embedding model needed
Disadvantages: - Cannot understand semantics - Not sensitive to synonyms
1 | from rank_bm25 import BM25Okapi |
Hybrid Retrieval
Dense and Sparse retrieval each have advantages, and hybrid retrieval combines both for optimal results. In practice, hybrid retrieval typically improves retrieval precision by 10-30% compared to single methods.
Why Hybrid Retrieval:
- Dense retrieval excels at semantic understanding but may miss exact matches
- Sparse retrieval excels at keyword matching but cannot understand semantics
- Hybrid retrieval combines both strengths, ensuring semantic relevance while guaranteeing exact matches aren't missed
Fusion Strategies:
RRF (Reciprocal Rank Fusion): The most common fusion method, weighted merging of rankings from both retrieval results. RRF score formula:
where is the set of all retrieval method results, is the rank of document in method , and is a smoothing parameter (typically 60). Weighted Fusion: Weighted sum of similarity scores from both retrieval results, requiring weight adjustment based on data characteristics (e.g., dense:sparse = 0.7:0.3).
Reranking Fusion: First retrieve top-k candidates with both methods (e.g., 20 each), merge and deduplicate to get candidate set, then use Cross-Encoder reranking model to fine-rank all candidates, selecting final top-k.
1 | def hybrid_retrieval(query, query_embedding, vectorstore, documents, top_k=5): |
Reranking Techniques
Reranking fine-ranks initial retrieval results to improve final result quality.
Cross-Encoder Reranking
Cross-Encoder inputs query and document together into the model to compute relevance scores.
Advantages: - High accuracy - Can understand query-document interactions
Disadvantages: - High computational cost (cannot pre-compute) - Slow
1 | from sentence_transformers import CrossEncoder |
Multi-Stage Retrieval
Multi-stage retrieval combines fast retrieval with precise reranking:
- Stage 1: Use fast methods (Dense/Sparse) to retrieve many candidates (e.g., 100)
- Stage 2: Use reranking model to fine-rank candidates (e.g., top-5)
1 | def multi_stage_retrieval(query, query_embedding, vectorstore, top_k=5): |
Query Rewriting and Expansion
Query optimization can improve retrieval performance, including query rewriting, query expansion, and query decomposition.
Query Rewriting
Query rewriting converts user queries into forms more suitable for retrieval.
Methods: 1. Synonym expansion: Add synonyms 2. Query completion: Complete incomplete queries 3. Query simplification: Remove redundant words
1 | from langchain.llms import OpenAI |
Query Expansion
Query expansion adds related terms and concepts.
1 | def expand_query(query, llm): |
Query Decomposition
For complex queries, decompose into multiple sub-queries.
1 | def decompose_query(query, llm): |
Practical: Building Enterprise-Grade RAG Systems
Building RAG with LangChain
LangChain provides a complete RAG toolchain.
1 | from langchain.document_loaders import DirectoryLoader |
Building RAG with LlamaIndex
LlamaIndex focuses on the data layer for LLM applications.
1 | from llama_index import VectorStoreIndex, SimpleDirectoryReader, ServiceContext |
Advanced RAG Patterns
Parent-Child Retrieval: - Storage: Split documents into small chunks (child chunks) - Retrieval: Retrieve child chunks, but return parent chunks (containing more context)
1 | from langchain.text_splitter import RecursiveCharacterTextSplitter |
Self-RAG: - Use LLM to judge whether retrieval is needed - Critically evaluate retrieval results - Decide whether to use retrieved information based on evaluation
1 | def self_rag(query, llm, vectorstore): |
❓ Q&A: Common Questions on RAG
Q1: What's the difference between RAG and fine-tuning? When to use RAG?
A: - RAG: Dynamically retrieves external knowledge, suitable for frequently updated knowledge bases, need to access latest information, multi-domain knowledge - Fine-tuning: Encodes knowledge into model parameters, suitable for specific tasks, relatively stable knowledge, need for ultimate performance - Choice: If knowledge needs frequent updates or involves private data, choose RAG; if task is fixed and performance requirements are high, consider fine-tuning
Q2: How to choose a vector database?
A: Choice depends on: - Data scale: Millions use FAISS, tens of millions+ use Milvus - Deployment: Local use FAISS/Chroma, cloud deployment use Milvus/Pinecone - Ops capability: Don't want to manage infrastructure use Pinecone, have ops team use Milvus - Development stage: Rapid prototyping use FAISS/Chroma, production use Milvus
Q3: How to choose Embedding models?
A: - General scenarios:
all-mpnet-base-v2 or text-embedding-ada-002 -
Resource-constrained: all-MiniLM-L6-v2
(fast, low dimension) - Q&A tasks:
multi-qa-mpnet-base - Multilingual:
paraphrase-multilingual-mpnet-base-v2 -
Domain-specific: Fine-tune general models with domain
data
Q4: How to choose between Dense Retrieval and Sparse Retrieval?
A: - Dense Retrieval: Suitable for semantic understanding, synonym matching, concept retrieval - Sparse Retrieval: Suitable for exact keyword matching, term retrieval - Recommendation: Use Hybrid Retrieval, combining strengths of both
Q5: How to improve retrieval accuracy?
A: Multiple approaches: 1. Optimize Embedding: Use better models or domain fine-tuning 2. Improve splitting strategy: Choose appropriate splitting methods based on document characteristics 3. Use Reranking: Cross-Encoder reranking 4. Query optimization: Query rewriting, expansion, decomposition 5. Multi-stage retrieval: Coarse ranking then fine ranking
Q6: What to do when RAG systems have hallucinations?
A: 1. Improve retrieval quality: Ensure retrieved documents are relevant to query 2. Prompt design: Clearly require model to answer based on retrieved content, say "don't know" if unknown 3. Result verification: Fact-check key information 4. Use Self-RAG: Let model evaluate relevance of retrieval results 5. Confidence scoring: Give confidence scores to generated results, prompt user when confidence is low
Q7: How to handle long documents?
A: 1. Parent-Child Retrieval: Retrieve small chunks, return large chunks 2. Sliding window: Include adjacent chunks during retrieval 3. Document summarization: Generate summaries for long documents, retrieve summaries 4. Hierarchical retrieval: Retrieve chapters first, then specific content
Q8: How to optimize RAG system latency?
A: 1. Async retrieval: Retrieve multiple queries in parallel 2. Caching: Cache results of common queries 3. Index optimization: Use faster indexes (e.g., HNSW) 4. Batch processing: Process multiple queries in batches 5. Model optimization: Use faster Embedding and generative models
Q9: How to evaluate RAG system performance?
A: Evaluation metrics: - Retrieval metrics: Recall@K, MRR (Mean Reciprocal Rank), NDCG - Generation metrics: BLEU, ROUGE, BERTScore, human evaluation - End-to-end metrics: Answer accuracy, relevance, completeness - System metrics: Latency, throughput, cost
1 | def evaluate_rag_system(qa_chain, test_set): |
Q10: How to build a multi-turn conversational RAG system?
A: 1. Context management: Maintain conversation history 2. Query rewriting: Combine current query with history 3. Contextual retrieval: Consider conversation context during retrieval 4. Memory mechanism: Distinguish short-term memory (current conversation) and long-term memory (knowledge base)
1 | class ConversationalRAG: |
RAG technology provides large language models with the ability to access external knowledge, a key technology for building knowledge-enhanced AI systems. An excellent RAG system requires careful design of various components, from vector database selection to retrieval strategy optimization, from query processing to result generation. In practice, it's necessary to select appropriate components and technologies based on specific needs, continuously optimize and iterate, to build efficient and accurate RAG systems.
- Post title:NLP (10): RAG and Knowledge Enhancement Systems
- Post author:Chen Kai
- Create time:2024-03-28 15:00:00
- Post link:https://www.chenk.top/en/nlp-rag-knowledge-enhancement/
- Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.