NLP (12): Frontiers and Practical Applications
Chen Kai BOSS

The boundaries of large language model capabilities are rapidly expanding: from simple text generation to complex tool calling, from code completion to long document understanding, from single-turn dialogue to multi-turn reasoning. Behind these capabilities are breakthroughs in frontier research such as Agent architectures, code-specialized models, and long-context techniques.

However, capability improvements also bring new challenges. Models can "hallucinate" plausible-sounding but non-existent information, may generate harmful content, and need alignment with human values. More importantly, how to deploy these technologies in production? How to design scalable architectures? How to monitor and optimize performance?

This article dives deep into frontier technologies in NLP: from architectural designs of Function Calling and ReAct agents, to code generation principles of CodeLlama and StarCoder, from long-context implementations of LongLoRA and LongLLaMA, to technical solutions for hallucination mitigation and safety alignment. More importantly, this article provides complete production-grade deployment solutions: from FastAPI service design to Docker containerization, from monitoring systems to performance optimization, each component includes runnable code and best practices.

Agents and Tool Use

Function Calling: Enabling LLMs to Call External Tools

Function Calling is a feature introduced by OpenAI in GPT-4, allowing models to call external functions and APIs during text generation. This enables LLMs to break free from pure text limitations and interact with external systems (databases, APIs, tools), achieving more powerful capabilities.

Core Concept:

Function Calling workflow consists of three steps:

  1. Function Definition: Developers define available functions and their parameters, including function names, descriptions, parameter types, and constraints. These definitions are provided to the model in JSON Schema format.

  2. Function Decision: Based on user queries and function definitions, the model decides whether function calls are needed. If needed, the model generates parameters conforming to function signatures (in JSON format).

  3. Function Execution and Result Integration: The system executes function calls, returns results to the model, and the model generates final answers based on function results.

Why It Works:

Function Calling's advantage lies in separating "understanding" and "execution": the model is responsible for understanding user intent and generating correct parameters, while external systems execute specific operations. This design ensures security (function execution in controlled environments) while providing flexibility (easy to add new functions).

Implementation Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
import json
from openai import OpenAI

client = OpenAI()

# Define available functions
functions = [
{
"name": "get_weather",
"description": "Get weather information for a specified city",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g., Beijing, Shanghai"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
},
{
"name": "send_email",
"description": "Send an email",
"parameters": {
"type": "object",
"properties": {
"to": {"type": "string", "description": "Recipient email"},
"subject": {"type": "string", "description": "Email subject"},
"body": {"type": "string", "description": "Email body"}
},
"required": ["to", "subject", "body"]
}
}
]

def get_weather(location, unit="celsius"):
"""Simulate weather query function"""
# In practice, call real weather API
return f"Weather in {location}: 25 degrees {unit}, sunny"

def send_email(to, subject, body):
"""Simulate email sending function"""
# In practice, call email service
return f"Email sent to {to}"

def chat_with_functions(user_message):
"""Chat using Function Calling"""
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": user_message}],
functions=functions,
function_call="auto"
)

message = response.choices[0].message

# Check if function call is needed
if message.function_call:
function_name = message.function_call.name
function_args = json.loads(message.function_call.arguments)

# Call corresponding function
if function_name == "get_weather":
result = get_weather(**function_args)
elif function_name == "send_email":
result = send_email(**function_args)
else:
result = "Unknown function"

# Return function result to model
second_response = client.chat.completions.create(
model="gpt-4",
messages=[
{"role": "user", "content": user_message},
message,
{
"role": "function",
"name": function_name,
"content": result
}
],
functions=functions
)

return second_response.choices[0].message.content
else:
return message.content

# Usage example
response = chat_with_functions("What's the weather in Beijing today?")
print(response)

ReAct: Combining Reasoning and Acting

ReAct (Reasoning + Acting) is a framework proposed by Google Research that interleaves reasoning and acting, enabling models to complete complex tasks through iterative "thought-action-observation" loops. Unlike Function Calling's static tool calls, ReAct allows models to dynamically plan steps and adjust strategies based on intermediate results.

ReAct Loop Explained:

ReAct's core is an iterative loop, each iteration containing three steps:

  1. Thought: The model analyzes current state, task objectives, and available tools, deciding what operation to execute next. The thought process is explicit, output as text, making reasoning interpretable.

  2. Action: Based on thought results, the model selects tools to call and generates parameters. Action format is typically Action: [tool_name](parameters).

  3. Observation: The system executes actions and returns results, the model observes results and updates internal state. Observation results are added to context for next-round thinking.

  4. Iteration: Repeat the above process until the model outputs Final Answer or reaches maximum iterations.

Why It Works:

ReAct's advantage lies in making reasoning explicit, allowing models to "see" their own thought processes. This enables models to: - Adjust strategies based on intermediate results - Handle complex tasks requiring multi-step reasoning - Learn from errors (by observing failed action results) - Provide interpretable reasoning paths

Implementation Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
from langchain.agents import initialize_agent, Tool
from langchain.llms import OpenAI
from langchain import SerpAPIWrapper

class ReActAgent:
def __init__(self):
self.llm = OpenAI(temperature=0)

# Define tools
search = SerpAPIWrapper()
self.tools = [
Tool(
name="Search",
func=search.run,
description="Used to search for latest information, input should be a search query"
),
Tool(
name="Calculator",
func=self.calculator,
description="Used to perform mathematical calculations, input should be a mathematical expression"
)
]

self.agent = initialize_agent(
self.tools,
self.llm,
agent="react-docstore",
verbose=True
)

def calculator(self, expression):
"""Calculator tool"""
try:
result = eval(expression)
return str(result)
except:
return "Calculation error"

def run(self, query):
"""Execute query"""
return self.agent.run(query)

# Usage example
agent = ReActAgent()
result = agent.run("Search for OpenAI's latest model, then calculate the result of 2024 minus 2015")
print(result)

Custom ReAct Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
class CustomReActAgent:
def __init__(self, llm, tools):
self.llm = llm
self.tools = {tool.name: tool for tool in tools}
self.max_iterations = 10

def run(self, query):
"""Execute ReAct loop"""
history = []

for i in range(self.max_iterations):
# Build prompt
prompt = self.build_prompt(query, history)

# Get model response
response = self.llm(prompt)

# Parse response
thought, action, action_input = self.parse_response(response)

history.append({
"thought": thought,
"action": action,
"action_input": action_input
})

# Check if complete
if action == "Final Answer":
return action_input

# Execute action
if action in self.tools:
observation = self.tools[action].run(action_input)
history.append({"observation": observation})
else:
history.append({"observation": f"Unknown tool: {action}"})

return "Reached maximum iterations, task not completed"

def build_prompt(self, query, history):
"""Build prompt"""
prompt = f"Question: {query}\n\n"

for step in history:
if "thought" in step:
prompt += f"Thought: {step['thought']}\n"
if "action" in step:
prompt += f"Action: {step['action']}({step['action_input']})\n"
if "observation" in step:
prompt += f"Observation: {step['observation']}\n"

prompt += "\nAvailable tools: " + ", ".join(self.tools.keys())
prompt += "\nPlease continue in the format 'Thought -> Action -> Observation'."

return prompt

def parse_response(self, response):
"""Parse model response"""
# Simplified implementation, should be more robust in practice
lines = response.strip().split("\n")
thought = ""
action = ""
action_input = ""

for line in lines:
if line.startswith("Thought:"):
thought = line.replace("Thought:", "").strip()
elif line.startswith("Action:"):
parts = line.replace("Action:", "").strip().split("(")
action = parts[0].strip()
if len(parts) > 1:
action_input = parts[1].rstrip(")").strip()

return thought, action, action_input

Code Generation and Understanding

CodeLlama: Code-Specialized Large Model

CodeLlama is a code generation and understanding model developed by Meta based on LLaMA 2.

Features:

  • Multilingual Support: Python, C++, Java, PHP, TypeScript, C#, Bash, etc.
  • Multiple Variants: Base model, Python-specific, instruction-tuned versions
  • Long Context: Supports up to 100K tokens of context

Usage Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

class CodeLlamaGenerator:
def __init__(self, model_name="codellama/CodeLlama-7b-Instruct-hf"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
device_map="auto"
)

def generate_code(self, prompt, max_length=512, temperature=0.2):
"""Generate code"""
# Build prompt format
formatted_prompt = f"<s>[INST] {prompt} [/INST]"

inputs = self.tokenizer(formatted_prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_length,
temperature=temperature,
top_p=0.9,
do_sample=True
)

generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract generated code portion
code = generated_text.split("[/INST]")[-1].strip()
return code

def complete_code(self, code_context, language="python"):
"""Code completion"""
prompt = f"Complete the following {language} code:\n\n{code_context}"
return self.generate_code(prompt)

def explain_code(self, code):
"""Code explanation"""
prompt = f"Explain what the following code does:\n\n{code}"
return self.generate_code(prompt, max_length=256)

# Usage example
generator = CodeLlamaGenerator()

# Generate code
code = generator.generate_code("Write a Python function to calculate fibonacci numbers")
print(code)

# Code completion
partial_code = """
def quicksort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
"""
completed = generator.complete_code(partial_code)
print(completed)

StarCoder: Trained on GitHub Code

StarCoder is a code model developed by the BigCode project, trained on GitHub code.

Features:

  • Large-Scale Training: Trained on code in 80+ programming languages
  • Long Context: Supports 8K tokens
  • Code Completion: Specifically optimized code completion capabilities

Usage Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from transformers import AutoModelForCausalLM, AutoTokenizer

class StarCoderGenerator:
def __init__(self, model_name="bigcode/starcoder"):
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
trust_remote_code=True
)

def complete(self, code, max_new_tokens=256):
"""Code completion"""
inputs = self.tokenizer(code, return_tensors="pt").to("cuda")

with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.2,
top_p=0.95,
do_sample=True
)

completed_code = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
return completed_code

# Usage example
starcoder = StarCoderGenerator()
code = "def binary_search(arr, target):"
completed = starcoder.complete(code)
print(completed)

Code Understanding and Q&A

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class CodeUnderstandingSystem:
def __init__(self, model_name="codellama/CodeLlama-7b-Instruct-hf"):
self.generator = CodeLlamaGenerator(model_name)

def answer_question(self, code, question):
"""Answer questions about code"""
prompt = f"""Given the following code:
{code}

Question: {question}

Please provide a detailed answer."""
return self.generator.generate_code(prompt, max_length=256)

def find_bugs(self, code):
"""Find bugs in code"""
prompt = f"""Analyze the following code and identify any bugs or potential issues:

{code}

List all bugs found:"""
return self.generator.generate_code(prompt)

def refactor_code(self, code, instructions):
"""Refactor code"""
prompt = f"""Refactor the following code according to these instructions: {instructions}

Original code:
{code}

Refactored code:"""
return self.generator.generate_code(prompt)

# Usage example
code_understanding = CodeUnderstandingSystem()

code = """
def calculate_total(items):
total = 0
for item in items:
total += item.price
return total
"""

# Answer questions
answer = code_understanding.answer_question(code, "What does this function do?")
print(answer)

# Find bugs
bugs = code_understanding.find_bugs(code)
print(bugs)

Long-Context Modeling

Challenges and Solutions

Traditional Transformer attention mechanisms have complexity, whereis the sequence length, limiting the model's ability to handle long contexts.

Main Challenges:

  1. Computational Complexity: Attention matrix grows quadratically with sequence length
  2. Memory Usage: Need to store complete attention matrix
  3. Position Encoding: Need to handle positions beyond maximum training length

LongLoRA: Efficient Long-Context Fine-Tuning

LongLoRA achieves efficient long-context fine-tuning through sparse attention mechanisms.

Core Idea:

  • Shifted Sparse Attention: Only compute local and global attention, reducing complexity
  • LoRA Fine-Tuning: Only fine-tune a small number of parameters, maintaining efficiency

Implementation Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model

class LongLoRAModel:
def __init__(self, base_model_name, max_length=8192):
self.tokenizer = AutoTokenizer.from_pretrained(base_model_name)
self.model = AutoModelForCausalLM.from_pretrained(
base_model_name,
torch_dtype=torch.float16,
device_map="auto"
)

# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM"
)

self.model = get_peft_model(self.model, lora_config)
self.max_length = max_length

def generate(self, prompt, max_new_tokens=256):
"""Generate long text"""
inputs = self.tokenizer(
prompt,
return_tensors="pt",
truncation=True,
max_length=self.max_length
).to("cuda")

with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=0.7,
top_p=0.9
)

generated_text = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
return generated_text

# Usage example
long_model = LongLoRAModel("meta-llama/Llama-2-7b-hf", max_length=8192)
long_prompt = "..." # Very long prompt
result = long_model.generate(long_prompt)

LongLLaMA: Extending Context Window

LongLLaMA extends the context window through the FoT (Focus on Transformer) mechanism.

FoT Mechanism:

  • Memory Layer: Stores long-term memory
  • Attention Mechanism: Establishes connections between memory layer and current context
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class LongLLaMAModel:
def __init__(self, base_model, memory_size=4096):
self.base_model = base_model
self.memory_size = memory_size
self.memory = None

def forward_with_memory(self, input_ids, attention_mask):
"""Forward pass with memory"""
# Update memory
if self.memory is not None:
# Combine current input with memory
extended_input = torch.cat([self.memory, input_ids], dim=1)
extended_mask = torch.cat([
torch.ones_like(self.memory),
attention_mask
], dim=1)
else:
extended_input = input_ids
extended_mask = attention_mask

# Forward pass
outputs = self.base_model(
input_ids=extended_input,
attention_mask=extended_mask
)

# Update memory (keep last memory_size tokens)
if extended_input.size(1) > self.memory_size:
self.memory = extended_input[:, -self.memory_size:]
else:
self.memory = extended_input

return outputs

Hallucination and Mitigation

Definition and Types of Hallucination

Types of Hallucination:

  1. Factual Hallucination: Generate content inconsistent with facts
  2. Logical Hallucination: Incorrect reasoning process
  3. Consistency Hallucination: Contradictory statements

Mitigation Strategies

Retrieval-Augmented Generation (RAG):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
from langchain.vectorstores import FAISS
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import OpenAI
from langchain.chains import RetrievalQA

class RAGSystem:
def __init__(self, documents, llm_model="gpt-3.5-turbo"):
# Text splitting
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)
texts = text_splitter.split_documents(documents)

# Create vector store
embeddings = OpenAIEmbeddings()
self.vectorstore = FAISS.from_documents(texts, embeddings)

# Create retrieval chain
self.qa_chain = RetrievalQA.from_chain_type(
llm=OpenAI(model_name=llm_model),
chain_type="stuff",
retriever=self.vectorstore.as_retriever(),
return_source_documents=True
)

def query(self, question):
"""Query and return answer and sources"""
result = self.qa_chain({"query": question})
return {
"answer": result["result"],
"sources": result["source_documents"]
}

# Usage example
documents = [...] # Document list
rag = RAGSystem(documents)
result = rag.query("What is machine learning?")
print(f"Answer: {result['answer']}")
print(f"Sources: {result['sources']}")

Confidence Scoring:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class ConfidenceScorer:
def __init__(self, model):
self.model = model

def score_response(self, question, answer, context=None):
"""Score answer confidence"""
# Method 1: Probability-based confidence
probs = self.get_token_probabilities(question, answer)
avg_prob = probs.mean()

# Method 2: Consistency-based confidence
consistency = self.check_consistency(answer, context)

# Method 3: Factuality-based confidence
factuality = self.check_factuality(answer, context)

confidence = (avg_prob + consistency + factuality) / 3
return confidence

def get_token_probabilities(self, question, answer):
"""Get token probabilities"""
# Implement token probability calculation
pass

def check_consistency(self, answer, context):
"""Check consistency"""
# Implement consistency check
pass

def check_factuality(self, answer, context):
"""Check factuality"""
# Implement factuality check
pass

Safety and Alignment

Safety Challenges

Main Risks:

  1. Harmful Content Generation: Generate violent, discriminatory content
  2. Privacy Leakage: Leak sensitive information from training data
  3. Misuse: Used for malicious purposes

Alignment Techniques

RLHF (Reinforcement Learning from Human Feedback):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM

class RLHFTrainer:
def __init__(self, model, reward_model):
self.model = model
self.reward_model = reward_model

def train_step(self, prompts, responses):
"""RLHF training step"""
# 1. Generate responses
logits = self.model(prompts)

# 2. Compute rewards
rewards = self.reward_model(prompts, responses)

# 3. Compute policy gradient
loss = self.compute_policy_gradient(logits, rewards)

# 4. Update model
loss.backward()
return loss.item()

def compute_policy_gradient(self, logits, rewards):
"""Compute policy gradient"""
# PPO or other RL algorithms
pass

Safety Filters:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class SafetyFilter:
def __init__(self):
self.harmful_patterns = [
r"violence",
r"discrimination",
# ... more patterns
]

def filter(self, text):
"""Filter harmful content"""
for pattern in self.harmful_patterns:
if re.search(pattern, text, re.IGNORECASE):
return None, "Harmful content detected"
return text, None

Model Evaluation System

Evaluation Dimensions

1. Capability Evaluation:

  • Language Understanding: GLUE, SuperGLUE
  • Language Generation: BLEU, ROUGE, METEOR
  • Reasoning Ability: GSM8K, HellaSwag

2. Safety Evaluation:

  • Harmful Content Detection
  • Bias Detection
  • Privacy Risk Assessment

3. Efficiency Evaluation:

  • Inference Speed
  • Memory Usage
  • Energy Consumption

Evaluation Framework

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
class ModelEvaluator:
def __init__(self, model, tokenizer):
self.model = model
self.tokenizer = tokenizer

def evaluate_glue(self, dataset):
"""Evaluate GLUE tasks"""
results = {}
for task_name, task_data in dataset.items():
accuracy = self.evaluate_task(task_name, task_data)
results[task_name] = accuracy
return results

def evaluate_generation(self, test_set):
"""Evaluate generation quality"""
bleu_scores = []
rouge_scores = []

for example in test_set:
generated = self.generate(example["input"])
bleu = self.compute_bleu(example["reference"], generated)
rouge = self.compute_rouge(example["reference"], generated)
bleu_scores.append(bleu)
rouge_scores.append(rouge)

return {
"bleu": np.mean(bleu_scores),
"rouge": np.mean(rouge_scores)
}

def evaluate_safety(self, test_prompts):
"""Evaluate safety"""
harmful_count = 0
for prompt in test_prompts:
response = self.generate(prompt)
if self.is_harmful(response):
harmful_count += 1

return {
"harmful_rate": harmful_count / len(test_prompts),
"safe_rate": 1 - harmful_count / len(test_prompts)
}

Practical: Complete NLP Project Deployment

Project Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
nlp-service/
├── app/
│ ├── __init__.py
│ ├── main.py # FastAPI application
│ ├── models.py # Model loading
│ ├── routers/ # API routes
│ │ ├── chat.py
│ │ ├── embedding.py
│ │ └── generation.py
│ └── utils/
│ ├── logging.py
│ └── monitoring.py
├── tests/
├── docker/
│ └── Dockerfile
├── requirements.txt
├── docker-compose.yml
└── README.md

FastAPI Application

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# app/main.py
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
from app.models import ModelManager
from app.routers import chat, embedding, generation
import logging

# Configure logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

app = FastAPI(title="NLP Service API", version="1.0.0")

# CORS configuration
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)

# Global model manager
model_manager = ModelManager()

@app.on_event("startup")
async def startup_event():
"""Load models on startup"""
logger.info("Loading models...")
await model_manager.load_models()
logger.info("Models loaded successfully")

@app.on_event("shutdown")
async def shutdown_event():
"""Cleanup resources on shutdown"""
logger.info("Shutting down...")
await model_manager.cleanup()

# Register routes
app.include_router(chat.router, prefix="/api/v1", tags=["chat"])
app.include_router(embedding.router, prefix="/api/v1", tags=["embedding"])
app.include_router(generation.router, prefix="/api/v1", tags=["generation"])

@app.get("/health")
async def health_check():
"""Health check"""
return {"status": "healthy", "models_loaded": model_manager.models_loaded}

if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

Model Management

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# app/models.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoModel
from typing import Optional
import asyncio

class ModelManager:
def __init__(self):
self.models = {}
self.tokenizers = {}
self.models_loaded = False

async def load_models(self):
"""Asynchronously load models"""
# Chat model
await self._load_model_async(
"chat",
"meta-llama/Llama-2-7b-chat-hf"
)

# Embedding model
await self._load_model_async(
"embedding",
"sentence-transformers/all-MiniLM-L6-v2"
)

self.models_loaded = True

async def _load_model_async(self, name, model_path):
"""Asynchronously load a single model"""
loop = asyncio.get_event_loop()

# Execute CPU-intensive operations in thread pool
tokenizer = await loop.run_in_executor(
None,
AutoTokenizer.from_pretrained,
model_path
)

model = await loop.run_in_executor(
None,
AutoModelForCausalLM.from_pretrained,
model_path,
{"torch_dtype": torch.float16, "device_map": "auto"}
)

self.tokenizers[name] = tokenizer
self.models[name] = model

async def cleanup(self):
"""Cleanup resources"""
for model in self.models.values():
del model
torch.cuda.empty_cache()

def get_model(self, name: str):
"""Get model"""
if name not in self.models:
raise ValueError(f"Model {name} not loaded")
return self.models[name]

def get_tokenizer(self, name: str):
"""Get tokenizer"""
if name not in self.tokenizers:
raise ValueError(f"Tokenizer {name} not loaded")
return self.tokenizers[name]

API Routes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# app/routers/chat.py
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
from app.models import ModelManager
import torch

router = APIRouter()

class ChatRequest(BaseModel):
message: str
max_tokens: int = 256
temperature: float = 0.7

class ChatResponse(BaseModel):
response: str
tokens_used: int

@router.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest, model_manager: ModelManager):
"""Chat endpoint"""
try:
model = model_manager.get_model("chat")
tokenizer = model_manager.get_tokenizer("chat")

# Encode input
inputs = tokenizer(request.message, return_tensors="pt").to("cuda")

# Generate
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=request.max_tokens,
temperature=request.temperature,
do_sample=True
)

# Decode
response_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
tokens_used = outputs.shape[1] - inputs.input_ids.shape[1]

return ChatResponse(
response=response_text,
tokens_used=tokens_used
)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

Docker Configuration

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# docker/Dockerfile
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

WORKDIR /app

# Install Python
RUN apt-get update && apt-get install -y \
python3.9 \
python3-pip \
&& rm -rf /var/lib/apt/lists/*

# Install dependencies
COPY requirements.txt .
RUN pip3 install --no-cache-dir -r requirements.txt

# Copy application code
COPY app/ ./app/

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV CUDA_VISIBLE_DEVICES=0

# Expose port
EXPOSE 8000

# Start command
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

Docker Compose

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# docker-compose.yml
version: '3.8'

services:
nlp-service:
build:
context: .
dockerfile: docker/Dockerfile
ports:
- "8000:8000"
environment:
- CUDA_VISIBLE_DEVICES=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
- ./models:/app/models
restart: unless-stopped

monitoring:
image: prometheus/prometheus
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml

Monitoring System

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
# app/utils/monitoring.py
from prometheus_client import Counter, Histogram, Gauge
import time

# Metric definitions
request_count = Counter(
'nlp_requests_total',
'Total number of requests',
['endpoint', 'status']
)

request_duration = Histogram(
'nlp_request_duration_seconds',
'Request duration in seconds',
['endpoint']
)

model_memory = Gauge(
'nlp_model_memory_bytes',
'Model memory usage in bytes'
)

def track_request(endpoint, status):
"""Track request"""
request_count.labels(endpoint=endpoint, status=status).inc()

def track_duration(endpoint, duration):
"""Track request duration"""
request_duration.labels(endpoint=endpoint).observe(duration)

def update_memory_usage():
"""Update memory usage"""
import torch
if torch.cuda.is_available():
memory = torch.cuda.memory_allocated()
model_memory.set(memory)

Deployment Script

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!/bin/bash
# deploy.sh

echo "Building Docker image..."
docker build -t nlp-service:latest -f docker/Dockerfile .

echo "Stopping existing containers..."
docker-compose down

echo "Starting services..."
docker-compose up -d

echo "Waiting for services to be ready..."
sleep 10

echo "Checking health..."
curl http://localhost:8000/health

echo "Deployment complete!"

❓ Q&A: Common Questions About Frontiers and Practical Applications

Q1: What's the difference between Function Calling and ReAct?

A: Function Calling is a static tool calling mechanism where the model decides whether to call based on function definitions. ReAct is a dynamic reasoning-action loop where the model can autonomously plan steps and execute iteratively. Function Calling is better suited for structured tool calls, while ReAct is better for complex task planning.

Q2: How to choose between CodeLlama and StarCoder?

A: CodeLlama is based on LLaMA 2 with stronger instruction-following capabilities, suitable for code generation and Q&A. StarCoder is trained on larger-scale code with stronger code completion capabilities. The choice depends on specific needs: choose CodeLlama for conversational code generation, choose StarCoder for code completion.

Q3: What are the practical application scenarios for long-context models?

A: Main scenarios: 1) Long document Q&A and summarization; 2) Codebase understanding and generation; 3) Multi-turn dialogue history maintenance; 4) Long text analysis. Note that long contexts increase computational costs and require trade-offs.

Q4: How to effectively mitigate model hallucination?

A: Comprehensive strategies: 1) Use RAG to provide external knowledge; 2) Implement confidence scoring and uncertainty quantification; 3) Add fact-checking steps; 4) Use more reliable models; 5) Human review of critical outputs.

Q5: How much human annotation is needed for RLHF training?

A: Typically requires thousands to tens of thousands of human feedback data points. Can use semi-automatic methods: first generate initial feedback using rules or models, then human review and correction to improve efficiency.

Q6: Key considerations for deploying NLP models in production?

A: Key factors: 1) Model size and inference speed; 2) GPU memory and cost; 3) Concurrent processing capability; 4) Error handling and fallback strategies; 5) Monitoring and logging; 6) Security and access control.

Q7: How to optimize NLP service performance?

A: Optimization strategies: 1) Model quantization (INT8/INT4); 2) Batch request processing; 3) Use KV caching; 4) Model distillation; 5) Use smaller model variants; 6) Asynchronous processing; 7) CDN caching for static content.

Q8: How to handle large model files in Docker deployment?

A: Solutions: 1) Use Docker volumes to mount model directories; 2) Use model caching services (e.g., HuggingFace Cache); 3) Pre-download models during build; 4) Use model servers (e.g., TensorRT Inference Server).

Q9: How to monitor NLP service health?

A: Monitoring metrics: 1) Request volume and response time; 2) Error rate and anomalies; 3) GPU usage and memory; 4) Model output quality (sampling evaluation); 5) User feedback. Use Prometheus + Grafana for visualization.

Q10: How to manage resources for multi-model services?

A: Strategies: 1) Use model queues and priority scheduling; 2) Dynamic model loading/unloading; 3) Use model servers (e.g., TorchServe, Triton); 4) Implement resource quotas and rate limiting; 5) Use Kubernetes for resource management and auto-scaling.

  • Post title:NLP (12): Frontiers and Practical Applications
  • Post author:Chen Kai
  • Create time:2024-04-11 14:45:00
  • Post link:https://www.chenk.top/en/nlp-frontiers-applications/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments