Large language models have fundamentally changed how we interact with
AI systems. Yet most users still struggle to extract their full
potential. The difference between a mediocre response and an exceptional
one often comes down to prompt engineering — a practice that blends
empirical experimentation with systematic methodology.
This guide walks you through the entire spectrum of prompt
engineering, from foundational techniques that require no special
knowledge to cutting-edge optimization frameworks used in production
systems. You'll learn not just what works, but why it works, backed by
research findings and practical code examples. Whether you're building
AI applications or simply want better ChatGPT responses, the principles
here apply universally across modern LLMs.
Why Prompt Engineering
Matters
When OpenAI released GPT-3 in 2020, researchers quickly discovered
something surprising: the same model could produce vastly different
results depending on how you phrased your request. A poorly worded
prompt might generate nonsense, while a carefully crafted one could
solve complex reasoning tasks. This wasn't a bug — it was a fundamental
property of how these models learn language patterns.
Traditional programming operates on exact instructions: write a
function, specify inputs and outputs, and the computer executes
deterministically. Language models work differently. They predict the
most likely continuation of text based on patterns learned from
trillions of tokens. Your prompt doesn't command the model; it sets up a
context that nudges probability distributions toward useful outputs.
Think of it like this: asking "What's the capital of France?" is
simple lookup. But asking "Explain why the French Revolution began"
requires the model to synthesize historical context, identify causal
relationships, and structure a coherent narrative. The quality of that
synthesis depends heavily on how you frame the problem.
Early prompt engineering was mostly trial and error. Researchers
would tweak phrasing, add examples, or modify formatting until results
improved. But as models grew more capable, systematic approaches
emerged. Today's advanced techniques — chain-of-thought reasoning, tree
search algorithms, automatic prompt optimization — represent a mature
field backed by rigorous evaluation.
The stakes are high. A well-engineered prompt can reduce API costs by
10x through more efficient context usage. It can boost task accuracy
from 40% to 90% on complex reasoning benchmarks. For production systems
handling millions of requests, these improvements translate to real
business value.
Fundamental Techniques
Before diving into advanced methods, master these core approaches.
They form the building blocks of all prompt engineering strategies.
Zero-Shot Prompting
Zero-shot prompting means asking the model to perform a task without
providing any examples. You rely entirely on the model's pre-training to
understand and execute your request.
Example:
1 2 3 4 5 6
prompt = """ Classify the sentiment of this review: "The movie was boring and predictable. I wouldn't recommend it." Sentiment: """
The model sees no examples of sentiment classification. It must infer
from its training that "boring" and "wouldn't recommend" indicate
negative sentiment.
When zero-shot works well:
Simple, well-defined tasks the model has seen during training
(translation, summarization, basic Q&A)
Tasks with clear conventions (like sentiment being "positive" or
"negative")
When you want fast prototyping without gathering examples
When zero-shot fails:
Domain-specific jargon or specialized tasks
Ambiguous instructions
Tasks requiring specific output formatting
Research from the GPT-3 paper (Brown et al., 2020) showed zero-shot
accuracy of 59% on natural language inference tasks. For comparison,
few-shot prompting improved this to 70%, demonstrating the value of
examples.
Optimization tips:
Be explicit about the task. Instead of "Tell me
about this review," say "Classify the sentiment as positive, negative,
or neutral."
Specify output format. Add "Return only one
word: positive, negative, or neutral" to prevent verbose
responses.
Add constraints. "Ignore sarcasm and focus on
literal sentiment" helps avoid common pitfalls.
Here's a production-ready template:
1 2 3 4 5 6 7 8 9
defzero_shot_classify(text, labels): prompt = f""" Task: Classify the following text into one of these categories: {', '.join(labels)} Text: {text} Category (respond with only the category name): """ return prompt
Few-Shot Prompting
Few-shot prompting provides 2-10 examples before asking the model to
perform your task. This dramatically improves accuracy by establishing
patterns the model can follow.
Example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
prompt = """ Classify the sentiment of these reviews: Review: "Absolutely loved this film! Best movie I've seen all year." Sentiment: positive Review: "It was okay. Nothing special but not terrible either." Sentiment: neutral Review: "Waste of time and money. Horrible acting." Sentiment: negative Review: "The cinematography was stunning and the story kept me engaged." Sentiment: """
The model now has concrete examples showing the mapping from text to
sentiment labels. This helps in several ways:
Demonstrates output format (single word,
lowercase)
Shows edge cases (mixed reviews map to
"neutral")
Primes the model's context with relevant semantic
patterns
Key insight: Few-shot examples act as soft
conditioning. The model's next-token prediction mechanism looks for
patterns in the examples and applies them to the new input. You're
essentially programming through demonstration.
Choosing examples:
Research by Liu et al. (2021) on "What Makes Good In-Context
Examples?" found that:
Diversity matters more than volume. 5 diverse
examples beat 20 similar ones.
Hard examples help. Include edge cases the model
might struggle with.
Order affects results. Place most relevant examples
nearest to the query.
defselect_examples(query, example_pool, k=5): """ Select k diverse, relevant examples for few-shot prompting. Uses embedding similarity to find relevant examples, then filters for diversity. """ from sklearn.metrics.pairwise import cosine_similarity # Compute embeddings (using any embedding model) query_emb = get_embedding(query) example_embs = [get_embedding(ex['text']) for ex in example_pool] # Get top-k most similar similarities = cosine_similarity([query_emb], example_embs)[0] top_k_indices = similarities.argsort()[-k*2:][::-1] # Filter for diversity (select dissimilar examples from top candidates) selected = [top_k_indices[0]] for idx in top_k_indices[1:]: iflen(selected) >= k: break # Add if not too similar to already selected ifall(cosine_similarity( [example_embs[idx]], [example_embs[s]] )[0][0] < 0.9for s in selected): selected.append(idx) return [example_pool[i] for i in selected]
Performance benchmarks:
On the SuperGLUE benchmark (Wang et al., 2019), GPT-3 achieved: -
Zero-shot: 69.5% average accuracy - One-shot: 71.8% - Few-shot (32
examples): 75.2%
Diminishing returns kick in around 10-15 examples for most tasks.
Many-Shot Prompting
Recent research (Anthropic, 2024) shows that extremely long contexts
(100K+ tokens) enable "many-shot" prompting with hundreds of examples.
This bridges the gap between few-shot prompting and traditional
fine-tuning.
Example scenario: You're building a specialized code
reviewer that catches company-specific anti-patterns. Instead of
fine-tuning (which requires infrastructure and expertise), you provide
200 examples of code reviews in the prompt.
# Abbreviated example - real prompt would have 100+ examples prompt = """ Review the following code for anti-patterns: [Example 1] Code: def getData(): return db.query('SELECT * FROM users') Issue: Raw SQL injection risk. Use parameterized queries. Severity: High [Example 2] Code: result = api.call(); result.field Issue: No error handling. Add try-catch for network failures. Severity: Medium [... 198 more examples ...] [Example 200] Code: if x == True: return True Issue: Redundant comparison. Use 'if x: return True' Severity: Low Now review this code: {user_code} Issue: """
Why many-shot works:
Statistical learning. With hundreds of examples,
the model effectively learns a task-specific distribution, similar to
fine-tuning but without weight updates.
Reduced ambiguity. More examples cover more edge
cases, leaving less room for misinterpretation.
Format consistency. The model sees the output
pattern so many times it rarely deviates.
Anthropic's findings (2024):
On specialized tasks, 500-shot prompting approaches fine-tuned model
performance
Gains plateau around 200-300 examples for most tasks
Works best with Claude's 200K context window
Trade-offs:
Cost: Processing 100K+ token prompts is expensive.
At GPT-4 pricing ($0.01/1K tokens), a 200K context costs$2 per
request.
Latency: Longer prompts mean slower generation
(though batching can amortize costs).
Caching: Use prompt caching (available in Claude,
GPT-4) to reuse long contexts across requests.
When to use many-shot:
Specialized domains where fine-tuning is impractical
Rapid iteration (adding examples is easier than retraining)
Tasks requiring nuanced judgment that benefits from diverse
examples
Task Decomposition
Complex tasks often fail because you're asking the model to do too
much at once. Decomposition breaks a hard problem into simpler
sub-problems.
Example: Instead of "Analyze this legal contract and
extract all obligations," decompose:
# Step 2: For each section, extract obligations obligations = [] for section in sections: prompt = f""" Legal text: {section} List all obligations mentioned in this text. Format each as: [Party] must [Action] [Conditions] """ obligations.extend(llm_call(prompt))
# Step 3: Classify obligations by type classified = classify_obligations(obligations)
# Map-Reduce pattern defmap_reduce_pattern(items, map_prompt, reduce_prompt): mapped = [llm_call(map_prompt.format(item=item)) for item in items] reduced = llm_call(reduce_prompt.format(items=mapped)) return reduced
# Validate-Retry pattern defvalidate_retry_pattern(prompt, validator, max_retries=3): for attempt inrange(max_retries): result = llm_call(prompt) if validator(result): return result prompt += f"\n\nPrevious attempt was invalid: {result}\nTry again:" raise ValueError("Max retries exceeded")
Advanced Reasoning
Techniques
Moving beyond basic prompting, these techniques explicitly guide the
model's reasoning process, dramatically improving performance on complex
tasks.
Chain-of-Thought (CoT)
Prompting
Chain-of-thought prompting asks the model to show its work — to
generate intermediate reasoning steps before producing a final answer.
This simple change yields massive improvements on math, logic, and
multi-step reasoning tasks.
The discovery: Wei et al. (2022) found that adding
"think step by step" to prompts improved accuracy on grade-school math
problems from 17% to 78% with GPT-3.
# Without CoT prompt = """ Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: """ # Model often outputs: "10" (incorrect)
# With CoT prompt = """ Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: think step by step. """ # Model outputs: # "Roger starts with 5 balls. # He buys 2 cans, each with 3 balls. # 2 cans × 3 balls = 6 balls. # Total: 5 + 6 = 11 balls." # Answer: 11 (correct!)
Why CoT works:
The prevailing theory (from interpretability research) is that
language models perform implicit computation through their forward pass.
Each layer refines representations, but there's a limit to how much
computation a single forward pass can do. By generating intermediate
steps, the model gets to "think longer" through multiple forward passes
(one per generated token).
Think of it like working memory: humans solve complex problems by
writing down intermediate results. CoT lets models do the same.
Variants:
1. Zero-shot CoT: Just add "think step by step"
without examples. Surprisingly effective across diverse tasks.
1 2
defzero_shot_cot(question): returnf"{question}\n\nthink step by step."
2. Few-shot CoT: Provide examples that include
reasoning chains.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
prompt = """ Q: If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets? A: think step by step. - 5 machines make 5 widgets in 5 minutes - That means each machine makes 1 widget in 5 minutes - So 100 machines make 100 widgets in 5 minutes Answer: 5 minutes Q: A farmer has 15 sheep and all but 8 die. How many are left? A: think step by step. - "All but 8 die" means 8 survive - The number who die is 15 - 8 = 7 - The number left is 8 Answer: 8 sheep Q: {new_question} A: think step by step. """
3. Structured CoT: Enforce specific reasoning
formats.
1 2 3 4 5 6 7 8 9 10 11 12 13
prompt = f""" Problem: {problem} Solve this using the following structure: 1. What is known: [List given information] 2. What is unknown: [What we need to find] 3. Relevant formulas/principles: [What applies here] 4. Step-by-step solution: a) [First step with explanation] b) [Second step with explanation] ... 5. Final answer: [Concise answer] """
Benchmark results (Wei et al., 2022):
Benchmark
Baseline
CoT
Improvement
GSM8K (math)
17.1%
78.2%
+357%
SVAMP (math)
63.7%
79.0%
+24%
CommonsenseQA
72.5%
78.1%
+7.7%
StrategyQA
54.3%
66.1%
+21.7%
When CoT helps most:
Multi-step reasoning (math, logic puzzles)
Problems requiring intermediate calculations
Tasks where the reasoning path matters (explainability)
Complex decision-making with trade-offs
When CoT doesn't help:
Simple lookup tasks ("What's the capital of France?")
When the model lacks requisite knowledge (CoT can't fix missing
facts)
Very short answers (generating reasoning is wasteful)
defself_consistency(question, n=5, temperature=0.7): """ Generate n reasoning chains and return majority answer. Args: question: Problem to solve n: Number of chains to generate temperature: Sampling temperature (higher = more diversity) Returns: Most common answer and confidence score """ from collections import Counter prompt = f"{question}\n\nthink step by step." answers = [] for _ inrange(n): response = llm_call( prompt, temperature=temperature, max_tokens=512 ) answer = extract_answer(response) answers.append(answer) # Majority vote vote_counts = Counter(answers) majority_answer, count = vote_counts.most_common(1)[0] confidence = count / n return majority_answer, confidence
Example:
1 2 3 4 5 6 7 8 9
Question: "If you overtake the person in 2nd place, what place are you in?"
Chain 1: "You overtake 2nd, so you're now 2nd. Answer: 2nd place" ✓ Chain 2: "You were behind 2nd, now you're ahead, so 1st. Answer: 1st place" ✗ Chain 3: "Overtaking 2nd means you take their position. Answer: 2nd place" ✓ Chain 4: "You pass the person in 2nd. You're now 2nd. Answer: 2nd place" ✓ Chain 5: "You overtake 2nd, making you 1st. Answer: 1st place" ✗
Majority: "2nd place" (3/5 = 60% confidence)
Performance gains (Wang et al., 2022):
On GSM8K math problems: - Standard CoT: 74.4% - Self-consistency
(n=40): 83.7% (+12.5%)
On CommonsenseQA: - Standard CoT: 78.1% - Self-consistency (n=40):
81.5% (+4.4%)
Cost-performance trade-offs:
Self-consistency requires n × cost of single inference. Choose n
based on task importance:
Task criticality
n
Cost multiplier
Exploratory
3
3×
Production
5
5×
High-stakes
10-20
10-20×
Optimization: Use lower n initially, then increase
for low-confidence cases:
1 2 3 4 5 6 7 8 9 10 11 12
defadaptive_self_consistency(question, confidence_threshold=0.7): """ Start with n=3, increase if confidence is low. """ answer, confidence = self_consistency(question, n=3) if confidence >= confidence_threshold: return answer, confidence # Low confidence, generate more chains answer, confidence = self_consistency(question, n=10) return answer, confidence
Advanced variant - weighted voting:
Not all reasoning chains are equal. Weight votes by chain
quality:
defweighted_self_consistency(question, n=5): """ Weight each answer by reasoning chain quality. """ prompt = f"{question}\n\nthink step by step." answers_with_scores = [] for _ inrange(n): response = llm_call(prompt, temperature=0.7) answer = extract_answer(response) # Score reasoning quality quality_prompt = f""" Rate the logical quality of this reasoning (0-10): {response} Score: """ quality = int(llm_call(quality_prompt, temperature=0)) answers_with_scores.append((answer, quality)) # Weighted vote from collections import defaultdict vote_weights = defaultdict(float) for answer, quality in answers_with_scores: vote_weights[answer] += quality best_answer = max(vote_weights, key=vote_weights.get) confidence = vote_weights[best_answer] / sum(vote_weights.values()) return best_answer, confidence
Tree of Thoughts (ToT)
Tree of Thoughts (Yao et al., 2023) takes CoT further by exploring
multiple reasoning paths simultaneously through tree search. Instead of
a linear chain, the model explores a tree of possibilities, backtracking
when paths seem unpromising.
Key idea: Model reasoning as search through a state
space. Each "thought" is a partial solution. Use heuristics to
prioritize promising branches.
deftree_of_thoughts(problem, depth=3, breadth=3): """ Explore solution tree using depth-first search. Args: problem: Initial problem statement depth: Maximum depth to explore breadth: Number of thoughts to generate at each node Returns: Best solution found """ defexplore(state, current_depth): if current_depth >= depth: # Terminal state, evaluate quality return evaluate_solution(state) # Generate possible next thoughts thoughts = generate_thoughts(state, k=breadth) # Evaluate each thought's promise thought_values = [] for thought in thoughts: # Combine current state with thought new_state = state + "\n" + thought # Recursively explore this branch value = explore(new_state, current_depth + 1) thought_values.append((thought, value)) # Return best branch best_thought, best_value = max(thought_values, key=lambda x: x[1]) return best_value # Start exploration from initial problem explore(problem, 0)
defgenerate_thoughts(state, k=3): """Generate k possible next reasoning steps.""" prompt = f""" Current reasoning: {state} Generate {k} different possible next steps: 1. """ response = llm_call(prompt, temperature=0.8, max_tokens=300) # Parse into list of thoughts thoughts = [t.strip() for t in response.split('\n') if t.strip()] return thoughts[:k]
defevaluate_solution(state): """Score how promising this reasoning path is.""" prompt = f""" Rate this reasoning on a scale of 1-10: {state} Consider: logical coherence, progress toward solution, likelihood of correctness. Score (integer 1-10): """ score = int(llm_call(prompt, temperature=0, max_tokens=5)) return score
Concrete example (Game of 24):
Task: Use four numbers (e.g., 4, 9, 10, 13) with +, -, ×, ÷ to make
24.
ToT exploration:
1 2 3 4 5 6 7 8 9 10 11 12
Root: "4, 9, 10, 13" ├─ Thought 1: "13 - 9 = 4" → State: "4, 4, 10" │ ├─ Thought 1.1: "10 - 4 = 6" → State: "4, 6" │ │ └─ Thought 1.1.1: "6 × 4 = 24" ✓ SOLUTION FOUND │ ├─ Thought 1.2: "4 × 4 = 16" → State: "10, 16" │ │ └─ Dead end (can't make 24) │ └─ Thought 1.3: "10 + 4 = 14" → State: "4, 14" │ └─ Dead end ├─ Thought 2: "10 - 4 = 6" → State: "6, 9, 13" │ └─ ... (explore further) └─ Thought 3: "9 + 10 = 19" → State: "4, 13, 19" └─ ... (explore further)
The tree search backtracks when a path seems unpromising (evaluated
by the model itself), exploring alternative branches.
Benchmark results (Yao et al., 2023):
Task
CoT
ToT
Improvement
Game of 24
7.3%
74%
+914%
Creative writing
7.3
7.9
+8.2%
Crosswords
15.6%
78%
+400%
Why ToT works:
Exploration: Tries multiple approaches rather than
committing to first thought
Self-evaluation: Model judges its own reasoning
quality
Backtracking: Abandons dead ends early
Implementation challenges:
Cost: Exploring a tree with breadth=3 and depth=4
requires 3^4 = 81 LLM calls
Evaluation accuracy: Model must reliably judge
which thoughts are promising
Search strategy: DFS, BFS, or best-first? Each has
trade-offs.
classTreeOfThoughts: def__init__(self, model, max_calls=50): self.model = model self.max_calls = max_calls self.call_count = 0 defsolve(self, problem): """Solve problem using BFS-based ToT.""" from queue import PriorityQueue # Priority queue of (value, state) queue = PriorityQueue() queue.put((0, problem, [])) # (priority, state, path) best_solution = None best_score = -float('inf') whilenot queue.empty() and self.call_count < self.max_calls: priority, state, path = queue.get() # Check if this is a terminal state if self._is_terminal(state): score = self._evaluate(state) if score > best_score: best_score = score best_solution = state continue # Generate next thoughts thoughts = self._generate_thoughts(state, k=3) for thought in thoughts: new_state = state + "\n" + thought new_path = path + [thought] # Evaluate promise of this thought value = self._evaluate_thought(new_state) # Add to queue (negative because PriorityQueue is min-heap) queue.put((-value, new_state, new_path)) return best_solution, best_score def_generate_thoughts(self, state, k): self.call_count += 1 prompt = f""" Current state: {state} Generate {k} possible next steps: """ response = self.model.generate(prompt, temperature=0.8) return self._parse_thoughts(response, k) def_evaluate_thought(self, state): self.call_count += 1 prompt = f""" Rate this reasoning (1-10): {state} Score: """ returnint(self.model.generate(prompt, temperature=0)) def_is_terminal(self, state): # Task-specific logic return"Answer:"in state orlen(state.split('\n')) > 10
Graph of Thoughts (GoT)
Graph of Thoughts (Besta et al., 2023) generalizes ToT by allowing
arbitrary graph structures. Thoughts can merge (combine multiple
reasoning paths) or branch, enabling more complex reasoning
patterns.
Key insight: Not all reasoning follows trees.
Sometimes you want to: - Merge: Combine insights from
multiple branches - Loop: Refine a solution iteratively
- Aggregate: Synthesize information from many
sources
Example - Document analysis:
1 2 3 4 5
Documents → Split into chunks → Process each chunk ↓ Extract key points ↓ Merge all key points → Synthesize summary
This is a graph: multiple parallel paths (one per chunk) that merge
into a synthesis step.
defgraph_of_thoughts_summarization(documents): """ Summarize multiple documents using GoT. Graph structure: 1. Generate summary for each document (parallel) 2. Extract key themes from each summary (parallel) 3. Merge all themes 4. Generate final synthesis """ graph = ThoughtGraph() # Layer 1: Individual summaries summaries = [] for i, doc inenumerate(documents): node_id = f"summary_{i}" summary = llm_call(f"Summarize this document:\n{doc}") graph.add_node(node_id, summary) summaries.append((node_id, summary)) # Layer 2: Key themes from each summary themes = [] for node_id, summary in summaries: theme_node_id = f"theme_{node_id}" theme = llm_call(f"Extract key themes from:\n{summary}") graph.add_node(theme_node_id, theme) graph.add_edge(node_id, theme_node_id, 'continue') themes.append((theme_node_id, theme)) # Layer 3: Merge themes merge_node_id = "merged_themes" all_themes = "\n\n".join([t[1] for t in themes]) merged = llm_call(f"Merge these themes:\n{all_themes}") graph.add_node(merge_node_id, merged) for theme_id, _ in themes: graph.add_edge(theme_id, merge_node_id, 'merge') # Layer 4: Final synthesis final_node_id = "final_summary" synthesis = llm_call(f"Synthesize a final summary from:\n{merged}") graph.add_node(final_node_id, synthesis) graph.add_edge(merge_node_id, final_node_id, 'continue') return synthesis, graph
Advantages over ToT:
Parallelization: Independent branches can run
concurrently
Information aggregation: Merge multiple
perspectives
Iterative refinement: Loops for incremental
improvement
Performance (Besta et al., 2023):
On sorting tasks (sort 32 numbers), GoT achieved: - 89% accuracy -
62% cost reduction vs. ToT (through parallelization)
ReAct (Reasoning + Acting)
ReAct (Yao et al., 2022) interleaves reasoning with actions in the
world. Instead of pure thought, the model alternates between thinking
(reasoning) and acting (calling tools, APIs, executing code).
Pattern:
1 2 3 4 5 6 7 8
Thought: I need to find the population of Paris Action: search("Paris population") Observation: 2.16 million (2019) Thought: Now I need Tokyo's population Action: search("Tokyo population") Observation: 37.4 million (2021) Thought: Tokyo is larger than Paris Action: finish("Tokyo has a larger population than Paris")
Why ReAct matters:
Language models struggle with: - Current information (training data
is stale) - Precise calculations - Accessing private data
ReAct bridges this by letting models use external tools.
defsearch(query): """Search the web for information.""" # Integration with search API import requests response = requests.get( "https://api.search.com/search", params={"q": query} ) return response.json()["snippet"]
defcalculate(expression): """Evaluate a mathematical expression.""" try: # Safe eval (in production, use a proper math parser) result = eval(expression, {"__builtins__": { }}, {}) returnstr(result) except Exception as e: returnf"Error: {e}"
defwikipedia(query): """Get Wikipedia summary.""" import wikipediaapi wiki = wikipediaapi.Wikipedia('en') page = wiki.page(query) return page.summary[:500]
tools = { "search": search, "calculate": calculate, "wikipedia": wikipedia }
result = agent.run( "What is the total GDP of the three largest EU countries?" )
# Execution trace: # Thought: I need to identify the three largest EU countries by GDP # Action: search # Action Input: largest EU countries by GDP # Observation: Germany, France, Italy are top 3 # # Thought: Now I need each country's GDP # Action: search # Action Input: Germany GDP 2023 # Observation: $4.3 trillion # # Thought: Get France GDP # Action: search # Action Input: France GDP 2023 # Observation:$2.9 trillion # # Thought: Get Italy GDP # Action: search # Action Input: Italy GDP 2023 # Observation: $2.0 trillion # # Thought: Sum the three GDPs # Action: calculate # Action Input: 4.3 + 2.9 + 2.0 # Observation: 9.2 # # Thought: I have the total # Action: finish # Action Input: The total GDP of Germany, France, and Italy is$9.2 trillion
Performance benchmarks (Yao et al., 2022):
On HotpotQA (multi-hop question answering): - Standard prompting:
28.7% - CoT: 32.9% - ReAct: 37.4%
On AlfWorld (interactive environment): - Standard prompting: 12% -
ReAct: 34%
Best practices:
Tool documentation: Clear docstrings help the model
choose appropriate tools
Error handling: Tools should return descriptive
error messages
Rate limiting: Prevent infinite loops or excessive
API calls
Observation truncation: Limit observation length to
avoid context overflow
Optimization & Automation
Manual prompt engineering doesn't scale. These techniques automate
the optimization process.
Automatic Prompt Engineering
(APE)
APE (Zhou et al., 2022) automatically generates and selects optimal
prompts. Instead of manually iterating, you provide examples and let an
LLM discover effective prompts.
Algorithm:
Generate candidate prompts: Use an LLM to create
diverse prompt variations
Evaluate each prompt: Test on a validation set
Select the best: Choose the highest-performing
prompt
defautomatic_prompt_engineering( task_description, train_examples, val_examples, num_candidates=20 ): """ Automatically discover optimal prompt for a task. Args: task_description: Natural language description of task train_examples: List of (input, output) pairs for prompt generation val_examples: List of (input, output) pairs for evaluation num_candidates: Number of prompt variations to try Returns: Best prompt and its accuracy """ # Step 1: Generate candidate prompts meta_prompt = f""" Task: {task_description} Here are some examples: {format_examples(train_examples[:5])} Generate {num_candidates} different prompts that could solve this task. Each prompt should use a different approach or phrasing. Prompts: """ candidates_text = llm_call(meta_prompt, temperature=1.0, max_tokens=2000) candidates = parse_prompts(candidates_text) # Step 2: Evaluate each candidate results = [] for prompt in candidates: correct = 0 for input_text, expected_output in val_examples: full_prompt = f"{prompt}\n\nInput: {input_text}\nOutput:" output = llm_call(full_prompt, temperature=0) if normalize(output) == normalize(expected_output): correct += 1 accuracy = correct / len(val_examples) results.append((prompt, accuracy)) # Step 3: Return best prompt best_prompt, best_accuracy = max(results, key=lambda x: x[1]) return best_prompt, best_accuracy
defformat_examples(examples): return"\n".join([ f"Input: {inp}\nOutput: {out}" for inp, out in examples ])
defparse_prompts(text): """Extract individual prompts from generated text.""" # Simple parsing - in production, use more robust extraction prompts = [] for line in text.split('\n'): line = line.strip() if line andlen(line) > 20: # Filter very short lines prompts.append(line) return prompts
defnormalize(text): """Normalize text for comparison.""" return text.strip().lower()
# Output might be: # "Determine if the customer review expresses satisfaction (positive) # or dissatisfaction (negative) with the product." # Accuracy: 87.3%
Discovered prompts often outperform human-written
ones. Zhou et al. found APE-generated prompts achieved 3-8%
higher accuracy than human baselines on various tasks.
Why APE works:
Exploration: Tries diverse phrasings humans might
not consider
Data-driven: Optimizes directly on your task, not
general heuristics
Scales: Can test hundreds of candidates
automatically
defiterative_ape(task, train_examples, val_examples, iterations=3): """ Iteratively refine prompts. """ best_prompt = task # Start with task description best_accuracy = 0 for i inrange(iterations): print(f"Iteration {i+1}/{iterations}") # Generate variations of current best refinement_prompt = f""" Current prompt: {best_prompt} Current accuracy: {best_accuracy:.1%} Generate 10 improved variations of this prompt. Consider: - More specific instructions - Better examples - Clearer output format - Edge case handling Variations: """ candidates_text = llm_call(refinement_prompt, temperature=0.9) candidates = parse_prompts(candidates_text) # Evaluate for prompt in candidates: accuracy = evaluate_prompt(prompt, val_examples) if accuracy > best_accuracy: best_accuracy = accuracy best_prompt = prompt print(f"New best: {accuracy:.1%}") return best_prompt, best_accuracy
DSPy: Declarative
Self-improving Prompts
DSPy (Khattab et al., 2023) is a framework that treats prompting as a
programming problem. Instead of writing prompts, you write programs that
generate prompts.
Core concepts:
Signatures: Type-annotated function specs
(input/output)
Overhead: Optimization requires compute and
training data
LLMLingua: Prompt Compression
LLMLingua (Jiang et al., 2023) compresses prompts to reduce costs
while preserving performance. It selectively removes tokens that
contribute least to the model's understanding.
Motivation: Long prompts are expensive. A
10,000-token prompt costs 10× more than 1,000 tokens. Can we compress
without hurting accuracy?
Method:
LLMLingua uses a smaller LLM to score each token's importance, then
removes low-scoring tokens.
original_prompt = """ You are a helpful assistant. Please answer the following question accurately. Context: The French Revolution was a period of radical political and societal change in France that began with the Estates-General of 1789 and ended with the formation of the French Consulate in November 1799. Many of its ideas are considered fundamental principles of liberal democracy. Question: When did the French Revolution begin? Please provide a concise answer based on the context above. """
defcompute_token_importance(prompt, model): """ Score each token by how much it affects model's predictions. Uses conditional perplexity: remove token i, measure how much perplexity increases. """ tokens = tokenize(prompt) importances = [] for i inrange(len(tokens)): # Compute perplexity with token i ppl_with = model.perplexity(tokens) # Compute perplexity without token i tokens_without = tokens[:i] + tokens[i+1:] ppl_without = model.perplexity(tokens_without) # Importance = how much perplexity increases when removed importance = ppl_without - ppl_with importances.append(importance) return importances
Performance (Jiang et al., 2023):
On question-answering tasks with 2× compression: - Accuracy drop:
Only 2-3% - Cost savings: 50% - Latency improvement: 1.4× faster
On retrieval-augmented generation with 4× compression: - Accuracy
drop: 5-7% - Cost savings: 75%
defgenerate_structured_output(data, schema): """ Generate JSON conforming to schema. """ prompt = f""" Generate a JSON object that follows this schema: {json.dumps(schema, indent=2)} Rules: - All required fields must be present - Use correct data types (string, number, boolean, array, object) - Enum fields must use one of the specified values - No additional fields beyond the schema Input data: {data} Output valid JSON only (no explanation): """ response = llm_call(prompt, temperature=0) # Validate against schema try: parsed = json.loads(response) validate_against_schema(parsed, schema) return parsed except Exception as e: # Retry with error feedback return retry_with_feedback(prompt, response, str(e))
deffew_shot_structured_generation(): prompt = """ Extract information in this JSON format: Example 1: Text: "Alice is 25 and works as a designer." Output: {"name": "Alice", "age": 25, "occupation": "designer"} Example 2: Text: "Bob, a 40-year-old teacher, lives in Boston." Output: {"name": "Bob", "age": 40, "occupation": "teacher", "location": "Boston"} Example 3: Text: "Carol enjoys programming. She's 35." Output: {"name": "Carol", "age": 35, "interests": ["programming"]} Now extract from this text: Text: "David is a 28-year-old chef who loves traveling." Output: """ return llm_call(prompt, temperature=0)
Strategy 3: Use function calling
(OpenAI-specific):
defgenerate_code(task, language="python", test_cases=None): """ Generate and validate code for a task. """ prompt = f""" Write {language} code to solve this task: Task: {task} Requirements: - Include clear comments explaining the logic - Handle edge cases and errors - Follow best practices for {language} - Include a main function demonstrating usage {format_test_cases(test_cases) if test_cases else''} Provide complete, runnable code: ```{language} """ code = llm_call(prompt, temperature=0.3) # Extract code from markdown code = extract_code_block(code) # Validate if test_cases: results = run_tests(code, test_cases, language) ifnotall(r.passed for r in results): # Debug and retry code = debug_and_fix(code, results, prompt) return code
defformat_test_cases(test_cases): """Format test cases for prompt.""" formatted = "Test cases:\n" for i, (input_data, expected) inenumerate(test_cases, 1): formatted += f"{i}. Input: {input_data} → Expected: {expected}\n" return formatted
defdebug_and_fix(code, test_results, original_prompt): """Iteratively fix failing code.""" failures = [r for r in test_results ifnot r.passed] debug_prompt = f""" {original_prompt} Your previous code failed these tests: {format_failures(failures)} Fix the code to pass all tests. Provide corrected code: ```python """ fixed_code = llm_call(debug_prompt, temperature=0.2) return extract_code_block(fixed_code)
classDataExtractor: """ Extract structured data using progressive refinement. """ def__init__(self, model): self.model = model defextract(self, text, schema, confidence_threshold=0.8): """ Extract data matching schema, with confidence scores. """ # Step 1: Initial extraction extraction_prompt = f""" Extract all instances matching this schema: {format_schema(schema)} Text: {text} For each instance, provide: 1. Extracted data in JSON 2. Confidence (0-1) 3. Supporting quote from text Output: """ raw_extraction = self.model.generate(extraction_prompt) candidates = self.parse_candidates(raw_extraction) # Step 2: Validate and filter validated = [] for candidate in candidates: if candidate['confidence'] >= confidence_threshold: # Verify extraction is supported by text if self.verify_extraction(text, candidate): validated.append(candidate) # Step 3: Resolve conflicts (overlapping extractions) resolved = self.resolve_conflicts(validated) return resolved defverify_extraction(self, text, candidate): """ Verify that extraction is actually supported by source text. """ verification_prompt = f""" Source text: {text} Claimed extraction: {json.dumps(candidate['data'])} Supporting quote: {candidate['quote']} Is this extraction accurate and supported by the text? Answer: yes or no """ answer = self.model.generate(verification_prompt, temperature=0) return'yes'in answer.lower()
classPromptDebugger: """ Diagnose why a prompt fails. """ defdebug(self, prompt, expected, actual): """ Run diagnostic checks and suggest fixes. """ issues = [] # Check 1: Is the prompt clear? if self.check_ambiguity(prompt): issues.append({ "issue": "Ambiguous instructions", "evidence": self.find_ambiguous_phrases(prompt), "fix": "Add specific constraints and examples" }) # Check 2: Are there contradictions? if self.check_contradictions(prompt): issues.append({ "issue": "Contradictory requirements", "evidence": self.find_contradictions(prompt), "fix": "Resolve conflicting instructions" }) # Check 3: Is context sufficient? if self.check_missing_context(prompt, expected): issues.append({ "issue": "Insufficient context", "evidence": "Expected output requires information not in prompt", "fix": "Provide additional background or examples" }) # Check 4: Format issues? if self.check_format_mismatch(expected, actual): issues.append({ "issue": "Output format mismatch", "evidence": f"Expected format: {self.infer_format(expected)}, Got: {self.infer_format(actual)}", "fix": "Explicitly specify output format with examples" }) # Check 5: Is it too complex? if self.check_complexity(prompt): issues.append({ "issue": "Task too complex for single prompt", "evidence": f"Prompt asks for {self.count_subtasks(prompt)} subtasks", "fix": "Decompose into multiple steps" }) return self.generate_report(issues) defcheck_ambiguity(self, prompt): """Detect vague language.""" ambiguous_phrases = [ "relevant", "appropriate", "good", "bad", "some", "few", "many", "stuff", "things" ] returnany(phrase in prompt.lower() for phrase in ambiguous_phrases) defcheck_contradictions(self, prompt): """Check for conflicting instructions.""" # Simple heuristic: look for "but", "however", "although" contradiction_markers = ["but", "however", "although", "except"] returnsum(marker in prompt.lower() for marker in contradiction_markers) >= 2 defcheck_format_mismatch(self, expected, actual): """Check if output format matches expectation.""" expected_fmt = self.infer_format(expected) actual_fmt = self.infer_format(actual) return expected_fmt != actual_fmt definfer_format(self, text): """Infer output format (json, list, prose, etc).""" text = text.strip() if text.startswith('{') and text.endswith('}'): return"json_object" if text.startswith('[') and text.endswith(']'): return"json_array" if'\n-'in text or'\n*'in text or'\n1.'in text: return"list" return"prose" defcount_subtasks(self, prompt): """Estimate number of subtasks in prompt.""" task_markers = [ "first", "then", "next", "finally", "also", "additionally", "1.", "2.", "3.", "step 1", "step 2" ] returnsum(marker in prompt.lower() for marker in task_markers) defgenerate_report(self, issues): """Format diagnosis report.""" ifnot issues: return"No obvious issues detected. Try:\n- Adding more examples\n- Adjusting temperature\n- Using a different model" report = "Diagnosis:\n\n" for i, issue inenumerate(issues, 1): report += f"{i}. {issue['issue']}\n" report += f" Evidence: {issue['evidence']}\n" report += f" Fix: {issue['fix']}\n\n" return report
# Usage debugger = PromptDebugger()
prompt = "Classify the sentiment of this review." expected = "positive" actual = "The review expresses a generally favorable opinion with some minor criticisms..."
defanalyze_errors(predictions, ground_truths, inputs): """ Categorize and analyze prediction errors. """ errors = [] for pred, truth, inp inzip(predictions, ground_truths, inputs): if normalize(pred) != normalize(truth): error = { "input": inp, "prediction": pred, "ground_truth": truth, "error_type": categorize_error(pred, truth), "severity": compute_severity(pred, truth) } errors.append(error) # Group by error type from collections import defaultdict by_type = defaultdict(list) for error in errors: by_type[error["error_type"]].append(error) # Generate report report = f"Total errors: {len(errors)}/{len(predictions)} ({100*len(errors)/len(predictions):.1f}%)\n\n" for error_type, instances insorted(by_type.items(), key=lambda x: -len(x[1])): report += f"{error_type}: {len(instances)} instances\n" report += "Examples:\n" for instance in instances[:3]: # Show up to 3 examples report += f" Input: {instance['input'][:50]}...\n" report += f" Predicted: {instance['prediction']}\n" report += f" Expected: {instance['ground_truth']}\n\n" return report
defcategorize_error(prediction, ground_truth): """Classify type of error.""" pred_norm = normalize(prediction) truth_norm = normalize(ground_truth) iflen(pred_norm) == 0: return"empty_output" if pred_norm in truth_norm or truth_norm in pred_norm: return"partial_match" pred_words = set(pred_norm.split()) truth_words = set(truth_norm.split()) overlap = len(pred_words & truth_words) / max(len(pred_words), len(truth_words)) if overlap > 0.5: return"semantic_error" if overlap > 0: return"partial_hallucination" return"complete_hallucination"
defcompute_severity(prediction, ground_truth): """Score error severity (0=minor, 1=critical).""" # Use semantic similarity as proxy similarity = semantic_similarity(prediction, ground_truth) return1 - similarity
Common Pitfalls & Solutions
Learn from common mistakes:
Pitfall 1: Vague Instructions
Problem:
1
prompt = "Make this better: {text}"
Why it fails: "Better" is subjective. The model
doesn't know what dimensions to optimize.
Solution:
1 2 3 4 5 6 7 8 9 10 11
prompt = f""" Improve this text for clarity and conciseness: - Remove redundant words - Use active voice - Break up long sentences - Preserve all key information Original: {text} Improved version: """
Pitfall 2: Assuming Knowledge
Problem:
1
prompt = "What's the bug in this code?"# No code provided
Why it fails: Model needs context you forgot to
include.
Solution:
1 2 3 4 5 6 7 8 9 10 11 12 13
prompt = f""" Find bugs in this Python code: {code} Check for: - Syntax errors - Logic errors - Edge cases not handled - Potential runtime errors Bugs found: """
Pitfall 3: Overly Long
Prompts
Problem: 15,000-word prompt with every possible
instruction.
Why it fails: Models struggle to attend to all
information (the "lost in the middle" problem). Also expensive.
Solution: Decompose or use hierarchical prompting:
1 2 3 4 5
# Step 1: Extract key info key_info = extract_key_information(long_document)
# Step 2: Focused task with only relevant info result = perform_task(key_info)
Pitfall 4: Ignoring Output
Format
Problem:
1 2 3
prompt = "Extract the dates." # Gets: "The dates are March 3 and April 15." # Want: ["2024-03-03", "2024-04-15"]
Solution:
1 2 3 4 5 6 7 8
prompt = """ Extract all dates in ISO format (YYYY-MM-DD). Output as JSON array of strings. Text: {text} Output (JSON only): """
defvalidated_generation(prompt, validator, max_retries=3): """Generate with validation loop.""" for attempt inrange(max_retries): output = llm_call(prompt) is_valid, error = validator(output) if is_valid: return output # Retry with feedback prompt += f"\n\nError: {error}\nTry again:" raise ValueError("Failed validation after max retries")
# Example validator defvalidate_json(output): try: json.loads(output) returnTrue, None except json.JSONDecodeError as e: returnFalse, f"Invalid JSON: {e}"
Frequently Asked Questions
Q1: Should I use
higher or lower temperature?
Answer: It depends on your task:
Temperature = 0 (or very low): Use for tasks
requiring consistency and correctness (classification, extraction, math,
code generation). The model always picks the most likely token, giving
deterministic outputs.
Temperature = 0.7-0.8: Good for creative tasks
where you want diversity but still coherent outputs (writing,
brainstorming, marketing copy).
Temperature = 1.0+: Maximum randomness. Rarely
useful except for exploration or artistic generation.
Rule of thumb: Start with 0 for structured tasks,
0.7 for creative ones.
Q2:
How many examples should I include in few-shot prompts?
Answer: Research shows diminishing returns:
2-3 examples: Often sufficient for simple
tasks
5-7 examples: Sweet spot for most tasks
10+ examples: Only helps if examples are diverse
and cover edge cases
Stick with prompting when: 1. Task changes frequently (prompts are
easier to iterate) 2. You lack training data or labeling budget 3. You
need interpretability (prompts are transparent) 4. You want to leverage
the latest models without retraining
Q4: How do I prevent
hallucinations?
Strategies:
Ground with context: Provide relevant facts in
the prompt
1
prompt = f"Based on this context: {facts}\n\nAnswer: {question}"
Explicit instructions: Tell the model to say "I
don't know" when uncertain
1
"If the answer isn't in the context, respond: 'Information not available'"
Request citations: Ask the model to quote
sources
1
"Quote the exact sentence from the text that supports your answer"
Lower temperature: Reduces random guessing
(temperature=0)
Validation loop: Check outputs against ground
truth
Model-specific: - Exact phrasing sensitivity varies
- Some models need explicit format instructions (XML, JSON markers) -
Function calling is API-specific (OpenAI vs. others)
Best practice: Test on your target model. Don't
assume prompts transfer perfectly.
Q7: Can I automate prompt
optimization?
Yes, several approaches:
APE (Automatic Prompt Engineering): Generate and
test candidate prompts
DSPy: Framework for programmatic prompt
optimization
Genetic algorithms: Evolve prompts through mutation
and selection
Reinforcement learning: Optimize prompts using
reward signals
Practical recommendation: Start with manual
engineering to understand the task, then automate optimization for
production.
Q8:
How do I evaluate prompt quality without ground truth?
Methods:
LLM-as-Judge: Use a stronger model to evaluate
outputs
1
score = judge_llm(f"Rate this output (0-10): {output}")
Maintenance: Good prompts need less frequent
updates
Example calculation: - Baseline: 70% accuracy
requires 1.4× API calls for retries - Optimized: 90% accuracy reduces to
1.1× API calls - Savings: 22% fewer API calls
At scale (1M requests/month), this saves thousands of dollars.
Q10:
How do I debug a prompt that sometimes works and sometimes fails?
Diagnosis steps:
Check for non-determinism: Set temperature=0 to
eliminate randomness
1
output = llm_call(prompt, temperature=0)
Identify failure patterns: Analyze errors for
commonalities
1 2
failures = [x for x in test_set ifnot is_correct(x)] patterns = cluster_similar_inputs(failures)
Add explicit edge case handling: Update prompt
with failure examples
1
prompt += "\n\nEdge case examples:\n{edge_cases}"
Use self-consistency: Vote across multiple
generations
1 2
outputs = [llm_call(prompt) for _ inrange(5)] final = majority_vote(outputs)
Incremental refinement: Test each change on
known failure cases
Q11:
Should I use XML, JSON, or plain text for structured prompts?
Multimodal prompting: Text + images + audio in one
prompt
Automated optimization: Tools that auto-tune
prompts (DSPy, APE)
Prompt compression: Fit more context into token
limits
Meta-prompting: Prompts that generate prompts
Embodied agents: Prompts controlling robots and
virtual agents
Fine-tuning obsolescence? As models improve, less
need for complex prompts
Prediction: Prompt engineering will remain valuable
but become more automated. Focus will shift from manual crafting to: -
Designing prompt optimization objectives -
Building evaluation frameworks - Integrating
prompts into larger systems
The skill won't disappear — it will evolve toward higher-level
orchestration.
Conclusion
Prompt engineering transforms how we interact with AI systems. What
began as trial-and-error has matured into a discipline backed by
rigorous research and practical frameworks.
The fundamentals — clear instructions, well-chosen examples,
structured outputs — apply universally. Advanced techniques like
chain-of-thought reasoning and tree search unlock capabilities that
seemed impossible with naive prompting. Optimization methods like
automatic prompt engineering and DSPy scale these practices to
production systems.
But techniques alone aren't enough. Effective prompt engineering
requires:
Empiricism: Test everything. What works for one
model or task may fail for another.
Iteration: Your first prompt will rarely be your
best. Expect to refine based on failures.
Evaluation: Measure rigorously. Without metrics,
you're flying blind.
Contextual thinking: Understand your model's
strengths, your task's requirements, and the trade-offs between cost,
latency, and quality.
As models grow more powerful, the nature of prompt engineering
evolves. Simple tasks that once required careful prompting now work
zero-shot. But complex reasoning, specialized domains, and production
constraints ensure that prompt engineering remains essential.
The future belongs to those who can orchestrate AI systems
effectively — not just by writing clever prompts, but by building
frameworks that optimize, validate, and scale prompt-based solutions.
Whether you're a researcher pushing the boundaries of what LLMs can do
or a practitioner building real-world applications, the principles in
this guide provide a foundation for success.
Start simple. Measure constantly. Iterate relentlessly. And remember:
the best prompt is the one that reliably solves your problem, not the
cleverest one.
Post title:Prompt Engineering Complete Guide: From Zero to Advanced Optimization
Post author:Chen Kai
Create time:2025-04-01 00:00:00
Post link:https://www.chenk.top/en/prompt-engineering-complete-guide/
Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.