Prompt Engineering Complete Guide: From Zero to Advanced Optimization
Chen Kai BOSS

Large language models have fundamentally changed how we interact with AI systems. Yet most users still struggle to extract their full potential. The difference between a mediocre response and an exceptional one often comes down to prompt engineering — a practice that blends empirical experimentation with systematic methodology.

This guide walks you through the entire spectrum of prompt engineering, from foundational techniques that require no special knowledge to cutting-edge optimization frameworks used in production systems. You'll learn not just what works, but why it works, backed by research findings and practical code examples. Whether you're building AI applications or simply want better ChatGPT responses, the principles here apply universally across modern LLMs.

Why Prompt Engineering Matters

When OpenAI released GPT-3 in 2020, researchers quickly discovered something surprising: the same model could produce vastly different results depending on how you phrased your request. A poorly worded prompt might generate nonsense, while a carefully crafted one could solve complex reasoning tasks. This wasn't a bug — it was a fundamental property of how these models learn language patterns.

Traditional programming operates on exact instructions: write a function, specify inputs and outputs, and the computer executes deterministically. Language models work differently. They predict the most likely continuation of text based on patterns learned from trillions of tokens. Your prompt doesn't command the model; it sets up a context that nudges probability distributions toward useful outputs.

Think of it like this: asking "What's the capital of France?" is simple lookup. But asking "Explain why the French Revolution began" requires the model to synthesize historical context, identify causal relationships, and structure a coherent narrative. The quality of that synthesis depends heavily on how you frame the problem.

Early prompt engineering was mostly trial and error. Researchers would tweak phrasing, add examples, or modify formatting until results improved. But as models grew more capable, systematic approaches emerged. Today's advanced techniques — chain-of-thought reasoning, tree search algorithms, automatic prompt optimization — represent a mature field backed by rigorous evaluation.

The stakes are high. A well-engineered prompt can reduce API costs by 10x through more efficient context usage. It can boost task accuracy from 40% to 90% on complex reasoning benchmarks. For production systems handling millions of requests, these improvements translate to real business value.

Fundamental Techniques

Before diving into advanced methods, master these core approaches. They form the building blocks of all prompt engineering strategies.

Zero-Shot Prompting

Zero-shot prompting means asking the model to perform a task without providing any examples. You rely entirely on the model's pre-training to understand and execute your request.

Example:

1
2
3
4
5
6
prompt = """
Classify the sentiment of this review:
"The movie was boring and predictable. I wouldn't recommend it."

Sentiment:
"""

The model sees no examples of sentiment classification. It must infer from its training that "boring" and "wouldn't recommend" indicate negative sentiment.

When zero-shot works well:

  • Simple, well-defined tasks the model has seen during training (translation, summarization, basic Q&A)
  • Tasks with clear conventions (like sentiment being "positive" or "negative")
  • When you want fast prototyping without gathering examples

When zero-shot fails:

  • Domain-specific jargon or specialized tasks
  • Ambiguous instructions
  • Tasks requiring specific output formatting

Research from the GPT-3 paper (Brown et al., 2020) showed zero-shot accuracy of 59% on natural language inference tasks. For comparison, few-shot prompting improved this to 70%, demonstrating the value of examples.

Optimization tips:

  1. Be explicit about the task. Instead of "Tell me about this review," say "Classify the sentiment as positive, negative, or neutral."

  2. Specify output format. Add "Return only one word: positive, negative, or neutral" to prevent verbose responses.

  3. Add constraints. "Ignore sarcasm and focus on literal sentiment" helps avoid common pitfalls.

Here's a production-ready template:

1
2
3
4
5
6
7
8
9
def zero_shot_classify(text, labels):
prompt = f"""
Task: Classify the following text into one of these categories: {', '.join(labels)}

Text: {text}

Category (respond with only the category name):
"""
return prompt

Few-Shot Prompting

Few-shot prompting provides 2-10 examples before asking the model to perform your task. This dramatically improves accuracy by establishing patterns the model can follow.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
prompt = """
Classify the sentiment of these reviews:

Review: "Absolutely loved this film! Best movie I've seen all year."
Sentiment: positive

Review: "It was okay. Nothing special but not terrible either."
Sentiment: neutral

Review: "Waste of time and money. Horrible acting."
Sentiment: negative

Review: "The cinematography was stunning and the story kept me engaged."
Sentiment:
"""

The model now has concrete examples showing the mapping from text to sentiment labels. This helps in several ways:

  1. Demonstrates output format (single word, lowercase)
  2. Shows edge cases (mixed reviews map to "neutral")
  3. Primes the model's context with relevant semantic patterns

Key insight: Few-shot examples act as soft conditioning. The model's next-token prediction mechanism looks for patterns in the examples and applies them to the new input. You're essentially programming through demonstration.

Choosing examples:

Research by Liu et al. (2021) on "What Makes Good In-Context Examples?" found that:

  • Diversity matters more than volume. 5 diverse examples beat 20 similar ones.
  • Hard examples help. Include edge cases the model might struggle with.
  • Order affects results. Place most relevant examples nearest to the query.

Example selection algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def select_examples(query, example_pool, k=5):
"""
Select k diverse, relevant examples for few-shot prompting.

Uses embedding similarity to find relevant examples,
then filters for diversity.
"""
from sklearn.metrics.pairwise import cosine_similarity

# Compute embeddings (using any embedding model)
query_emb = get_embedding(query)
example_embs = [get_embedding(ex['text']) for ex in example_pool]

# Get top-k most similar
similarities = cosine_similarity([query_emb], example_embs)[0]
top_k_indices = similarities.argsort()[-k*2:][::-1]

# Filter for diversity (select dissimilar examples from top candidates)
selected = [top_k_indices[0]]
for idx in top_k_indices[1:]:
if len(selected) >= k:
break
# Add if not too similar to already selected
if all(cosine_similarity(
[example_embs[idx]],
[example_embs[s]]
)[0][0] < 0.9 for s in selected):
selected.append(idx)

return [example_pool[i] for i in selected]

Performance benchmarks:

On the SuperGLUE benchmark (Wang et al., 2019), GPT-3 achieved: - Zero-shot: 69.5% average accuracy - One-shot: 71.8% - Few-shot (32 examples): 75.2%

Diminishing returns kick in around 10-15 examples for most tasks.

Many-Shot Prompting

Recent research (Anthropic, 2024) shows that extremely long contexts (100K+ tokens) enable "many-shot" prompting with hundreds of examples. This bridges the gap between few-shot prompting and traditional fine-tuning.

Example scenario: You're building a specialized code reviewer that catches company-specific anti-patterns. Instead of fine-tuning (which requires infrastructure and expertise), you provide 200 examples of code reviews in the prompt.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
# Abbreviated example - real prompt would have 100+ examples
prompt = """
Review the following code for anti-patterns:

[Example 1]
Code: def getData(): return db.query('SELECT * FROM users')
Issue: Raw SQL injection risk. Use parameterized queries.
Severity: High

[Example 2]
Code: result = api.call(); result.field
Issue: No error handling. Add try-catch for network failures.
Severity: Medium

[... 198 more examples ...]

[Example 200]
Code: if x == True: return True
Issue: Redundant comparison. Use 'if x: return True'
Severity: Low

Now review this code:
{user_code}

Issue:
"""

Why many-shot works:

  1. Statistical learning. With hundreds of examples, the model effectively learns a task-specific distribution, similar to fine-tuning but without weight updates.

  2. Reduced ambiguity. More examples cover more edge cases, leaving less room for misinterpretation.

  3. Format consistency. The model sees the output pattern so many times it rarely deviates.

Anthropic's findings (2024):

  • On specialized tasks, 500-shot prompting approaches fine-tuned model performance
  • Gains plateau around 200-300 examples for most tasks
  • Works best with Claude's 200K context window

Trade-offs:

  • Cost: Processing 100K+ token prompts is expensive. At GPT-4 pricing ($0.01/1K tokens), a 200K context costs$2 per request.
  • Latency: Longer prompts mean slower generation (though batching can amortize costs).
  • Caching: Use prompt caching (available in Claude, GPT-4) to reuse long contexts across requests.

When to use many-shot:

  • Specialized domains where fine-tuning is impractical
  • Rapid iteration (adding examples is easier than retraining)
  • Tasks requiring nuanced judgment that benefits from diverse examples

Task Decomposition

Complex tasks often fail because you're asking the model to do too much at once. Decomposition breaks a hard problem into simpler sub-problems.

Example: Instead of "Analyze this legal contract and extract all obligations," decompose:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Step 1: Identify sections
sections = extract_sections(contract)

# Step 2: For each section, extract obligations
obligations = []
for section in sections:
prompt = f"""
Legal text:
{section}

List all obligations mentioned in this text.
Format each as: [Party] must [Action] [Conditions]
"""
obligations.extend(llm_call(prompt))

# Step 3: Classify obligations by type
classified = classify_obligations(obligations)

# Step 4: Generate summary
summary = summarize_obligations(classified)

Why decomposition helps:

  1. Reduces cognitive load. Each sub-task is simpler and has clearer success criteria.
  2. Enables validation. You can check intermediate outputs before proceeding.
  3. Improves debugging. When something fails, you know exactly which step broke.

Real-world example: GitHub Copilot Workspace decomposes "Implement a feature" into: 1. Understand the codebase (semantic search) 2. Identify affected files 3. Generate individual file edits 4. Synthesize a complete solution

Each step uses specialized prompts optimized for that sub-task.

Pattern library:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Extract-Transform-Load pattern
def etl_pattern(data, extraction_prompt, transformation_prompt):
extracted = llm_call(extraction_prompt.format(data=data))
transformed = llm_call(transformation_prompt.format(data=extracted))
return transformed

# Map-Reduce pattern
def map_reduce_pattern(items, map_prompt, reduce_prompt):
mapped = [llm_call(map_prompt.format(item=item)) for item in items]
reduced = llm_call(reduce_prompt.format(items=mapped))
return reduced

# Validate-Retry pattern
def validate_retry_pattern(prompt, validator, max_retries=3):
for attempt in range(max_retries):
result = llm_call(prompt)
if validator(result):
return result
prompt += f"\n\nPrevious attempt was invalid: {result}\nTry again:"
raise ValueError("Max retries exceeded")

Advanced Reasoning Techniques

Moving beyond basic prompting, these techniques explicitly guide the model's reasoning process, dramatically improving performance on complex tasks.

Chain-of-Thought (CoT) Prompting

Chain-of-thought prompting asks the model to show its work — to generate intermediate reasoning steps before producing a final answer. This simple change yields massive improvements on math, logic, and multi-step reasoning tasks.

The discovery: Wei et al. (2022) found that adding "think step by step" to prompts improved accuracy on grade-school math problems from 17% to 78% with GPT-3.

Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Without CoT
prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A:
"""
# Model often outputs: "10" (incorrect)

# With CoT
prompt = """
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 tennis balls. How many tennis balls does he have now?
A: think step by step.
"""
# Model outputs:
# "Roger starts with 5 balls.
# He buys 2 cans, each with 3 balls.
# 2 cans × 3 balls = 6 balls.
# Total: 5 + 6 = 11 balls."
# Answer: 11 (correct!)

Why CoT works:

The prevailing theory (from interpretability research) is that language models perform implicit computation through their forward pass. Each layer refines representations, but there's a limit to how much computation a single forward pass can do. By generating intermediate steps, the model gets to "think longer" through multiple forward passes (one per generated token).

Think of it like working memory: humans solve complex problems by writing down intermediate results. CoT lets models do the same.

Variants:

1. Zero-shot CoT: Just add "think step by step" without examples. Surprisingly effective across diverse tasks.

1
2
def zero_shot_cot(question):
return f"{question}\n\nthink step by step."

2. Few-shot CoT: Provide examples that include reasoning chains.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
prompt = """
Q: If it takes 5 machines 5 minutes to make 5 widgets, how long would it take 100 machines to make 100 widgets?
A: think step by step.
- 5 machines make 5 widgets in 5 minutes
- That means each machine makes 1 widget in 5 minutes
- So 100 machines make 100 widgets in 5 minutes
Answer: 5 minutes

Q: A farmer has 15 sheep and all but 8 die. How many are left?
A: think step by step.
- "All but 8 die" means 8 survive
- The number who die is 15 - 8 = 7
- The number left is 8
Answer: 8 sheep

Q: {new_question}
A: think step by step.
"""

3. Structured CoT: Enforce specific reasoning formats.

1
2
3
4
5
6
7
8
9
10
11
12
13
prompt = f"""
Problem: {problem}

Solve this using the following structure:
1. What is known: [List given information]
2. What is unknown: [What we need to find]
3. Relevant formulas/principles: [What applies here]
4. Step-by-step solution:
a) [First step with explanation]
b) [Second step with explanation]
...
5. Final answer: [Concise answer]
"""

Benchmark results (Wei et al., 2022):

Benchmark Baseline CoT Improvement
GSM8K (math) 17.1% 78.2% +357%
SVAMP (math) 63.7% 79.0% +24%
CommonsenseQA 72.5% 78.1% +7.7%
StrategyQA 54.3% 66.1% +21.7%

When CoT helps most:

  • Multi-step reasoning (math, logic puzzles)
  • Problems requiring intermediate calculations
  • Tasks where the reasoning path matters (explainability)
  • Complex decision-making with trade-offs

When CoT doesn't help:

  • Simple lookup tasks ("What's the capital of France?")
  • When the model lacks requisite knowledge (CoT can't fix missing facts)
  • Very short answers (generating reasoning is wasteful)

Implementation best practices:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
class ChainOfThoughtEngine:
def __init__(self, model, temperature=0.7):
self.model = model
self.temperature = temperature

def solve(self, question, mode='zero-shot'):
if mode == 'zero-shot':
prompt = f"{question}\n\nthink step by step."
elif mode == 'few-shot':
prompt = self._build_few_shot_prompt(question)
else:
raise ValueError(f"Unknown mode: {mode}")

# Generate reasoning
response = self.model.generate(
prompt,
temperature=self.temperature,
max_tokens=512 # Allow room for reasoning
)

# Extract answer (often after "Answer:" or "Therefore,")
answer = self._extract_answer(response)
return answer, response

def _extract_answer(self, response):
"""Extract final answer from reasoning chain."""
# Look for common answer markers
markers = ["Answer:", "Therefore,", "The answer is"]
for marker in markers:
if marker in response:
# Get text after marker
after_marker = response.split(marker)[-1].strip()
# Take first sentence/line
answer = after_marker.split('.')[0].split('\n')[0]
return answer.strip()
# Fallback: return last line
return response.strip().split('\n')[-1]

Self-Consistency

Self-consistency (Wang et al., 2022) improves CoT by generating multiple reasoning paths and selecting the most common answer through majority vote.

Intuition: A single reasoning chain might make a mistake. But if you generate 10 chains and 7 reach the same answer, that's likely correct.

Algorithm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def self_consistency(question, n=5, temperature=0.7):
"""
Generate n reasoning chains and return majority answer.

Args:
question: Problem to solve
n: Number of chains to generate
temperature: Sampling temperature (higher = more diversity)

Returns:
Most common answer and confidence score
"""
from collections import Counter

prompt = f"{question}\n\nthink step by step."

answers = []
for _ in range(n):
response = llm_call(
prompt,
temperature=temperature,
max_tokens=512
)
answer = extract_answer(response)
answers.append(answer)

# Majority vote
vote_counts = Counter(answers)
majority_answer, count = vote_counts.most_common(1)[0]
confidence = count / n

return majority_answer, confidence

Example:

1
2
3
4
5
6
7
8
9
Question: "If you overtake the person in 2nd place, what place are you in?"

Chain 1: "You overtake 2nd, so you're now 2nd. Answer: 2nd place" ✓
Chain 2: "You were behind 2nd, now you're ahead, so 1st. Answer: 1st place" ✗
Chain 3: "Overtaking 2nd means you take their position. Answer: 2nd place" ✓
Chain 4: "You pass the person in 2nd. You're now 2nd. Answer: 2nd place" ✓
Chain 5: "You overtake 2nd, making you 1st. Answer: 1st place" ✗

Majority: "2nd place" (3/5 = 60% confidence)

Performance gains (Wang et al., 2022):

On GSM8K math problems: - Standard CoT: 74.4% - Self-consistency (n=40): 83.7% (+12.5%)

On CommonsenseQA: - Standard CoT: 78.1% - Self-consistency (n=40): 81.5% (+4.4%)

Cost-performance trade-offs:

Self-consistency requires n × cost of single inference. Choose n based on task importance:

Task criticality n Cost multiplier
Exploratory 3
Production 5
High-stakes 10-20 10-20×

Optimization: Use lower n initially, then increase for low-confidence cases:

1
2
3
4
5
6
7
8
9
10
11
12
def adaptive_self_consistency(question, confidence_threshold=0.7):
"""
Start with n=3, increase if confidence is low.
"""
answer, confidence = self_consistency(question, n=3)

if confidence >= confidence_threshold:
return answer, confidence

# Low confidence, generate more chains
answer, confidence = self_consistency(question, n=10)
return answer, confidence

Advanced variant - weighted voting:

Not all reasoning chains are equal. Weight votes by chain quality:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def weighted_self_consistency(question, n=5):
"""
Weight each answer by reasoning chain quality.
"""
prompt = f"{question}\n\nthink step by step."

answers_with_scores = []
for _ in range(n):
response = llm_call(prompt, temperature=0.7)
answer = extract_answer(response)

# Score reasoning quality
quality_prompt = f"""
Rate the logical quality of this reasoning (0-10):
{response}

Score:
"""
quality = int(llm_call(quality_prompt, temperature=0))

answers_with_scores.append((answer, quality))

# Weighted vote
from collections import defaultdict
vote_weights = defaultdict(float)
for answer, quality in answers_with_scores:
vote_weights[answer] += quality

best_answer = max(vote_weights, key=vote_weights.get)
confidence = vote_weights[best_answer] / sum(vote_weights.values())

return best_answer, confidence

Tree of Thoughts (ToT)

Tree of Thoughts (Yao et al., 2023) takes CoT further by exploring multiple reasoning paths simultaneously through tree search. Instead of a linear chain, the model explores a tree of possibilities, backtracking when paths seem unpromising.

Key idea: Model reasoning as search through a state space. Each "thought" is a partial solution. Use heuristics to prioritize promising branches.

Algorithm (simplified DFS variant):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def tree_of_thoughts(problem, depth=3, breadth=3):
"""
Explore solution tree using depth-first search.

Args:
problem: Initial problem statement
depth: Maximum depth to explore
breadth: Number of thoughts to generate at each node

Returns:
Best solution found
"""
def explore(state, current_depth):
if current_depth >= depth:
# Terminal state, evaluate quality
return evaluate_solution(state)

# Generate possible next thoughts
thoughts = generate_thoughts(state, k=breadth)

# Evaluate each thought's promise
thought_values = []
for thought in thoughts:
# Combine current state with thought
new_state = state + "\n" + thought

# Recursively explore this branch
value = explore(new_state, current_depth + 1)
thought_values.append((thought, value))

# Return best branch
best_thought, best_value = max(thought_values, key=lambda x: x[1])
return best_value

# Start exploration from initial problem
explore(problem, 0)

def generate_thoughts(state, k=3):
"""Generate k possible next reasoning steps."""
prompt = f"""
Current reasoning:
{state}

Generate {k} different possible next steps:
1.
"""
response = llm_call(prompt, temperature=0.8, max_tokens=300)
# Parse into list of thoughts
thoughts = [t.strip() for t in response.split('\n') if t.strip()]
return thoughts[:k]

def evaluate_solution(state):
"""Score how promising this reasoning path is."""
prompt = f"""
Rate this reasoning on a scale of 1-10:
{state}

Consider: logical coherence, progress toward solution, likelihood of correctness.

Score (integer 1-10):
"""
score = int(llm_call(prompt, temperature=0, max_tokens=5))
return score

Concrete example (Game of 24):

Task: Use four numbers (e.g., 4, 9, 10, 13) with +, -, ×, ÷ to make 24.

ToT exploration:

1
2
3
4
5
6
7
8
9
10
11
12
Root: "4, 9, 10, 13"
├─ Thought 1: "13 - 9 = 4" → State: "4, 4, 10"
│ ├─ Thought 1.1: "10 - 4 = 6" → State: "4, 6"
│ │ └─ Thought 1.1.1: "6 × 4 = 24" ✓ SOLUTION FOUND
│ ├─ Thought 1.2: "4 × 4 = 16" → State: "10, 16"
│ │ └─ Dead end (can't make 24)
│ └─ Thought 1.3: "10 + 4 = 14" → State: "4, 14"
│ └─ Dead end
├─ Thought 2: "10 - 4 = 6" → State: "6, 9, 13"
│ └─ ... (explore further)
└─ Thought 3: "9 + 10 = 19" → State: "4, 13, 19"
└─ ... (explore further)

The tree search backtracks when a path seems unpromising (evaluated by the model itself), exploring alternative branches.

Benchmark results (Yao et al., 2023):

Task CoT ToT Improvement
Game of 24 7.3% 74% +914%
Creative writing 7.3 7.9 +8.2%
Crosswords 15.6% 78% +400%

Why ToT works:

  1. Exploration: Tries multiple approaches rather than committing to first thought
  2. Self-evaluation: Model judges its own reasoning quality
  3. Backtracking: Abandons dead ends early

Implementation challenges:

  • Cost: Exploring a tree with breadth=3 and depth=4 requires 3^4 = 81 LLM calls
  • Evaluation accuracy: Model must reliably judge which thoughts are promising
  • Search strategy: DFS, BFS, or best-first? Each has trade-offs.

Production-ready implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
class TreeOfThoughts:
def __init__(self, model, max_calls=50):
self.model = model
self.max_calls = max_calls
self.call_count = 0

def solve(self, problem):
"""Solve problem using BFS-based ToT."""
from queue import PriorityQueue

# Priority queue of (value, state)
queue = PriorityQueue()
queue.put((0, problem, [])) # (priority, state, path)

best_solution = None
best_score = -float('inf')

while not queue.empty() and self.call_count < self.max_calls:
priority, state, path = queue.get()

# Check if this is a terminal state
if self._is_terminal(state):
score = self._evaluate(state)
if score > best_score:
best_score = score
best_solution = state
continue

# Generate next thoughts
thoughts = self._generate_thoughts(state, k=3)

for thought in thoughts:
new_state = state + "\n" + thought
new_path = path + [thought]

# Evaluate promise of this thought
value = self._evaluate_thought(new_state)

# Add to queue (negative because PriorityQueue is min-heap)
queue.put((-value, new_state, new_path))

return best_solution, best_score

def _generate_thoughts(self, state, k):
self.call_count += 1
prompt = f"""
Current state:
{state}

Generate {k} possible next steps:
"""
response = self.model.generate(prompt, temperature=0.8)
return self._parse_thoughts(response, k)

def _evaluate_thought(self, state):
self.call_count += 1
prompt = f"""
Rate this reasoning (1-10):
{state}

Score:
"""
return int(self.model.generate(prompt, temperature=0))

def _is_terminal(self, state):
# Task-specific logic
return "Answer:" in state or len(state.split('\n')) > 10

Graph of Thoughts (GoT)

Graph of Thoughts (Besta et al., 2023) generalizes ToT by allowing arbitrary graph structures. Thoughts can merge (combine multiple reasoning paths) or branch, enabling more complex reasoning patterns.

Key insight: Not all reasoning follows trees. Sometimes you want to: - Merge: Combine insights from multiple branches - Loop: Refine a solution iteratively - Aggregate: Synthesize information from many sources

Example - Document analysis:

1
2
3
4
5
Documents → Split into chunks → Process each chunk

Extract key points

Merge all key points → Synthesize summary

This is a graph: multiple parallel paths (one per chunk) that merge into a synthesis step.

Graph structure representation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class ThoughtGraph:
def __init__(self):
self.nodes = {} # id -> thought content
self.edges = [] # (from_id, to_id, operation)

def add_node(self, node_id, content):
self.nodes[node_id] = content

def add_edge(self, from_id, to_id, operation='continue'):
"""
operation can be:
- 'continue': Standard progression
- 'merge': Combine multiple thoughts
- 'refine': Iteratively improve
"""
self.edges.append((from_id, to_id, operation))

Example implementation - multi-document summarization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def graph_of_thoughts_summarization(documents):
"""
Summarize multiple documents using GoT.

Graph structure:
1. Generate summary for each document (parallel)
2. Extract key themes from each summary (parallel)
3. Merge all themes
4. Generate final synthesis
"""
graph = ThoughtGraph()

# Layer 1: Individual summaries
summaries = []
for i, doc in enumerate(documents):
node_id = f"summary_{i}"
summary = llm_call(f"Summarize this document:\n{doc}")
graph.add_node(node_id, summary)
summaries.append((node_id, summary))

# Layer 2: Key themes from each summary
themes = []
for node_id, summary in summaries:
theme_node_id = f"theme_{node_id}"
theme = llm_call(f"Extract key themes from:\n{summary}")
graph.add_node(theme_node_id, theme)
graph.add_edge(node_id, theme_node_id, 'continue')
themes.append((theme_node_id, theme))

# Layer 3: Merge themes
merge_node_id = "merged_themes"
all_themes = "\n\n".join([t[1] for t in themes])
merged = llm_call(f"Merge these themes:\n{all_themes}")
graph.add_node(merge_node_id, merged)
for theme_id, _ in themes:
graph.add_edge(theme_id, merge_node_id, 'merge')

# Layer 4: Final synthesis
final_node_id = "final_summary"
synthesis = llm_call(f"Synthesize a final summary from:\n{merged}")
graph.add_node(final_node_id, synthesis)
graph.add_edge(merge_node_id, final_node_id, 'continue')

return synthesis, graph

Advantages over ToT:

  1. Parallelization: Independent branches can run concurrently
  2. Information aggregation: Merge multiple perspectives
  3. Iterative refinement: Loops for incremental improvement

Performance (Besta et al., 2023):

On sorting tasks (sort 32 numbers), GoT achieved: - 89% accuracy - 62% cost reduction vs. ToT (through parallelization)

ReAct (Reasoning + Acting)

ReAct (Yao et al., 2022) interleaves reasoning with actions in the world. Instead of pure thought, the model alternates between thinking (reasoning) and acting (calling tools, APIs, executing code).

Pattern:

1
2
3
4
5
6
7
8
Thought: I need to find the population of Paris
Action: search("Paris population")
Observation: 2.16 million (2019)
Thought: Now I need Tokyo's population
Action: search("Tokyo population")
Observation: 37.4 million (2021)
Thought: Tokyo is larger than Paris
Action: finish("Tokyo has a larger population than Paris")

Why ReAct matters:

Language models struggle with: - Current information (training data is stale) - Precise calculations - Accessing private data

ReAct bridges this by letting models use external tools.

Architecture:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
class ReActAgent:
def __init__(self, model, tools, max_steps=10):
self.model = model
self.tools = tools # Dict of tool_name -> function
self.max_steps = max_steps

def run(self, task):
"""Execute task using ReAct loop."""
trajectory = [f"Task: {task}"]

for step in range(self.max_steps):
# Generate thought + action
prompt = self._build_prompt(trajectory)
response = self.model.generate(prompt, temperature=0)

# Parse response
thought, action, action_input = self._parse_response(response)

trajectory.append(f"Thought: {thought}")
trajectory.append(f"Action: {action}[{action_input}]")

# Execute action
if action == "finish":
return action_input

if action not in self.tools:
observation = f"Error: Unknown action '{action}'"
else:
observation = self.tools[action](action_input)

trajectory.append(f"Observation: {observation}")

return "Maximum steps reached without finishing"

def _build_prompt(self, trajectory):
tools_desc = "\n".join([
f"- {name}: {func.__doc__}"
for name, func in self.tools.items()
])

return f"""
You can use these tools:
{tools_desc}

Always respond in this format:
Thought: [your reasoning]
Action: [tool_name]
Action Input: [input for tool]

When done:
Thought: [final reasoning]
Action: finish
Action Input: [final answer]

{chr(10).join(trajectory)}

Thought:
"""

def _parse_response(self, response):
"""Extract thought, action, and input from model response."""
lines = response.strip().split('\n')

thought = ""
action = ""
action_input = ""

for line in lines:
if line.startswith("Thought:"):
thought = line.split("Thought:")[-1].strip()
elif line.startswith("Action:"):
action = line.split("Action:")[-1].strip()
elif line.startswith("Action Input:"):
action_input = line.split("Action Input:")[-1].strip()

return thought, action, action_input

Example tools:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def search(query):
"""Search the web for information."""
# Integration with search API
import requests
response = requests.get(
"https://api.search.com/search",
params={"q": query}
)
return response.json()["snippet"]

def calculate(expression):
"""Evaluate a mathematical expression."""
try:
# Safe eval (in production, use a proper math parser)
result = eval(expression, {"__builtins__": { }}, {})
return str(result)
except Exception as e:
return f"Error: {e}"

def wikipedia(query):
"""Get Wikipedia summary."""
import wikipediaapi
wiki = wikipediaapi.Wikipedia('en')
page = wiki.page(query)
return page.summary[:500]

tools = {
"search": search,
"calculate": calculate,
"wikipedia": wikipedia
}

Real-world usage:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
agent = ReActAgent(model, tools, max_steps=10)

result = agent.run(
"What is the total GDP of the three largest EU countries?"
)

# Execution trace:
# Thought: I need to identify the three largest EU countries by GDP
# Action: search
# Action Input: largest EU countries by GDP
# Observation: Germany, France, Italy are top 3
#
# Thought: Now I need each country's GDP
# Action: search
# Action Input: Germany GDP 2023
# Observation: $4.3 trillion
#
# Thought: Get France GDP
# Action: search
# Action Input: France GDP 2023
# Observation:$2.9 trillion
#
# Thought: Get Italy GDP
# Action: search
# Action Input: Italy GDP 2023
# Observation: $2.0 trillion
#
# Thought: Sum the three GDPs
# Action: calculate
# Action Input: 4.3 + 2.9 + 2.0
# Observation: 9.2
#
# Thought: I have the total
# Action: finish
# Action Input: The total GDP of Germany, France, and Italy is$9.2 trillion

Performance benchmarks (Yao et al., 2022):

On HotpotQA (multi-hop question answering): - Standard prompting: 28.7% - CoT: 32.9% - ReAct: 37.4%

On AlfWorld (interactive environment): - Standard prompting: 12% - ReAct: 34%

Best practices:

  1. Tool documentation: Clear docstrings help the model choose appropriate tools
  2. Error handling: Tools should return descriptive error messages
  3. Rate limiting: Prevent infinite loops or excessive API calls
  4. Observation truncation: Limit observation length to avoid context overflow

Optimization & Automation

Manual prompt engineering doesn't scale. These techniques automate the optimization process.

Automatic Prompt Engineering (APE)

APE (Zhou et al., 2022) automatically generates and selects optimal prompts. Instead of manually iterating, you provide examples and let an LLM discover effective prompts.

Algorithm:

  1. Generate candidate prompts: Use an LLM to create diverse prompt variations
  2. Evaluate each prompt: Test on a validation set
  3. Select the best: Choose the highest-performing prompt

Implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
def automatic_prompt_engineering(
task_description,
train_examples,
val_examples,
num_candidates=20
):
"""
Automatically discover optimal prompt for a task.

Args:
task_description: Natural language description of task
train_examples: List of (input, output) pairs for prompt generation
val_examples: List of (input, output) pairs for evaluation
num_candidates: Number of prompt variations to try

Returns:
Best prompt and its accuracy
"""
# Step 1: Generate candidate prompts
meta_prompt = f"""
Task: {task_description}

Here are some examples:
{format_examples(train_examples[:5])}

Generate {num_candidates} different prompts that could solve this task.
Each prompt should use a different approach or phrasing.

Prompts:
"""

candidates_text = llm_call(meta_prompt, temperature=1.0, max_tokens=2000)
candidates = parse_prompts(candidates_text)

# Step 2: Evaluate each candidate
results = []
for prompt in candidates:
correct = 0
for input_text, expected_output in val_examples:
full_prompt = f"{prompt}\n\nInput: {input_text}\nOutput:"
output = llm_call(full_prompt, temperature=0)
if normalize(output) == normalize(expected_output):
correct += 1

accuracy = correct / len(val_examples)
results.append((prompt, accuracy))

# Step 3: Return best prompt
best_prompt, best_accuracy = max(results, key=lambda x: x[1])
return best_prompt, best_accuracy

def format_examples(examples):
return "\n".join([
f"Input: {inp}\nOutput: {out}"
for inp, out in examples
])

def parse_prompts(text):
"""Extract individual prompts from generated text."""
# Simple parsing - in production, use more robust extraction
prompts = []
for line in text.split('\n'):
line = line.strip()
if line and len(line) > 20: # Filter very short lines
prompts.append(line)
return prompts

def normalize(text):
"""Normalize text for comparison."""
return text.strip().lower()

Example usage:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
task = "Classify whether a product review is positive or negative"

train_examples = [
("Great product, works perfectly!", "positive"),
("Terrible quality, broke after 1 day", "negative"),
("Decent but overpriced", "negative"),
("Exceeded my expectations", "positive"),
]

val_examples = [
("Love it!", "positive"),
("Disappointed", "negative"),
# ... more examples
]

best_prompt, accuracy = automatic_prompt_engineering(
task, train_examples, val_examples
)

print(f"Best prompt (accuracy: {accuracy:.1%}):")
print(best_prompt)

# Output might be:
# "Determine if the customer review expresses satisfaction (positive)
# or dissatisfaction (negative) with the product."
# Accuracy: 87.3%

Discovered prompts often outperform human-written ones. Zhou et al. found APE-generated prompts achieved 3-8% higher accuracy than human baselines on various tasks.

Why APE works:

  1. Exploration: Tries diverse phrasings humans might not consider
  2. Data-driven: Optimizes directly on your task, not general heuristics
  3. Scales: Can test hundreds of candidates automatically

Advanced: Iterative APE:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def iterative_ape(task, train_examples, val_examples, iterations=3):
"""
Iteratively refine prompts.
"""
best_prompt = task # Start with task description
best_accuracy = 0

for i in range(iterations):
print(f"Iteration {i+1}/{iterations}")

# Generate variations of current best
refinement_prompt = f"""
Current prompt: {best_prompt}
Current accuracy: {best_accuracy:.1%}

Generate 10 improved variations of this prompt.
Consider:
- More specific instructions
- Better examples
- Clearer output format
- Edge case handling

Variations:
"""

candidates_text = llm_call(refinement_prompt, temperature=0.9)
candidates = parse_prompts(candidates_text)

# Evaluate
for prompt in candidates:
accuracy = evaluate_prompt(prompt, val_examples)
if accuracy > best_accuracy:
best_accuracy = accuracy
best_prompt = prompt
print(f"New best: {accuracy:.1%}")

return best_prompt, best_accuracy

DSPy: Declarative Self-improving Prompts

DSPy (Khattab et al., 2023) is a framework that treats prompting as a programming problem. Instead of writing prompts, you write programs that generate prompts.

Core concepts:

  1. Signatures: Type-annotated function specs (input/output)
  2. Modules: Composable prompt templates
  3. Optimizers: Automatic prompt tuning

Example - sentiment classification:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
import dspy

# Define signature (input -> output types)
class SentimentSignature(dspy.Signature):
"""Classify sentiment of text."""
text = dspy.InputField()
sentiment = dspy.OutputField(desc="positive, negative, or neutral")

# Create module
class SentimentClassifier(dspy.Module):
def __init__(self):
super().__init__()
self.predictor = dspy.Predict(SentimentSignature)

def forward(self, text):
return self.predictor(text=text)

# Configure LLM
lm = dspy.OpenAI(model="gpt-3.5-turbo")
dspy.settings.configure(lm=lm)

# Use classifier
classifier = SentimentClassifier()
result = classifier("This movie was amazing!")
print(result.sentiment) # "positive"

Automatic optimization:

DSPy can automatically tune prompts using a training set:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
from dspy.teleprompt import BootstrapFewShot

# Training data
train_data = [
dspy.Example(text="Great product!", sentiment="positive"),
dspy.Example(text="Terrible service", sentiment="negative"),
# ... more examples
]

# Optimize
optimizer = BootstrapFewShot(metric=exact_match)
optimized_classifier = optimizer.compile(
SentimentClassifier(),
trainset=train_data
)

# The optimized classifier now has better prompts
result = optimized_classifier("I love this")

What DSPy does under the hood:

  1. Bootstrapping: Generates examples by running the model on training data
  2. Filtering: Keeps only high-quality examples
  3. Compilation: Inserts these examples into few-shot prompts
  4. Iteration: Refines through multiple passes

Advanced: Multi-stage programs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class MultiHopQA(dspy.Module):
"""Answer questions requiring multiple reasoning steps."""

def __init__(self):
super().__init__()
self.retrieve = dspy.Retrieve(k=3) # Get relevant docs
self.generate_query = dspy.ChainOfThought("question -> search_query")
self.answer = dspy.ChainOfThought("context, question -> answer")

def forward(self, question):
# Generate search query
search_query = self.generate_query(question=question).search_query

# Retrieve context
context = self.retrieve(search_query).passages

# Generate answer
answer = self.answer(context=context, question=question).answer
return answer

# DSPy will automatically optimize all three sub-prompts

Benefits:

  • Modularity: Compose complex pipelines from simple components
  • Automatic optimization: No manual prompt tuning
  • Type safety: Catch errors before runtime

Trade-offs:

  • Learning curve: Requires understanding DSPy abstractions
  • Black box: Less control than manual prompting
  • Overhead: Optimization requires compute and training data

LLMLingua: Prompt Compression

LLMLingua (Jiang et al., 2023) compresses prompts to reduce costs while preserving performance. It selectively removes tokens that contribute least to the model's understanding.

Motivation: Long prompts are expensive. A 10,000-token prompt costs 10× more than 1,000 tokens. Can we compress without hurting accuracy?

Method:

LLMLingua uses a smaller LLM to score each token's importance, then removes low-scoring tokens.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from llmlingua import PromptCompressor

compressor = PromptCompressor()

original_prompt = """
You are a helpful assistant. Please answer the following question accurately.

Context: The French Revolution was a period of radical political and societal
change in France that began with the Estates-General of 1789 and ended with
the formation of the French Consulate in November 1799. Many of its ideas
are considered fundamental principles of liberal democracy.

Question: When did the French Revolution begin?

Please provide a concise answer based on the context above.
"""

compressed_prompt = compressor.compress(
original_prompt,
rate=0.5, # Target 50% compression
target_token=200 # Or specify exact token budget
)

print(f"Original: {len(original_prompt)} chars")
print(f"Compressed: {len(compressed_prompt)} chars")
print(f"Compression: {100*(1-len(compressed_prompt)/len(original_prompt)):.1f}%")

# Output:
# Original: 487 chars
# Compressed: 201 chars
# Compression: 58.7%
#
# Compressed prompt:
# "helpful assistant answer accurately
# Context: French Revolution radical political societal change France began
# Estates-General 1789 ended French Consulate November 1799
# Question: When French Revolution begin?
# concise answer based context"

Key technique: Token-level importance scoring:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def compute_token_importance(prompt, model):
"""
Score each token by how much it affects model's predictions.

Uses conditional perplexity: remove token i, measure
how much perplexity increases.
"""
tokens = tokenize(prompt)
importances = []

for i in range(len(tokens)):
# Compute perplexity with token i
ppl_with = model.perplexity(tokens)

# Compute perplexity without token i
tokens_without = tokens[:i] + tokens[i+1:]
ppl_without = model.perplexity(tokens_without)

# Importance = how much perplexity increases when removed
importance = ppl_without - ppl_with
importances.append(importance)

return importances

Performance (Jiang et al., 2023):

On question-answering tasks with 2× compression: - Accuracy drop: Only 2-3% - Cost savings: 50% - Latency improvement: 1.4× faster

On retrieval-augmented generation with 4× compression: - Accuracy drop: 5-7% - Cost savings: 75%

When to use LLMLingua:

  • Long context scenarios (RAG, document analysis)
  • Cost-sensitive applications
  • When slight accuracy trade-off is acceptable

Implementation tips:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
class AdaptiveCompressor:
"""
Compress prompts adaptively based on budget and importance.
"""
def __init__(self, base_compressor):
self.compressor = base_compressor

def compress(self, prompt, budget_tokens):
"""
Compress prompt to fit within token budget.

Preserves:
1. Instructions (high priority)
2. Examples (medium priority)
3. Context (compress most aggressively)
"""
sections = self._split_sections(prompt)

# Allocate budget
instruction_budget = int(budget_tokens * 0.3)
example_budget = int(budget_tokens * 0.4)
context_budget = budget_tokens - instruction_budget - example_budget

compressed_sections = {
'instruction': self.compressor.compress(
sections['instruction'],
target_token=instruction_budget
),
'examples': self.compressor.compress(
sections['examples'],
target_token=example_budget
),
'context': self.compressor.compress(
sections['context'],
target_token=context_budget
)
}

return self._merge_sections(compressed_sections)

Practical Templates & Patterns

Theory is useful, but practitioners need ready-to-use templates. Here are battle-tested patterns for common scenarios.

Structured Output Generation

Getting LLMs to output valid JSON, XML, or other structured formats is notoriously tricky. Use these strategies:

Strategy 1: Schema-first prompting:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
def generate_structured_output(data, schema):
"""
Generate JSON conforming to schema.
"""
prompt = f"""
Generate a JSON object that follows this schema:

{json.dumps(schema, indent=2)}

Rules:
- All required fields must be present
- Use correct data types (string, number, boolean, array, object)
- Enum fields must use one of the specified values
- No additional fields beyond the schema

Input data:
{data}

Output valid JSON only (no explanation):
"""

response = llm_call(prompt, temperature=0)

# Validate against schema
try:
parsed = json.loads(response)
validate_against_schema(parsed, schema)
return parsed
except Exception as e:
# Retry with error feedback
return retry_with_feedback(prompt, response, str(e))

# Example usage
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "number"},
"hobbies": {
"type": "array",
"items": {"type": "string"}
},
"status": {
"type": "string",
"enum": ["active", "inactive"]
}
},
"required": ["name", "age"]
}

data = "John is a 30-year-old software engineer who likes hiking and reading. He's currently active."

result = generate_structured_output(data, schema)
# {
# "name": "John",
# "age": 30,
# "hobbies": ["hiking", "reading"],
# "status": "active"
# }

Strategy 2: Few-shot with valid examples:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def few_shot_structured_generation():
prompt = """
Extract information in this JSON format:

Example 1:
Text: "Alice is 25 and works as a designer."
Output: {"name": "Alice", "age": 25, "occupation": "designer"}

Example 2:
Text: "Bob, a 40-year-old teacher, lives in Boston."
Output: {"name": "Bob", "age": 40, "occupation": "teacher", "location": "Boston"}

Example 3:
Text: "Carol enjoys programming. She's 35."
Output: {"name": "Carol", "age": 35, "interests": ["programming"]}

Now extract from this text:
Text: "David is a 28-year-old chef who loves traveling."
Output:
"""
return llm_call(prompt, temperature=0)

Strategy 3: Use function calling (OpenAI-specific):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import openai

response = openai.ChatCompletion.create(
model="gpt-4",
messages=[
{"role": "user", "content": "Extract person info from: David is a 28-year-old chef"}
],
functions=[
{
"name": "extract_person_info",
"description": "Extract structured information about a person",
"parameters": {
"type": "object",
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"occupation": {"type": "string"}
},
"required": ["name"]
}
}
],
function_call={"name": "extract_person_info"}
)

# Guaranteed valid JSON matching schema
args = json.loads(response["choices"][0]["message"]["function_call"]["arguments"])

Code Generation & Debugging

Generating reliable code requires careful prompting:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def generate_code(task, language="python", test_cases=None):
"""
Generate and validate code for a task.
"""
prompt = f"""
Write {language} code to solve this task:

Task: {task}

Requirements:
- Include clear comments explaining the logic
- Handle edge cases and errors
- Follow best practices for {language}
- Include a main function demonstrating usage

{format_test_cases(test_cases) if test_cases else ''}

Provide complete, runnable code:

```{language}
"""

code = llm_call(prompt, temperature=0.3)

# Extract code from markdown
code = extract_code_block(code)

# Validate
if test_cases:
results = run_tests(code, test_cases, language)
if not all(r.passed for r in results):
# Debug and retry
code = debug_and_fix(code, results, prompt)

return code

def format_test_cases(test_cases):
"""Format test cases for prompt."""
formatted = "Test cases:\n"
for i, (input_data, expected) in enumerate(test_cases, 1):
formatted += f"{i}. Input: {input_data} → Expected: {expected}\n"
return formatted

def debug_and_fix(code, test_results, original_prompt):
"""Iteratively fix failing code."""
failures = [r for r in test_results if not r.passed]

debug_prompt = f"""
{original_prompt}

Your previous code failed these tests:

{format_failures(failures)}

Fix the code to pass all tests. Provide corrected code:

```python
"""

fixed_code = llm_call(debug_prompt, temperature=0.2)
return extract_code_block(fixed_code)

Data Extraction & Classification

Extract structured data from unstructured text:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
class DataExtractor:
"""
Extract structured data using progressive refinement.
"""

def __init__(self, model):
self.model = model

def extract(self, text, schema, confidence_threshold=0.8):
"""
Extract data matching schema, with confidence scores.
"""
# Step 1: Initial extraction
extraction_prompt = f"""
Extract all instances matching this schema:

{format_schema(schema)}

Text:
{text}

For each instance, provide:
1. Extracted data in JSON
2. Confidence (0-1)
3. Supporting quote from text

Output:
"""

raw_extraction = self.model.generate(extraction_prompt)
candidates = self.parse_candidates(raw_extraction)

# Step 2: Validate and filter
validated = []
for candidate in candidates:
if candidate['confidence'] >= confidence_threshold:
# Verify extraction is supported by text
if self.verify_extraction(text, candidate):
validated.append(candidate)

# Step 3: Resolve conflicts (overlapping extractions)
resolved = self.resolve_conflicts(validated)

return resolved

def verify_extraction(self, text, candidate):
"""
Verify that extraction is actually supported by source text.
"""
verification_prompt = f"""
Source text:
{text}

Claimed extraction:
{json.dumps(candidate['data'])}

Supporting quote:
{candidate['quote']}

Is this extraction accurate and supported by the text?
Answer: yes or no
"""

answer = self.model.generate(verification_prompt, temperature=0)
return 'yes' in answer.lower()

Creative Writing & Content Generation

Generate high-quality creative content:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
def generate_creative_content(topic, style, length, audience):
"""
Generate creative content with stylistic control.
"""
prompt = f"""
Write a {length}-word piece about: {topic}

Style: {style}
Audience: {audience}

Guidelines:
- Start with a compelling hook
- Use vivid, concrete details
- Vary sentence structure for rhythm
- End with a memorable conclusion
- Avoid clich é s and generic phrases
- Show, don't just tell

Content:
"""

# Use higher temperature for creativity
content = llm_call(prompt, temperature=0.8, max_tokens=length*2)

# Post-process: check quality metrics
quality_score = evaluate_writing_quality(content)

if quality_score < 0.7:
# Refine using critique feedback
content = refine_content(content, prompt)

return content

def evaluate_writing_quality(text):
"""
Score writing quality across multiple dimensions.
"""
eval_prompt = f"""
Rate this writing on a scale of 0-1 for each criterion:

Text:
{text}

Criteria:
1. Originality: Unique perspective and fresh language
2. Clarity: Easy to understand, well-structured
3. Engagement: Captures and holds attention
4. Concrete details: Specific, vivid examples
5. Flow: Smooth transitions, varied rhythm

Provide scores:
Originality:
Clarity:
Engagement:
Concrete details:
Flow:
"""

response = llm_call(eval_prompt, temperature=0)
scores = parse_scores(response)
return sum(scores.values()) / len(scores)

Multi-turn Conversation Management

Build context-aware conversational agents:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
class ConversationManager:
"""
Manage multi-turn conversations with context tracking.
"""

def __init__(self, model, system_prompt, max_context_tokens=4000):
self.model = model
self.system_prompt = system_prompt
self.max_context_tokens = max_context_tokens
self.history = []

def chat(self, user_message):
"""
Process user message and generate response.
"""
# Add user message to history
self.history.append({"role": "user", "content": user_message})

# Compress history if needed
context = self.get_context()

# Generate response
full_prompt = self.build_prompt(context)
response = self.model.generate(full_prompt)

# Add response to history
self.history.append({"role": "assistant", "content": response})

return response

def get_context(self):
"""
Get recent context within token budget.
"""
# Always include system prompt
context = [{"role": "system", "content": self.system_prompt}]

token_count = count_tokens(self.system_prompt)

# Add messages from most recent backward
for message in reversed(self.history):
msg_tokens = count_tokens(message["content"])
if token_count + msg_tokens > self.max_context_tokens:
break
context.insert(1, message) # Insert after system prompt
token_count += msg_tokens

return context

def summarize_history(self):
"""
Summarize old conversation turns to save tokens.
"""
if len(self.history) < 10:
return

# Summarize oldest messages
old_messages = self.history[:6]
summary_prompt = f"""
Summarize this conversation concisely:

{format_messages(old_messages)}

Summary (2-3 sentences):
"""

summary = self.model.generate(summary_prompt, temperature=0)

# Replace old messages with summary
self.history = [
{"role": "system", "content": f"Previous conversation: {summary}"}
] + self.history[6:]

Evaluation & Debugging

Prompt engineering is empirical. You need robust evaluation to know what works.

Evaluation Metrics

1. Exact Match:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def exact_match(prediction, ground_truth):
"""Strict equality after normalization."""
return normalize(prediction) == normalize(ground_truth)

def normalize(text):
"""Lowercase, strip whitespace, remove punctuation."""
import string
text = text.lower().strip()
text = text.translate(str.maketrans('', '', string.punctuation))
return ' '.join(text.split())

# Usage
predictions = ["The answer is 42.", "42", "forty-two"]
ground_truth = "42"

for pred in predictions:
print(f"{pred}{exact_match(pred, ground_truth)}")
# Output:
# The answer is 42. → True
# 42 → True
# forty-two → False

2. F1 Score (for extraction tasks):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def compute_f1(prediction, ground_truth):
"""
Token-level F1 score.
"""
pred_tokens = set(normalize(prediction).split())
true_tokens = set(normalize(ground_truth).split())

if len(pred_tokens) == 0 and len(true_tokens) == 0:
return 1.0
if len(pred_tokens) == 0 or len(true_tokens) == 0:
return 0.0

common = pred_tokens & true_tokens
precision = len(common) / len(pred_tokens)
recall = len(common) / len(true_tokens)

if precision + recall == 0:
return 0.0

f1 = 2 * precision * recall / (precision + recall)
return f1

3. Semantic Similarity:

1
2
3
4
5
6
7
8
9
10
11
12
13
def semantic_similarity(prediction, ground_truth, model="sentence-transformers"):
"""
Measure semantic similarity using embeddings.
"""
from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('all-MiniLM-L6-v2')

emb1 = model.encode(prediction, convert_to_tensor=True)
emb2 = model.encode(ground_truth, convert_to_tensor=True)

cosine_sim = util.cos_sim(emb1, emb2).item()
return cosine_sim

4. LLM-as-Judge:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def llm_as_judge(prediction, ground_truth, criteria):
"""
Use an LLM to evaluate quality.
"""
judge_prompt = f"""
Evaluate this model output:

Task: {criteria['task']}

Expected: {ground_truth}
Actual: {prediction}

Criteria:
{format_criteria(criteria['rubric'])}

Provide:
1. Score (0-10)
2. Brief justification

Evaluation:
"""

evaluation = llm_call(judge_prompt, temperature=0)
score = extract_score(evaluation)
justification = extract_justification(evaluation)

return score, justification

# Example
criteria = {
"task": "Summarize the article",
"rubric": {
"Accuracy": "Factually correct, no hallucinations",
"Completeness": "Covers key points",
"Conciseness": "No unnecessary details",
"Fluency": "Clear, grammatical prose"
}
}

score, reasoning = llm_as_judge(
prediction="The article discusses climate change impacts...",
ground_truth="Summary should cover: causes, impacts, solutions",
criteria=criteria
)

Systematic A/B Testing

Compare prompt variants rigorously:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
class PromptExperiment:
"""
Run A/B tests on prompt variants.
"""

def __init__(self, test_set, metrics):
self.test_set = test_set # List of (input, ground_truth)
self.metrics = metrics # List of metric functions

def evaluate_prompt(self, prompt_fn):
"""
Evaluate a prompt on test set.

Args:
prompt_fn: Function that takes input and returns prompt

Returns:
Dict of metric name -> score
"""
results = {}
for metric in self.metrics:
scores = []
for input_data, ground_truth in self.test_set:
prompt = prompt_fn(input_data)
prediction = llm_call(prompt, temperature=0)
score = metric(prediction, ground_truth)
scores.append(score)
results[metric.__name__] = sum(scores) / len(scores)
return results

def compare(self, prompts_dict):
"""
Compare multiple prompt variants.

Args:
prompts_dict: {name: prompt_fn}

Returns:
DataFrame with results
"""
import pandas as pd

results = []
for name, prompt_fn in prompts_dict.items():
scores = self.evaluate_prompt(prompt_fn)
scores['prompt_name'] = name
results.append(scores)

df = pd.DataFrame(results)
return df.sort_values(by=df.columns[0], ascending=False)

# Usage example
test_set = [
("Classify: This movie was great!", "positive"),
("Classify: Terrible experience", "negative"),
# ... more examples
]

metrics = [exact_match, semantic_similarity]

experiment = PromptExperiment(test_set, metrics)

# Define variants
prompts = {
"baseline": lambda x: f"Classify sentiment: {x}\nSentiment:",
"with_instructions": lambda x: f"Classify as positive/negative/neutral.\nText: {x}\nSentiment:",
"few_shot": lambda x: few_shot_template(x),
"cot": lambda x: f"Classify sentiment step by step: {x}\nThinking:"
}

results = experiment.compare(prompts)
print(results)

# exact_match semantic_similarity prompt_name
# 2 0.87 0.94 few_shot
# 3 0.84 0.91 cot
# 1 0.79 0.89 with_instructions
# 0 0.73 0.85 baseline

Debugging Failed Prompts

When prompts fail, diagnose systematically:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
class PromptDebugger:
"""
Diagnose why a prompt fails.
"""

def debug(self, prompt, expected, actual):
"""
Run diagnostic checks and suggest fixes.
"""
issues = []

# Check 1: Is the prompt clear?
if self.check_ambiguity(prompt):
issues.append({
"issue": "Ambiguous instructions",
"evidence": self.find_ambiguous_phrases(prompt),
"fix": "Add specific constraints and examples"
})

# Check 2: Are there contradictions?
if self.check_contradictions(prompt):
issues.append({
"issue": "Contradictory requirements",
"evidence": self.find_contradictions(prompt),
"fix": "Resolve conflicting instructions"
})

# Check 3: Is context sufficient?
if self.check_missing_context(prompt, expected):
issues.append({
"issue": "Insufficient context",
"evidence": "Expected output requires information not in prompt",
"fix": "Provide additional background or examples"
})

# Check 4: Format issues?
if self.check_format_mismatch(expected, actual):
issues.append({
"issue": "Output format mismatch",
"evidence": f"Expected format: {self.infer_format(expected)}, Got: {self.infer_format(actual)}",
"fix": "Explicitly specify output format with examples"
})

# Check 5: Is it too complex?
if self.check_complexity(prompt):
issues.append({
"issue": "Task too complex for single prompt",
"evidence": f"Prompt asks for {self.count_subtasks(prompt)} subtasks",
"fix": "Decompose into multiple steps"
})

return self.generate_report(issues)

def check_ambiguity(self, prompt):
"""Detect vague language."""
ambiguous_phrases = [
"relevant", "appropriate", "good", "bad",
"some", "few", "many", "stuff", "things"
]
return any(phrase in prompt.lower() for phrase in ambiguous_phrases)

def check_contradictions(self, prompt):
"""Check for conflicting instructions."""
# Simple heuristic: look for "but", "however", "although"
contradiction_markers = ["but", "however", "although", "except"]
return sum(marker in prompt.lower() for marker in contradiction_markers) >= 2

def check_format_mismatch(self, expected, actual):
"""Check if output format matches expectation."""
expected_fmt = self.infer_format(expected)
actual_fmt = self.infer_format(actual)
return expected_fmt != actual_fmt

def infer_format(self, text):
"""Infer output format (json, list, prose, etc)."""
text = text.strip()
if text.startswith('{') and text.endswith('}'):
return "json_object"
if text.startswith('[') and text.endswith(']'):
return "json_array"
if '\n-' in text or '\n*' in text or '\n1.' in text:
return "list"
return "prose"

def count_subtasks(self, prompt):
"""Estimate number of subtasks in prompt."""
task_markers = [
"first", "then", "next", "finally", "also", "additionally",
"1.", "2.", "3.", "step 1", "step 2"
]
return sum(marker in prompt.lower() for marker in task_markers)

def generate_report(self, issues):
"""Format diagnosis report."""
if not issues:
return "No obvious issues detected. Try:\n- Adding more examples\n- Adjusting temperature\n- Using a different model"

report = "Diagnosis:\n\n"
for i, issue in enumerate(issues, 1):
report += f"{i}. {issue['issue']}\n"
report += f" Evidence: {issue['evidence']}\n"
report += f" Fix: {issue['fix']}\n\n"

return report

# Usage
debugger = PromptDebugger()

prompt = "Classify the sentiment of this review."
expected = "positive"
actual = "The review expresses a generally favorable opinion with some minor criticisms..."

diagnosis = debugger.debug(prompt, expected, actual)
print(diagnosis)

# Output:
# Diagnosis:
#
# 1. Ambiguous instructions
# Evidence: ['relevant', 'good']
# Fix: Add specific constraints and examples
#
# 2. Output format mismatch
# Evidence: Expected format: prose, Got: prose
# Fix: Explicitly specify output format with examples

Error Analysis

Systematically analyze failure modes:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def analyze_errors(predictions, ground_truths, inputs):
"""
Categorize and analyze prediction errors.
"""
errors = []

for pred, truth, inp in zip(predictions, ground_truths, inputs):
if normalize(pred) != normalize(truth):
error = {
"input": inp,
"prediction": pred,
"ground_truth": truth,
"error_type": categorize_error(pred, truth),
"severity": compute_severity(pred, truth)
}
errors.append(error)

# Group by error type
from collections import defaultdict
by_type = defaultdict(list)
for error in errors:
by_type[error["error_type"]].append(error)

# Generate report
report = f"Total errors: {len(errors)}/{len(predictions)} ({100*len(errors)/len(predictions):.1f}%)\n\n"

for error_type, instances in sorted(by_type.items(), key=lambda x: -len(x[1])):
report += f"{error_type}: {len(instances)} instances\n"
report += "Examples:\n"
for instance in instances[:3]: # Show up to 3 examples
report += f" Input: {instance['input'][:50]}...\n"
report += f" Predicted: {instance['prediction']}\n"
report += f" Expected: {instance['ground_truth']}\n\n"

return report

def categorize_error(prediction, ground_truth):
"""Classify type of error."""
pred_norm = normalize(prediction)
truth_norm = normalize(ground_truth)

if len(pred_norm) == 0:
return "empty_output"

if pred_norm in truth_norm or truth_norm in pred_norm:
return "partial_match"

pred_words = set(pred_norm.split())
truth_words = set(truth_norm.split())
overlap = len(pred_words & truth_words) / max(len(pred_words), len(truth_words))

if overlap > 0.5:
return "semantic_error"

if overlap > 0:
return "partial_hallucination"

return "complete_hallucination"

def compute_severity(prediction, ground_truth):
"""Score error severity (0=minor, 1=critical)."""
# Use semantic similarity as proxy
similarity = semantic_similarity(prediction, ground_truth)
return 1 - similarity

Common Pitfalls & Solutions

Learn from common mistakes:

Pitfall 1: Vague Instructions

Problem:

1
prompt = "Make this better: {text}"

Why it fails: "Better" is subjective. The model doesn't know what dimensions to optimize.

Solution:

1
2
3
4
5
6
7
8
9
10
11
prompt = f"""
Improve this text for clarity and conciseness:
- Remove redundant words
- Use active voice
- Break up long sentences
- Preserve all key information

Original: {text}

Improved version:
"""

Pitfall 2: Assuming Knowledge

Problem:

1
prompt = "What's the bug in this code?"  # No code provided

Why it fails: Model needs context you forgot to include.

Solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
prompt = f"""
Find bugs in this Python code:

{code}

Check for:
- Syntax errors
- Logic errors
- Edge cases not handled
- Potential runtime errors

Bugs found:
"""

Pitfall 3: Overly Long Prompts

Problem: 15,000-word prompt with every possible instruction.

Why it fails: Models struggle to attend to all information (the "lost in the middle" problem). Also expensive.

Solution: Decompose or use hierarchical prompting:

1
2
3
4
5
# Step 1: Extract key info
key_info = extract_key_information(long_document)

# Step 2: Focused task with only relevant info
result = perform_task(key_info)

Pitfall 4: Ignoring Output Format

Problem:

1
2
3
prompt = "Extract the dates."
# Gets: "The dates are March 3 and April 15."
# Want: ["2024-03-03", "2024-04-15"]

Solution:

1
2
3
4
5
6
7
8
prompt = """
Extract all dates in ISO format (YYYY-MM-DD).
Output as JSON array of strings.

Text: {text}

Output (JSON only):
"""

Pitfall 5: No Validation

Problem: Accepting model output without checking.

Solution:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def validated_generation(prompt, validator, max_retries=3):
"""Generate with validation loop."""
for attempt in range(max_retries):
output = llm_call(prompt)

is_valid, error = validator(output)
if is_valid:
return output

# Retry with feedback
prompt += f"\n\nError: {error}\nTry again:"

raise ValueError("Failed validation after max retries")

# Example validator
def validate_json(output):
try:
json.loads(output)
return True, None
except json.JSONDecodeError as e:
return False, f"Invalid JSON: {e}"

Frequently Asked Questions

Q1: Should I use higher or lower temperature?

Answer: It depends on your task:

  • Temperature = 0 (or very low): Use for tasks requiring consistency and correctness (classification, extraction, math, code generation). The model always picks the most likely token, giving deterministic outputs.

  • Temperature = 0.7-0.8: Good for creative tasks where you want diversity but still coherent outputs (writing, brainstorming, marketing copy).

  • Temperature = 1.0+: Maximum randomness. Rarely useful except for exploration or artistic generation.

Rule of thumb: Start with 0 for structured tasks, 0.7 for creative ones.

Q2: How many examples should I include in few-shot prompts?

Answer: Research shows diminishing returns:

  • 2-3 examples: Often sufficient for simple tasks
  • 5-7 examples: Sweet spot for most tasks
  • 10+ examples: Only helps if examples are diverse and cover edge cases
  • 50+ examples: Approaches few-shot learning ceiling; consider fine-tuning instead

Quality matters more than quantity. Choose diverse, representative examples.

Q3: When should I fine-tune instead of prompt engineering?

Answer: Fine-tune when:

  1. You have >1,000 high-quality examples (preferably 10,000+)
  2. Task is specialized with domain jargon or style
  3. Latency/cost matters and you make millions of requests
  4. Prompt engineering plateaus below acceptable accuracy

Stick with prompting when: 1. Task changes frequently (prompts are easier to iterate) 2. You lack training data or labeling budget 3. You need interpretability (prompts are transparent) 4. You want to leverage the latest models without retraining

Q4: How do I prevent hallucinations?

Strategies:

  1. Ground with context: Provide relevant facts in the prompt

    1
    prompt = f"Based on this context: {facts}\n\nAnswer: {question}"

  2. Explicit instructions: Tell the model to say "I don't know" when uncertain

    1
    "If the answer isn't in the context, respond: 'Information not available'"

  3. Request citations: Ask the model to quote sources

    1
    "Quote the exact sentence from the text that supports your answer"

  4. Lower temperature: Reduces random guessing (temperature=0)

  5. Validation loop: Check outputs against ground truth

    1
    2
    if not can_verify(output, source_text):
    output = "Cannot verify answer"

Q5: What's the best way to handle long documents?

Approaches:

  1. Chunking + Map-Reduce:

    1
    2
    3
    4
    # Process each chunk
    summaries = [summarize(chunk) for chunk in chunks]
    # Combine results
    final = synthesize(summaries)

  2. Retrieval-Augmented Generation (RAG):

    1
    2
    3
    4
    # Find relevant chunks
    relevant = semantic_search(query, document_chunks, k=5)
    # Use only relevant context
    answer = llm_call(f"Context: {relevant}\n\nQuestion: {query}")

  3. Hierarchical summarization:

    1
    2
    3
    # Layer 1: Summarize paragraphs
    # Layer 2: Summarize summaries
    # Layer 3: Final synthesis

  4. Use models with large context windows:

    • Claude 3: 200K tokens
    • GPT-4 Turbo: 128K tokens
    • Gemini 1.5: 1M tokens

Q6: How do I make prompts work across different models?

Answer: Some techniques transfer well, others don't:

Universal techniques: - Clear instructions - Few-shot examples - Output format specification - Chain-of-thought reasoning

Model-specific: - Exact phrasing sensitivity varies - Some models need explicit format instructions (XML, JSON markers) - Function calling is API-specific (OpenAI vs. others)

Best practice: Test on your target model. Don't assume prompts transfer perfectly.

Q7: Can I automate prompt optimization?

Yes, several approaches:

  1. APE (Automatic Prompt Engineering): Generate and test candidate prompts
  2. DSPy: Framework for programmatic prompt optimization
  3. Genetic algorithms: Evolve prompts through mutation and selection
  4. Reinforcement learning: Optimize prompts using reward signals

Practical recommendation: Start with manual engineering to understand the task, then automate optimization for production.

Q8: How do I evaluate prompt quality without ground truth?

Methods:

  1. LLM-as-Judge: Use a stronger model to evaluate outputs

    1
    score = judge_llm(f"Rate this output (0-10): {output}")

  2. Consistency checking: Generate multiple outputs, check agreement

    1
    2
    outputs = [llm_call(prompt) for _ in range(5)]
    consistency = compute_agreement(outputs)

  3. Human eval on samples: Evaluate 100-200 examples manually

    1
    2
    sample = random.sample(outputs, 200)
    human_scores = get_human_ratings(sample)

  4. Proxy metrics: Measure related properties

    • Length (for summaries)
    • Readability scores
    • Semantic similarity to reference texts

Q9: What's the ROI of investing in prompt engineering?

Quantifiable benefits:

  • Accuracy improvements: 20-50% gains on complex tasks (see CoT benchmarks)
  • Cost reduction: Better prompts can reduce API calls through higher success rates
  • Latency: Shorter, optimized prompts generate faster
  • Maintenance: Good prompts need less frequent updates

Example calculation: - Baseline: 70% accuracy requires 1.4× API calls for retries - Optimized: 90% accuracy reduces to 1.1× API calls - Savings: 22% fewer API calls

At scale (1M requests/month), this saves thousands of dollars.

Q10: How do I debug a prompt that sometimes works and sometimes fails?

Diagnosis steps:

  1. Check for non-determinism: Set temperature=0 to eliminate randomness

    1
    output = llm_call(prompt, temperature=0)

  2. Identify failure patterns: Analyze errors for commonalities

    1
    2
    failures = [x for x in test_set if not is_correct(x)]
    patterns = cluster_similar_inputs(failures)

  3. Add explicit edge case handling: Update prompt with failure examples

    1
    prompt += "\n\nEdge case examples:\n{edge_cases}"

  4. Use self-consistency: Vote across multiple generations

    1
    2
    outputs = [llm_call(prompt) for _ in range(5)]
    final = majority_vote(outputs)

  5. Incremental refinement: Test each change on known failure cases

Q11: Should I use XML, JSON, or plain text for structured prompts?

Comparison:

XML:

1
2
3
4
5
<instruction>Classify sentiment</instruction>
<example>
<text>Great product</text>
<label>positive</label>
</example>
- Pro: Clear structure, Claude prefers XML - Con: Verbose

JSON:

1
2
3
4
{
"instruction": "Classify sentiment",
"example": {"text": "Great product", "label": "positive"}
}
- Pro: Compact, easy to parse programmatically - Con: Models sometimes generate invalid JSON

Plain text:

1
2
Classify sentiment:
Example: "Great product" → positive
- Pro: Natural, easy to read/write - Con: Less structure for complex data

Recommendation: Use plain text for simple prompts, JSON for structured I/O, XML for complex multi-part prompts (especially with Claude).

Q12: How do I balance cost vs. quality?

Strategies:

  1. Tiered approach: Use cheaper models for simple tasks, expensive ones for hard tasks

    1
    2
    3
    4
    if is_simple(task):
    return gpt_3_5(task) # Cheaper
    else:
    return gpt_4(task) # More expensive but accurate

  2. Caching: Reuse responses for repeated inputs

    1
    2
    3
    @lru_cache(maxsize=10000)
    def cached_llm_call(prompt):
    return llm_call(prompt)

  3. Prompt compression: Use LLMLingua to reduce token count

    1
    compressed = compress_prompt(long_prompt, target_ratio=0.5)

  4. Cascade: Try simple prompts first, escalate if they fail

    1
    2
    3
    result = zero_shot(task)
    if confidence(result) < 0.7:
    result = few_shot(task) # More expensive

  5. Batch processing: Process multiple requests together to amortize costs

Q13: What's the difference between CoT, ToT, and GoT?

Quick comparison:

Technique Structure When to use Cost multiplier
CoT Linear chain Multi-step reasoning, math, logic 1-2× (longer outputs)
ToT Tree search Problems with multiple solution paths 5-50× (explores branches)
GoT Arbitrary graph Parallel processing, merging insights Varies (more efficient than ToT)

Example: "Plan a trip to Europe"

  • CoT: Day 1 → Day 2 → Day 3... (sequential)
  • ToT: Explore multiple itineraries, backtrack if hotel unavailable
  • GoT: Research destinations in parallel, merge into cohesive plan

Q14: How do I handle rate limits and API errors?

Robust implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import time
from functools import wraps

def retry_with_exponential_backoff(
max_retries=5,
initial_delay=1,
backoff_factor=2
):
"""Decorator to retry with exponential backoff."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
delay = initial_delay
for attempt in range(max_retries):
try:
return func(*args, **kwargs)
except RateLimitError:
if attempt == max_retries - 1:
raise
time.sleep(delay)
delay *= backoff_factor
except APIError as e:
if "server_error" in str(e):
# Retry on server errors
time.sleep(delay)
continue
else:
# Don't retry on client errors
raise
return wrapper
return decorator

@retry_with_exponential_backoff()
def llm_call_robust(prompt, **kwargs):
return llm_call(prompt, **kwargs)

Q15: What's the future of prompt engineering?

Emerging trends:

  1. Multimodal prompting: Text + images + audio in one prompt
  2. Automated optimization: Tools that auto-tune prompts (DSPy, APE)
  3. Prompt compression: Fit more context into token limits
  4. Meta-prompting: Prompts that generate prompts
  5. Embodied agents: Prompts controlling robots and virtual agents
  6. Fine-tuning obsolescence? As models improve, less need for complex prompts

Prediction: Prompt engineering will remain valuable but become more automated. Focus will shift from manual crafting to: - Designing prompt optimization objectives - Building evaluation frameworks - Integrating prompts into larger systems

The skill won't disappear — it will evolve toward higher-level orchestration.

Conclusion

Prompt engineering transforms how we interact with AI systems. What began as trial-and-error has matured into a discipline backed by rigorous research and practical frameworks.

The fundamentals — clear instructions, well-chosen examples, structured outputs — apply universally. Advanced techniques like chain-of-thought reasoning and tree search unlock capabilities that seemed impossible with naive prompting. Optimization methods like automatic prompt engineering and DSPy scale these practices to production systems.

But techniques alone aren't enough. Effective prompt engineering requires:

  1. Empiricism: Test everything. What works for one model or task may fail for another.
  2. Iteration: Your first prompt will rarely be your best. Expect to refine based on failures.
  3. Evaluation: Measure rigorously. Without metrics, you're flying blind.
  4. Contextual thinking: Understand your model's strengths, your task's requirements, and the trade-offs between cost, latency, and quality.

As models grow more powerful, the nature of prompt engineering evolves. Simple tasks that once required careful prompting now work zero-shot. But complex reasoning, specialized domains, and production constraints ensure that prompt engineering remains essential.

The future belongs to those who can orchestrate AI systems effectively — not just by writing clever prompts, but by building frameworks that optimize, validate, and scale prompt-based solutions. Whether you're a researcher pushing the boundaries of what LLMs can do or a practitioner building real-world applications, the principles in this guide provide a foundation for success.

Start simple. Measure constantly. Iterate relentlessly. And remember: the best prompt is the one that reliably solves your problem, not the cleverest one.

  • Post title:Prompt Engineering Complete Guide: From Zero to Advanced Optimization
  • Post author:Chen Kai
  • Create time:2025-04-01 00:00:00
  • Post link:https://www.chenk.top/en/prompt-engineering-complete-guide/
  • Copyright Notice:All articles in this blog are licensed under BY-NC-SA unless stating additionally.
 Comments