CoT doesn't make LLMs smarter—it turns implicit reasoning into explicit steps. This way you can see where it went wrong, and it's easier to pull it back on track.
12.1 Why Do We Need Step-by-Step Reasoning?
First, a real case:
In early 2024, I was helping an edtech company build a math tutoring Agent. Simple functionality: students input math problems, Agent provides solutions and steps.
Everything worked fine during testing. Simple arithmetic, single-variable equations—high accuracy.
First week after launch, complaints came in. A middle schooler asked: "Xiao Ming has 15 apples. He splits them among 3 friends, how many does each get? Then he buys 8 more, how many does he have now?"
Agent answered: "11."
Wrong. The correct answer is 8 (15 / 3 = 5, each friend gets 5, Xiao Ming keeps 0, 0 + 8 = 8). But the Agent saw "15" and "3", computed 15 - 3 - 1 = 11, completely misunderstanding the question.
This exposed a fundamental LLM problem—it "skips steps," giving answers directly rather than actually reasoning. For multi-step problems, this step-skipping easily leads to errors.
A week later, we added one line to the prompt: "Please solve step by step." Same question, Agent started answering like this:
Step 1: Xiao Ming has 15 apples, splits among 3 friends
Step 2: Each gets 15 ÷ 3 = 5 apples
Step 3: After splitting, Xiao Ming has 15 - 15 = 0 apples left
Step 4: Buys 8 more, now has 0 + 8 = 8 apples
Answer: 8 apples
On our test set, accuracy improved by about 40%. (Note: this is the result from a specific scenario; actual effects vary by task type and model. We recommend testing on your own evaluation set.)
This is the value of Chain-of-Thought—making LLMs externalize their reasoning process, computing step by step rather than guessing answers by "intuition."
LLM's Default Behavior
LLMs' default behavior is "say it all at once"—you ask one thing, they generate an entire paragraph. They don't stop to think "was what I just said correct," and they don't calculate during generation.
This causes a problem: complex reasoning easily goes wrong.
| Task Type | LLM Default Behavior | Problem |
|---|---|---|
| Multi-step math | Gives answer directly | Skips steps, calculates wrong |
| Logical reasoning | Guesses by intuition | Logic chain breaks |
| Causal analysis | Surface associations | Cause-effect reversed |
| Code debugging | Lists common causes | Doesn't actually analyze |
CoT's Solution
Chain-of-Thought's core idea is simple: Make the LLM think step by step, writing out intermediate steps.
See the effect of CoT on the same problem:
Let me calculate step by step:
→ Step 1: Xiao Ming initially has 5 apples
→ Step 2: After giving 2 to Xiao Hong, 5 - 2 = 3 remaining
→ Step 3: After receiving 3 from Xiao Hua, total 3 + 3 = 6
→ Step 4: After eating 1, 6 - 1 = 5 remaining
Therefore, Xiao Ming now has 5 apples.
Got it right this time. Key difference: Explicit reasoning process forces the model to actually compute, rather than guessing by intuition.
CoT's Value
| Dimension | Without CoT | With CoT |
|---|---|---|
| Accuracy | Pattern matching, complex reasoning error-prone | Step-by-step verification, reduces jumping errors |
| Explainability | Black box output, can't audit | Transparent process, every step traceable |
| Debug ability | Wrong but don't know where | Can pinpoint which step went wrong |
But I should note: CoT isn't a cure-all. It can improve accuracy but can't guarantee correctness. Each step in step-by-step reasoning can still be wrong.
12.2 CoT's Academic Background
CoT comes from a 2022 paper: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models (Wei et al.).
Core finding:
For tasks requiring multi-step reasoning, adding "let's think step by step" to the prompt or providing reasoning examples can significantly improve LLM accuracy.
Later, someone discovered an even simpler method—Zero-shot CoT: just add "Let's think step by step" after the question to trigger step-by-step reasoning in LLMs.
This discovery is interesting: LLMs actually have step-by-step reasoning capability, they just need to be "reminded" to use it.
12.3 CoT Prompt Design
Basic Template
The simplest CoT prompt:
Question: If today is Wednesday, what day of the week will it be 10 days from now?
Please think step by step, marking each reasoning step with →.
Give your conclusion at the end, starting with "Therefore:"
Shannon's Default Template
Shannon's CoT implementation is in patterns/chain_of_thought.go, with this default template:
func buildChainOfThoughtPrompt(query string, config ChainOfThoughtConfig) string {
if config.PromptTemplate != "" {
return strings.ReplaceAll(config.PromptTemplate, "{query}", query)
}
// Default CoT template
return fmt.Sprintf(`Please solve this step-by-step:
Question: %s
Think through this systematically:
1. First, identify what is being asked
2. Break down the problem into steps
3. Work through each step with clear reasoning
4. Show your work and explain your thinking
5. Arrive at the final answer
Use "→" to mark each reasoning step.
End with "Therefore:" followed by your final answer.`, query)
}
Implementation reference (Shannon): patterns/chain_of_thought.go - buildChainOfThoughtPrompt function
Domain-Specific Templates
Different scenarios need different CoT templates:
Math-specific:
Math problem: {query}
Please solve using this format:
【Analysis】First understand what the question asks
【Formulas】List the formulas needed
【Calculation】
→ Step 1: ...
→ Step 2: ...
【Verification】Check if the result is reasonable
【Answer】The final result is...
Code debugging-specific:
Debug problem: {query}
Please analyze systematically:
1. 【Symptom Description】The observed error phenomenon
2. 【Hypothesis List】Possible causes (ranked by likelihood)
→ Hypothesis A: ...
→ Hypothesis B: ...
3. 【Verification Process】Verify each hypothesis
4. 【Root Cause Analysis】Determine the real cause
5. 【Fix Solution】Provide the solution
Therefore: The root cause is... The fix method is...
Logic reasoning-specific:
Reasoning problem: {query}
Please reason through the logic chain:
【Known Conditions】
- Condition 1: ...
- Condition 2: ...
【Reasoning Process】
→ From condition 1, we can derive...
→ Combined with condition 2, we can further derive...
→ Therefore...
【Conclusion】...
12.4 Shannon's CoT Implementation
Configuration Structure
type ChainOfThoughtConfig struct {
MaxSteps int // Maximum reasoning steps
RequireExplanation bool // Whether to require explanation
ShowIntermediateSteps bool // Whether output includes intermediate steps
PromptTemplate string // Custom template
StepDelimiter string // Step delimiter, default "\n→ "
ModelTier string // Model tier
}
Result Structure
type ChainOfThoughtResult struct {
FinalAnswer string // Final answer
ReasoningSteps []string // Reasoning step list
TotalTokens int // Token consumption
Confidence float64 // Reasoning confidence (0-1)
StepDurations []time.Duration // Time per step
}
Core Flow
func ChainOfThought(
ctx workflow.Context,
query string,
context map[string]interface{},
sessionID string,
history []string,
config ChainOfThoughtConfig,
opts Options,
) (*ChainOfThoughtResult, error) {
// 1. Set defaults
if config.MaxSteps == 0 {
config.MaxSteps = 5
}
if config.StepDelimiter == "" {
config.StepDelimiter = "\n→ "
}
// 2. Build CoT Prompt
cotPrompt := buildChainOfThoughtPrompt(query, config)
// 3. Call LLM
cotResult := executeAgent(ctx, cotPrompt, ...)
// 4. Parse reasoning steps
steps := parseReasoningSteps(cotResult.Response, config.StepDelimiter)
// 5. Extract final answer
answer := extractFinalAnswer(cotResult.Response, steps)
// 6. Calculate confidence
confidence := calculateReasoningConfidence(steps, cotResult.Response)
// 7. Request clarification if low confidence (optional)
if config.RequireExplanation && confidence < 0.7 {
// Use half budget to regenerate clearer explanation
clarificationResult := requestClarification(ctx, query, steps)
// Update result...
}
return &ChainOfThoughtResult{
FinalAnswer: answer,
ReasoningSteps: steps,
Confidence: confidence,
TotalTokens: cotResult.TokensUsed,
}, nil
}
12.5 Step Parsing
The reasoning process generated by LLM needs to be parsed into a structured list of steps. Shannon's implementation:
func parseReasoningSteps(response, delimiter string) []string {
lines := strings.Split(response, "\n")
steps := []string{}
for _, line := range lines {
line = strings.TrimSpace(line)
// Recognize step markers
if strings.HasPrefix(line, "→") ||
strings.HasPrefix(line, "Step") ||
strings.HasPrefix(line, "1.") ||
strings.HasPrefix(line, "2.") ||
strings.HasPrefix(line, "3.") ||
strings.HasPrefix(line, "•") {
steps = append(steps, line)
}
}
// Fallback strategy: when no explicit markers, split by sentences
if len(steps) == 0 {
segments := strings.Split(response, ". ")
for _, seg := range segments {
if len(strings.TrimSpace(seg)) > 20 {
steps = append(steps, seg)
if len(steps) >= 5 {
break
}
}
}
}
return steps
}
Parsing priority:
- Explicit markers (→, Step, number.)
- Symbol markers (•)
- Fallback: split by sentences
Extracting Final Answer
func extractFinalAnswer(response string, steps []string) string {
// Look for conclusion markers
markers := []string{
"Therefore:",
"Final Answer:",
"The answer is:",
"因此:",
"结论:",
}
lower := strings.ToLower(response)
for _, marker := range markers {
if idx := strings.Index(lower, strings.ToLower(marker)); idx != -1 {
answer := response[idx+len(marker):]
// Take until next blank line
if endIdx := strings.Index(answer, "\n\n"); endIdx > 0 {
answer = answer[:endIdx]
}
return strings.TrimSpace(answer)
}
}
// Fallback: use last step
if len(steps) > 0 {
return steps[len(steps)-1]
}
// Further fallback: last paragraph
paragraphs := strings.Split(response, "\n\n")
if len(paragraphs) > 0 {
return paragraphs[len(paragraphs)-1]
}
return response
}
Multiple fallback strategies here ensure that even if the LLM doesn't output in the expected format, a meaningful answer can still be extracted.
12.6 Confidence Evaluation
Reasoning quality can be quantitatively evaluated. Shannon's implementation:
func calculateReasoningConfidence(steps []string, response string) float64 {
confidence := 0.5 // Base score
// Step sufficiency: >=3 steps adds points
if len(steps) >= 3 {
confidence += 0.2
}
// Logical connectors
logicalTerms := []string{
"therefore", "because", "since", "thus",
"consequently", "hence", "so", "implies",
}
lower := strings.ToLower(response)
count := 0
for _, term := range logicalTerms {
count += strings.Count(lower, term)
}
if count >= 3 {
confidence += 0.15
}
// Structured markers
if strings.Contains(response, "Step") || strings.Contains(response, "→") {
confidence += 0.1
}
// Clear conclusion
if strings.Contains(lower, "therefore") ||
strings.Contains(lower, "final answer") {
confidence += 0.05
}
if confidence > 1.0 {
confidence = 1.0
}
return confidence
}
Confidence formula (this is a heuristic I designed for discussion purposes, not an academic standard):
Confidence = 0.5 (base)
+ 0.2 (steps >= 3)
+ 0.15 (logical words >= 3)
+ 0.1 (structured markers)
+ 0.05 (clear conclusion)
────────────────
Max 1.0
12.7 Low Confidence Handling
When RequireExplanation=true and confidence is below 0.7, Shannon requests clarification:
if config.RequireExplanation && confidence < 0.7 {
clarificationPrompt := fmt.Sprintf(
"The previous reasoning for '%s' had unclear steps. "+
"Please provide a clearer step-by-step explanation:\n%s",
query,
strings.Join(steps, config.StepDelimiter),
)
// Use half budget to regenerate
clarifyResult := executeAgentWithBudget(ctx, clarificationPrompt, opts.BudgetAgentMax/2)
// Update result
if clarifyResult.Success {
clarifiedSteps := parseReasoningSteps(clarifyResult.Response, delimiter)
if len(clarifiedSteps) > 0 {
result.ReasoningSteps = clarifiedSteps
result.FinalAnswer = extractFinalAnswer(clarifyResult.Response, clarifiedSteps)
result.Confidence = calculateReasoningConfidence(clarifiedSteps, clarifyResult.Response)
}
result.TotalTokens += clarifyResult.TokensUsed
}
}
Clarification strategy:
- Use original steps as reference
- Use half the budget (control costs)
- Request clearer explanation
12.8 CoT vs Tree-of-Thoughts
CoT is linear: one step follows another, no going back.
Tree-of-Thoughts (ToT) is tree-shaped: each step can have multiple branches, with backtracking.
| Feature | Chain-of-Thought | Tree-of-Thoughts |
|---|---|---|
| Structure | Linear chain | Branching tree |
| Exploration | Single path | Multiple paths in parallel |
| Backtracking | Not supported | Supported |
| Token consumption | Lower | Higher (3-10x) |
| Use case | Deterministic reasoning | Exploratory problems |
When to Use ToT?
Does the problem have multiple possible solution paths?
├─ No → Use CoT (single path is enough)
└─ Yes → Need to compare different approaches?
├─ No → Use CoT (pick one randomly)
└─ Yes → Use ToT (systematic exploration)
ToT is covered in detail in Chapter 17. For now, just know: CoT is sufficient for most scenarios.
12.9 Common Pitfalls
Pitfall 1: Over-decomposition
Symptom: Simple problems are forced into too many steps, verbose output.
// Asked "What is 2+3?"
→ Step 1: Identify problem type—this is an addition problem
→ Step 2: Determine operands—2 and 3
→ Step 3: Review definition of addition
→ Step 4: Execute calculation 2 + 3 = 5
→ Step 5: Verify result
Therefore: 5
Solution: Dynamically adjust MaxSteps based on complexity:
func adaptiveMaxSteps(query string) int {
complexity := estimateComplexity(query)
if complexity < 0.3 {
return 2 // Simple problem
} else if complexity < 0.7 {
return 5 // Medium
}
return 8 // Complex
}
Pitfall 2: Confusing Reasoning with Facts
Symptom: CoT generates steps that "look reasonable" but are based on wrong facts.
→ Step 1: Tesla became the world's most valuable automaker in 2020 (wrong fact)
→ Step 2: Therefore its sales should also be highest (wrong reasoning)
The problem: logic is correct, but premises are wrong, so conclusion is wrong.
Solution: Use CoT with tools to verify key facts:
When reasoning, follow these principles:
1. When specific data is involved, mark [needs verification]
2. Distinguish "reasoning" from "factual statements"
3. If uncertain about a fact, explicitly state it
Pitfall 3: Inflated Confidence
Symptom: Model uses many logical connectors, but actual reasoning quality is poor.
For example, circular reasoning:
→ Step 1: A is true because B is true
→ Step 2: B is true because A is true
Therefore: Both A and B are true
Using "because" adds confidence points, but this is invalid reasoning.
Solution: Add semantic detection:
func enhancedConfidence(steps []string, response string) float64 {
base := calculateConfidence(steps, response)
// Check for circular reasoning
if hasCircularReasoning(steps) {
base -= 0.3
}
// Check logical coherence between steps
if !hasLogicalCoherence(steps) {
base -= 0.2
}
return max(0, min(1.0, base))
}
Pitfall 4: Inconsistent Format
Symptom: LLM sometimes uses "→", sometimes "Step", sometimes numbers—parsing fails.
Solution: Clearly specify format in prompt, and support multiple formats in parsing (Shannon already does this).
12.10 When to Use CoT?
Not all tasks need CoT.
| Task Type | Use CoT? | Reason |
|---|---|---|
| Simple calculation | No | Direct computation is faster |
| Fact lookup | No | Direct lookup is more accurate |
| Multi-step math | Yes | Reduce calculation errors |
| Logical reasoning | Yes | Externalize reasoning chain |
| Causal analysis | Yes | Trace cause-effect relationships |
| Code debugging | Yes | Systematic investigation |
| Creative writing | No | Would limit creativity |
| Real-time conversation | Depends | Latency vs accuracy trade-off |
Rule of thumb:
- Needs "derivation" → use
- Needs "auditable process" → use
- Simple and direct → don't use
- Creative → don't use
- Latency sensitive → use carefully
12.11 How Do Other Frameworks Do It?
CoT is a universal pattern; everyone has implementations:
| Framework/Paper | Implementation | Characteristics |
|---|---|---|
| Zero-shot CoT | "Let's think step by step" | Simplest, one sentence triggers |
| Few-shot CoT | Provide reasoning examples | More controllable, but needs manual design |
| Self-Consistency | Multiple CoT + voting | More accurate, but expensive |
| LangChain | CoT Prompt templates | Easy to integrate |
| OpenAI o1/o3 | Built-in multi-step reasoning (black box) | Internal mechanism opaque, no manual triggering needed |
Core logic is the same: make LLM write out reasoning process.
Differences are in:
- Triggering method (zero-shot vs few-shot)
- Format constraints (free vs structured)
- Quality assurance (single vs multi-vote)
12.12 Relationship with ReAct
You might ask: what's the difference between CoT and ReAct?
| Dimension | CoT | ReAct |
|---|---|---|
| Core goal | Externalize reasoning process | Reasoning + action loop |
| Uses tools | No (pure reasoning) | Yes (think while doing) |
| Output | Reasoning steps + answer | Multi-round thought/action/observation |
| Use case | Problems requiring computation/reasoning | Tasks requiring external information |
Simply put:
- CoT: Think clearly then answer (doesn't need external information)
- ReAct: Think while searching and doing (needs external information)
They can be combined: use CoT during ReAct's "thinking" phase.
Key Takeaways
- CoT Essence: Externalize thinking, force model to reason step by step
- Prompt Design: Clear instruction "step by step" + format convention (→, Step)
- Step Parsing: Recognize markers + multi-layer fallback strategy
- Confidence Evaluation: Step count + logical words + structured markers (heuristic, not academic standard)
- Use Cases: Multi-step reasoning, needs audit trail; not for simple tasks, creative work
Shannon Lab (10-Minute Quick Start)
This section helps you map this chapter's concepts to Shannon source code in 10 minutes.
Required Reading (1 file)
patterns/chain_of_thought.go: Find theChainOfThoughtfunction, see how it usesbuildChainOfThoughtPromptto build prompts,parseReasoningStepsto parse reasoning steps, andcalculateReasoningConfidenceto evaluate confidence
Optional Deep Dive (2 files, choose based on interest)
patterns/tree_of_thoughts.go: Compare implementation differences between ToT and CoT- Try it yourself in ChatGPT/Claude: same math problem, with and without "Let's think step by step"—what's different about the answers?
Exercises
Exercise 1: Design CoT Templates
Design specialized CoT prompt templates for these scenarios:
- Legal reasoning: Determine whether an action is illegal
- Medical diagnosis: Infer possible diseases from symptoms
- Financial analysis: Evaluate investment value of a stock
Each template should include:
- Problem description placeholder
- Format requirements for reasoning steps
- Format requirements for conclusion
Exercise 2: Source Code Reading
Read the parseReasoningSteps function in patterns/chain_of_thought.go:
- What step marker formats does it support?
- How does it fall back when LLM doesn't use any markers?
- Why does fallback limit to maximum 5 steps?
Exercise 3 (Advanced): Design Circular Reasoning Detection
Design a hasCircularReasoning function:
- Input: List of reasoning steps
- Output: Whether circular reasoning exists
Think about:
- What patterns count as "circular reasoning"?
- What method to use for detection? (keyword matching? semantic similarity?)
- Is there false positive risk?
Want to Go Deeper?
- Chain-of-Thought Prompting - Wei et al., 2022, original paper
- Zero-shot CoT: "Let's think step by step" - Simplest CoT triggering method
- Self-Consistency Decoding - Multiple CoT + voting improves accuracy
- Tree of Thoughts - Tree-shaped extension of CoT
Next Chapter Preview
That concludes Part 4 (Single Agent Patterns). We learned three core patterns:
- Planning: Decompose complex tasks into subtasks
- Reflection: Evaluate output quality, retry if not meeting standards
- Chain-of-Thought: Externalize reasoning process, reduce jumping errors
But what a single Agent can do is limited. When tasks are complex enough, you need multiple Agents to collaborate.
That's what Part 5 is about—Multi-Agent Orchestration.
Next chapter we'll start with orchestration basics: when a single Agent isn't enough, how to have multiple Agents divide work? Who decides who does what? What happens when something fails?