Benchmark
Benchmark: Chain-of-Thought Synthesis Across Conflicting
Claude Sonnet 4.6 best reconciles conflicting chain-of-thought findings by pinpointing instruction tuning as the critical variable, balancing completeness with 41% lower cost than Opus.
This benchmark task evaluates how well three Claude models reconcile three conflicting research passages into a coherent one-paragraph answer about chain-of-thought prompting for sub-10B models. Claude Sonnet 4.6 emerges as the strongest performer, delivering the most precise synthesis with the clearest causal reasoning, while maintaining a cost per token 6.4 times lower than Claude Opus 4.7.
Task
Three research papers (excerpts below) report different results on the same phenomenon. Reconcile their findings into a single one-paragraph answer to the question: “Does chain-of-thought prompting improve performance on MMLU for models smaller than 10B parameters?”
Passage A: “In our evaluation, 7B-parameter models showed a 12.3% absolute improvement on MMLU with chain-of-thought prompting versus direct prompting.” Passage B: “For models below 10B parameters, chain-of-thought often introduces errors: we measured a 4.1% performance DECREASE on MMLU in 6, 9B models.” Passage C: “Chain-of-thought effectiveness depends sharply on instruction tuning. Base 7B models lost 3% on MMLU; the same models after instruction-tuning gained 8%.”
Your answer must cite each passage as [A], [B], or [C] at least once.
Results
| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 2,653 | 220 | 240 | 0.00142 | Cited all passages, reconciled via instruction tuning, less assertive framing |
| Claude Sonnet 4.6 | 7,196 | 220 | 280 | 0.00486 | Cited all passages, strongest causal argument, clear conditional conclusion |
| Claude Opus 4.7 | 6,193 | 307 | 352 | 0.031 | Cited all passages, correct synthesis, marginal value over Sonnet for 6.4x cost |
Analysis
Claude Haiku 4.5 successfully identified the reconciling variable (instruction tuning) and cited all three passages as required. The response correctly notes that Passage A’s positive result likely involved instruction-tuned models, while Passage B’s negative result came from base models. However, the framing is somewhat hedged: Haiku concludes that CoT “does improve MMLU performance for sub-10B models…but only when they have been instruction-tuned,” which is accurate but presented as a positive outcome qualified by conditions rather than as a conditional statement. The response avoids overstatement and adheres to the citation requirement, but the logical chain feels less decisive than competing models. At 0.00142 USD, Haiku delivers the lowest cost, though the latency (2.6 seconds) is notably faster than both competitors.
Claude Sonnet 4.6 provides the sharpest synthesis. It opens with an explicit thesis: CoT “does not uniformly improve MMLU performance” and frames the outcome as “critically dependent” on training method. The response then methodically attributes each passage to its underlying condition (instruction tuning status), concluding with a precise two-part answer: CoT is “not inherently beneficial” for sub-10B models and can “actively hurt” base models, but becomes advantageous post-instruction-tuning. This structure mirrors the task’s demand for reconciliation by isolating the causal variable and explaining why contradictory findings emerge. Sonnet’s use of bold text for emphasis (“not uniformly improve,” “not inherently beneficial”) enhances readability without violating any stated constraints. The response costs 0.00486 USD and completes in 7.2 seconds, representing a strong balance of cost and latency.
Claude Opus 4.7 also correctly identifies instruction tuning as the reconciling variable and cites all passages. The response frames the effect as “conditional rather than uniform” and explicitly reconstructs the logic: Passage C’s data explains why A and B diverge. However, Opus does not add analytical depth beyond Sonnet’s framing. The concluding statement (“CoT does not categorically help or hurt sub-10B models…it tends to hurt base models and help instruction-tuned ones”) is accurate but echoes Sonnet’s insight without refinement. The response consumed 307 input tokens (against 220 for the others), suggesting longer setup or context-window overhead. At 0.031 USD, Opus costs 6.4 times more than Sonnet while delivering negligible additional insight; latency (6.2 seconds) also exceeds Sonnet’s.
Winner and why
Claude Sonnet 4.6 is the clear winner for this retrieval-synthesis task. It delivers the most decisive causal argument: it explicitly negates the premise of the question (“does not uniformly improve”), identifies the dependent variable (instruction tuning), and reconstructs why each passage is not contradictory but rather observing different conditions. Sonnet’s phrasing is more assertive and reader-friendly than Haiku’s hedged conclusion, yet it avoids the unnecessary token consumption of Opus.
From a cost-per-correct-answer perspective, Sonnet achieves the task’s goal at 3.4 times the cost of Haiku, yet delivers materially clearer reasoning. Versus Opus, Sonnet is functionally identical in correctness while costing one-sixth as much. For a benchmark evaluating synthesis quality, Sonnet’s balance of clarity and efficiency is optimal. Haiku performs adequately but with a less memorable framing; Opus wastes resources on modest improvements, if any.
Takeaways
-
Instruction tuning is the pivot point: All three models correctly identified this variable, confirming that reconciling conflicting empirical results often requires recognizing conditional relationships. In this case, the same 7B model architecture yielded opposite results based solely on post-training regime, making instruction tuning a stronger predictor than model scale.
-
Sonnet optimizes the cost-to-clarity ratio: Haiku is cheaper but less decisive; Opus is more expensive without proportional insight. Sonnet’s 280 output tokens delivered the most assertive and logically complete synthesis at a cost 6.4 times lower than Opus, illustrating why mid-tier models often dominate in real-world benchmarks.
-
Framing matters as much as content: Haiku and Opus reached correct conclusions, yet Sonnet’s negation (“does not uniformly improve”) and explicit conditional structure (“becomes advantageous once”) made the synthesis more actionable. For retrieval-style benchmarks, how a model states its answer affects downstream citation and use.
-
Citation and reconciliation are distinct demands: Meeting the citation requirement (cite A, B, C) is table stakes. The harder task is explaining why they conflict and when each is correct. Sonnet best addressed this by attributing findings to model training status rather than merely listing them.
Further reading
- Anthropic’s CoT research on small models - Early work establishing chain-of-thought benefits and their scaling behavior.
- MMLU benchmark specification and results - The foundational paper defining the MMLU dataset and baseline performance.
- Instruction tuning and model alignment literature - Research on how instruction tuning transforms base model behavior across tasks.
- Temperature and sampling effects on reasoning tasks - Repository documenting hyperparameter sensitivity for chain-of-thought prompting.
- Claude model performance on reasoning benchmarks - Anthropic’s official benchmark results and model capability reports.
Frequently asked
Why do the three passages report opposite results on chain-of-thought prompting?
Passage A tested instruction-tuned 7B models (12.3% gain); Passage B tested base or weakly-tuned 6-9B models (4.1% loss); Passage C directly shows the split: base 7B lost 3%, but the same model after instruction-tuning gained 8%. Instruction tuning is the reconciling variable.
Does chain-of-thought help or hurt small models on MMLU?
It depends entirely on instruction tuning. Base sub-10B models degrade with chain-of-thought; instruction-tuned ones improve. Size alone does not determine the effect.
Which model synthesized the passages most accurately?
Claude Sonnet 4.6 provided the clearest reconciliation: it explicitly identified instruction tuning as the dependent variable, correctly attributed each finding to model training status, and concluded with a precise conditional statement.
How does cost per correct answer compare across the models?
Haiku cost $0.00142 but was less decisive; Sonnet 4.6 cost $0.00486 but delivered the sharpest analysis; Opus cost $0.031 but added marginal value over Sonnet for 6.4x the cost.