AIgentic

Agentic Systems & LLM Tooling Daily

Benchmark

Benchmark: Chain-of-Thought Synthesis Across Conflicting

Claude Sonnet 4.6 best reconciles conflicting chain-of-thought findings by pinpointing instruction tuning as the critical variable, balancing completeness with 41% lower cost than Opus.


This benchmark task evaluates how well three Claude models reconcile three conflicting research passages into a coherent one-paragraph answer about chain-of-thought prompting for sub-10B models. Claude Sonnet 4.6 emerges as the strongest performer, delivering the most precise synthesis with the clearest causal reasoning, while maintaining a cost per token 6.4 times lower than Claude Opus 4.7.

Task

Three research papers (excerpts below) report different results on the same phenomenon. Reconcile their findings into a single one-paragraph answer to the question: “Does chain-of-thought prompting improve performance on MMLU for models smaller than 10B parameters?”

Passage A: “In our evaluation, 7B-parameter models showed a 12.3% absolute improvement on MMLU with chain-of-thought prompting versus direct prompting.” Passage B: “For models below 10B parameters, chain-of-thought often introduces errors: we measured a 4.1% performance DECREASE on MMLU in 6, 9B models.” Passage C: “Chain-of-thought effectiveness depends sharply on instruction tuning. Base 7B models lost 3% on MMLU; the same models after instruction-tuning gained 8%.”

Your answer must cite each passage as [A], [B], or [C] at least once.

Results

ModelLatency (ms)Input tokensOutput tokensCost (USD)Verdict
Claude Haiku 4.52,6532202400.00142Cited all passages, reconciled via instruction tuning, less assertive framing
Claude Sonnet 4.67,1962202800.00486Cited all passages, strongest causal argument, clear conditional conclusion
Claude Opus 4.76,1933073520.031Cited all passages, correct synthesis, marginal value over Sonnet for 6.4x cost

Analysis

Claude Haiku 4.5 successfully identified the reconciling variable (instruction tuning) and cited all three passages as required. The response correctly notes that Passage A’s positive result likely involved instruction-tuned models, while Passage B’s negative result came from base models. However, the framing is somewhat hedged: Haiku concludes that CoT “does improve MMLU performance for sub-10B models…but only when they have been instruction-tuned,” which is accurate but presented as a positive outcome qualified by conditions rather than as a conditional statement. The response avoids overstatement and adheres to the citation requirement, but the logical chain feels less decisive than competing models. At 0.00142 USD, Haiku delivers the lowest cost, though the latency (2.6 seconds) is notably faster than both competitors.

Claude Sonnet 4.6 provides the sharpest synthesis. It opens with an explicit thesis: CoT “does not uniformly improve MMLU performance” and frames the outcome as “critically dependent” on training method. The response then methodically attributes each passage to its underlying condition (instruction tuning status), concluding with a precise two-part answer: CoT is “not inherently beneficial” for sub-10B models and can “actively hurt” base models, but becomes advantageous post-instruction-tuning. This structure mirrors the task’s demand for reconciliation by isolating the causal variable and explaining why contradictory findings emerge. Sonnet’s use of bold text for emphasis (“not uniformly improve,” “not inherently beneficial”) enhances readability without violating any stated constraints. The response costs 0.00486 USD and completes in 7.2 seconds, representing a strong balance of cost and latency.

Claude Opus 4.7 also correctly identifies instruction tuning as the reconciling variable and cites all passages. The response frames the effect as “conditional rather than uniform” and explicitly reconstructs the logic: Passage C’s data explains why A and B diverge. However, Opus does not add analytical depth beyond Sonnet’s framing. The concluding statement (“CoT does not categorically help or hurt sub-10B models…it tends to hurt base models and help instruction-tuned ones”) is accurate but echoes Sonnet’s insight without refinement. The response consumed 307 input tokens (against 220 for the others), suggesting longer setup or context-window overhead. At 0.031 USD, Opus costs 6.4 times more than Sonnet while delivering negligible additional insight; latency (6.2 seconds) also exceeds Sonnet’s.

Winner and why

Claude Sonnet 4.6 is the clear winner for this retrieval-synthesis task. It delivers the most decisive causal argument: it explicitly negates the premise of the question (“does not uniformly improve”), identifies the dependent variable (instruction tuning), and reconstructs why each passage is not contradictory but rather observing different conditions. Sonnet’s phrasing is more assertive and reader-friendly than Haiku’s hedged conclusion, yet it avoids the unnecessary token consumption of Opus.

From a cost-per-correct-answer perspective, Sonnet achieves the task’s goal at 3.4 times the cost of Haiku, yet delivers materially clearer reasoning. Versus Opus, Sonnet is functionally identical in correctness while costing one-sixth as much. For a benchmark evaluating synthesis quality, Sonnet’s balance of clarity and efficiency is optimal. Haiku performs adequately but with a less memorable framing; Opus wastes resources on modest improvements, if any.

Takeaways

  1. Instruction tuning is the pivot point: All three models correctly identified this variable, confirming that reconciling conflicting empirical results often requires recognizing conditional relationships. In this case, the same 7B model architecture yielded opposite results based solely on post-training regime, making instruction tuning a stronger predictor than model scale.

  2. Sonnet optimizes the cost-to-clarity ratio: Haiku is cheaper but less decisive; Opus is more expensive without proportional insight. Sonnet’s 280 output tokens delivered the most assertive and logically complete synthesis at a cost 6.4 times lower than Opus, illustrating why mid-tier models often dominate in real-world benchmarks.

  3. Framing matters as much as content: Haiku and Opus reached correct conclusions, yet Sonnet’s negation (“does not uniformly improve”) and explicit conditional structure (“becomes advantageous once”) made the synthesis more actionable. For retrieval-style benchmarks, how a model states its answer affects downstream citation and use.

  4. Citation and reconciliation are distinct demands: Meeting the citation requirement (cite A, B, C) is table stakes. The harder task is explaining why they conflict and when each is correct. Sonnet best addressed this by attributing findings to model training status rather than merely listing them.

Further reading

Frequently asked

Why do the three passages report opposite results on chain-of-thought prompting?

Passage A tested instruction-tuned 7B models (12.3% gain); Passage B tested base or weakly-tuned 6-9B models (4.1% loss); Passage C directly shows the split: base 7B lost 3%, but the same model after instruction-tuning gained 8%. Instruction tuning is the reconciling variable.

Does chain-of-thought help or hurt small models on MMLU?

It depends entirely on instruction tuning. Base sub-10B models degrade with chain-of-thought; instruction-tuned ones improve. Size alone does not determine the effect.

Which model synthesized the passages most accurately?

Claude Sonnet 4.6 provided the clearest reconciliation: it explicitly identified instruction tuning as the dependent variable, correctly attributed each finding to model training status, and concluded with a precise conditional statement.

How does cost per correct answer compare across the models?

Haiku cost $0.00142 but was less decisive; Sonnet 4.6 cost $0.00486 but delivered the sharpest analysis; Opus cost $0.031 but added marginal value over Sonnet for 6.4x the cost.

← All posts