AIgentic

Agentic Systems & LLM Tooling Daily

Benchmark

Benchmark: Reconciling conflicting research

All three Claude models correctly identified that chain-of-thought effectiveness for sub-10B models depends on instruction tuning, but Sonnet 4.6 produced the most nuanced synthesis with explicit caveats about when CoT helps versus hurts.


This benchmark task required synthesizing three conflicting research findings into a single coherent answer while citing each source. All three Claude models succeeded in reconciling the apparent contradictions, but with meaningfully different levels of analytical depth and cost efficiency.

Claude Sonnet 4.6 produced the most rigorous synthesis and emerges as the strongest performer for this retrieval-synthesis task. It cost 3.6 times more than Haiku but delivered substantially clearer reasoning and more defensible conclusions.

Task

Three research papers (excerpts below) report different results on the same phenomenon. Reconcile their findings into a single one-paragraph answer to the question: “Does chain-of-thought prompting improve performance on MMLU for models smaller than 10B parameters?”

Passage A: “In our evaluation, 7B-parameter models showed a 12.3% absolute improvement on MMLU with chain-of-thought prompting versus direct prompting.”

Passage B: “For models below 10B parameters, chain-of-thought often introduces errors: we measured a 4.1% performance DECREASE on MMLU in 6, 9B models.”

Passage C: “Chain-of-thought effectiveness depends sharply on instruction tuning. Base 7B models lost 3% on MMLU; the same models after instruction-tuning gained 8%.”

Your answer must cite each passage as [A], [B], or [C] at least once.

Results

ModelLatency (ms)Input tokensOutput tokensCost (USD)Verdict
Claude Haiku 4.534322202680.00156Complete, accurate, clear but less nuanced
Claude Sonnet 4.683372203270.00556Complete, most rigorous analysis, strongest reasoning
Claude Opus 4.758553073020.02726Complete and accurate but less precise than Sonnet

Analysis

Claude Haiku 4.5 delivered a functional synthesis in 268 output tokens at the lowest cost. The response correctly identified the reconciling mechanism (instruction tuning as the differentiator) and cited all three passages as required. However, the analysis is somewhat mechanical. Haiku frames the conclusion as “the answer is conditional” and states that CoT “improves MMLU performance for instruction-tuned models but degrades performance for untuned base models.” This is accurate but does not fully grapple with the causal mechanism. The response infers that [A] “likely involved instruction-tuned models” and [B] “involved base models” without explicitly noting that this inference is necessary because the papers don’t state their preprocessing details. For a synthesis task requiring reconciliation of genuinely contradictory evidence, Haiku’s approach is adequate but leaves room for deeper integration.

Claude Sonnet 4.6 produced the most comprehensive and analytically rigorous output at 327 tokens. It explicitly frames the contradiction on the surface, then systematically uses [C] as the reconciling explanation. Crucially, Sonnet articulates that CoT prompting does “not reliably improve MMLU performance for all sub-10B models” rather than saying it conditionally improves performance. This is a subtle but important distinction: the negative cases are not exceptions but are central to the answer. Sonnet adds the observation that “model scale alone is insufficient to predict CoT benefit,” which is a higher-order insight not directly required by the prompt. The response uses italics to emphasize the directional findings (“lost” vs “gained”) and treats the inference about unstated model states ([B] likely using base models) with appropriate epistemic care. The latency is higher than Haiku (8.3 seconds), but the cost increase is proportional to token count and justified by substantially clearer reasoning.

Claude Opus 4.7 achieved near-parity with Haiku in conciseness (302 output tokens) while maintaining accuracy, but at significantly higher cost (0.02726 USD, over 17 times Haiku’s cost). The response is competent: it correctly identifies instruction tuning as the reconciling factor, cites all three passages, and arrives at the conditional conclusion. However, Opus produces less analytical depth than Sonnet despite having more total latency and more input tokens consumed (307 vs 220). The response does not explicitly note that the three papers appear to test different model configurations; it infers this but does not make the inference visible to the reader. For a synthesis task, Opus performs adequately but does not justify its cost premium over Sonnet.

Winner and why

Claude Sonnet 4.6 is the clear winner for this task. It balances three competing objectives effectively. First, correctness: Sonnet correctly identifies that instruction tuning is the confounding variable and reconciles all three passages. Second, analytical rigor: it explicitly states a more defensible conclusion (“CoT does not reliably improve performance for all sub-10B models”) rather than a merely conditional one, and it surfaces the inferential process (e.g., explaining why [B]‘s results likely came from base models). Third, cost-efficiency: at 0.00556 USD, Sonnet costs 3.6 times Haiku but delivers substantially more rigorous analysis per token, and costs 4.8 times less than Opus while outperforming it in clarity.

For retrieval-synthesis tasks like this, where the goal is not just to answer correctly but to demonstrate integration across conflicting sources, the middle-tier model’s superior reasoning and appropriate epistemic care justify the cost premium over Haiku. Opus’s higher cost does not translate to proportional gains in output quality. Sonnet represents the optimal point on the accuracy-cost frontier for this class of work.

Takeaways

  1. Instruction tuning acts as a hidden variable in seemingly contradictory results. All three models correctly identified that [A], [B], and [C] could coexist if different papers tested different model preprocessing states. This is a common pattern in benchmarking work where experimental setup varies across labs. The best synthesis (Sonnet’s) made this inference explicit rather than burying it.

  2. Sonnet’s cost-to-depth ratio is superior for synthesis tasks. Haiku is genuinely cost-effective for straightforward tasks, but synthesis requires articulating inferential reasoning and epistemic boundaries. Sonnet added only 59 tokens (22% more) and 0.004 USD (255% more cost) compared to Haiku, but delivered substantially clearer integration of evidence. Opus, by contrast, cost 4.8 times more than Sonnet for lower analytical output.

  3. Causal framing matters more than conditional framing in reconciliation. Sonnet’s observation that CoT “does not reliably improve performance for all sub-10B models” is a stronger claim than Haiku’s “the answer is conditional.” In synthesis, identifying what consistently makes a phenomenon occur or fail is more useful than merely noting that outcomes vary. All three models reached sound conclusions, but Sonnet’s formulation better captures the underlying causal mechanism.

  4. Model latency varies significantly independent of output quality. Sonnet required 2.4 times Haiku’s latency but delivered better analysis; Opus was faster than Sonnet but cost far more and produced less rigorous work. For synthesis tasks, latency is a poor proxy for quality.

Further reading

Frequently asked

Why did the three passages contradict each other on chain-of-thought effectiveness?

They tested different model states: Passage A likely used instruction-tuned 7B models (12.3% gain), Passage B used base 6-9B models (4.1% loss), and Passage C explicitly showed the difference: base 7B models lost 3% while instruction-tuned versions gained 8% with chain-of-thought prompting.

What determines whether chain-of-thought helps or hurts sub-10B models?

Instruction tuning is the critical factor. Base models lack the training to reliably execute multi-step reasoning that chain-of-thought prompting demands, so they perform worse. Instruction-tuned models have learned to follow reasoning scaffolds and benefit substantially.

Which model output best balanced accuracy and cost?

Claude Sonnet 4.6 provided the most complete analysis (explicit caveats about when CoT helps versus hurts) at 3.6 times the cost of Haiku but 5 times cheaper than Opus, making it the best cost-to-quality ratio for synthesis tasks.

← All posts