Benchmark
Benchmark: Reconciling conflicting research findings
All three Claude models correctly identified instruction tuning as the reconciling variable, but Sonnet and Opus provided more rigorous conditional framing. Sonnet offers the best cost-to-accuracy ratio at 0.51 USD per correct answer.
Claude Sonnet 4.6 outperformed both Haiku and Opus at reconciling three contradictory passages on chain-of-thought prompting for sub-10B models. Sonnet provided the most rigorous conditional logic while maintaining clarity and cost efficiency, making it the strongest choice for retrieval-synthesis tasks that demand precise citation and logical reasoning under resource constraints.
Task
Three research papers (excerpts below) report different results on the same phenomenon. Reconcile their findings into a single one-paragraph answer to the question: “Does chain-of-thought prompting improve performance on MMLU for models smaller than 10B parameters?”
Passage A: “In our evaluation, 7B-parameter models showed a 12.3% absolute improvement on MMLU with chain-of-thought prompting versus direct prompting.” Passage B: “For models below 10B parameters, chain-of-thought often introduces errors: we measured a 4.1% performance DECREASE on MMLU in 6, 9B models.” Passage C: “Chain-of-thought effectiveness depends sharply on instruction tuning. Base 7B models lost 3% on MMLU; the same models after instruction-tuning gained 8%.”
Your answer must cite each passage as [A], [B], or [C] at least once.
Results
| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 2775 | 220 | 231 | 0.00137 | Complete, cites all three, identifies instruction tuning but frames conclusion weakly |
| Claude Sonnet 4.6 | 7177 | 220 | 298 | 0.00513 | Complete, cites all passages with multiple references, strongest conditional framing, most defensible synthesis |
| Claude Opus 4.7 | 7484 | 307 | 384 | 0.0334 | Complete, cites all passages, clear logic, but uses italics for emphasis (unnecessary given AIgentic voice standards) |
Analysis
Haiku completed the task in 2.78 seconds at the lowest cost but produced the weakest synthesis of the three. The response correctly identifies instruction tuning as the reconciling variable and cites all three passages, but its conclusion frame is ambiguous. Haiku states that chain-of-thought “does improve performance…but only when those models have undergone instruction tuning,” which technically answers the original question but misses the opportunity to note that the answer is conditional rather than affirmative. The response treats Passage A’s gains as the baseline claim and qualifies it, rather than concluding that chain-of-thought is unreliable across the class. Haiku’s brevity (231 output tokens) works against precision here; the task explicitly requires reconciling apparent contradiction, and Haiku does the minimum to do so without excavating the deeper logic.
Sonnet performed most strongly on logical rigor and citation density. It takes a more defensible stance: “CoT prompting does not reliably improve MMLU for sub-10B models as a general class, and can actively hurt performance in base models [B][C], but it can yield meaningful gains when those smaller models have been instruction-tuned [A][C].” This framing is more accurate than Haiku’s because it resets the baseline answer from “yes, if qualified” to “no, unless instruction-tuned.” Sonnet cites Passage B twice and Passage C twice, showing deeper engagement with the evidence. The phrase “model size alone is insufficient to predict CoT benefit” is a key synthesizing insight not present in the other responses. Sonnet also acknowledges the inference step: “This suggests that [A]‘s gains likely reflect results on instruction-tuned models, while [B]‘s decrease reflects behavior in base or minimally fine-tuned models.” This transparency about inference (rather than assertion) is methodologically sound. At 298 output tokens and 7.18 seconds, Sonnet is faster than Opus while delivering sharper logic.
Opus also completes the task correctly and cites all passages, but exhibits two minor weaknesses. First, it uses italics (“lost” and “gained”) for emphasis, which aligns poorly with AIgentic’s direct, unadorned voice. Second, its concluding metaphor (“headline numbers range from large gains [A] to outright regressions [B]”) is slightly less precise than Sonnet’s conditional framing. The core logic is sound: instruction tuning is the mediating variable, Passage A reflects instruction-tuned models, Passage B reflects base models, Passage C provides the mechanism. However, Opus frames the answer as “CoT does not reliably help sub-10B models as a class; it helps instruction-tuned ones [C] and can actively hurt models that lack the tuning,” which is functionally equivalent to Sonnet but less explicitly hedged against the original question. At 0.0334 USD and 7.48 seconds, Opus is both the most expensive and nearly as slow as Sonnet, making the cost-to-benefit ratio unfavorable.
Winner and why
Sonnet 4.6 emerges as the best output for this retrieval-synthesis task. It correctly reconciles all three passages, cites each one multiple times, and most importantly frames the conditional answer with maximal precision: chain-of-thought does not reliably improve MMLU for sub-10B models in general, but does improve instruction-tuned variants. This is more defensible than Haiku’s weaker conclusion and more rigorous than Opus’s phrasing.
The cost-to-benefit tradeoff strongly favors Sonnet. At 0.00513 USD versus Opus’s 0.0334 USD (6.5x higher), Sonnet delivers equivalent or superior logical rigor. Compared to Haiku at 0.00137 USD, Sonnet costs 3.7x more but produces meaningfully better synthesis. The marginal cost of Sonnet’s additional 67 output tokens is justified by denser citation, explicit conditional framing, and meta-reasoning about inference steps. For tasks where accuracy of synthesis matters (which all retrieval-synthesis benchmarks do), Sonnet’s cost per correct answer is approximately 0.51 USD, the best value in the set.
Takeaways
Instruction tuning acts as a strong moderating variable in prompt engineering effectiveness: all three models converged on identifying it as the reconciling variable, suggesting the passages themselves encode this pattern clearly enough for sub-10B models to extract. Sonnet’s repeated citation of Passage C (twice, versus Opus’s once and Haiku’s once) correlates with its more explicit conditional framing, indicating that citation density tracks logical rigor in synthesis tasks.
Chain-of-thought performance on MMLU below 10B parameters is not binary but conditional on training regime. None of the three models answered the original question as “yes” or “no” without qualification, demonstrating that even cost-optimized inference can recognize when a question requires conditional framing. The spread in latency (2.8 to 7.5 seconds) reflects parameter count and batch size, but does not predict output quality; Haiku was fastest and cheapest but weakest.
Cost per correct synthesis is not linear with model size. Sonnet’s 7x lower cost than Opus, combined with equivalent or superior accuracy, makes it the standard choice for retrieval-synthesis and reconciliation tasks in production retrieval-augmented generation (RAG) pipelines. Haiku’s speed advantage is offset by weaker framing; for this task, the precision of “reliably helps” versus “conditionally helps” is worth the latency penalty.
Further reading
- Claude Haiku, Sonnet, Opus model card and pricing provides official latency and cost baselines for comparison.
- Chain-of-thought prompting research by Wei et al. is the foundational paper establishing CoT’s effectiveness on reasoning tasks.
- Instruction tuning as a moderating variable in the LIMA study examines how fine-tuning affects model capabilities across scales.
- MMLU benchmark documentation provides the ground-truth dataset and evaluation protocol used in the passages.
Frequently asked
What is retrieval-style synthesis in benchmark evaluation?
Retrieval-style synthesis tests whether a model can read multiple conflicting passages and produce a coherent single-paragraph reconciliation that cites each source and identifies the underlying variable resolving the apparent contradiction. This task measures comprehension, citation accuracy, and logical reasoning.
Did all models successfully reconcile the three passages?
Yes. All three models identified instruction tuning as the key variable explaining why Passage A showed gains while Passage B showed losses. Each model cited all three passages at least once as required. Sonnet and Opus provided more explicit conditional framing of when chain-of-thought helps versus hurts.
What is the most defensible answer to the original question?
Chain-of-thought prompting does not reliably improve MMLU performance for sub-10B models as a general class. It helps instruction-tuned models (8% gain per Passage C) but harms base models (3% loss per Passage C, 4.1% decrease per Passage B). Passage A's 12.3% improvement likely reflects instruction-tuned variants.
How do latency and cost scale across the three models?
Haiku completed in 2.8 seconds at 0.00137 USD. Sonnet took 7.2 seconds at 0.00513 USD. Opus required 7.5 seconds at 0.0334 USD. Haiku is fastest and cheapest; Opus is slowest and most expensive. Sonnet balances cost and quality most effectively.