---
title: "Benchmark: Reconciling Conflicting Research"
description: "Claude models reconcile contradictory findings on chain-of-thought effectiveness for sub-10B models. Sonnet and Opus correctly identify instruction tuning."
tldr: "All three Claude models correctly identified instruction tuning as the key variable reconciling conflicting chain-of-thought results, but Sonnet and Opus provided clearer causal framing and practical guidance. Sonnet achieved the best balance of accuracy, clarity, and cost."
url: "https://aigentic.blog/benchmark-retrieval-synthesis-2026-06-05"
publishedAt: "2026-06-05T13:00:11.453Z"
updatedAt: "2026-06-05T13:00:11.453Z"
category: "benchmark"
tags: ["benchmark","claude","retrieval-synthesis","reasoning","prompt-engineering"]
---

# Benchmark: Reconciling Conflicting Research

> All three Claude models correctly identified instruction tuning as the key variable reconciling conflicting chain-of-thought results, but Sonnet and Opus provided clearer causal framing and practical guidance. Sonnet achieved the best balance of accuracy, clarity, and cost.

The task required synthesizing three conflicting research passages into a single reconciling paragraph, with mandatory citations to each source. Claude Opus 4.7 produced the most efficient and precise answer, while Claude Sonnet 4.6 provided the most actionable framing and pedagogical clarity. All three models correctly identified the core insight: instruction tuning moderates chain-of-thought effectiveness for sub-10B models.

## Task

> Three research papers (excerpts below) report different results on the same phenomenon. Reconcile their findings into a single one-paragraph answer to the question: "Does chain-of-thought prompting improve performance on MMLU for models smaller than 10B parameters?"
> 
> Passage A: "In our evaluation, 7B-parameter models showed a 12.3% absolute improvement on MMLU with chain-of-thought prompting versus direct prompting."
> Passage B: "For models below 10B parameters, chain-of-thought often introduces errors: we measured a 4.1% performance DECREASE on MMLU in 6, 9B models."
> Passage C: "Chain-of-thought effectiveness depends sharply on instruction tuning. Base 7B models lost 3% on MMLU; the same models after instruction-tuning gained 8%."
> 
> Your answer must cite each passage as [A], [B], or [C] at least once.

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|-------|--------------|--------------|---------------|-----------|---------|
| Claude Haiku 4.5 | 4,127 | 220 | 233 | 0.00138 | Complete, accurate, clear but verbose preamble |
| Claude Sonnet 4.6 | 7,479 | 220 | 302 | 0.00519 | Complete, accurate, clearest causal framing and practical guidance |
| Claude Opus 4.7 | 5,657 | 307 | 289 | 0.02628 | Complete, accurate, most concise and direct |

## Analysis

**Claude Haiku 4.5** completed the task correctly and cited all three passages. The output opens with a rhetorical section heading (which the instructions discourage, though this was not a written submission guideline) and then delivers a solid reconciliation: instruction tuning is the critical variable, and the three passages map onto base versus instruction-tuned variants. The logic is sound. However, the response includes an unnecessary structural framing ("Reconciling Chain-of-Thought Effects on Sub-10B Models") that adds length without substantive contribution. The synthesis itself is clear and all three passages receive explicit citation with bracketed references. Processing time was the fastest at 4.1 seconds, and token efficiency was reasonable at 233 output tokens for a complete answer.

**Claude Sonnet 4.6** produced the longest response (302 output tokens) and spent the most latency (7.5 seconds). The added length is justified by superior analytical depth. The model explicitly labels the reconciliation as "genuine mixed results" rather than a simple oversight, and provides clearer causal reasoning: it articulates that base models "lack the instruction-following scaffolding needed to use intermediate reasoning steps effectively." This explanation of the mechanism (error compounding in untuned models) is absent from Haiku's output. Sonnet also adds a practical conclusion ("Practitioners working with sub-10B models should treat instruction tuning as a near-essential prerequisite") that goes beyond synthesis into actionable guidance. All three passages are cited with explicit bracketing. The cost-per-token is higher, but the informational yield justifies it for users who need reasoning clarity.

**Claude Opus 4.7** delivered the most compact correct answer (289 output tokens) at moderate latency (5.7 seconds). The reconciliation is tight and precise: it identifies model conditioning as the variable, maps each passage to a hypothesis (base versus tuned), and concludes in a single sentence. The language is direct ("CoT does not reliably improve MMLU for models under 10B in general, but it does yield substantial gains for instruction-tuned models"). All three passages receive citation. The trade-off is that Opus sacrifices some pedagogical detail; the mechanism explanation ("they struggle to generate coherent reasoning chains and instead compound errors") is present but brief. The cost per token is substantially higher due to Opus's pricing tier, making it the most expensive solution at 0.026 USD despite shorter output.

## Winner and why

**Claude Sonnet 4.6** is the optimal choice for this task. It balances correctness, clarity, and cost most effectively.

All three models are factually correct and complete the citation requirement. The substantive difference lies in analytical depth and practical value. Sonnet provides the clearest causal explanation of why base models fail with chain-of-thought prompting (error compounding due to insufficient instruction-following capability) and explicitly connects this to the observed performance gaps across passages. More importantly, it translates the reconciliation into actionable guidance for practitioners.

Opus achieves comparable correctness in fewer tokens, but its brevity comes at the expense of the mechanistic explanation that helps readers understand not just that instruction tuning matters, but why. Haiku is correct but adds structural overhead without compensating depth.

Cost-wise, Sonnet is not the cheapest (Haiku at 0.00138 USD wins that category), but it is vastly cheaper than Opus (3.75x lower cost) while delivering materially better analysis than Haiku. For retrieval-and-synthesis tasks that prioritize both accuracy and interpretability, Sonnet's cost-per-insight ratio is superior.

## Takeaways

1. **Instruction tuning emerged as the critical moderating variable across all three models.** None of the models were misled by the conflicting numerical results; all correctly recognized that model conditioning, not size alone, explains the apparent contradiction. This suggests the underlying logic is robust enough that different LLMs converge on the same explanation despite varying response styles.

2. **Sonnet achieved the best cost-to-clarity tradeoff for synthesis tasks.** Haiku was faster and cheaper per output token, but added structural framing that consumed tokens without deepening understanding. Opus was slower and 19x more expensive than Haiku while producing no materially better reconciliation. Sonnet's middle cost (3.75x Haiku, 1/5 Opus) purchased the clearest mechanistic explanation and most actionable conclusion.

3. **Mechanistic reasoning justifies longer outputs in synthesis tasks.** The extra 69 output tokens Sonnet used versus Opus were spent on causal explanation (why base models compound errors, why instruction tuning provides scaffolding) rather than filler. For tasks where users need to understand not just the answer but the reasoning, this length premium is justified. Shorter does not always mean better.

4. **All three models correctly applied the citation requirement.** Each passage was cited at least once with explicit bracket notation ([A], [B], [C]), and the logical mapping of passages to underlying model variants was sound across all outputs. For constraint-based synthesis tasks, this consistency suggests the instruction was clear and non-ambiguous.

## Further reading

- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) on arXiv: the seminal Wei et al. paper introducing chain-of-thought prompting methodology.
- [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard): Hugging Face's benchmark suite tracking MMLU performance across open models including sub-10B variants.
- [Instruction Tuning for Large Language Models: A Survey](https://arxiv.org/abs/2308.10792) on arXiv: comprehensive review of instruction-tuning techniques and their impact on reasoning and task performance.
- [MMLU Benchmark Homepage](https://github.com/hendrycks/test): official repository containing the Massive Multitask Language Understanding dataset and evaluation code.
```

## Frequently asked

### What is the core insight that reconciles the three conflicting passages?

Instruction tuning is the critical moderating variable. Base models under 10B parameters lose accuracy with chain-of-thought prompting, while instruction-tuned models gain accuracy. Passage A likely tested instruction-tuned models, Passage B likely tested base models, and Passage C explicitly demonstrated the split.

### Does chain-of-thought prompting help or hurt smaller models?

It depends on model conditioning. For base pretrained sub-10B models, chain-of-thought prompting typically decreases MMLU performance. For instruction-tuned models in the same size range, it substantially improves performance, with gains around 8% observed in the research.

### Why did the three papers report opposite results?

The papers tested different model variants. The 12.3% improvement reported likely came from instruction-tuned models, while the 4.1% decrease came from base models. Without controlling for instruction tuning as a variable, the results appeared contradictory.

### What should practitioners do with sub-10B models and chain-of-thought prompting?

Instruction tuning should be treated as a prerequisite before applying chain-of-thought prompting to models under 10B parameters. Base models may perform worse with chain-of-thought reasoning due to compounding errors in intermediate steps.