---
title: "Benchmark: Retrieval-synthesis reconciliation"
description: "Claude Haiku, Sonnet, and Opus tested on synthesizing conflicting research findings about chain-of-thought prompting for sub-10B models. Sonnet wins."
tldr: "Three Claude models reconciled conflicting passages on chain-of-thought effects for small LLMs. Sonnet 4.6 produced the most precise synthesis, correctly identifying instruction tuning as the key moderator while maintaining clarity. Haiku was competitive but less rigorous; Opus was accurate but costly."
url: "https://aigentic.blog/benchmark-retrieval-synthesis-conflicting-passages-sonnet"
publishedAt: "2026-05-20T13:00:40.349Z"
updatedAt: "2026-05-20T13:00:40.349Z"
category: "benchmark"
tags: ["benchmark","claude","retrieval-synthesis","chain-of-thought","mmlu","model-evaluation"]
---

# Benchmark: Retrieval-synthesis reconciliation

> Three Claude models reconciled conflicting passages on chain-of-thought effects for small LLMs. Sonnet 4.6 produced the most precise synthesis, correctly identifying instruction tuning as the key moderator while maintaining clarity. Haiku was competitive but less rigorous; Opus was accurate but costly.

Claude Sonnet 4.6 was the strongest performer on this retrieval-synthesis benchmark, delivering the most logically rigorous and precise reconciliation of conflicting passages while maintaining the lowest cost-per-correct-answer. All three models cited the required passages and reached defensible conclusions, but Sonnet's explicit identification of instruction tuning as a critical moderating variable and its clear structural argument separated it from its peers.

## Task

> Three research papers (excerpts below) report different results on the same phenomenon. Reconcile their findings into a single one-paragraph answer to the question: "Does chain-of-thought prompting improve performance on MMLU for models smaller than 10B parameters?"
>
> Passage A: "In our evaluation, 7B-parameter models showed a 12.3% absolute improvement on MMLU with chain-of-thought prompting versus direct prompting."
> Passage B: "For models below 10B parameters, chain-of-thought often introduces errors: we measured a 4.1% performance DECREASE on MMLU in 6, 9B models."
> Passage C: "Chain-of-thought effectiveness depends sharply on instruction tuning. Base 7B models lost 3% on MMLU; the same models after instruction-tuning gained 8%."
>
> Your answer must cite each passage as [A], [B], or [C] at least once.

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|-------|--------------|--------------|---------------|------------|---------|
| Claude Haiku 4.5 | 3,499 | 220 | 260 | $0.00152 | Complete, accurate, adequate |
| Claude Sonnet 4.6 | 8,609 | 220 | 291 | $0.00503 | Complete, precise, superior structure |
| Claude Opus 4.7 | 6,213 | 307 | 300 | $0.0271 | Complete, accurate, inefficient |

## Analysis

**Haiku 4.5** delivered a functionally correct answer. It cited all three passages, correctly identified instruction tuning as the reconciling factor, and reasoned that passage A likely described instruction-tuned models while passage B likely described base models. The logic is sound: "base models of this size actually perform worse with chain-of-thought reasoning, suggesting these smaller models lack sufficient capability to reliably execute multi-step reasoning without explicit instruction-following training." However, the framing is slightly less precise than competitors. It opens with a full heading ("Reconciling Chain-of-Thought Effects on Sub-10B Models") when the task explicitly requested a single paragraph. The conclusion also hedges the answer somewhat by saying CoT "does improve MMLU performance for sub-10B models, but only when..." rather than more clearly stating the conditionality from the start. The output was efficient in tokens and cost, reflecting Haiku's typical profile.

**Sonnet 4.6** produced the strongest synthesis. It explicitly labeled the moderating variables upfront ("model type and training regime are critical moderating variables") and immediately invoked passage C as the "most clarifying framework," creating a clear logical hierarchy. The analysis distinguishes between two failure modes: base models degraded by "added reasoning demands" and instruction-tuned models "more likely to benefit." Critically, Sonnet provided a nuanced negative answer to the direct question: CoT does "not reliably improve MMLU performance for all sub-10B models as a class," which is more precise than a simple yes or no. The paragraph is denser and more structured than Haiku's, moving from the clarifying framework to resolving A/B tension to a unified conclusion. The latency was higher (8.6 seconds), but the cost per output token remained reasonable at $0.00173 per token.

**Opus 4.7** also delivered a correct answer with similar logic to Sonnet, correctly identifying instruction tuning as the decisive factor and reasoning through why passages A and B conflict. It offers slightly more concise language ("This suggests that...") and provides a clean summary: "CoT does not reliably improve MMLU for models under 10B parameters on its own." However, Opus employed 307 input tokens (versus 220 for both Haiku and Sonnet), suggesting a less efficient processing of the task. More critically, its cost was $0.0271, approximately 5.4x more expensive than Sonnet and 17.8x more expensive than Haiku, without a corresponding quality gain over Sonnet. The latency was also between Haiku and Sonnet, offering no speed advantage.

## Winner and why

**Sonnet 4.6** is the clear winner. It balanced correctness, clarity, and cost-efficiency more effectively than both alternatives. Compared to Haiku, Sonnet delivered a more sophisticated argument structure, correctly signaling the hierarchy of evidence (passage C as clarifying framework) and providing a more precise conditional answer to the core question. Compared to Opus, Sonnet achieved equal accuracy while costing one-fifth as much and consuming fewer input tokens. For retrieval-synthesis tasks where multiple passages must be reconciled into a coherent answer, Sonnet's ability to frame the problem clearly and identify the key variable upfront is a significant advantage. The cost difference is especially pronounced: Sonnet's estimated cost per correct answer was $0.00503, versus Opus at $0.0271, a 5.4x premium for equivalent or slightly lower clarity. In practical deployment, Sonnet represents the optimal tradeoff point between model capacity (sufficient for multi-source reasoning) and resource efficiency.

## Takeaways

Instruction tuning emerged as the critical moderating variable across all three models' reasoning. This is not a weakness in the models' synthesis; rather, it reflects the actual structure of the evidence. Passage C provided the interpretive key that unlocked the apparent contradiction between passages A and B. Models that identified this framework early and prominently (Sonnet and Opus) produced structurally superior syntheses.

Cost-per-correct-answer varies sharply across model tiers. Haiku at $0.00152 was cheapest but offered adequate rather than superior synthesis; Sonnet at $0.00503 delivered the best quality-to-cost ratio; Opus at $0.0271 crossed into inefficiency for this task tier. For retrieval-synthesis work at scale, Sonnet's middle ground is more defensible than pushing to the largest model without a clear accuracy gain.

Latency differences were large in absolute terms (Haiku: 3.5s, Sonnet: 8.6s, Opus: 6.2s) but negligible for end-user experience on a single task. Sonnet's slower inference is not a practical constraint for synthesis work where quality matters more than sub-second response times.

The ability to recognize and name moderating variables (Sonnet: "model type and training regime are critical moderating variables") is a subtle but measurable quality differentiator. This phrasing signals deeper reasoning about causation versus mere correlation, which is valuable for tasks requiring evidence synthesis in domains with hidden confounders.

## Further reading

- [Anthropic's Claude model family documentation](https://docs.anthropic.com/en/docs/about-claude/models/overview) provides specifications for Haiku, Sonnet, and Opus including token limits and pricing.
- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) (Wei et al., 2022) is the foundational paper on CoT prompting and its effectiveness across model scales.
- [On the Instruction Tuning Problem of Commonsense Reasoning](https://arxiv.org/abs/2305.12737) examines how instruction tuning reshapes model behavior on reasoning tasks, directly relevant to understanding passage C's findings.
- [Scaling Laws and Emergent Abilities in Language Models](https://arxiv.org/abs/2305.16264) provides context on when and why smaller models struggle with multi-step reasoning.

```

## Frequently asked

### Why did the models disagree on whether CoT helps sub-10B models?

They didn't disagree. All three correctly identified that instruction tuning is the critical variable. The apparent conflict in the source passages (12.3% gain vs 4.1% loss) stems from different model types: instruction-tuned models benefit, base models degrade. Sonnet most clearly articulated this distinction.

### Which passage was most useful for reconciliation?

Passage C, which explicitly compared base and instruction-tuned versions of the same 7B model. All models correctly recognized this as the key to explaining passages A and B. Sonnet and Opus identified it as the 'clarifying framework' upfront.

### Did smaller models (Haiku) struggle with multi-source synthesis?

Haiku performed well overall, citing all three passages and reaching the correct conclusion. However, it was slightly less explicit about the instruction-tuning mechanism and offered less nuanced discussion of why base models fail with CoT reasoning.

### What was the cost-per-token difference between models?

Haiku cost $0.00152 for 480 tokens. Sonnet cost $0.00503 for 511 tokens (3.3x more expensive per token). Opus cost $0.0271 for 607 tokens (17.8x more expensive). Sonnet delivered superior accuracy at moderate cost.

### Did latency matter for this task?

Haiku was fastest at 3.5 seconds; Sonnet took 8.6 seconds; Opus took 6.2 seconds. For a single retrieval-synthesis task, latency differences are negligible. Accuracy and cost-efficiency were the primary differentiators.
