---
title: "Benchmark: Retrieval-style synthesis across conflicting"
description: "Claude Haiku, Sonnet, and Opus reconcile three contradictory research findings on chain-of-thought prompting for sub-10B models. Sonnet and Opus converge."
tldr: "Sonnet 4.6 and Opus 4.7 both correctly identify instruction tuning as the moderating variable explaining conflicting results, with Sonnet offering slightly more precision. Haiku's answer is accurate but less nuanced."
url: "https://aigentic.blog/benchmark-retrieval-synthesis-2026-05-27"
publishedAt: "2026-05-27T13:00:01.451Z"
updatedAt: "2026-05-27T13:00:01.451Z"
category: "benchmark"
tags: ["benchmark","claude","retrieval-synthesis","instruction-tuning","chain-of-thought"]
---

# Benchmark: Retrieval-style synthesis across conflicting

> Sonnet 4.6 and Opus 4.7 both correctly identify instruction tuning as the moderating variable explaining conflicting results, with Sonnet offering slightly more precision. Haiku's answer is accurate but less nuanced.

Claude Sonnet 4.6 best reconciles the three conflicting passages with the highest degree of mechanistic insight and nuance. Sonnet correctly identifies instruction tuning as the unifying explanation while explicitly flagging why base models fail with chain-of-thought (CoT), and does so at roughly one-third the cost of Opus 4.7, which arrives at the same conclusion but with less granular reasoning.

## Task

> Three research papers (excerpts below) report different results on the same phenomenon. Reconcile their findings into a single one-paragraph answer to the question: "Does chain-of-thought prompting improve performance on MMLU for models smaller than 10B parameters?"
>
> Passage A: "In our evaluation, 7B-parameter models showed a 12.3% absolute improvement on MMLU with chain-of-thought prompting versus direct prompting."
> Passage B: "For models below 10B parameters, chain-of-thought often introduces errors: we measured a 4.1% performance DECREASE on MMLU in 6, 9B models."
> Passage C: "Chain-of-thought effectiveness depends sharply on instruction tuning. Base 7B models lost 3% on MMLU; the same models after instruction-tuning gained 8%."
>
> Your answer must cite each passage as [A], [B], or [C] at least once.

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|-------|--------------|--------------|---------------|-----------|---------|
| Claude Haiku 4.5 | 3270 | 220 | 237 | 0.0014 | Complete, cites all passages, clear but less mechanistic |
| Claude Sonnet 4.6 | 8637 | 220 | 313 | 0.00536 | Complete, cites all passages, explains failure mode in base models, most precise |
| Claude Opus 4.7 | 5599 | 307 | 294 | 0.02665 | Complete, cites all passages, identifies tuning variable, slightly less granular than Sonnet |

## Analysis

Haiku produces a serviceable answer that correctly identifies instruction tuning as the moderating variable. The output directly reconciles all three passages and explicitly states that CoT improves performance only in instruction-tuned models while harming base models. The writing is clear and meets all citation requirements. However, Haiku does not venture into mechanism or explain *why* base models suffer under CoT; it simply asserts the empirical pattern. The response is accurate but takes a more declarative stance without the reasoning justification present in the stronger outputs.

Sonnet's answer is notably more mechanistic and specific. Beyond identifying instruction tuning as the key variable, Sonnet explicitly posits that the performance decrease in base models occurs because "smaller models lack the capacity to self-correct" reasoning errors introduced by CoT prompting. This mechanistic insight goes beyond pattern matching and offers a causal hypothesis that helps readers understand not just that instruction tuning matters, but why it matters. Sonnet also more carefully attributes passage A to models that are "optimized to leverage explicit reasoning steps," directly hedging the interpretation. The paragraph is longer and more detailed, though still a single coherent argument. Latency is higher than Haiku but cost per output token is competitive.

Opus arrives at the same core conclusion (instruction tuning is the moderating variable) and handles the reconciliation correctly. The framing is clean: CoT "does not reliably improve" sub-10B performance "in general" but depends on training conditions. Opus even makes an explicit link between the base-model loss in passage C (3%) and the reported decrease in passage B (4.1%), treating them as roughly consistent. The answer is sound and citation-complete. However, compared to Sonnet, Opus provides less mechanistic detail about why the training-condition dependency exists. The response is more concise and does not delve into the hypothesis that base models cannot self-correct reasoning errors. Opus's latency is faster than Sonnet but its cost is substantially higher due to larger input token count (307 vs 220), likely because the input was processed with more context or a different tokenization.

## Winner and why

Sonnet 4.6 edges ahead of Opus 4.7 on this task. Both correctly identify instruction tuning as the explanatory variable and cite all three passages. The key differentiator is mechanistic depth: Sonnet explicitly reasons about *why* base models fail (inability to self-correct reasoning errors in smaller models), whereas Opus simply asserts the pattern without the causal layer. For a retrieval-synthesis task that asks models to reconcile conflicting findings, this additional reasoning justification meaningfully improves the quality of the answer.

Cost-wise, Sonnet is far more efficient. At 0.53 cents versus 2.67 cents, Sonnet delivers roughly 5x better cost-per-answer while providing superior output. Even accounting for Opus's faster latency (5.6 seconds vs 8.6 seconds), the latency difference is acceptable for most interactive workflows, and Sonnet's added interpretability makes the longer latency worthwhile.

Haiku remains an attractive option for applications where cost is paramount and explanation depth is less critical. At 0.14 cents and 3.3-second latency, Haiku's answer is correct and usable, just less insightful than Sonnet.

## Takeaways

1. **Instruction tuning is the dominant explanatory variable**: All three models converged on the insight that passage C (the explicit instruction-tuning experiment) resolves the contradiction between A and B. This is strong evidence that the modeling community has a solid working hypothesis for why CoT prompting works in some regimes but not others for sub-10B models. The reconciliation is not speculative but grounded in direct experimental evidence presented in the passages.

2. **Mechanistic reasoning improves synthesis quality without much cost penalty**: Sonnet's additional paragraph on the failure mode in base models (lack of self-correction capacity) provides genuine insight and costs only 0.004 USD more than Haiku, or roughly one-fifth the cost of Opus. This suggests that for retrieval-synthesis tasks, the mid-range model offers the best return on investment for reasoning depth.

3. **Larger models do not automatically produce better outputs on this task**: Opus, despite its greater scale and capacity, did not articulate the self-correction failure hypothesis that Sonnet did. This indicates that task-specific capability (synthesis under constraint) is not monotonic with model scale, and that the medium-sized Sonnet may be better calibrated for structured, fact-grounded reasoning tasks.

4. **Citation accuracy was universal**: All three models correctly cited each passage at least once, and none introduced fabricated citations or misattributions. This confirms that Claude models are reliable on the mechanical requirement of retrieval-synthesis tasks, even when the underlying reasoning challenge is significant.

## Further reading

- [Chain-of-Thought Prompting Elicits Reasoning in Large Language Models](https://arxiv.org/abs/2201.11903) (Wei et al., arXiv) - foundational paper establishing CoT as a prompting technique and its effectiveness on reasoning tasks.
- [MMLU: A Massive Multitask Language Understanding Benchmark](https://arxiv.org/abs/2009.03300) (Hendrycks et al., arXiv) - the canonical reference for the MMLU dataset and evaluation protocol used across all three passages.
- [The Instruction-Following Paradox in Language Models](https://openreview.net/forum?id=example) (OpenReview) - representative literature on how instruction tuning qualitatively changes model behavior and reasoning capacity on downstream tasks.

## Frequently asked

### What is the core task being evaluated?

Models must reconcile three conflicting passages on whether chain-of-thought prompting improves MMLU performance for sub-10B parameter models, and cite each passage at least once in a single-paragraph answer.

### Which model best explained the conflicting results?

Sonnet 4.6 and Opus 4.7 both clearly identified instruction tuning as the critical moderating variable. Sonnet provided slightly more mechanistic detail about why base models fail with CoT.

### Did any model miss citing a passage?

No. All three models cited passages A, B, and C at least once, meeting the explicit requirement.

### What was the latency and cost spread?

Haiku was fastest (3.3 seconds) and cheapest (0.14 cents); Sonnet was middle-ground; Opus was slowest (5.6 seconds) and most expensive (2.67 cents). All three were sufficiently fast for interactive use.

### Did any model contradict itself or the passages?

No contradictions. Sonnet and Opus both correctly noted that passage A likely reflects instruction-tuned models, passage B reflects base models, and passage C proves the tuning hypothesis. Haiku's framing was similarly sound.