Why did the models disagree on whether CoT helps sub-10B models?

They didn't disagree. All three correctly identified that instruction tuning is the critical variable. The apparent conflict in the source passages (12.3% gain vs 4.1% loss) stems from different model types: instruction-tuned models benefit, base models degrade. Sonnet most clearly articulated this distinction.

Which passage was most useful for reconciliation?

Passage C, which explicitly compared base and instruction-tuned versions of the same 7B model. All models correctly recognized this as the key to explaining passages A and B. Sonnet and Opus identified it as the 'clarifying framework' upfront.

Did smaller models (Haiku) struggle with multi-source synthesis?

Haiku performed well overall, citing all three passages and reaching the correct conclusion. However, it was slightly less explicit about the instruction-tuning mechanism and offered less nuanced discussion of why base models fail with CoT reasoning.

What was the cost-per-token difference between models?

Haiku cost $0.00152 for 480 tokens. Sonnet cost $0.00503 for 511 tokens (3.3x more expensive per token). Opus cost $0.0271 for 607 tokens (17.8x more expensive). Sonnet delivered superior accuracy at moderate cost.

Did latency matter for this task?

Haiku was fastest at 3.5 seconds; Sonnet took 8.6 seconds; Opus took 6.2 seconds. For a single retrieval-synthesis task, latency differences are negligible. Accuracy and cost-efficiency were the primary differentiators.

Benchmark: Retrieval-synthesis reconciliation

Claude Sonnet 4.6 was the strongest performer on this retrieval-synthesis benchmark, delivering the most logically rigorous and precise reconciliation of conflicting passages while maintaining the lowest cost-per-correct-answer. All three models cited the required passages and reached defensible conclusions, but Sonnet’s explicit identification of instruction tuning as a critical moderating variable and its clear structural argument separated it from its peers.

Task

Three research papers (excerpts below) report different results on the same phenomenon. Reconcile their findings into a single one-paragraph answer to the question: “Does chain-of-thought prompting improve performance on MMLU for models smaller than 10B parameters?”

Passage A: “In our evaluation, 7B-parameter models showed a 12.3% absolute improvement on MMLU with chain-of-thought prompting versus direct prompting.” Passage B: “For models below 10B parameters, chain-of-thought often introduces errors: we measured a 4.1% performance DECREASE on MMLU in 6, 9B models.” Passage C: “Chain-of-thought effectiveness depends sharply on instruction tuning. Base 7B models lost 3% on MMLU; the same models after instruction-tuning gained 8%.”

Your answer must cite each passage as [A], [B], or [C] at least once.

Results

Model	Latency (ms)	Input tokens	Output tokens	Cost (USD)	Verdict
Claude Haiku 4.5	3,499	220	260	$0.00152	Complete, accurate, adequate
Claude Sonnet 4.6	8,609	220	291	$0.00503	Complete, precise, superior structure
Claude Opus 4.7	6,213	307	300	$0.0271	Complete, accurate, inefficient

Analysis

Haiku 4.5 delivered a functionally correct answer. It cited all three passages, correctly identified instruction tuning as the reconciling factor, and reasoned that passage A likely described instruction-tuned models while passage B likely described base models. The logic is sound: “base models of this size actually perform worse with chain-of-thought reasoning, suggesting these smaller models lack sufficient capability to reliably execute multi-step reasoning without explicit instruction-following training.” However, the framing is slightly less precise than competitors. It opens with a full heading (“Reconciling Chain-of-Thought Effects on Sub-10B Models”) when the task explicitly requested a single paragraph. The conclusion also hedges the answer somewhat by saying CoT “does improve MMLU performance for sub-10B models, but only when…” rather than more clearly stating the conditionality from the start. The output was efficient in tokens and cost, reflecting Haiku’s typical profile.

Sonnet 4.6 produced the strongest synthesis. It explicitly labeled the moderating variables upfront (“model type and training regime are critical moderating variables”) and immediately invoked passage C as the “most clarifying framework,” creating a clear logical hierarchy. The analysis distinguishes between two failure modes: base models degraded by “added reasoning demands” and instruction-tuned models “more likely to benefit.” Critically, Sonnet provided a nuanced negative answer to the direct question: CoT does “not reliably improve MMLU performance for all sub-10B models as a class,” which is more precise than a simple yes or no. The paragraph is denser and more structured than Haiku’s, moving from the clarifying framework to resolving A/B tension to a unified conclusion. The latency was higher (8.6 seconds), but the cost per output token remained reasonable at $0.00173 per token.

Opus 4.7 also delivered a correct answer with similar logic to Sonnet, correctly identifying instruction tuning as the decisive factor and reasoning through why passages A and B conflict. It offers slightly more concise language (“This suggests that…”) and provides a clean summary: “CoT does not reliably improve MMLU for models under 10B parameters on its own.” However, Opus employed 307 input tokens (versus 220 for both Haiku and Sonnet), suggesting a less efficient processing of the task. More critically, its cost was $0.0271, approximately 5.4x more expensive than Sonnet and 17.8x more expensive than Haiku, without a corresponding quality gain over Sonnet. The latency was also between Haiku and Sonnet, offering no speed advantage.

Winner and why

Sonnet 4.6 is the clear winner. It balanced correctness, clarity, and cost-efficiency more effectively than both alternatives. Compared to Haiku, Sonnet delivered a more sophisticated argument structure, correctly signaling the hierarchy of evidence (passage C as clarifying framework) and providing a more precise conditional answer to the core question. Compared to Opus, Sonnet achieved equal accuracy while costing one-fifth as much and consuming fewer input tokens. For retrieval-synthesis tasks where multiple passages must be reconciled into a coherent answer, Sonnet’s ability to frame the problem clearly and identify the key variable upfront is a significant advantage. The cost difference is especially pronounced: Sonnet’s estimated cost per correct answer was $0.00503, versus Opus at $0.0271, a 5.4x premium for equivalent or slightly lower clarity. In practical deployment, Sonnet represents the optimal tradeoff point between model capacity (sufficient for multi-source reasoning) and resource efficiency.

Takeaways

Instruction tuning emerged as the critical moderating variable across all three models’ reasoning. This is not a weakness in the models’ synthesis; rather, it reflects the actual structure of the evidence. Passage C provided the interpretive key that unlocked the apparent contradiction between passages A and B. Models that identified this framework early and prominently (Sonnet and Opus) produced structurally superior syntheses.

Cost-per-correct-answer varies sharply across model tiers. Haiku at $0.00152 was cheapest but offered adequate rather than superior synthesis; Sonnet at $0.00503 delivered the best quality-to-cost ratio; Opus at $0.0271 crossed into inefficiency for this task tier. For retrieval-synthesis work at scale, Sonnet’s middle ground is more defensible than pushing to the largest model without a clear accuracy gain.

Latency differences were large in absolute terms (Haiku: 3.5s, Sonnet: 8.6s, Opus: 6.2s) but negligible for end-user experience on a single task. Sonnet’s slower inference is not a practical constraint for synthesis work where quality matters more than sub-second response times.

The ability to recognize and name moderating variables (Sonnet: “model type and training regime are critical moderating variables”) is a subtle but measurable quality differentiator. This phrasing signals deeper reasoning about causation versus mere correlation, which is valuable for tasks requiring evidence synthesis in domains with hidden confounders.

Benchmark: Retrieval-synthesis reconciliation

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Related

Benchmark: Chain-of-Thought Synthesis Across Conflicting

Benchmark: Python race-condition diagnosis across Claude

Benchmark: Structured extraction from messy prose

Benchmark: Multi-step travel planning with tool use