AIgentic

Agentic Systems & LLM Tooling Daily

Benchmark

Benchmark: Structured extraction from messy prose

Claude Sonnet 4.6 and Opus 4.7 both achieve perfect accuracy on schema extraction with context inference; Haiku stops short on one inference. Sonnet 4.6 costs 73% less than Opus while matching correctness.


Claude Sonnet 4.6 and Opus 4.7 both achieved perfect accuracy on schema extraction with contextual inference, while Haiku 4.5 correctly extracted four products but failed to infer one missing field from context. Sonnet 4.6 is the best value, matching Opus’s correctness at one-quarter the cost while providing explicit reasoning.

Task

Extract every product mentioned in the text below into a JSON array. Each entry must have: name (string), price_usd (number or null if absent), release_year (int or null), category (one of: “laptop”, “phone”, “tablet”, “wearable”, “other”). Do not include anything that isn’t a specific product.

Text: “Last Tuesday Sara showed off her new Framework 16 (she paid $2,099 for the AMD build), while Marcus insisted his 2022 iPad Mini at around $500 was still fine. A cousin dropped by with an Apple Watch Ultra 2 (I think she said it was $799) and started arguing that the Pixel 9 Pro, which came out last year, was overpriced at $999. By the end of the night someone mentioned the rumor about a foldable Surface device coming in 2026, but nothing was confirmed.”

Results

ModelLatency (ms)Input tokensOutput tokensCost (USD)Verdict
Claude Haiku 4.51,485206196$0.00119Complete extraction, missed one inference
Claude Sonnet 4.65,505206330$0.00557Perfect accuracy, explicit reasoning
Claude Opus 4.74,224285222$0.02093Perfect accuracy, minimal explanation

Analysis

Claude Haiku 4.5 extracted the four confirmed products correctly: Framework 16 (laptop, $2,099), iPad Mini (tablet, $500, 2022), Apple Watch Ultra 2 (wearable, $799), and Pixel 9 Pro (phone, $999). However, it left the Pixel 9 Pro’s release_year field as null. The text states “came out last year,” which provides sufficient context to infer a 2024 release (given the phrase implies the current year is 2025). Haiku did correctly exclude the foldable Surface device, recognizing it as unconfirmed rumor rather than a concrete product. This is the model’s only substantive weakness: it trades inference-capability for speed and cost.

Claude Sonnet 4.6 delivered the same four products with identical extraction, but crucially inferred release_year as 2024 for the Pixel 9 Pro. Beyond correctness, Sonnet 4.6 also provided explicit notes explaining its decision-making: why the Surface rumor was excluded, how the year was derived from context, and why Framework 16’s release year remained null (no reliable evidence in the text). This transparency is valuable for debugging and auditing. The response was also more verbose (330 output tokens versus Haiku’s 196), but the additional tokens consisted of explanatory text rather than redundant output.

Claude Opus 4.7 matched Sonnet 4.6’s extraction accuracy, including the correct 2024 inference for Pixel 9 Pro. It also correctly excluded the rumor and handled null fields appropriately. However, Opus provided no explanation or reasoning, returning only the JSON array. With 285 input tokens (higher than Haiku or Sonnet due to longer prompt parsing or internal representation), Opus incurred significantly higher cost ($0.02093 versus $0.00557 for Sonnet). On this narrow task, Opus’s additional capability brought no marginal benefit: all three models avoided hallucination and false positives, and both Sonnet and Opus reached the correct answer.

Winner and why

Claude Sonnet 4.6 is the clear winner for this task. It achieved perfect accuracy (matching Opus), correctly inferred the Pixel 9 Pro’s 2024 release year from context, and excluded the unconfirmed rumor with no errors. Sonnet’s explicit reasoning notes provided transparency that neither Haiku nor Opus offered, making the output auditable. At $0.00557, Sonnet costs 73% less than Opus ($0.02093) while producing identical correctness. Haiku’s failure to infer the release year from “came out last year” disqualifies it, despite its 77% cost advantage over Sonnet; a single missed inference is material in a schema-extraction task where precision is the primary criterion. Sonnet strikes the optimal balance between cost, accuracy, and interpretability.

Takeaways

Contextual inference matters in schema extraction. Haiku’s null for Pixel 9 Pro’s release year reveals a gap in implicit reasoning. The phrase “came out last year” is unambiguous enough to infer 2024 given a 2025 context, yet Haiku declined to make this leap. For real-world messy data, this kind of inference is routine, making it a meaningful differentiator.

Transparency in reasoning is underrated. Sonnet 4.6’s decision notes (explaining why the rumor was excluded, how the year was derived, and why Framework 16’s year stayed null) took only 134 extra tokens but provided clear audit trails. In production systems, this explainability can prevent downstream errors and reduce review friction.

Cost scales nonlinearly with capability; value does not. Opus costs 3.75x more than Sonnet but delivers identical correctness. For schema extraction, Haiku’s lower cost doesn’t offset its inference miss. The sweet spot is Sonnet: mature enough to handle context, transparent enough to debug, and cheap enough to run at scale.

Unconfirmed products are correctly filtered. All three models excluded the rumored foldable Surface device, demonstrating robust understanding of the task boundary. This is a positive signal for their ability to distinguish signal from noise in semi-structured text.

Further reading

Frequently asked

Which model correctly inferred the Pixel 9 Pro's release year?

Claude Sonnet 4.6 and Opus 4.7 both correctly inferred 2024 as the release year based on the phrase 'came out last year' relative to the text's implied present of 2025. Claude Haiku 4.5 left this field null, missing the contextual inference required.

Did any model include the unconfirmed foldable Surface device?

No. All three models correctly excluded the foldable Surface device mentioned as a 2026 rumor, recognizing that the task explicitly requires extraction of confirmed products only, not speculative announcements.

How does cost scale with accuracy in this task?

Sonnet 4.6 achieves perfect accuracy at $0.00557, while Opus 4.7 also achieves perfect accuracy at $0.02093 (3.75x more expensive). Haiku 4.5 costs $0.00119 but misses one contextual inference, making it the least accurate despite being cheapest.

What is the quality difference in reasoning between models?

Sonnet 4.6 provided explicit notes explaining its decision-making (why the rumor was excluded, how the year was inferred). Haiku and Opus delivered JSON without explanation, making it harder to verify confidence or understand edge-case reasoning.

← All posts