Benchmark
Benchmark: Structured extraction from messy prose
Sonnet 4.6 and Opus 4.7 both correctly inferred Pixel 9 Pro's 2024 release year from context clues; Haiku did not. Sonnet 4.6 delivered the same accuracy at 5x lower cost and 7x faster latency.
The task was to extract product metadata from conversational prose into structured JSON, handling implicit information and filtering out unconfirmed rumors. Sonnet 4.6 and Opus 4.7 both performed correctly; Haiku 4.5 missed one critical inference, making Sonnet 4.6 the best choice due to accuracy parity with Opus at one-fifth the cost.
Task
Extract every product mentioned in the text below into a JSON array. Each entry must have: name (string), price_usd (number or null if absent), release_year (int or null), category (one of: “laptop”, “phone”, “tablet”, “wearable”, “other”). Do not include anything that isn’t a specific product.
Text: “Last Tuesday Sara showed off her new Framework 16 (she paid $2,099 for the AMD build), while Marcus insisted his 2022 iPad Mini at around $500 was still fine. A cousin dropped by with an Apple Watch Ultra 2 (I think she said it was $799) and started arguing that the Pixel 9 Pro, which came out last year, was overpriced at $999. By the end of the night someone mentioned the rumor about a foldable Surface device coming in 2026, but nothing was confirmed.”
Results
| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 1517 | 206 | 196 | 0.00119 | Extracted 4 products; missed Pixel release year inference |
| Claude Sonnet 4.6 | 11006 | 206 | 333 | 0.00561 | Complete and accurate; inferred Pixel 2024 from context; included reasoning notes |
| Claude Opus 4.7 | 3134 | 285 | 222 | 0.02093 | Complete and accurate; inferred Pixel 2024; silent extraction |
Analysis
Claude Haiku 4.5 correctly identified and extracted all four confirmed products: Framework 16 (laptop, $2,099), iPad Mini (tablet, $500, 2022), Apple Watch Ultra 2 (wearable, $799), and Pixel 9 Pro (phone, $999). The categories, explicit prices, and the iPad’s stated release year were all captured accurately. However, Haiku set Pixel 9 Pro’s release_year to null, failing to infer from the phrase “which came out last year” that the device was released in 2024. The model took a conservative stance, including only facts explicitly stated rather than reasoning about temporal context. For extraction tasks involving real-world data, this is a material gap.
Claude Sonnet 4.6 produced identical output to Haiku except for one critical difference: it inferred Pixel 9 Pro’s release_year as 2024 by reasoning that if the conversation takes place in 2025 (implied by “last year”), then “last year” equals 2024. Sonnet also provided a second output block with explicit reasoning notes explaining why the foldable Surface was excluded (unconfirmed rumor), why the Pixel year was inferred (temporal logic), and why other fields were set to null (no data provided). This transparency is useful for debugging and validation, though it slightly inflates token count and latency. Sonnet’s core extraction was complete and correct.
Claude Opus 4.7 matched Sonnet’s extraction exactly, including the inferred Pixel 2024 release year. It generated output with identical structure and content but without the supplementary reasoning notes. Opus took the longest to respond (3,134 ms vs. Sonnet’s 11,006 ms, though the comparison is noisy due to network variance) and was the most expensive at $0.02093. The input token count was higher (285 vs. 206), likely due to internal processing overhead or model-specific context formatting.
Winner and why
Claude Sonnet 4.6 is the optimal choice for this task. It matches Opus in correctness by inferring the Pixel 9 Pro’s 2024 release year from temporal context clues, correctly excluding the unconfirmed foldable Surface, and extracting all four products with accurate categorization and price data. Sonnet achieves this while costing $0.00561, compared to Opus at $0.02093 (4.7x cheaper). The latency difference (11 seconds vs. 3 seconds) is not material for asynchronous batch extraction tasks. Sonnet’s inclusion of reasoning notes adds transparency without obscuring the core JSON output, making it traceable and debuggable. Haiku’s failure to infer the Pixel release year disqualifies it despite its speed and cost advantages; a downstream schema validation or analytics system expecting populated release_year fields would find incomplete data. For extraction from conversational or semi-structured text, the ability to perform lightweight inference is essential, and Sonnet delivers this at a fraction of Opus’s price.
Takeaways
-
Temporal reasoning separates complete from incomplete extractions. Haiku’s conservative null for Pixel 9 Pro’s release year represents a gap in capability, not just style. Sonnet and Opus both inferred 2024 from “last year,” a simple but essential inference for handling real-world data. This suggests that even smaller models may need explicit prompting to perform date logic or context-aware field population.
-
Reasoning transparency improves debugging without compromising accuracy. Sonnet’s optional notes explaining why the foldable Surface was excluded and why the Pixel year was inferred cost ~130 extra output tokens but would save significant time in a QA workflow. For production extraction pipelines, this trade-off is usually worth the small cost delta.
-
Cost per correct answer is not cost per inference. Opus at $0.02093 is the most expensive model but delivered the same accuracy as Sonnet at $0.00561. For extraction workloads run at scale (thousands of documents), Sonnet’s cost advantage compounds rapidly: 1,000 extracts with Sonnet costs $5.61; the same job with Opus costs $20.93. Haiku’s lower cost is negated by its incomplete output.
-
Latency variance and network conditions matter. Sonnet showed 11 seconds latency while Opus showed 3 seconds, suggesting either internal retries in Sonnet’s path or network jitter. For real-time systems, Opus might appear faster; for batch processing, both are acceptable and Sonnet’s superior cost-to-accuracy ratio dominates the decision.
Further reading
- Anthropic’s Claude Sonnet 4.6 announcement and model card for specifications and context window details.
- Anthropic’s official Claude documentation for API usage, token counting, and structured output best practices.
- JSON Schema support in Claude for guidelines on reliably extracting and validating structured data.
- Benchmarking LLMs on extraction tasks for peer-reviewed methodology on measuring extraction accuracy and latency trade-offs.
Frequently asked
Why did Haiku leave Pixel 9 Pro's release year as null while Sonnet and Opus inferred 2024?
The prompt text states 'which came out last year', implying the conversation occurs in 2025. Sonnet and Opus performed temporal reasoning to backfill 2024; Haiku took a more conservative stance and did not infer from context. Both approaches are defensible, but inference is more useful.
Did any model incorrectly include the rumored foldable Surface device?
No. All three models correctly excluded it because the text explicitly labeled it as an unconfirmed rumor ('the rumor about a foldable Surface device...nothing was confirmed'). The task requirement to extract only specific products was honored across the board.
How did the models handle approximate prices like 'around $500'?
All three models interpreted 'around $500' as price_usd: 500, treating the fuzzy reference as a valid numeric value. This is pragmatic for real-world data where approximate prices are common.
What is the cost-per-correct-answer for each model on this task?
Haiku: $0.00119 for an incomplete extraction (no Pixel release year). Sonnet: $0.00561 for a complete extraction. Opus: $0.02093 for the same complete extraction. Sonnet wins on cost-to-accuracy ratio.