Did any model incorrectly include the rumored Surface device?

No. All three models correctly excluded the unconfirmed foldable Surface device. Each recognized that unconfirmed rumors do not qualify as specific products.

How did models handle temporal inference for release dates?

Haiku left Pixel 9 Pro's release_year as null, treating the text literally. Sonnet 4.6 and Opus 4.7 inferred 2024 from the phrase 'came out last year,' assuming the text context implies ~2025.

What trade-off exists between cheaper and more capable models here?

Haiku costs 0.00119 USD but misses release-year inference. Sonnet costs 0.00564 USD and infers correctly, making it 4.7x cheaper per correct answer than Opus at 0.02093 USD.

Did approximate prices cause extraction errors?

No. All models correctly parsed 'around $500' as 500 and '~$799' and '$999' as exact values, demonstrating robust handling of informal price notation.

Benchmark: Structured extraction from messy prose

Claude Sonnet 4.6 and Opus 4.7 both achieved perfect extraction with correct release-year inference for the Pixel 9 Pro, while Claude Haiku 4.5 left that field null. Sonnet delivers the best outcome for this task, combining correctness with a cost advantage of 4.7x over Opus.

Task

Extract every product mentioned in the text below into a JSON array. Each entry must have: name (string), price_usd (number or null if absent), release_year (int or null), category (one of: “laptop”, “phone”, “tablet”, “wearable”, “other”). Do not include anything that isn’t a specific product.

Text: “Last Tuesday Sara showed off her new Framework 16 (she paid $2,099 for the AMD build), while Marcus insisted his 2022 iPad Mini at around $500 was still fine. A cousin dropped by with an Apple Watch Ultra 2 (I think she said it was $799) and started arguing that the Pixel 9 Pro, which came out last year, was overpriced at $999. By the end of the night someone mentioned the rumor about a foldable Surface device coming in 2026, but nothing was confirmed.”

Results

Model	Latency (ms)	Input tokens	Output tokens	Cost (USD)	Verdict
Claude Haiku 4.5	1426	206	196	0.00119	Extracted 4 products correctly; release_year null for Pixel 9 Pro
Claude Sonnet 4.6	7356	206	335	0.00564	Extracted 4 products correctly; inferred Pixel 9 Pro release as 2024
Claude Opus 4.7	4120	285	222	0.02093	Extracted 4 products correctly; inferred Pixel 9 Pro release as 2024

Analysis

Claude Haiku 4.5 produced a clean, valid JSON output with all four confirmed products correctly identified and categorized. Framework 16 mapped to laptop, iPad Mini to tablet, Apple Watch Ultra 2 to wearable, and Pixel 9 Pro to phone. Price extraction was accurate across all entries, including the informal “around $500” notation. The model correctly excluded the rumored foldable Surface device, recognizing that unconfirmed rumors are not specific products. The only gap lies in the release_year field for Pixel 9 Pro, which Haiku left as null despite the text stating it “came out last year.” This represents a literal interpretation: the model extracted only explicitly stated years (2022 for iPad Mini) and avoided temporal inference.

Claude Sonnet 4.6 delivered identical product identification and categorization but added temporal reasoning. It inferred that “came out last year” (relative to an apparent present of ~2025) means the Pixel 9 Pro was released in 2024, populating that field with an integer rather than null. The model also provided explicit commentary on its decisions, explaining why it excluded the Surface device and noting its reasoning for the Pixel 9 Pro inference. This transparency is valuable in structured extraction tasks where understanding the model’s decision boundary matters. Output token count (335) was significantly higher than Haiku’s (196), reflecting the additional explanation text.

Claude Opus 4.7 matched Sonnet 4.6’s extraction accuracy and also inferred 2024 for the Pixel 9 Pro’s release year. It provided no explanatory notes, delivering only the JSON object. Input token count (285) exceeded both competitors, possibly due to a longer system prompt or different tokenization behavior. Latency (4120 ms) was faster than Sonnet 4.6 (7356 ms), but cost (0.02093 USD) was roughly 3.7x higher.

Winner and why

Claude Sonnet 4.6 is the optimal choice for this extraction task. It achieved perfect accuracy matching Opus 4.7 on all four products and correctly inferred the Pixel 9 Pro’s 2024 release year from contextual clues. The critical difference is cost-efficiency: Sonnet costs 0.00564 USD per correct extraction, compared to Opus’s 0.02093 USD, making it 3.7x cheaper for an identical correct output. Compared to Haiku’s 0.00119 USD cost, Sonnet’s marginal expense of 0.00445 USD buys complete accuracy on the release-year inference task, which Haiku missed. In production systems processing high volumes of messy product data, the inference capability is often worth the cost premium. Sonnet’s additional explanatory output also provides signal about confidence and decision logic, useful for debugging and validation workflows.

Takeaways

Temporal inference separates the models in this benchmark. Haiku’s literal approach misses implicit release dates; Sonnet 4.6 and Opus 4.7 both perform context-aware reasoning to populate null fields with reasonable inferences. For tasks requiring only exact matches, Haiku’s speed and cost are defensible; for real-world product intelligence pipelines, inference capability justifies moving upmarket.

Cost-per-correct-answer favors Sonnet 4.6 decisively. The model’s 3.7x cost advantage over Opus, combined with identical correctness on the core extraction task, makes it the default choice for structured extraction at scale. Haiku’s four-fold cost advantage becomes irrelevant when it produces incomplete outputs.

All three models correctly rejected unconfirmed rumors, demonstrating a shared understanding of the “specific product” filter. The foldable Surface device mentioned for 2026 was excluded by all, confirming that contemporary Claude models respect the distinction between announced and rumored hardware.

Latency variance (1426 ms to 7356 ms) does not correlate with accuracy in this benchmark. Haiku’s fastest response sacrificed inference capability, while Sonnet’s longer latency funded better reasoning. Opus achieved faster latency than Sonnet but at significantly higher cost, making it the least efficient option for this task.

Benchmark: Structured extraction from messy prose

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Related

Benchmark: Structured extraction from messy prose

Benchmark: Retrieval-synthesis reconciliation

Benchmark: Python race-condition diagnosis across Claude

Benchmark: Multi-step travel planning with tool use