Why did Haiku leave Pixel 9 Pro's release year as null while Sonnet and Opus inferred 2024?

The prompt text states 'which came out last year', implying the conversation occurs in 2025. Sonnet and Opus performed temporal reasoning to backfill 2024; Haiku took a more conservative stance and did not infer from context. Both approaches are defensible, but inference is more useful.

Did any model incorrectly include the rumored foldable Surface device?

No. All three models correctly excluded it because the text explicitly labeled it as an unconfirmed rumor ('the rumor about a foldable Surface device...nothing was confirmed'). The task requirement to extract only specific products was honored across the board.

How did the models handle approximate prices like 'around $500'?

All three models interpreted 'around $500' as price_usd: 500, treating the fuzzy reference as a valid numeric value. This is pragmatic for real-world data where approximate prices are common.

What is the cost-per-correct-answer for each model on this task?

Haiku: $0.00119 for an incomplete extraction (no Pixel release year). Sonnet: $0.00561 for a complete extraction. Opus: $0.02093 for the same complete extraction. Sonnet wins on cost-to-accuracy ratio.

Benchmark: Structured extraction from messy prose

The task was to extract product metadata from conversational prose into structured JSON, handling implicit information and filtering out unconfirmed rumors. Sonnet 4.6 and Opus 4.7 both performed correctly; Haiku 4.5 missed one critical inference, making Sonnet 4.6 the best choice due to accuracy parity with Opus at one-fifth the cost.

Task

Extract every product mentioned in the text below into a JSON array. Each entry must have: name (string), price_usd (number or null if absent), release_year (int or null), category (one of: “laptop”, “phone”, “tablet”, “wearable”, “other”). Do not include anything that isn’t a specific product.

Text: “Last Tuesday Sara showed off her new Framework 16 (she paid $2,099 for the AMD build), while Marcus insisted his 2022 iPad Mini at around $500 was still fine. A cousin dropped by with an Apple Watch Ultra 2 (I think she said it was $799) and started arguing that the Pixel 9 Pro, which came out last year, was overpriced at $999. By the end of the night someone mentioned the rumor about a foldable Surface device coming in 2026, but nothing was confirmed.”

Results

Model	Latency (ms)	Input tokens	Output tokens	Cost (USD)	Verdict
Claude Haiku 4.5	1517	206	196	0.00119	Extracted 4 products; missed Pixel release year inference
Claude Sonnet 4.6	11006	206	333	0.00561	Complete and accurate; inferred Pixel 2024 from context; included reasoning notes
Claude Opus 4.7	3134	285	222	0.02093	Complete and accurate; inferred Pixel 2024; silent extraction

Analysis

Claude Haiku 4.5 correctly identified and extracted all four confirmed products: Framework 16 (laptop, $2,099), iPad Mini (tablet, $500, 2022), Apple Watch Ultra 2 (wearable, $799), and Pixel 9 Pro (phone, $999). The categories, explicit prices, and the iPad’s stated release year were all captured accurately. However, Haiku set Pixel 9 Pro’s release_year to null, failing to infer from the phrase “which came out last year” that the device was released in 2024. The model took a conservative stance, including only facts explicitly stated rather than reasoning about temporal context. For extraction tasks involving real-world data, this is a material gap.

Claude Sonnet 4.6 produced identical output to Haiku except for one critical difference: it inferred Pixel 9 Pro’s release_year as 2024 by reasoning that if the conversation takes place in 2025 (implied by “last year”), then “last year” equals 2024. Sonnet also provided a second output block with explicit reasoning notes explaining why the foldable Surface was excluded (unconfirmed rumor), why the Pixel year was inferred (temporal logic), and why other fields were set to null (no data provided). This transparency is useful for debugging and validation, though it slightly inflates token count and latency. Sonnet’s core extraction was complete and correct.

Claude Opus 4.7 matched Sonnet’s extraction exactly, including the inferred Pixel 2024 release year. It generated output with identical structure and content but without the supplementary reasoning notes. Opus took the longest to respond (3,134 ms vs. Sonnet’s 11,006 ms, though the comparison is noisy due to network variance) and was the most expensive at $0.02093. The input token count was higher (285 vs. 206), likely due to internal processing overhead or model-specific context formatting.

Winner and why

Claude Sonnet 4.6 is the optimal choice for this task. It matches Opus in correctness by inferring the Pixel 9 Pro’s 2024 release year from temporal context clues, correctly excluding the unconfirmed foldable Surface, and extracting all four products with accurate categorization and price data. Sonnet achieves this while costing $0.00561, compared to Opus at $0.02093 (4.7x cheaper). The latency difference (11 seconds vs. 3 seconds) is not material for asynchronous batch extraction tasks. Sonnet’s inclusion of reasoning notes adds transparency without obscuring the core JSON output, making it traceable and debuggable. Haiku’s failure to infer the Pixel release year disqualifies it despite its speed and cost advantages; a downstream schema validation or analytics system expecting populated release_year fields would find incomplete data. For extraction from conversational or semi-structured text, the ability to perform lightweight inference is essential, and Sonnet delivers this at a fraction of Opus’s price.

Takeaways

Temporal reasoning separates complete from incomplete extractions. Haiku’s conservative null for Pixel 9 Pro’s release year represents a gap in capability, not just style. Sonnet and Opus both inferred 2024 from “last year,” a simple but essential inference for handling real-world data. This suggests that even smaller models may need explicit prompting to perform date logic or context-aware field population.
Reasoning transparency improves debugging without compromising accuracy. Sonnet’s optional notes explaining why the foldable Surface was excluded and why the Pixel year was inferred cost ~130 extra output tokens but would save significant time in a QA workflow. For production extraction pipelines, this trade-off is usually worth the small cost delta.
Cost per correct answer is not cost per inference. Opus at $0.02093 is the most expensive model but delivered the same accuracy as Sonnet at $0.00561. For extraction workloads run at scale (thousands of documents), Sonnet’s cost advantage compounds rapidly: 1,000 extracts with Sonnet costs $5.61; the same job with Opus costs $20.93. Haiku’s lower cost is negated by its incomplete output.
Latency variance and network conditions matter. Sonnet showed 11 seconds latency while Opus showed 3 seconds, suggesting either internal retries in Sonnet’s path or network jitter. For real-time systems, Opus might appear faster; for batch processing, both are acceptable and Sonnet’s superior cost-to-accuracy ratio dominates the decision.

Benchmark: Structured extraction from messy prose

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Related