AIgentic

Agentic Systems & LLM Tooling Daily

Benchmark

Benchmark: Structured extraction from messy prose

Claude Sonnet 4.6 achieved the most complete extraction by including the rumored Surface foldable device, while Haiku and Opus stopped at four confirmed products. Sonnet's explicit reasoning about edge cases outweighed its higher cost.


The task was to extract product mentions from a conversational text into JSON with structured fields (name, price, release year, category), requiring models to parse ambiguous pricing hints, infer missing data, and make judgment calls about what qualifies as a product. Claude Sonnet 4.6 achieved the most complete and defensible extraction by including the rumored Surface foldable device, while Haiku and Opus each missed this entry entirely, making Sonnet the clearest winner despite costing roughly five times more per inference.

Task

Extract every product mentioned in the text below into a JSON array. Each entry must have: name (string), price_usd (number or null if absent), release_year (int or null), category (one of: “laptop”, “phone”, “tablet”, “wearable”, “other”). Do not include anything that isn’t a specific product.

Text: “Last Tuesday Sara showed off her new Framework 16 (she paid $2,099 for the AMD build), while Marcus insisted his 2022 iPad Mini at around $500 was still fine. A cousin dropped by with an Apple Watch Ultra 2 (I think she said it was $799) and started arguing that the Pixel 9 Pro, which came out last year, was overpriced at $999. By the end of the night someone mentioned the rumor about a foldable Surface device coming in 2026, but nothing was confirmed.”

Results

ModelLatency (ms)Input tokensOutput tokensCost (USD)Verdict
Claude Haiku 4.513352061960.00119Missed rumored Surface device
Claude Sonnet 4.659352063670.00612Complete with reasoning
Claude Opus 4.727372852220.02093Missed rumored Surface device

Analysis

Claude Haiku 4.5 correctly extracted four products (Framework 16, iPad Mini, Apple Watch Ultra 2, Pixel 9 Pro) with accurate JSON encoding, proper category assignment, and reasonable handling of ambiguous prices. However, it excluded the rumored Surface foldable device entirely. The omission reflects a conservative interpretation of “do not include anything that isn’t a specific product”; without explicit guidance, Haiku treated the unconfirmed rumor as falling outside scope. The model also declined to infer Pixel 9 Pro’s release year despite contextual clues (“came out last year”), correctly setting it to null rather than guessing. Performance was the fastest and cheapest option, at 1.3 seconds and $0.00119.

Claude Sonnet 4.6 extracted the same four confirmed products plus the Surface foldable device, treating the latter as a legitimate product mention because it possessed a specific name (“Surface foldable device”), form factor, and release year (2026). Sonnet appended explicit notes justifying each judgment call, notably inferring that “came out last year” implied a 2024 release for Pixel 9 Pro (assuming the text’s present context is 2025). This reasoning transparency is valuable for audit and trust, but it inflated output tokens to 367 and latency to nearly 6 seconds. The decision to include the rumored device hinged on whether the task’s constraint (“do not include anything that isn’t a specific product”) bars unconfirmed items. The task did not explicitly forbid rumors; Sonnet’s inclusiveness is defensible and arguably more complete.

Claude Opus 4.7 matched Haiku’s four-product extraction exactly, omitting the Surface device and declining to infer Pixel 9 Pro’s year. Opus was slightly faster than Sonnet (2.7 seconds) but produced only 222 output tokens, implying a leaner response. It offered no reasoning notes, leaving the decision to exclude the rumored device invisible. Opus cost $0.02093, the highest of the three, despite not delivering additional correctness or insight over Haiku’s $0.00119 output.

Winner and why

Claude Sonnet 4.6 is the clear winner. While it cost 5 times more than Haiku and 3 times more than Opus on a per-token basis, it delivered the only complete extraction by correctly reasoning that a named, dated future product (“foldable Surface device coming in 2026”) meets the definition of “specific product mention,” even if unconfirmed. The task’s instruction “do not include anything that isn’t a specific product” is technically satisfied by Sonnet’s inclusion, whereas Haiku and Opus made an implicit assumption that rumors do not count. Sonnet’s explicit reasoning notes also provide traceability, crucial in data extraction pipelines where stakeholders need to audit edge-case decisions.

The cost-to-correctness ratio favors Sonnet here because completeness is the primary success metric for extraction tasks. A missing entry is a downstream data loss; an extra field is easier to filter. For production workflows where missing products could cause inventory, pricing, or analysis gaps, Sonnet’s premium is justified. Haiku remains valuable for high-volume extraction where rumor inclusion is explicitly unwanted, but as stated, the task did not disqualify rumors, making Sonnet’s conservative interpretation of the prompt superior.

Takeaways

  1. Edge-case reasoning separates models at the same capability tier. Haiku and Opus both achieved identical four-product outputs, yet Sonnet’s decision to include the unconfirmed Surface device reflects a more literal reading of the task prompt. The prompt never said “exclude rumors”; the competing models made implicit assumptions rather than explicit judgments. In production extraction, this kind of interpretive difference is often the source of data quality gaps.

  2. Output verbosity tracks with model capacity, not just task complexity. Sonnet’s 367 tokens versus Haiku’s 196 reflects not just an extra JSON entry but also Sonnet’s decision to append reasoning notes. For budget-conscious deployments, this overhead is real; for auditable systems, it is necessary overhead. Opus’s cost per token was higher despite not delivering Sonnet’s reasoning or Sonnet’s completeness.

  3. Inferring missing structured fields remains a judgment call across tiers. All models correctly nulled release_year fields where no explicit year was given (Framework 16, Apple Watch Ultra 2). Only Sonnet inferred 2024 for Pixel 9 Pro from contextual clues, a more aggressive but defensible choice that depends on the confidence in the assumed present date (2025). This variance suggests prompts specifying whether inference is desired would reduce model divergence.

  4. For extraction tasks, completeness often outweighs cost up to a threshold. Sonnet’s 5x cost premium over Haiku translates to fractions of a cent per query. For tasks where missing data has downstream cost (compliance, analytics, inventory), Sonnet’s completeness is cheap insurance. Opus’s higher price without offsetting benefit suggests smaller models can be a sweet spot for cost-controlled extraction when prompts are well-specified.

Further reading

Frequently asked

Should extraction include rumored or unconfirmed products?

Claude Sonnet 4.6 included the rumored Surface foldable device, reasoning that a specific product name, form factor, and release year qualified it as a 'product mention' even if unconfirmed. Haiku and Opus excluded it, treating rumor as insufficient. The task prompt did not explicitly forbid unconfirmed products, making this a reasonable interpretation question.

How did the models handle missing release_year values?

All three models correctly set release_year to null for products lacking explicit year information (Framework 16, Apple Watch Ultra 2, Pixel 9 Pro initially). Only Sonnet inferred 2024 for Pixel 9 Pro from 'came out last year' context, assuming the text's present is 2025.

Why did Sonnet cost 5 times more than Haiku for this task?

Sonnet produced 367 output tokens versus Haiku's 196, largely due to explicit reasoning notes justifying its extraction decisions. The extra context and explanation increased token usage significantly, though the core JSON was only one additional entry.

Was Opus's omission of the Surface device a correctness issue?

The task said 'do not include anything that isn't a specific product.' Opus may have interpreted 'rumor' as disqualifying; Sonnet interpreted 'specific product mention' as including named future devices. Without explicit guidance, Opus's conservative approach is defensible but less complete.

← All posts