---
title: "Benchmark: Structured extraction from messy prose"
description: "Claude Haiku 4.5, Sonnet 4.6, and Opus 4.7 extract products from conversational text into JSON. Sonnet and Opus infer release year; Haiku stays literal. All."
tldr: "All three Claude models correctly extracted four confirmed products and excluded an unconfirmed rumor. Sonnet 4.6 and Opus 4.7 inferred the Pixel 9 Pro's 2024 release year; Haiku left it null. Sonnet delivered the best balance of accuracy and cost at 0.596 cents."
url: "https://aigentic.blog/benchmark-schema-extraction-2026-05-25"
publishedAt: "2026-05-25T13:00:28.464Z"
updatedAt: "2026-05-25T13:00:28.464Z"
category: "benchmark"
tags: ["benchmark","claude","extraction","json-parsing","structured-output"]
---

# Benchmark: Structured extraction from messy prose

> All three Claude models correctly extracted four confirmed products and excluded an unconfirmed rumor. Sonnet 4.6 and Opus 4.7 inferred the Pixel 9 Pro's 2024 release year; Haiku left it null. Sonnet delivered the best balance of accuracy and cost at 0.596 cents.

The task requires extracting specific products mentioned in conversational prose and formatting them as JSON with strict schema compliance. All three Claude models correctly identified the four confirmed products and excluded the unconfirmed rumor; the key difference lies in handling ambiguous temporal references and cost tradeoffs.

## Task

> Extract every product mentioned in the text below into a JSON array. Each entry must have: name (string), price_usd (number or null if absent), release_year (int or null), category (one of: "laptop", "phone", "tablet", "wearable", "other"). Do not include anything that isn't a specific product.
>
> Text:
> "Last Tuesday Sara showed off her new Framework 16 (she paid $2,099 for the AMD build), while Marcus insisted his 2022 iPad Mini at around $500 was still fine. A cousin dropped by with an Apple Watch Ultra 2 (I think she said it was $799) and started arguing that the Pixel 9 Pro, which came out last year, was overpriced at $999. By the end of the night someone mentioned the rumor about a foldable Surface device coming in 2026, but nothing was confirmed."

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|-------|--------------|--------------|---------------|-----------|---------|
| Claude Haiku 4.5 | 1848 | 206 | 196 | 0.00119 | Complete, literal interpretation |
| Claude Sonnet 4.6 | 6670 | 206 | 356 | 0.00596 | Complete, includes reasoning notes |
| Claude Opus 4.7 | 3481 | 285 | 222 | 0.02093 | Complete, inferred year, higher cost |

## Analysis

Claude Haiku 4.5 produced a clean, correct extraction in 1.8 seconds. It identified Framework 16, iPad Mini, Apple Watch Ultra 2, and Pixel 9 Pro with accurate prices and categories. The critical decision: Haiku left `release_year` as null for the Pixel 9 Pro, despite the text stating "came out last year." This is strict literal extraction; no explicit year is provided in the source material, so Haiku does not infer. It also correctly omitted the foldable Surface, recognizing the text explicitly flags it as unconfirmed. The output is minimal, clean JSON with no accompanying prose.

Claude Sonnet 4.6 produced identical extraction results but added 160 extra output tokens of reasoning. It made an explicit inference: since the text says the Pixel 9 Pro "came out last year" and the context implies the present is 2025, Sonnet filled `release_year` as 2024. Sonnet also included a "Notes on decisions made" section explaining why the foldable Surface was excluded, why Framework 16's year is null, and why the Pixel 9 Pro year was inferred. The reasoning demonstrates transparency and allows a human reviewer to validate or override the inference. Latency was 6.6 seconds, 3.6 times higher than Haiku, though the output is more defensible in a production extraction pipeline where explainability matters.

Claude Opus 4.7 took a middle path: it matched Sonnet's inference (Pixel 9 Pro as 2024) but provided only the JSON output, without reasoning notes. Opus consumed 285 input tokens and 222 output tokens, higher than both competitors, and cost $0.02093, making it 17.6 times more expensive than Haiku per request. Latency was 3.5 seconds, faster than Sonnet but slower than Haiku. Opus offers no clear advantage here; it infers like Sonnet but lacks Sonnet's transparency, while costing nearly 3.5 times more.

## Winner and why

Claude Sonnet 4.6 is the best choice for this extraction task, despite its 5x cost multiplier over Haiku. The reasoning is two-fold. First, the Pixel 9 Pro inference is defensible. The text provides a relative temporal cue ("came out last year"); inferring 2024 from that context is reasonable and likely correct. Haiku's null is not wrong, but it leaves a correctness gap. Sonnet's explicit reasoning notes allow validation and override, which is critical in production systems where extractions feed downstream pipelines. Second, the cost-per-correct-answer calculation favors Sonnet. At $0.00596 per request with full reasoning, Sonnet is appropriate for moderate-scale workloads (100 to 1000 extractions per day). Haiku shines only if extractions are purely literal and bulk volume is the priority (10,000+ per day).

In a real-world scenario, the choice depends on the extraction specification. If the contract is "extract only what is explicitly stated," Haiku's $0.00119 and literal interpretation win. If the contract is "extract and infer reasonable temporal references with audit trails," Sonnet at $0.00596 is the correct answer.

## Takeaways

- **Inference vs. literalism**: Sonnet and Opus inferred the Pixel 9 Pro's 2024 release year from "came out last year"; Haiku did not. Neither is objectively wrong, but Sonnet's explicit reasoning notes provide visibility for downstream validation. This is a key difference between models on extraction tasks with ambiguous inputs.

- **Cost-quality inflection point**: Haiku is 17.6 times cheaper than Opus per request but produces less detailed output. Sonnet at 5x Haiku's cost bridges the gap, offering inference plus reasoning for workloads that require both correctness and explainability. Opus is overpriced for extraction unless the text is adversarially complex.

- **All three rejected the unconfirmed product**: The foldable Surface device was explicitly flagged as a rumor in the text. All three models correctly recognized it was not a confirmed product and excluded it. This indicates robust instruction-following across the lineup on the "do not include unspecific items" constraint.

- **Output token inflation with reasoning**: Sonnet's 356 output tokens versus Haiku's 196 reflects the inclusion of reasoning notes. For high-volume batch extraction (thousands of products daily), this overhead compounds; Haiku may be necessary. For lower-volume workloads with audit requirements, Sonnet's transparency justifies the cost.

## Further reading

- [Claude Haiku 4.5 documentation](https://docs.anthropic.com/en/docs/about-claude/models/latest) - Anthropic's official model specifications and API reference.
- [JSON Schema validation best practices](https://json-schema.org/) - Standard schema definition for validating structured output extraction.
- [Temporal reference resolution in NLP](https://arxiv.org/abs/1905.07854) - Research on inferring absolute dates from relative temporal expressions like "last year."
- [Extraction and semantic parsing survey](https://arxiv.org/abs/2301.12031) - Comprehensive overview of structured information extraction from unstructured text.

## Frequently asked

### Why did Sonnet and Opus infer the Pixel 9 Pro release year as 2024 while Haiku left it null?

The text states the Pixel 9 Pro "came out last year," a relative time reference. Sonnet and Opus inferred the present context as ~2025 and filled in 2024. Haiku adhered to strict literal extraction, leaving null when no explicit year appeared. Neither approach is objectively wrong; it depends on the extraction specification.

### All three models excluded the foldable Surface device. Why is that correct?

The prompt required extraction of specific products only, not rumors or unconfirmed items. The text explicitly states the foldable Surface is a rumor with "nothing confirmed." All models correctly recognized this and omitted it from the JSON.

### How much slower was Sonnet compared to Haiku?

Sonnet took 6670ms versus Haiku's 1848ms, a 3.6x latency difference. However, Sonnet produced 356 output tokens compared to Haiku's 196, including detailed reasoning notes. The added latency reflects more thorough explanation, not inefficiency.

### Which model is most cost-effective for extraction tasks at scale?

Haiku 4.5 at $0.00119 per request is the cheapest; at 1000 daily extractions it costs under $1.20. Sonnet at $0.00596 is 5x more expensive per request but provides explicit reasoning. Opus at $0.02093 is best reserved for complex reasoning over simple extraction.
