---
title: "Benchmark: Structured extraction from messy prose"
description: "Claude Sonnet 4.6 and Opus 4.7 produced identical core extracts; Sonnet 4.6 included a speculative product and cost 5x less per token."
tldr: "Three Claude models extracted product data from natural language text. Sonnet 4.6 and Opus 4.7 produced identical extracts for confirmed products; Sonnet additionally included an unconfirmed future device. Haiku 4.5 omitted it. Sonnet 4.6 offers best cost-to-accuracy ratio."
url: "https://aigentic.blog/benchmark-schema-extraction-2026-06-22"
publishedAt: "2026-06-22T13:00:22.547Z"
updatedAt: "2026-06-22T13:00:22.547Z"
category: "benchmark"
tags: ["benchmark","claude","json-extraction","structured-data","llm-evaluation"]
---

# Benchmark: Structured extraction from messy prose

> Three Claude models extracted product data from natural language text. Sonnet 4.6 and Opus 4.7 produced identical extracts for confirmed products; Sonnet additionally included an unconfirmed future device. Haiku 4.5 omitted it. Sonnet 4.6 offers best cost-to-accuracy ratio.

Claude Sonnet 4.6 and Claude Opus 4.7 produced identical extractions of four confirmed products from conversational prose; Sonnet 4.6 additionally included an unconfirmed future device and explicitly justified its reasoning, all at 5x lower cost per output token than Opus 4.7. Sonnet 4.6 is the better choice for this task, balancing completeness, accuracy, and cost.

## Task

> Extract every product mentioned in the text below into a JSON array. Each entry must have: name (string), price_usd (number or null if absent), release_year (int or null), category (one of: "laptop", "phone", "tablet", "wearable", "other"). Do not include anything that isn't a specific product.
>
> Text: "Last Tuesday Sara showed off her new Framework 16 (she paid $2,099 for the AMD build), while Marcus insisted his 2022 iPad Mini at around $500 was still fine. A cousin dropped by with an Apple Watch Ultra 2 (I think she said it was $799) and started arguing that the Pixel 9 Pro, which came out last year, was overpriced at $999. By the end of the night someone mentioned the rumor about a foldable Surface device coming in 2026, but nothing was confirmed."

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|-------|--------------|--------------|---------------|------------|---------|
| Claude Haiku 4.5 | 2548 | 206 | 198 | $0.0012 | Complete on confirmed products; omitted rumored Surface device |
| Claude Sonnet 4.6 | 6148 | 206 | 368 | $0.00614 | Complete extraction with reasoning; included speculative product |
| Claude Opus 4.7 | 2925 | 285 | 222 | $0.02093 | Complete on confirmed products; omitted rumored Surface device |

## Analysis

Claude Haiku 4.5 extracted four products: Framework 16 ($2,099, laptop), iPad Mini ($500, 2022, tablet), Apple Watch Ultra 2 ($799, wearable), and Pixel 9 Pro ($999, 2024, phone). All core fields match the source text precisely. The model correctly set release_year to null for Framework 16 and Apple Watch Ultra 2 (not stated in the text) and inferred 2024 for Pixel 9 Pro from the phrase "came out last year." Notably, Haiku excluded the Surface foldable device mentioned in the final sentence. The prompt states "Do not include anything that isn't a specific product," and the text explicitly flagged this item as rumored and unconfirmed, so the omission is defensible but conservative. Output was concise at 198 tokens with no supplementary commentary. Latency was the fastest at 2548ms, and cost was lowest at $0.0012.

Claude Sonnet 4.6 produced an identical extraction for the four confirmed products and added a fifth entry: Surface foldable (price_usd: null, release_year: 2026, category: "other"). The model's interpretation hinges on reading "specific product" to include products with a defined name and projected release timeline, even if unconfirmed. The output included explicit justification: inline notes explaining the handling of Pixel 9 Pro's inferred year, the rationale for including the Surface foldable, and confirmation that iPad Mini's 2022 release year came directly from the text. This transparency is valuable for applications relying on extraction pipelines; users can audit judgment calls. Output consumed 368 tokens (the highest of the three), resulting in a cost of $0.00614, yet latency was longest at 6148ms. The extra tokens funded explanatory notes rather than redundant JSON repetition.

Claude Opus 4.7 matched Haiku's extraction exactly: four confirmed products, zero additional entries. Like Haiku, it omitted the Surface foldable, treating the unconfirmed status as disqualifying. The model took 285 input tokens (higher than Haiku and Sonnet, though the actual text was identical), produced 222 output tokens of pure JSON, and incurred the highest per-query cost at $0.02093. Latency was 2925ms, between Haiku and Sonnet. No reasoning notes accompanied the output.

## Winner and why

Sonnet 4.6 is the most appropriate model for this task. While Opus 4.7 is Anthropic's flagship, it offers no advantage here: its output is indistinguishable from Haiku's for the core extraction, yet costs 17x more. Sonnet 4.6 strikes the critical balance.

The deciding factor is handling of the Surface foldable. The prompt asks to extract "every product mentioned" and specifies "Do not include anything that isn't a specific product." The text names a product ("foldable Surface device"), assigns it a category context (device), and gives a release projection (2026). Whether to include it depends on interpreting "specific" and weighing recall against precision. Sonnet's decision to include it and explain why is defensible: downstream users (product roadmap trackers, competitive analysts) often need unconfirmed products flagged explicitly. Haiku and Opus's omission is also defensible: unconfirmed status justifies exclusion. But Sonnet's provision of reasoning transforms the judgment call from opaque to auditable.

On cost, Sonnet 4.6 at $0.00614 is only 5x Haiku's cost but delivers added transparency. Per token, Sonnet's output tokens cost $0.0000167 each, compared to Opus at $0.0000944 per output token. For applications running thousands of extractions, Sonnet's per-token economics are substantially better. Latency (6148ms vs. 2548ms for Haiku) is acceptable for non-realtime pipelines and the extra reasoning output justifies the wait.

## Takeaways

1. **Handling ambiguity in structured extraction**: All three models correctly extracted confirmed products and inferred release years from relative time references. Divergence occurred only on products explicitly flagged as rumored or unconfirmed. Sonnet 4.6's choice to include such items with reasoning is more useful for applications tracking future products, but the choice is debatable and application-dependent. Including such entries should be configurable via prompt clarification.

2. **Cost-per-correct-answer favors Sonnet 4.6**: Opus 4.7 is substantially more expensive per query despite producing less output. For identical extractions (confirmed products), Opus offers no advantage and adds cost friction. Sonnet delivers equivalent accuracy on core extractions with the option to include speculative products, at a fraction of Opus's price.

3. **Output verbosity as a feature, not a bug**: Sonnet's inclusion of reasoning notes (368 tokens vs. 222 for Opus) enabled auditing of judgment calls. In production extraction pipelines, explainability often justifies modest token overhead, especially when cost-per-token is favorable. Haiku's and Opus's silent omissions leave ambiguity about whether the Surface device was missed or deliberately excluded.

4. **Latency vs. throughput trade-off**: Sonnet's longer latency (6148ms) is a non-issue for batch extraction jobs but may matter for real-time systems. Haiku's fastest latency (2548ms) and lowest cost ($0.0012) make it viable for latency-sensitive, high-volume use cases where reasoning is unnecessary and recall of unconfirmed products is not required.

## Further reading

- [Anthropic's Claude model documentation](https://docs.anthropic.com/claude/reference/models-overview) - Official specifications for Claude Haiku 4.5, Sonnet 4.6, and Opus 4.7 performance and pricing.
- [JSON schema extraction best practices](https://github.com/anthropics/anthropic-sdk-python) - Anthropic's official Python SDK including structured output examples.
- [Benchmark methodology for LLM evaluation](https://arxiv.org/abs/2310.08521) - Academic reference on reproducible LLM benchmarking approaches.

```

## Frequently asked

### Should models include unconfirmed or rumored products in extractions?

The task specified 'specific products'. Sonnet 4.6 included the Surface foldable (rumored, 2026 release) while Haiku and Opus omitted it. The original prompt does not prohibit such entries, making inclusion a reasonable interpretation of 'specific' even for future/unconfirmed items. Inclusion adds value for applications tracking product roadmaps.

### How did models infer release years from relative time references?

The text said Pixel 9 Pro 'came out last year'. Both Sonnet 4.6 and Opus 4.7 inferred 2024, implying a conversation date of 2025. Haiku 4.5 also returned 2024. This inference was consistent across all three models and logically sound given the context.

### What was the cost difference between models for identical output?

Sonnet 4.6 cost $0.00614 (368 output tokens) while Opus 4.7 cost $0.02093 (222 output tokens) for largely the same extraction. Opus used fewer tokens but at a much higher per-token rate. Sonnet's extra output (explicit reasoning notes) added clarity without sacrificing accuracy.

### Did any model fail to extract a confirmed product?

No. All three models correctly extracted Framework 16, iPad Mini, Apple Watch Ultra 2, and Pixel 9 Pro with matching prices and categories. Differences arose only in handling the unconfirmed Surface foldable and output verbosity.
