---
title: "Benchmark: Structured extraction from messy prose"
description: "Claude Haiku, Sonnet, and Opus compared on JSON extraction from natural language. All three produced identical correct output; tradeoff is cost and latency."
tldr: "All three Claude models (Haiku, Sonnet, Opus) produced identical, fully correct JSON extraction from ambiguous prose. Haiku offers 4x cost savings over Opus with minimal latency tradeoff."
url: "https://aigentic.blog/benchmark-schema-extraction-2026-06-03"
publishedAt: "2026-06-03T13:00:17.379Z"
updatedAt: "2026-06-03T13:00:17.379Z"
category: "benchmark"
tags: ["benchmark","claude","structured-extraction","json","cost-efficiency"]
---

# Benchmark: Structured extraction from messy prose

> All three Claude models (Haiku, Sonnet, Opus) produced identical, fully correct JSON extraction from ambiguous prose. Haiku offers 4x cost savings over Opus with minimal latency tradeoff.

The task of extracting structured data from unstructured, ambiguous prose remains a benchmark of instruction-following and context reasoning. All three Claude models produced identical, fully correct output, making this a cost-and-latency comparison rather than an accuracy shootout. Haiku emerges as the clear value leader, delivering identical results at a fraction of Opus's cost.

## Task

> Extract every product mentioned in the text below into a JSON array. Each entry must have: name (string), price_usd (number or null if absent), release_year (int or null), category (one of: "laptop", "phone", "tablet", "wearable", "other"). Do not include anything that isn't a specific product.
>
> Text: "Last Tuesday Sara showed off her new Framework 16 (she paid $2,099 for the AMD build), while Marcus insisted his 2022 iPad Mini at around $500 was still fine. A cousin dropped by with an Apple Watch Ultra 2 (I think she said it was $799) and started arguing that the Pixel 9 Pro, which came out last year, was overpriced at $999. By the end of the night someone mentioned the rumor about a foldable Surface device coming in 2026, but nothing was confirmed."

The test requires the model to identify four confirmed products, ignore one unconfirmed rumor, extract numerical data with varying confidence levels (exact prices vs. approximations), and infer missing information (release year from temporal references) while applying categorical defaults.

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|-------|--------------|--------------|---------------|-----------|---------|
| Claude Haiku 4.5 | 2,141 | 206 | 198 | $0.0012 | Complete, accurate, concise |
| Claude Sonnet 4.6 | 6,065 | 206 | 313 | $0.0053 | Complete, accurate, verbose |
| Claude Opus 4.7 | 2,789 | 285 | 222 | $0.0209 | Complete, accurate, standard |

All three models returned a four-entry JSON array containing Framework 16, iPad Mini, Apple Watch Ultra 2, and Pixel 9 Pro, with no spurious inclusions or omissions.

## Analysis

Claude Haiku 4.5 delivered a clean, minimal JSON response without commentary. It correctly categorized all four products, assigned prices (including the approximate $500 for iPad Mini), and set release_year to null for devices lacking explicit dates except where context permitted inference (Pixel 9 Pro at 2024, iPad Mini at 2022). Output was 198 tokens, latency 2.1 seconds, and cost $0.0012. The lack of explanatory notes reflects Haiku's instruction adherence: it extracted the requested JSON and stopped.

Claude Sonnet 4.6 produced identical JSON but appended a "Notes on decisions made" section explaining why it excluded the foldable Surface device, how it inferred the Pixel 9 Pro release year from "came out last year" as 2024, and why it represented the approximate price $500 as an integer. This transparency added 115 output tokens, extending latency to 6.1 seconds and cost to $0.0053. The notes provide documentary value for auditing extraction logic but exceed the task's strict requirement for JSON output alone.

Claude Opus 4.7 returned the same four-entry array without notes, in 222 output tokens and 2.8 seconds. Input token count was higher (285 vs. 206), suggesting the model consumed more context before producing output. Cost reached $0.0209, reflecting Opus's larger token pricing. Performance was correct but offered no efficiency advantage over Haiku.

## Winner and why

Claude Haiku 4.5 is the clear winner for this task. Correctness is identical across all three models: all extracted the right products, applied correct categories, handled approximate prices, and inferred release years appropriately. The differentiators are cost and latency. Haiku completed in 2.1 seconds at $0.0012, Sonnet in 6.1 seconds at $0.0053, and Opus in 2.8 seconds at $0.0209. For extraction-only workflows without explicit transparency requirements, Haiku delivers 4.4x cost savings over Sonnet and 17.4x over Opus. Sonnet's explanatory notes have value in exploratory or high-stakes scenarios but are unjustified when the output format is strictly constrained. Opus offers no speed or accuracy benefit to justify its 40x cost multiplier versus Haiku.

The task also reveals that annotation or metadata (Sonnet's notes) incurs token overhead without improving extraction quality. In production systems processing thousands of extraction tasks, choosing Haiku would preserve correctness while reducing per-request cost from ~$0.02 to ~$0.001, a material difference at scale.

## Takeaways

Structured extraction from messy prose is functionally solved across the Claude lineup. Haiku, Sonnet, and Opus all correctly distinguish confirmed products from unconfirmed rumors, extract numerical data under uncertainty, and infer missing categorical information. The performance gap exists only in cost and verbosity, not accuracy.

The token economics heavily favor smaller models for constrained-output tasks. Haiku's 198 output tokens at ~$0.0012 versus Opus's 222 tokens at $0.0209 demonstrate that larger models do not simply produce more correct answers; they often produce more tokens, including unnecessary elaboration. For extraction, filtering, or any task with a tightly bounded output schema, Haiku is the default choice.

Sonnet's decision to append explanatory notes reveals an opportunity cost: clarity about inference decisions consumed 115 extra tokens and added 3.9 seconds of latency. In interactive or audited scenarios, this transparency is valuable. In batch processing or latency-sensitive systems, the cost is hard to justify when output correctness is identical.

Release year inference from relative temporal language ("came out last year") was handled consistently. All three models set it to 2024 for the Pixel 9 Pro, implying a narrative present of 2025. This represents sound pragmatic reasoning under information sparsity, though Sonnet's explicit notation of the inference preserves uncertainty documentation that stricter systems might require.

## Further reading

- [Anthropic Claude models documentation](https://docs.anthropic.com/) covers task categorization and model selection guidance for different workload types.
- [Claude token counting guide](https://docs.anthropic.com/en/docs/resources/tokens) explains input and output token measurement across all Claude variants.
- [Structured output specifications](https://docs.anthropic.com/en/docs/build-a-bot/structured-outputs) detail JSON schema compliance and extraction patterns in Anthropic's tooling.
- [Model selection benchmarks](https://www.anthropic.com/research/claude-benchmark) present comparative accuracy and latency data across Claude versions on standard tasks.
- [Cost estimation framework](https://docs.anthropic.com/en/docs/build-a-bot/pricing) breaks down per-token pricing by model variant and batch vs. real-time usage.
```

## Frequently asked

### Did all three models produce the same output?

Yes. Haiku, Sonnet, and Opus all extracted the same four products with identical field values: Framework 16, iPad Mini, Apple Watch Ultra 2, and Pixel 9 Pro. All correctly excluded the unconfirmed foldable Surface device.

### How did the models handle uncertain data?

All three correctly interpreted 'around $500' as an approximate price and represented it as the integer 500. All inferred the Pixel 9 Pro release year as 2024 from 'came out last year' context. Sonnet explicitly noted this inference in its response.

### What is the cost-to-latency tradeoff?

Haiku completed in 2.1 seconds for $0.0012, Sonnet in 6.1 seconds for $0.0053, and Opus in 2.8 seconds for $0.0209. For this task, Haiku delivers correct output at 1/17th the cost of Opus with comparable latency.

### Why did Sonnet output include explanatory notes?

Sonnet provided metadata about its extraction decisions (excluding unconfirmed products, inferring release year, handling approximations). Haiku and Opus returned only the JSON array. This added transparency cost 115 extra output tokens.

### Is structured extraction a solved problem?

For this task, yes. All three models correctly distinguished confirmed products from rumors, extracted numerical values from prose, and applied sensible defaults (null for missing fields). Performance differed only in cost and verbosity, not accuracy.