Which model completed the full 5-day itinerary most thoroughly?

Sonnet 4.6 and Opus 4.7 both delivered complete day-by-day plans with specific venues and meal recommendations. Haiku provided complete coverage but with less detail on named locations. Opus added the most contextual padding (pre-trip notes, exchange rate assumptions) but this added length without improving actionability.

Did any model exceed or stay within the $3,500 budget?

Opus 4.7 explicitly calculated a total of $2,847, leaving $653 buffer. Haiku and Sonnet presented component costs but did not provide a final verified total. Opus was the only model to close the loop and confirm budget compliance.

How did response length and token efficiency compare?

Haiku used 1,632 tokens, Sonnet 2,048, and Opus 1,628. Despite similar token counts, Opus and Haiku were more efficient per word. Sonnet's longer output included more detail on museums and neighborhood context, which improved itinerary quality but increased tokens. Cost per token favored Haiku ($0.00828) over Sonnet ($0.03109) and Opus ($0.12465).

Which model provided the most specific venue recommendations?

Sonnet named specific restaurants (Ichiran, Torikizoku, Sushizanmai), museums (Tokyo National Museum), and neighborhoods with clear reasoning. Opus matched this with venues like Uobei and teamLab Planets but added more travel logistics. Haiku named fewer venues but all recommendations were plausible.

Did any model use external tools despite the no-tools constraint?

None of the models invoked tools. All relied on training knowledge to estimate costs, flight times, and transit options. The prompt explicitly forbade tool use, and all three complied.

Benchmark: Multi-step travel planning with tool use

Claude Sonnet 4.6 produces the most actionable itinerary in this multi-step travel-planning benchmark, balancing specificity, budget compliance, and cost-efficiency. Haiku sacrifices detail for speed, Opus over-explains at significantly higher token cost, and only Sonnet delivers both completeness and appropriate scope for the task.

Task

You are planning a 5-day trip from San Francisco to Tokyo for a solo traveler arriving 2026-05-10 with a $3,500 total budget including flights. Produce a day-by-day itinerary. For each day include: transportation, at least 2 activities with approximate cost, and one meal recommendation with price range. End with a budget breakdown table totaling to within $3,500. Do not use any external tools; estimate costs from your training knowledge. Be specific (named venues, not categories).

Results

Model	Latency (ms)	Input tokens	Output tokens	Cost (USD)	Verdict
Claude Haiku 4.5	15,766	122	1,632	0.00828	Complete but sparse on named venues
Claude Sonnet 4.6	40,089	122	2,048	0.03109	Complete, specific, balanced scope
Claude Opus 4.7	24,138	170	1,628	0.12465	Complete with verified budget total, over-contextual

Analysis

Haiku 4.5 follows the prompt structure precisely: five days with transportation, two activities, meal recommendation, and cost estimates per day. It names specific locations (Senso-ji, Tsukiji, teamLab Borderless, Meiji Shrine, Chureito Pagoda) and includes yen-to-dollar conversions throughout. The day-by-day subtotals sum to approximately $250 excluding flights and accommodation, which fits well within the $3,500 budget. However, Haiku stops short of producing the final budget table (the output ends mid-table generation), and several venue descriptions are generic (e.g., “Kawaguchiko Music Forest Museum (optional) or Free Hiking Section”). The latency is shortest at 15.7 seconds, and the cost per token is the lowest at $0.00828, making Haiku the efficiency pick for stripped-down travel scaffolds.

Sonnet 4.6 delivers the richest practical content. It opens with flight logistics (airfare ~$900-1,050, flight time and arrival estimates), then structures five full days with named restaurants (Torikizoku, Ichiran, Sushizanmai Honten), specific museums (Tokyo National Museum), and detailed venue rationale (e.g., “Ichiran is famously solo-traveler-optimized with individual booths”). Transportation options are explicit (Narita Express, Tokyo Metro 24-hour pass), and meal recommendations include context such as price points and why a venue suits the traveler profile. The output reaches nearly 2,048 tokens, which at $0.03109 is 3.75x Haiku’s cost but delivers proportionally more actionable intelligence. The only shortcoming is that Sonnet, like Haiku, does not complete the final budget table; the output cuts off mid-sentence after a partial cost summary.

Opus 4.7 provides the most exhaustive preparation, including pre-trip estimates (flights, hotels, exchange rates), five complete days with activities, a full budget breakdown table that explicitly totals to $2,847 (leaving a $653 buffer), and even suggests upgrades for leftover funds. Venue specificity matches or exceeds Sonnet (Uobei Shibuya Dogenzaka, Tsukiji Sushiko, Marion Crepes, Ginza Six). The day-by-day structure is clean, and the budget closure is the only example among the three where the final deliverable (the budget table) is actually completed and verified against the $3,500 target. However, Opus incurs the highest cost at $0.12465, approximately 15x Haiku’s price, driven by input token inflation (170 vs. 122) and higher per-token pricing for the Opus family. The latency is moderate at 24.1 seconds. The output includes interpretive flourishes (exchange rate assumptions, pre-trip summary, travel insurance and SIM costs) that, while thorough, exceed the task’s explicit scope.

Winner and why

Sonnet 4.6 is the optimal choice for this task. It delivers complete day-by-day itineraries with specific, named venues, plausible cost estimates, and clear reasoning for each recommendation. The 2,048 tokens and $0.031 cost represent a sensible middle ground: Haiku’s truncation of the budget table and sparse venue detail disqualifies it despite lower cost, while Opus’s $0.125 per-query price is difficult to justify when Sonnet covers the task at 4% of the cost with comparable specificity. Sonnet also finishes in 40 seconds, only 24 seconds longer than Haiku, so latency is not a decisive factor. The one caveat is that Sonnet also fails to complete the final budget table, but this is a shared limitation with Haiku; Opus is the sole model to deliver a verified total. For a task where completeness of detail and cost-per-token ratio matter equally, Sonnet’s balance is decisive.

Takeaways

1. Budget closure matters. Opus was the only model to deliver a complete budget breakdown table and verify the total against the $3,500 constraint. Haiku and Sonnet both truncated their outputs before finalizing the budget summary, which is a material gap in a task that explicitly requires a totaling table. This suggests that longer context windows (or explicitly constrained output length) may help longer-response models achieve closure.

2. Specificity scales with output tokens but not proportionally to cost. Sonnet’s 2,048 tokens cost 3.75x Haiku’s, yet both delivered specific venue names and plausible costs. Opus’s 3.75x higher cost over Sonnet yielded marginal gains in contextual padding rather than substantive detail. For travel planning, the returns on model scale diminish after mid-range outputs; Sonnet’s efficiency frontier is steeper than Opus’s.

3. Named venues and meal recommendations require explicit instruction enforcement. All three models produced venue names and meal costs, but Haiku’s brevity (e.g., “Museum or scenic walk”) left solo planners with less actionable direction than Sonnet’s rationale (e.g., “Ichiran is famously solo-traveler-optimized”). This suggests that specificity is not automatic in smaller models even when prompted; it demands sufficient context window and generation depth.

4. Latency is not a blocker for tool-use tasks. Opus at 24 ms and Sonnet at 40 ms both complete within plausible real-time constraints for travel planning. Haiku’s 15-second advantage is immaterial when the output quality gap is significant. For multi-step reasoning (flight costs, day-by-day scheduling, budget arithmetic), latency ranks below output completeness.

Benchmark: Multi-step travel planning with tool use

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Task

Results

Analysis

Winner and why

Takeaways

Further reading

Frequently asked

Related

Benchmark: Multi-step travel planning with tool use

Benchmark: Multi-step Travel Planning with Tool Use

Benchmark: Retrieval-synthesis reconciliation

Benchmark: Python race-condition diagnosis across Claude