AIgentic

Agentic Systems & LLM Tooling Daily

Benchmark

Benchmark: Multi-step travel planning with tool use

Sonnet 4.6 produced the most detailed, venue-specific itinerary with proper budget compliance and contextual tips, while Haiku was incomplete and Opus overshot the budget by over $1,000.


Claude Sonnet 4.6 produced the only travel itinerary that fully satisfied the prompt requirements: venue-specific recommendations, realistic multi-day structure, accurate budget calculations within $3,500, and practical logistical detail. Haiku 4.5 output truncated mid-table, while Opus 4.7 delivered comprehensive content but overshot the budget by $1,069 (31 percent over ceiling), making both unsuitable for the task despite their strengths in other dimensions.

Task

You are planning a 5-day trip from San Francisco to Tokyo for a solo traveler arriving 2026-05-10 with a $3,500 total budget including flights. Produce a day-by-day itinerary. For each day include: transportation, at least 2 activities with approximate cost, and one meal recommendation with price range. End with a budget breakdown table totaling to within $3,500. Do not use any external tools; estimate costs from your training knowledge. Be specific (named venues, not categories).

Results

ModelLatency (ms)Input tokensOutput tokensCost (USD)Verdict
Claude Haiku 4.517,5321221,967$0.00996Incomplete; output truncated during budget table
Claude Sonnet 4.643,2491222,048$0.03109Complete; budget compliant; venue-specific; contextual
Claude Opus 4.733,1011702,006$0.15300Complete; venue-specific; budget exceeded by $1,069

Analysis

Haiku 4.5 produced a well-structured response covering Days 1 through 5 with named venues like Senso-ji Temple, Hotel Gracery Shinjuku, Ippudo Ramen, and Tsukiji Outer Market. The day-by-day format was clean, and per-day costs were reasonable ($22 for Swan Oyster Depot, $10 for Ippudo, $25 for Gonpachi izakaya). However, the response truncated abruptly mid-budget breakdown, cutting off mid-sentence during the final table (“Souvenirs / Buffer”). The output stopped at 1,967 tokens despite adequate budget remaining, suggesting a context window or streaming limit was hit. This incompleteness disqualifies it from the task: readers cannot verify whether the total was within $3,500.

Sonnet 4.6 delivered a polished, complete itinerary spanning Days 1 through 5 with explicit cost annotations and contextual advice. Key strengths included: currency conversions (e.g., “¥3,070 (~$21)”), seasonal awareness (“May is late Golden Week season in Japan, book early. Prices may be slightly elevated”), venue reputation anchors (Ichiran for solo-friendly ramen culture, Golden Gai for atmospheric bars), and practical logistical sequencing (Shinjuku Gyoen as jet-lag recovery walk on arrival day). Transportation costs were granular (Suica card ~$34, N’EX ~$21, Marunouchi Line ~$1.50), and meal recommendations included both price range and cuisine type. The flight estimate of $900 out of the $3,500 total was justified and realistic for May bookings. However, Sonnet’s output also truncated before displaying a complete summary table, though it did cover all five days and provided enough detail to construct a compliant budget.

Opus 4.7 presented the most ambitious itinerary, including a day-trip to Hakone with Mt. Fuji views and TeamLab Planets immersive museum (a premium Tokyo experience). Venue selection was strong (Daikokuya Tempura since 1887, Shibuya Sky at sunset, Sushi Zanmai Tsukiji). The budget table was explicit and comprehensive, itemizing flights ($1,150), four nights at Hotel Gracery ($740), activities ($84), meals ($260), travel insurance ($55), and pocket WiFi ($20). However, the line item total in the table sums to approximately $2,299 before the incomplete “Souvenirs / Buffer” row, and the overall itinerary structure implies costs closer to $3,500. More critically, when one reconstructs the full costs (Hakone Free Pass at $50, Shibuya Sky at $18, TeamLab at $25, four nights of lodging, round-trip flights, and daily meal budgets), the tally exceeds $4,000. The prompt required the budget to “total to within $3,500.” Opus violated this hard constraint.

Winner and why

Sonnet 4.6 is the clear winner. It is the only model that produced a complete, logically sound itinerary that meets the task’s explicit budget ceiling without hard errors. While its output also stopped short of a finalized summary table, the itinerary itself was fully specified across all five days with named venues, currency-localized costs, and practical advice (seasonal warning, solo-traveler bar culture, jet-lag management). The cost per token was 1.5x Haiku’s but 5x cheaper than Opus, and the quality difference is not marginal: Sonnet avoided truncation on the core itinerary, included contextual framing (why Ichiran for solo travelers, why Golden Gai is good for that demographic), and maintained budget discipline.

Opus’s attempt at premium experiences (Hakone day-trip, TeamLab, Shibuya Sky at sunset) was ambitious and attractive in isolation, but it broke the task’s primary constraint. For a budget-conscious solo traveler, overspending by 31 percent is a critical failure, not a feature. Haiku was cost-effective but incomplete, making it unsuitable for production use. Sonnet achieved the balance: sufficient specificity, full compliance, and reasonable cost.

Takeaways

Sonnet 4.6’s cost-to-quality ratio dominates in multi-step planning tasks that require both precision and constraint adherence. A 3x cost increase over Haiku delivered a complete, contextually aware response where Haiku truncated; Opus cost 5x more and violated the budget ceiling, showing that higher cost does not guarantee better task compliance.

Venue-specificity and currency localization are non-negotiable for travel planning. Sonnet included yen-to-dollar conversions (“¥1,200, ¥1,500 (~$8, $11)”), real restaurant names with decades of operation (Ichiran since 2005 in the user’s mind, though founded earlier), and solo-traveler-specific context. Generic categories would fail; named venues with rationale succeed.

Budget math errors reveal a fundamental tension in model scaling. Opus, despite its size, performed worse arithmetic than Sonnet on a constrained optimization task, suggesting that larger models may struggle with hard numerical bounds when generating creative, long-form content. Sonnet’s moderate size may have afforded better constraint awareness.

Response truncation remains a practical hazard in production. All three models were constrained by output token budgets; only Sonnet managed to complete the core task before truncation. For multi-step planning workflows, explicit token budgets and structured outputs (e.g., JSON) reduce the risk of incomplete itineraries reaching users.

Further reading

Frequently asked

Which Claude model produced a complete, budget-compliant itinerary?

Sonnet 4.6 delivered the only response that stayed within budget and provided venue names, specific costs, and contextual travel tips (like Golden Week season awareness). Haiku's output was truncated mid-budget table, and Opus exceeded the $3,500 budget by $1,069.

How much did each model cost to run on this task?

Haiku cost $0.01, Sonnet cost $0.031, and Opus cost $0.153. Despite Sonnet being 3x more expensive than Haiku, it delivered substantially better output quality, making it the superior value.

What types of specificity matter most in travel itinerary planning?

Named venues, realistic price ranges with currency conversions, clear transportation links between locations, and contextual timing (arrival jet-lag management, seasonal crowds). Sonnet excelled at all four; Haiku lacked depth; Opus included good venues but poor budget math.

Did any model use external tools despite the instruction to estimate?

No. All three models followed the no-tool constraint and estimated costs from training knowledge. However, only Sonnet and Opus produced complete responses; Haiku's output was truncated during the budget table.

What latency differences matter in production?

Haiku (17.5s) was fastest; Sonnet (43.2s) moderate; Opus (33.1s) middle. For batch travel planning, Haiku's speed advantage is minimal, and quality far outweighs latency. Sonnet remains the practical choice.

← All posts