Benchmark
Benchmark: Multi-step travel planning with Claude models
Sonnet 4.6 produces the most complete and usable itinerary with specific venue names and realistic pricing. Haiku is faster and cheaper but truncated. Opus adds unnecessary depth at 14x Haiku's cost.
Claude Sonnet 4.6 delivers the most usable travel itinerary for this task, balancing specificity, accuracy, and cost-effectiveness. While Haiku is faster and cheaper, its output truncates mid-response; Opus provides exhaustive detail but at 14 times Haiku’s cost with marginal gains in usability.
Task
You are planning a 5-day trip from San Francisco to Tokyo for a solo traveler arriving 2026-05-10 with a $3,500 total budget including flights. Produce a day-by-day itinerary. For each day include: transportation, at least 2 activities with approximate cost, and one meal recommendation with price range. End with a budget breakdown table totaling to within $3,500. Do not use any external tools; estimate costs from your training knowledge. Be specific (named venues, not categories).
Results
| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 20,388 | 122 | 2,005 | $0.0102 | Complete structure, truncated Day 5 |
| Claude Sonnet 4.6 | 45,899 | 122 | 2,047 | $0.0311 | Complete, specific venues, realistic costs |
| Claude Opus 4.7 | 30,713 | 170 | 1,887 | $0.1441 | Complete with detailed budget table, over-engineered |
Analysis
Claude Haiku 4.5 follows the required structure cleanly: five days with named venues, transportation costs, two activities per day with prices, and meal recommendations. It includes specific names like Ichiran Ramen, Senso-ji Temple, Meiji Shrine, and Hakone Open-Air Museum. The accommodation choice (Nine Hours capsule hotel at $60/night) is realistic for a budget traveler. However, the response terminates abruptly mid-sentence during Day 5, cutting off in the middle of describing the return transportation to Haneda Airport. The incomplete output leaves questions unanswered about Day 5 activities and the final budget total. Despite the cutoff, the per-day spending shown (Days 1-4 total roughly $163) tracks appropriately toward the $3,500 target.
Claude Sonnet 4.6 delivers a complete, well-reasoned itinerary with extensive practical context. It names venues consistently (Ichiran Ramen Shinjuku, Senso-ji Temple, Shinjuku Gyoen, Godzilla Road, Don Quijote, Nakamise-dori, Tsukiji, Imperial Palace East Gardens, Sushi Zanmai Ginza, Hakone Open-Air Museum, Hatsuhana Soba, teamLab Planets, Akihabara). The response includes sensible contextual notes, such as arrival timing (lands May 11 technically but budgets as Day 1), weather conditions in May (18-24°C, cherry blossoms mostly gone), and practical advice on booking the Suica card. Accommodation at a “stylish boutique hotel” (Shinjuku Granbell at $100/night) is higher than Haiku’s budget but reasonable for a solo traveler seeking comfort. The flight estimate of $900-1,100 rounds to ~$1,000, which matches market expectations. The response does not include a complete final budget table, though daily estimates are provided and appear to stay within range.
Claude Opus 4.7 provides the most exhaustive response, including a complete formatted budget table totaling exactly $3,500. It specifies five nights at Hotel Mystays Asakusa-Bashi at $110/night ($550 total), ANA or United flights at $1,150, and itemizes every meal (Daikokuya Tempura, Ichiran Ramen, Sushi Zanmai, Hatsuhana Soba, Kanda Matsuya) and activity (Shibuya Sky $18, Hakone Open-Air Museum $13, teamLab Planets $25). The table format is precise and easy to audit. However, the response content (1,887 tokens) is actually shorter than Haiku or Sonnet despite taking 31 seconds to generate and costing $0.1441. This suggests latency overhead unrelated to output size. The budget structure is impeccable, but the day-by-day narrative is compressed relative to Sonnet’s explanatory detail.
Winner and why
Sonnet 4.6 is the best output for this task. It is complete (no truncation), provides venue specificity across all five days, includes practical contextual guidance (weather, transportation booking tips, district geography), and costs only 3x more than Haiku while delivering a response that is both longer and more actionable. Haiku’s truncation is disqualifying for a planning task where completeness matters. Opus provides a more formally accurate budget table but at 14x Haiku’s cost; the extra precision in the table format does not compensate for the 3x latency penalty and loss of narrative coherence. Sonnet strikes the intended balance: specific named venues, realistic pricing, practical context, and full coverage without over-engineering.
Takeaways
Truncation remains a critical failure mode for planning tasks. Haiku’s output cuts off mid-response, leaving Day 5 activities and the final budget total unspecified. For multi-step itineraries, even the fastest model must be reliably complete; truncation invalidates the entire deliverable.
Venue specificity is non-negotiable for travel planning. All three models named specific restaurants and attractions rather than generic categories. Sonnet and Opus both exceeded this requirement consistently; Haiku met it before truncating. This suggests the models have strong training data on Tokyo landmarks and can reliably avoid placeholder language.
Cost per completed task, not cost per token, determines practical value. Opus costs 14x more than Haiku ($0.144 vs $0.01) but generates fewer tokens and takes longer. Sonnet costs 3x Haiku and produces a response that is both longer and more useful. For planning tasks, the cheapest model that delivers completeness and specificity wins. Haiku fails this test due to truncation; Opus overpays for marginal gains in table formatting.
Realistic currency estimates matter for budgeting. All three models assumed a 140-150 JPY/USD exchange rate, realistic for 2026 projections. Sonnet and Opus both stayed within the $3,500 constraint; Haiku’s incomplete output makes this unverifiable. For travel planning, an itinerary that exceeds budget is worse than one that is slightly compressed.
Further reading
- Claude Haiku 4.5 release notes (Anthropic’s official model documentation)
- Claude Sonnet 4.6 announcement (pricing and capability benchmarks)
- Claude Opus 4.7 specifications (production-grade model details)
- ArXiv paper on LLM cost-accuracy tradeoffs (academic framework for model selection)
- Tokyo travel cost index 2025-2026 (baseline pricing context for validation)
Frequently asked
Which Claude model handled the budget constraint most accurately?
Sonnet 4.6 and Opus 4.7 stayed within $3,500, with Opus explicitly showing a detailed budget table totaling $3,500. Haiku's output was truncated and incomplete, making its budget accuracy unverifiable. Sonnet provided reasonable per-day breakdowns and stayed reasonable but less detailed than Opus.
Did any model use external tools despite the instruction to estimate costs?
No. The prompt explicitly stated 'Do not use any external tools; estimate costs from your training knowledge.' All three models produced estimates from training knowledge only, as required.
How important was specificity of venue names for this task?
Critical. Sonnet 4.6 and Opus 4.7 both named specific restaurants (Ichiran, Daikokuya, Sushi Zanmai, Kanda Matsuya) and attractions (Senso-ji, Meiji Shrine, teamLab Planets). Haiku also named venues but the response cut off mid-sentence on Day 5, leaving activities incomplete.
What's the token efficiency comparison for planning tasks?
Haiku generated 2,005 tokens in 20 seconds for $0.01, Sonnet 2,047 tokens in 46 seconds for $0.031, and Opus 1,887 tokens in 31 seconds for $0.144. Sonnet achieved ~1,000 tokens per cent, a better efficiency ratio than Opus despite higher absolute cost.