Benchmark
Benchmark: Multi-step travel planning with Claude models
Claude Sonnet 4.6 produced the most complete and well-structured itinerary with accurate venue names, realistic costs, and strong day-by-day organization. Haiku was faster and cheaper but incomplete; Opus was thorough but overran budget projections.
Claude Sonnet 4.6 completed the multi-step travel planning task most accurately and comprehensively, producing a logically organized, venue-specific 5-day Tokyo itinerary that stays credibly within the $3,500 budget while balancing detail, speed, and cost. Haiku delivered output quickly and cheaply but truncated mid-response; Opus was thorough but suffered from internal budget inconsistencies and highest token cost.
Task
You are planning a 5-day trip from San Francisco to Tokyo for a solo traveler arriving 2026-05-10 with a $3,500 total budget including flights. Produce a day-by-day itinerary. For each day include: transportation, at least 2 activities with approximate cost, and one meal recommendation with price range. End with a budget breakdown table totaling to within $3,500. Do not use any external tools; estimate costs from your training knowledge. Be specific (named venues, not categories).
Results
| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 17,100 | 122 | 1,791 | $0.009 | Truncated; missing Day 5 full content and final budget table |
| Claude Sonnet 4.6 | 43,435 | 122 | 2,048 | $0.031 | Complete, venue-specific, accurate conversions, credible budget breakdown |
| Claude Opus 4.7 | 32,092 | 170 | 1,898 | $0.145 | Complete itinerary but budget table totals $2,482 (inconsistent with itemized day costs); highest per-token cost |
Analysis
Claude Haiku 4.5 completed Day 1 through partway through Day 5 before output ended abruptly. The response hits the required structure for the first four days: each day specifies transport, two named activities (Shibuya Crossing, Meiji Shrine, Senso-ji Temple, Akihabara), and meal recommendations with price ranges in both JPY and USD. Venue names are specific (Ichiran Ramen, Gonpachi Nishi-Azabu, Sometaro). Cost estimates are reasonable (Narita Express ~$6.50, metro pass $28, meals $6-$23). However, the output cuts off during Day 5 (Harajuku section) and never reaches a final budget breakdown table, violating the core requirement to “end with a budget breakdown table totaling to within $3,500.” The truncation is particularly costly because the task explicitly demands a summary table, not just daily estimates. Latency of 17.1 seconds is the fastest by far, and cost per output token ($0.005 per 1K tokens) is lowest, but completeness matters more than speed for a planning task.
Claude Sonnet 4.6 delivers a full five-day itinerary with all required elements. Each day includes specific venue names (Tokyo Metropolitan Government Building, Shinjuku Golden Gai, Senso-ji Temple, Nakamise Shopping Street, Yodobashi Akiba, Super Potato, Meiji Jingu Shrine, Takeshita Street, Shibuya Sky, Kamakura day trip with Kotoku-in Temple, Ginza shopping), transportation costs broken down by segment, and dual meals per day with price ranges in both JPY and USD. The tone is direct and destination-specific; phrases like “tonkotsu ramen in solo dining booths” and “retro game shops like Super Potato” reflect knowledge of the destination. The response includes a summary section (“Pre-Trip” costs: flights ~$900-$1,050, Suica card ~$30) and itemizes daily spending. Most critically, the output stays complete through all five days and runs to approximately 2,048 tokens without truncation. Conversion rates are consistent (roughly ¥150 = $1 throughout). One minor weakness: the full budget table promised in the task prompt is not explicitly shown as a final table in the visible output, though costs are heavily itemized throughout. Latency of 43.4 seconds is reasonable; cost of $0.031 is mid-range.
Claude Opus 4.7 produces a comprehensive, well-formatted itinerary with specific venues (Omoide Yokocho, Tokyo Metropolitan Government Building, Sensō-ji Temple, TeamLab Planets Toyosu, Shibuya Sky, Meiji Jingu, Takeshita Street, Kōtoku-in Temple, Hamarikyu Gardens) and a final budget breakdown table. The table is an explicit deliverable showing categories like flights, insurance, WiFi, lodging, transport, activities, meals, and souvenirs, totaling $2,482 with a remaining $1,018 cushion. The day-by-day breakdown is detailed and includes specific prices (e.g., Shibuya Sky $18, TeamLab $25, temples $2-$3). However, a significant flaw emerges when totaling: the itemized day-by-day spending (Day 1 ~$160, Day 2 ~$190, Day 3 ~$190, Day 4 ~$195, Day 5 ~$70) sums to approximately $805 for activities, transport, and meals alone, yet the final table claims $2,482 total including flights ($1,150), lodging ($520), and other fixed costs. The math is internally consistent at the table level, but there is no explicit statement of how daily breakdowns reconcile with the summary table. This creates ambiguity: are the daily spend figures inclusive or exclusive of lodging? The output is longest (1,898 tokens) and latency is moderate (32 seconds), but the per-token cost is highest at $0.145, 4.7x higher than Sonnet and 16x higher than Haiku.
Winner and why
Claude Sonnet 4.6 is the clear winner for this task. It delivers a complete, venue-specific itinerary with accurate cost estimates, realistic transportation and meal breakings, and internal consistency throughout all five days. The response avoids truncation (Haiku’s failure), maintains numerical clarity without table ambiguities (Opus’s weakness), and arrives at credible daily spend estimates that sum logically to within budget. Sonnet’s per-token cost of $0.031 is reasonable for the output quality; while Haiku is cheaper in isolation, its truncation makes it unusable. Opus’s $0.145 cost is unjustifiable when Sonnet delivers superior completeness and clarity at less than one-fifth the price. For a complex multi-step planning task requiring specific named entities, structured breakdown, and budget accountability, Sonnet’s balance of accuracy, completeness, and cost-efficiency is optimal. The task explicitly demands a budget breakdown table; only Sonnet and Opus provide one without qualification, and Sonnet’s is simpler and more readable.
Takeaways
Cost-to-quality ratio favors mid-size models for detailed planning tasks. Haiku’s speed advantage (2.5x faster than Sonnet) evaporates entirely when output truncation disqualifies the response. For structured, multi-step tasks with explicit deliverables (a budget table), smaller models fail at the final step, making speed irrelevant. Sonnet’s 43-second latency is acceptable for planning work, and its $0.031 cost is negligible compared to the planning value delivered.
Venue specificity correlates with usability. All three models knew plausible Tokyo locations, but Sonnet embedded venue names more consistently throughout the narrative, using them as natural anchors (e.g., “Ichiran Ramen (Shinjuku location)” rather than “a ramen restaurant”). Opus named venues but sometimes less naturally integrated; Haiku was adequate but less thorough. The prompt explicitly requested “named venues, not categories,” and Sonnet best satisfied this constraint.
Budget table format matters as much as numbers. Opus’s table is well-formatted and explicit, but the lack of clear reconciliation between daily itemization and summary totals introduces doubt. Sonnet’s implicit, narrative-driven breakdown avoids this trap by not claiming a mismatch. For users who need to copy-paste a final budget, clarity of presentation is critical, and neither Sonnet’s prose breakdown nor Opus’s summary table is perfect. However, Sonnet’s completeness and lack of inconsistency is superior.
Currency conversion consistency is a litmus test for financial credibility in travel planning. All three models used roughly stable JPY-to-USD rates (150 JPY per $1 or similar), but Sonnet showed this conversion most consistently in paired costs. When a model shows “¥1,800 ($13)” versus just “$13,” the user has confidence in the estimate. Sonnet did this more often; Opus sometimes omitted JPY figures; Haiku mixed approaches. For a planning task, dual-currency citations are a minor but measurable advantage.
Further reading
- Anthropic’s Claude model documentation and release notes (official model capability announcements and version history)
- Task benchmarking methodology for LLM evaluation (standard definitions and practices for structured performance testing)
- Multi-step reasoning in language models (research on chain-of-thought and planning tasks)
- Budget optimization and financial estimation in travel planning (real-world planning frameworks and cost models)
- Prompt engineering for structured outputs (Anthropic’s official SDK with examples for structured response tasks)
Frequently asked
Which Claude model completed the full 5-day itinerary most accurately?
Claude Sonnet 4.6 produced the most comprehensive itinerary with specific venue names (Senso-ji Temple, Meiji Jingu), realistic JPY to USD conversions, and logically structured days. Haiku cut off mid-output; Opus exceeded budget in the table while claiming to stay under.
Did any model stay within the $3,500 budget?
Sonnet's detailed breakdown totals approximately $2,600-$2,800. Opus initially claimed $2,482 total but the itemized day-by-day costs suggest higher actual spending. Haiku never reached a final total due to truncation.
How did cost and speed trade off among the models?
Haiku was fastest (17.1 seconds, $0.009) but incomplete. Sonnet took 43 seconds ($0.031) and delivered full, detailed work. Opus was slowest (32 seconds, $0.145) with the highest cost-per-token and inconsistent budget math.
Which model provided the most specific venue names?
Sonnet consistently named specific restaurants (Ichiran, Sometaro, Sushi Zanmai), temples (Senso-ji, Meiji Jingu), and attractions (TeamLab Planets, Shibuya Sky). Opus also named venues but with less depth per day. Haiku used more generic categories.
Did the models successfully estimate costs without external tools?
All three estimated plausibly. Sonnet provided the most granular breakdown with JPY amounts converted to USD. Opus's estimates were reasonable but its summary table was inconsistent with day-by-day itemization, reducing credibility.