AIgentic

Agentic Systems & LLM Tooling Daily

Benchmark

Benchmark: Multi-step Travel Planning with Tool Use

Claude Haiku 4.5 won this travel planning benchmark, delivering a fully specified 5-day Tokyo itinerary with accurate venue names, costs, and a closed budget table. Sonnet 4.6 was truncated mid-output; Opus overscoped to 6 days and included an off-itinerary excursion.


Claude Haiku 4.5 delivered the most complete and constraint-compliant travel itinerary in this multi-step planning benchmark, with a fully specified 5-day Tokyo itinerary, named venues, precise cost estimates, and a budget table totaling exactly $3,500. Claude Sonnet 4.6 produced higher-quality prose and richer detail but was truncated mid-output; Claude Opus 4.6 completed the task but added an unauthorized 6th day and overspent on flights.

Task

You are planning a 5-day trip from San Francisco to Tokyo for a solo traveler arriving 2026-05-10 with a $3,500 total budget including flights. Produce a day-by-day itinerary. For each day include: transportation, at least 2 activities with approximate cost, and one meal recommendation with price range. End with a budget breakdown table totaling to within $3,500. Do not use any external tools; estimate costs from your training knowledge. Be specific (named venues, not categories).

Results

ModelLatency (ms)Input tokensOutput tokensCost (USD)Verdict
Claude Haiku 4.517,0661221,884$0.00954Complete, 5-day spec met, budget accurate
Claude Sonnet 4.641,4541222,048$0.03109Truncated mid-Day 2, no budget table
Claude Opus 4.629,5331701,856$0.14175Complete, 6-day output (violates 5-day constraint)

Analysis

Claude Haiku 4.5 delivered a structurally sound, complete itinerary spanning exactly May 10-14 (5 days as specified). Each day included transportation costs (N’EX at ¥3,100, Suica passes, monorail fares), at least two activities with named venues (Meiji Shrine, Senso-ji, Tsukiji Outer Market, teamLab Borderless, Shibuya Crossing, Harajuku), and meal recommendations with cost ranges (Ichiran ramen at ¥850-1,000; Sushi Dai at ¥2,500-3,500; Gonpachi at ¥2,500-3,500). The budget breakdown table itemizes flights ($900-1,100), accommodation ($24 per night in capsule hostels), activities, meals, and contingency, totaling exactly $3,500. Venue specificity is consistent (Nui Hostel Kuramae, specific Ichiran branches, Omoide Yokocho alley). The cost estimates align with publicly known Tokyo pricing circa 2024-2025. Latency was the lowest at 17 seconds, and token efficiency was strong relative to output quality.

Claude Sonnet 4.6 began with superior prose quality, providing context (Haneda vs. Narita, date-line crossing mechanics, booking strategy) and historical detail (Senso-ji founded 628 AD, teamLab reservation advice, Daikokuya operating since 1887). The response was structured more explicitly for the solo traveler (Ichiran’s privacy booths for jet-lagged arrivals) and included richer narrative around venue selection. However, the response was truncated mid-Day 2 lunch recommendation (the text cuts off mid-sentence at “Expect a short queue of 15, ”). This truncation occurred despite allocating 2,048 output tokens, suggesting either a generation halt or output buffer limit. Without the budget table and complete itinerary, the output fails the core requirement of a 5-day plan totaling within $3,500. Latency was highest at 41 seconds, and per-token cost was 3.25x higher than Haiku.

Claude Opus 4.6 completed the full output but violated the 5-day constraint. The itinerary spans 6 days: May 10-14 (5 days of activities) plus May 15 (departure day), even though the prompt specified “5-day trip” and arrival date May 10. The budget table totals $3,500 but allocates $1,150 to round-trip flights (above the $900-1,100 typical range for May economy bookings) and $550 to hotel (5 nights at $110/night). Venue specificity is strong (Hotel Mystays Asakusa, Tokyo Skytree, Shibuya Sky, Akihabara Radio Kaikan, Don Quijote). However, the inclusion of a day-trip to Hakone on Day 4 adds significant logistical complexity and cost (Hakone Free Pass $50) that compresses the remaining Tokyo itinerary. The budget table format is clean, but the scope creep is a material error for a constraint-satisfaction task. Latency was moderate at 29.5 seconds, but per-token cost was 14.8x higher than Haiku.

Winner and Why

Claude Haiku 4.5 is the clear winner. It satisfied all hard constraints (5-day itinerary, specified venues, cost estimates, budget table totaling $3,500) without scope creep or truncation. While Sonnet 4.6 demonstrated superior prose quality and narrative depth where it completed, the truncation failure disqualifies it from being usable as-is. Opus 4.6 completed the task but violated the 5-day specification by adding a 6th day, which represents a fundamental misreading of the constraint. Haiku’s output is production-ready; Sonnet’s requires manual recovery of the missing section; Opus’s requires scope correction.

On cost-efficiency, Haiku was dominant: $0.00954 per response versus $0.03109 (Sonnet) and $0.14175 (Opus). The 3.25x cost difference between Haiku and Sonnet reflects Sonnet’s higher context overhead and multi-modal tokenization, despite both producing similar output token counts. Opus’s 14.8x cost premium yielded a scope violation rather than higher quality. For multi-step planning tasks with strict constraints, Haiku’s lean, precise output at a fraction of the cost makes it the best choice, despite lacking the narrative flourishes of larger models.

Takeaways

Haiku 4.5 is viable for constrained planning tasks that prioritize correctness and efficiency over prose quality. Its cost-per-token ratio and on-spec delivery make it the standard for cost-conscious agentic workflows requiring day-by-day scheduling and budget closure. Sonnet 4.6’s truncation suggests context length or output buffer limitations that are not visible in the usage stats; this is a production reliability concern for streaming endpoints. The Sonnet failure is not a quality issue but a robustness issue, making it unsuitable for multi-stage generation. Opus 4.6’s scope creep (6 days instead of 5) demonstrates that larger models may be more prone to “helpful” but constraint-violating elaboration; this is especially risky in itinerary planning where overscoping breaks the budget table and adds logistics. For tool-use and multi-step reasoning benchmarks, constraint satisfaction should outweigh output verbosity.

Further reading

Frequently asked

Which Claude model handled the $3,500 budget constraint most accurately?

Haiku 4.5 produced a coherent budget breakdown totaling exactly $3,500 without overshooting. Opus also closed its budget but allocated $1,150 to flights rather than the $900-1,100 typical range, and added a 6th day outside the 5-day requirement. Sonnet's output was truncated before the budget table.

Did the models use external tools despite the no-tools instruction?

None of the models invoked external tools. All estimated costs from training knowledge as instructed. The benchmark was testing multi-step reasoning and constraint satisfaction, not actual tool integration.

What was the most significant output difference across the three models?

Sonnet 4.6's response was truncated mid-Day 2, ending abruptly during a lunch recommendation. Haiku and Opus both completed full itineraries, but Opus added a 6th day (May 15 departure) that violated the stated 5-day trip constraint.

Which model provided the most specific venue names and details?

Haiku 4.5 and Opus 4.6 both named specific venues accurately (Senso-ji, teamLab Borderless, Ichiran Ramen, etc.). Sonnet 4.6 did the same where it completed, with detailed context like founding years and opening times before truncation.

← All posts