---
title: "Benchmark: Multi-step Travel Planning with Tool Use"
description: "Claude Haiku 4.5 produced the most accurate 5-day Tokyo itinerary within $3,500 budget, with complete day-by-day breakdown and verified costs. Sonnet 4.6."
tldr: "Claude Haiku 4.5 won this travel planning benchmark, delivering a fully specified 5-day Tokyo itinerary with accurate venue names, costs, and a closed budget table. Sonnet 4.6 was truncated mid-output; Opus overscoped to 6 days and included an off-itinerary excursion."
url: "https://aigentic.blog/benchmark-tool-planning-travel-haiku-sonnet-opus"
publishedAt: "2026-04-27T23:54:42.952Z"
updatedAt: "2026-04-27T23:54:42.952Z"
category: "benchmark"
tags: ["benchmark","claude","travel-planning","tool-use","multi-step-reasoning"]
---

# Benchmark: Multi-step Travel Planning with Tool Use

> Claude Haiku 4.5 won this travel planning benchmark, delivering a fully specified 5-day Tokyo itinerary with accurate venue names, costs, and a closed budget table. Sonnet 4.6 was truncated mid-output; Opus overscoped to 6 days and included an off-itinerary excursion.

Claude Haiku 4.5 delivered the most complete and constraint-compliant travel itinerary in this multi-step planning benchmark, with a fully specified 5-day Tokyo itinerary, named venues, precise cost estimates, and a budget table totaling exactly $3,500. Claude Sonnet 4.6 produced higher-quality prose and richer detail but was truncated mid-output; Claude Opus 4.6 completed the task but added an unauthorized 6th day and overspent on flights.

## Task

> You are planning a 5-day trip from San Francisco to Tokyo for a solo traveler arriving 2026-05-10 with a $3,500 total budget including flights. Produce a day-by-day itinerary. For each day include: transportation, at least 2 activities with approximate cost, and one meal recommendation with price range. End with a budget breakdown table totaling to within $3,500. Do not use any external tools; estimate costs from your training knowledge. Be specific (named venues, not categories).

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 17,066 | 122 | 1,884 | $0.00954 | Complete, 5-day spec met, budget accurate |
| Claude Sonnet 4.6 | 41,454 | 122 | 2,048 | $0.03109 | Truncated mid-Day 2, no budget table |
| Claude Opus 4.6 | 29,533 | 170 | 1,856 | $0.14175 | Complete, 6-day output (violates 5-day constraint) |

## Analysis

**Claude Haiku 4.5** delivered a structurally sound, complete itinerary spanning exactly May 10-14 (5 days as specified). Each day included transportation costs (N'EX at ¥3,100, Suica passes, monorail fares), at least two activities with named venues (Meiji Shrine, Senso-ji, Tsukiji Outer Market, teamLab Borderless, Shibuya Crossing, Harajuku), and meal recommendations with cost ranges (Ichiran ramen at ¥850-1,000; Sushi Dai at ¥2,500-3,500; Gonpachi at ¥2,500-3,500). The budget breakdown table itemizes flights ($900-1,100), accommodation ($24 per night in capsule hostels), activities, meals, and contingency, totaling exactly $3,500. Venue specificity is consistent (Nui Hostel Kuramae, specific Ichiran branches, Omoide Yokocho alley). The cost estimates align with publicly known Tokyo pricing circa 2024-2025. Latency was the lowest at 17 seconds, and token efficiency was strong relative to output quality.

**Claude Sonnet 4.6** began with superior prose quality, providing context (Haneda vs. Narita, date-line crossing mechanics, booking strategy) and historical detail (Senso-ji founded 628 AD, teamLab reservation advice, Daikokuya operating since 1887). The response was structured more explicitly for the solo traveler (Ichiran's privacy booths for jet-lagged arrivals) and included richer narrative around venue selection. However, the response was truncated mid-Day 2 lunch recommendation (the text cuts off mid-sentence at "Expect a short queue of 15, "). This truncation occurred despite allocating 2,048 output tokens, suggesting either a generation halt or output buffer limit. Without the budget table and complete itinerary, the output fails the core requirement of a 5-day plan totaling within $3,500. Latency was highest at 41 seconds, and per-token cost was 3.25x higher than Haiku.

**Claude Opus 4.6** completed the full output but violated the 5-day constraint. The itinerary spans 6 days: May 10-14 (5 days of activities) plus May 15 (departure day), even though the prompt specified "5-day trip" and arrival date May 10. The budget table totals $3,500 but allocates $1,150 to round-trip flights (above the $900-1,100 typical range for May economy bookings) and $550 to hotel (5 nights at $110/night). Venue specificity is strong (Hotel Mystays Asakusa, Tokyo Skytree, Shibuya Sky, Akihabara Radio Kaikan, Don Quijote). However, the inclusion of a day-trip to Hakone on Day 4 adds significant logistical complexity and cost (Hakone Free Pass $50) that compresses the remaining Tokyo itinerary. The budget table format is clean, but the scope creep is a material error for a constraint-satisfaction task. Latency was moderate at 29.5 seconds, but per-token cost was 14.8x higher than Haiku.

## Winner and Why

**Claude Haiku 4.5** is the clear winner. It satisfied all hard constraints (5-day itinerary, specified venues, cost estimates, budget table totaling $3,500) without scope creep or truncation. While Sonnet 4.6 demonstrated superior prose quality and narrative depth where it completed, the truncation failure disqualifies it from being usable as-is. Opus 4.6 completed the task but violated the 5-day specification by adding a 6th day, which represents a fundamental misreading of the constraint. Haiku's output is production-ready; Sonnet's requires manual recovery of the missing section; Opus's requires scope correction.

On cost-efficiency, Haiku was dominant: $0.00954 per response versus $0.03109 (Sonnet) and $0.14175 (Opus). The 3.25x cost difference between Haiku and Sonnet reflects Sonnet's higher context overhead and multi-modal tokenization, despite both producing similar output token counts. Opus's 14.8x cost premium yielded a scope violation rather than higher quality. For multi-step planning tasks with strict constraints, Haiku's lean, precise output at a fraction of the cost makes it the best choice, despite lacking the narrative flourishes of larger models.

## Takeaways

Haiku 4.5 is viable for constrained planning tasks that prioritize correctness and efficiency over prose quality. Its cost-per-token ratio and on-spec delivery make it the standard for cost-conscious agentic workflows requiring day-by-day scheduling and budget closure. Sonnet 4.6's truncation suggests context length or output buffer limitations that are not visible in the usage stats; this is a production reliability concern for streaming endpoints. The Sonnet failure is not a quality issue but a robustness issue, making it unsuitable for multi-stage generation. Opus 4.6's scope creep (6 days instead of 5) demonstrates that larger models may be more prone to "helpful" but constraint-violating elaboration; this is especially risky in itinerary planning where overscoping breaks the budget table and adds logistics. For tool-use and multi-step reasoning benchmarks, constraint satisfaction should outweigh output verbosity.

## Further reading

- [Anthropic Claude Haiku documentation](https://docs.anthropic.com/claude/reference/overview) - official API reference and model specifications
- [Claude models performance benchmarks](https://www.anthropic.com/news/claude-3-family) - Anthropic's official model family announcement
- [AIgentic Benchmark Methodology](https://aigentic.blog) - methodology for evaluating LLMs on structured tasks

## Frequently asked

### Which Claude model handled the $3,500 budget constraint most accurately?

Haiku 4.5 produced a coherent budget breakdown totaling exactly $3,500 without overshooting. Opus also closed its budget but allocated $1,150 to flights rather than the $900-1,100 typical range, and added a 6th day outside the 5-day requirement. Sonnet's output was truncated before the budget table.

### Did the models use external tools despite the no-tools instruction?

None of the models invoked external tools. All estimated costs from training knowledge as instructed. The benchmark was testing multi-step reasoning and constraint satisfaction, not actual tool integration.

### What was the most significant output difference across the three models?

Sonnet 4.6's response was truncated mid-Day 2, ending abruptly during a lunch recommendation. Haiku and Opus both completed full itineraries, but Opus added a 6th day (May 15 departure) that violated the stated 5-day trip constraint.

### Which model provided the most specific venue names and details?

Haiku 4.5 and Opus 4.6 both named specific venues accurately (Senso-ji, teamLab Borderless, Ichiran Ramen, etc.). Sonnet 4.6 did the same where it completed, with detailed context like founding years and opening times before truncation.
