---
title: "Benchmark: Multi-step travel planning with tool use"
description: "Claude Haiku 4.5, Sonnet 4.6, and Opus 4.7 compared on a 5-day Tokyo itinerary task requiring cost estimation and structured planning without external tools."
tldr: "Opus 4.7 produced the most complete itinerary with realistic daily breakdowns and a properly balanced budget table. Sonnet 4.6 started strong but truncated mid-response. Haiku 4.5 delivered a functional plan at half Sonnet's cost but with less detail and accuracy."
url: "https://aigentic.blog/benchmark-multi-step-travel-planning-tool-use"
publishedAt: "2026-05-01T13:00:43.320Z"
updatedAt: "2026-05-01T13:00:43.320Z"
category: "benchmark"
tags: ["benchmark","claude","travel-planning","cost-estimation","multi-step-reasoning"]
---

# Benchmark: Multi-step travel planning with tool use

> Opus 4.7 produced the most complete itinerary with realistic daily breakdowns and a properly balanced budget table. Sonnet 4.6 started strong but truncated mid-response. Haiku 4.5 delivered a functional plan at half Sonnet's cost but with less detail and accuracy.

Claude Haiku 4.5 delivered a usable but incomplete itinerary; Sonnet 4.6 truncated mid-response; and Opus 4.7 produced the only fully formed, budget-coherent plan. Opus 4.7 is the clear winner for this task despite a 15x cost multiplier over Haiku.

## Task

> You are planning a 5-day trip from San Francisco to Tokyo for a solo traveler arriving 2026-05-10 with a $3,500 total budget including flights. Produce a day-by-day itinerary. For each day include: transportation, at least 2 activities with approximate cost, and one meal recommendation with price range. End with a budget breakdown table totaling to within $3,500. Do not use any external tools; estimate costs from your training knowledge. Be specific (named venues, not categories).

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 20,212 | 122 | 1,914 | $0.0097 | Incomplete; response cuts off mid-Day 5; no final budget table |
| Claude Sonnet 4.6 | 43,378 | 122 | 2,048 | $0.0311 | Truncated; ends mid-Day 2 table; cannot validate completeness |
| Claude Opus 4.7 | 30,427 | 170 | 1,871 | $0.1429 | Complete; full 5-day itinerary with closed budget table summing to $3,500 |

## Analysis

**Claude Haiku 4.5** constructed a reasonable outline with named venues (Ichiran Ramen Shinjuku, Senso-ji Temple, Harajuku Takeshita Street) and realistic cost ranges. The itinerary covered arrival day plus four activity days with two activities and one meal per day. However, the response truncated at Day 5, cutting off mid-sentence during the shopping and departure section. The partial daily totals shown ($744 on Day 1, then $55, $77, $53 for subsequent days) do not sum to $3,500, and no final budget breakdown table appears. The model did not violate the no-external-tools constraint; all estimates came from training knowledge. The latency was the fastest of the three, and token efficiency was strong, making it the cheapest option at under one cent.

**Claude Sonnet 4.6** began with more elaborate formatting and higher specificity, including exact train costs (N'EX at ¥3,070 or ~$21), hotel recommendations (Shinjuku Granbell at ~$110/night for 5 nights = $550), and even a warning to confirm TeamLab Borderless reopening status at a provided URL. The response generated 2,048 output tokens and took 43 seconds, costing $0.0311. The itinerary was well-structured and included useful context like arrival times and crowd-avoidance strategies. However, the output terminates abruptly mid-sentence in the Day 2 spending table, making it impossible to evaluate whether the remaining days, final budget closure, and task completion were achieved. This incompleteness is a critical failure for a task requiring a "budget breakdown table totaling to within $3,500."

**Claude Opus 4.7** delivered the only fully complete response. It provided a 6-day structure (arrival May 10 through departure May 15) with all five activity days fully developed, named venues for each activity (Senso-ji Temple, TeamLab Planets Toyosu, Hakone Open-Air Museum, Meiji Jingu Shrine, Tsukiji Outer Market), and meal recommendations with price ranges ($15-20 for Daikokuya Tempura, $25 for Tsukiji breakfast, $40-55 for Bird Land Ginza). Critically, Opus constructed a final budget breakdown table with 12 line items (flight $950, hotel $650, transit $55, daily activities and meals $319 total, souvenirs $200, travel insurance $60, contingency $1,236) that sums precisely to $3,500. The specific costs were more realistic (flight prices for May 2026 bookings, accommodation in mid-range hotels, activity fees like Hamarikyu Gardens at ¥500 or Shibuya Sky at ¥2,500 matching known real-world pricing). The response was 1,871 tokens, slightly shorter than Sonnet's truncated attempt, yet achieved full completion. The latency (30 seconds) fell between Haiku and Sonnet. The cost was $0.1429, or approximately 15 times Haiku and 4.6 times Sonnet.

## Winner and why

Opus 4.7 is the clear winner. The task required two non-negotiable deliverables: a day-by-day itinerary with named venues, transportation, at least 2 activities per day with costs, and one meal per day with price range; and a budget breakdown table summing to within $3,500. Only Opus completed both. Haiku truncated before reaching the budget table. Sonnet truncated mid-Day 2, leaving the task unfinished. In domains requiring structural closure and numerical accuracy, truncation is disqualifying. Opus's cost multiplier is substantial, but the output quality is not proportionally better; rather, the other two models simply failed to finish. For a travel planner relying on this output to validate feasibility and allocate resources, Haiku and Sonnet are unusable without manual completion. Opus is the only submission that can be handed to a traveler as-is. When amortized over real-world use (a solo traveler making a single trip), the $0.13 difference is negligible compared to the value of a closed, validated plan.

## Takeaways

1. **Output truncation is a silent failure in structured tasks.** Sonnet and Haiku both generated enough tokens to finish but did not. This points to either transmission/storage limits or token-budget exhaustion mid-generation. For planning tasks with numerical requirements, truncation invalidates the entire output. Opus avoided this, likely by more efficient prose density, enabling full delivery within the same token window.

2. **Budget closure requires end-to-end modeling.** Opus 4.7 structured the itinerary with the final budget table as a primary constraint, working backward from the $3,500 limit to allocate daily spending and contingency. Haiku and Sonnet (insofar as we can assess) treated daily costs as independent and did not synthesize into a closed table. This suggests Opus used a more explicit constraint-satisfaction approach.

3. **Cost-per-correct-answer, not cost-per-token, is the relevant metric for this task class.** Haiku is 15 times cheaper than Opus per token, but zero times useful if the output is incomplete. Sonnet costs 3 times less than Opus but also failed. In tasks requiring closure and validation, the lowest-cost model is the one that finishes, regardless of absolute price.

4. **Specificity in venue and pricing is asymmetrically valued.** Opus named specific museums (Hamarikyu Gardens with ¥500 entry), specific restaurants (Bird Land Ginza, Michelin-listed yakitori), and specific hotels (APA Hotel Shinjuku Kabukicho Tower). Sonnet began this pattern (Shinjuku Granbell, Fuunji) but truncated before building on it. Haiku gave broader categories (izakaya, teahouse) with wider price ranges. For travel planning, narrow specificity is essential because travelers will search for these exact names and book accordingly. Vague categories (yakitori, $20-25) require manual research to validate.

## Further reading

- [TeamLab Borderless official website](https://www.teamlab.art/) - The immersive art installation referenced in Sonnet and Opus responses; reopening status and ticket information.
- [Haneda Airport transport options](https://www.haneda.tokyo/) - Official Haneda Airport site with real-time transport fares and Airport Limousine Bus schedules used in cost estimation.
- [Japan Rail Pass and Suica pricing](https://www.jreast.co.jp/e/index.html) - JR East official site with current IC card and day-pass rates for Tokyo transit referenced by all three models.
- [Tokyo National Museum](https://www.tnm.jp/modules/r_common_festival_list/) - Official collection and admission pricing for the Ueno museum mentioned across responses.
- [Michelin Guide Tokyo restaurants](https://guide.michelin.com/jp/en/tokyo-restaurants) - Reference for high-end dining venue validation, including Bird Land Ginza cited in Opus output.

## Frequently asked

### Why did Sonnet's response get cut off

Sonnet 4.6 generated 2,048 output tokens but appears to have been truncated during transmission or storage. The response ends mid-sentence in the Day 2 spending table, preventing evaluation of the full itinerary and final budget breakdown.

### Which model stayed closest to the $3,500 budget

Opus 4.7 constructed a detailed budget table that summed exactly to $3,500 with clear line items for flight, lodging, transit, activities, and contingency. Haiku's response was incomplete and Sonnet's was truncated, making full budget validation impossible for those two.

### How accurate were the cost estimates

Opus 4.7 provided the most realistic estimates for Tokyo in May 2026, including specific venue pricing (Hamarikyu Gardens ¥500, Shibuya Sky ¥2,500) and noting that prices may differ by booking lead time. Haiku and Sonnet used reasonable ranges but less specificity.

### Did any model use external tools

No. The prompt explicitly stated 'do not use any external tools.' All three models relied on training data for cost and venue estimates, as required. Sonnet flagged a single note about verifying TeamLab Borderless reopening status before travel.

### What was the quality-to-cost ratio

Haiku 4.5 cost $0.0097 and delivered a usable outline. Sonnet 4.6 cost $0.0311 but truncated. Opus 4.7 cost $0.1429 (15x Haiku) but provided completeness, accuracy, and a proper budget closure that aligned with the task requirements.
