---
title: "Benchmark: Python race-condition diagnosis across Claude"
description: "Claude Haiku, Sonnet, and Opus diagnose a cache stampede bug in Python async code. Sonnet and Opus identify the core issue; Haiku's fix has a subtle flaw."
tldr: "All three Claude models correctly identified the race condition in a concurrent cache function, but Sonnet and Opus provided more rigorous explanations and production-ready fixes. Haiku's solution omits synchronization between in-flight dictionary updates, risking duplicate fetches."
url: "https://aigentic.blog/benchmark-bug-diagnosis-python-2026-05-22"
publishedAt: "2026-05-22T13:54:07.328Z"
updatedAt: "2026-05-22T13:54:07.328Z"
category: "benchmark"
tags: ["benchmark","claude","python","async","concurrency","debugging"]
---

# Benchmark: Python race-condition diagnosis across Claude

> All three Claude models correctly identified the race condition in a concurrent cache function, but Sonnet and Opus provided more rigorous explanations and production-ready fixes. Haiku's solution omits synchronization between in-flight dictionary updates, risking duplicate fetches.

All three Claude models correctly identified the core bug: a cache stampede race condition in the async function where multiple concurrent calls with the same missing key each launch their own `fetcher()` call instead of sharing a single fetch. However, the quality of explanation and the correctness of the proposed fixes differ meaningfully, as do the latency and cost profiles.

## Task

> The following Python async function intermittently returns stale data. Identify the bug, explain why it happens, and provide a minimal corrected version.
>
> ```python
> import asyncio
>
> cache = {}
>
> async def get_or_fetch(key, fetcher):
>     if key in cache:
>         return cache[key]
>     value = await fetcher(key)
>     cache[key] = value
>     return value
> ```
>
> Answer format: three sections ("Bug", "Why it happens", "Fix" with code).

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|-------|--------------|--------------|---------------|-----------|---------|
| Claude Haiku 4.5 | 4,422 | 121 | 451 | 0.00238 | Race condition identified; fix has synchronization gap |
| Claude Sonnet 4.6 | 15,234 | 121 | 759 | 0.01175 | Complete, accurate, thoroughly explained |
| Claude Opus 4.7 | 12,814 | 159 | 689 | 0.05406 | Complete, correct, concise; less pedagogical |

## Analysis

**Claude Haiku 4.5** correctly identifies the race condition and its mechanism. The explanation is clear: between the cache check and the write, another coroutine can enter and launch its own fetch. The proposed fix introduces an `in_flight` dictionary to track pending tasks and has concurrent callers await the same task. However, there is a critical synchronization gap. The check `if key in in_flight` and the subsequent assignment `in_flight[key] = task` are not atomic. In Python's async model, `asyncio.create_task()` is a synchronous call that doesn't yield, but the window between checking the dictionary and inserting into it is real. A second coroutine could enter the function after the first has checked `if key in in_flight` but before it assigns `task` to the dictionary. Both would then create separate tasks. This is not a theoretical risk: under high concurrency, duplicate fetches will occur. The `finally` cleanup is correct, but it cannot mask the core flaw.

**Claude Sonnet 4.6** provides the most rigorous explanation. It walks through the exact sequence of events using a visual timeline, showing how multiple coroutines can pass the `if key in cache` check before any of them stores a result. The term "cache stampede" is named explicitly. The fix uses `asyncio.ensure_future()` instead of `create_task()` (a minor difference, both are valid), but critically, it stores the task in the `_in_flight` dictionary immediately after creation, before any await point. The key insight is stated clearly: "Both lines are synchronous (no await between them), so no other coroutine can interleave here." The code includes explicit type hints (`dict[str, asyncio.Task]`), documents why errors are not cached, and provides a table explaining each step of the solution. The explanation is production-ready and educational.

**Claude Opus 4.7** identifies the race condition with precision, introducing the term "thundering herd" as a synonym and clarifying that the staleness is not temporal (no TTL) but rather inconsistency between concurrent fetches. The fix is the most elegant: instead of a separate in_flight tracking dictionary, it stores a `Future` directly in the main cache dictionary itself, so the check-and-insert operation is atomic. The key line is: "Because the future is inserted into the cache **before** any await, any concurrent caller arriving during the fetch will find it and await the same result." This is correct and minimal. The code includes error handling that removes the future on exception, preventing stale in-flight entries. However, the explanation is less detailed than Sonnet's; there is no step-by-step walkthrough or table, and the term "thundering herd" may be less immediately recognizable to all developers as a cache stampede.

## Winner and why

**Claude Sonnet 4.6** is the best choice for this diagnostic task, balancing correctness, clarity, and cost-effectiveness. While Opus's solution is slightly more elegant, Sonnet's fix is correct and its explanation is comprehensive. Haiku's fix contains a synchronization bug that would manifest under realistic load, and at only 0.24 cents cheaper than Sonnet, the risk is not worth the savings. Opus's solution is correct but costs 4.6 times more than Sonnet (5.4 cents versus 1.18 cents) for a marginal improvement in elegance and no gain in explainability. Sonnet produces a production-grade fix with step-by-step documentation, a summary table, and explicit reasoning about where coroutines cannot interleave. For a diagnostic/educational benchmark, the combination of correctness, teachability, and reasonable cost makes Sonnet the optimal choice.

## Takeaways

1. **Race conditions in async code are invisible in the code itself.** The bug in the original function contains no explicit locks or multithreading; it relies purely on the assumption that the "check, fetch, write" sequence is atomic. None of the models were fooled, but Haiku's fix demonstrates that merely tracking in-flight requests is not enough; they must be inserted into the tracking structure before the first await, or the window remains open.

2. **Cost scaling versus explanation quality matters in this domain.** Haiku is 4.9 times faster and 0.24 cents (80% cheaper than Sonnet), but its fix has a subtle bug. Opus is 1.6 times slower and 4.6 times more expensive than Sonnet, yet provides only marginal gains in code elegance and no better explanation. Sonnet achieves the best cost-to-correctness ratio by providing a correct, thoroughly explained, production-ready fix at the lowest safe price point.

3. **The fix pattern generalizes: store futures, not just track them.** Opus's insight to store the Future directly in the cache is elegant and worth learning; Sonnet's approach of a separate tracking dictionary is more conventional but equally correct if implemented carefully. Both avoid the pitfall Haiku fell into by ensuring the synchronous insert precedes all await points.

4. **Pedagogical value correlates with real-world usefulness.** Sonnet's table and timeline explanation would help a developer not just fix this bug but understand and avoid similar patterns. For a benchmark task, the model that teaches is more valuable than the model that merely solves, especially when cost is comparable.

## Further reading

- [asyncio.Task documentation](https://docs.python.org/3/library/asyncio.task.html) explaining the difference between create_task and ensure_future, and when suspension points occur.
- [PEP 492: Coroutines with async and await syntax](https://www.python.org/dev/peps/pep-0492/) defining the async/await model and suspension semantics.
- [Cache stampede on Wikipedia](https://en.wikipedia.org/wiki/Cache_stampede) covering the general phenomenon and mitigation strategies.
- [asyncio event loop documentation](https://docs.python.org/3/library/asyncio-eventloop.html) detailing how the event loop schedules coroutines and where interleaving can occur.

## Frequently asked

### What is a cache stampede?

A cache stampede occurs when multiple concurrent requests detect a cache miss simultaneously, each launching its own expensive fetch operation instead of waiting for a single shared fetch. The last result wins, overwriting earlier ones and wasting resources. In async Python, this happens because await suspends a coroutine, allowing others to run before the cache write completes.

### Why does Haiku's fix have a problem?

Haiku creates an in_flight task and assigns it inside the if-block, but `asyncio.create_task()` itself doesn't hold a lock. Between checking `if key in in_flight` and assigning `in_flight[key] = task`, another coroutine can slip in, see the same empty slot, create its own task, and both callers now fetch independently. The synchronous assignment alone doesn't prevent this interleaving in cooperative multitasking.

### How does Opus prevent duplicate fetches?

Opus stores a Future (not a Task) directly in the cache dictionary immediately after creation, before any await. Because the check-and-insert is performed in sequence without a suspension point, concurrent callers are guaranteed to find the same Future and await it together, eliminating duplicate fetches.

### What's the latency and cost tradeoff between these models?

Haiku is fastest (4.4s) and cheapest (0.24 cents), producing a working but subtly flawed solution. Sonnet takes 15.2s and costs 1.18 cents, delivering the most thorough explanation and correct fix with clear documentation. Opus is slowest (12.8s) and most expensive (5.4 cents), providing an elegant minimal fix but less pedagogical detail than Sonnet.