---
title: "Benchmark: Python async race-condition diagnosis"
description: "Claude Haiku, Sonnet, and Opus diagnose a cache-stampede bug in async Python. Sonnet and Opus matched on accuracy; Haiku was complete but less detailed."
tldr: "All three Claude models correctly identified the race condition and provided working fixes. Sonnet 4.6 and Opus 4.7 produced nearly equivalent analyses; Haiku 4.5 was sound but briefer. Sonnet achieved the best correctness-to-cost ratio."
url: "https://aigentic.blog/benchmark-python-async-race-condition-haiku-sonnet-opus"
publishedAt: "2026-05-06T13:00:43.599Z"
updatedAt: "2026-05-06T13:00:43.599Z"
category: "benchmark"
tags: ["benchmark","claude","async-python","race-condition","cache","concurrency"]
---

# Benchmark: Python async race-condition diagnosis

> All three Claude models correctly identified the race condition and provided working fixes. Sonnet 4.6 and Opus 4.7 produced nearly equivalent analyses; Haiku 4.5 was sound but briefer. Sonnet achieved the best correctness-to-cost ratio.

All three Claude models correctly diagnosed the race condition and provided working fixes, but they differed in depth and efficiency. Claude Sonnet 4.6 achieved the best balance of correctness, clarity, and cost-per-token; Opus 4.7 was marginally more thorough but consumed significantly more tokens; Haiku 4.5 was accurate but less detailed.

## Task

> The following Python async function intermittently returns stale data. Identify the bug, explain why it happens, and provide a minimal corrected version.
>
> ```python
> import asyncio
>
> cache = {}
>
> async def get_or_fetch(key, fetcher):
>     if key in cache:
>         return cache[key]
>     value = await fetcher(key)
>     cache[key] = value
>     return value
> ```
>
> Answer format: three sections ("Bug", "Why it happens", "Fix" with code).

## Results

| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 5494 | 121 | 555 | 0.0029 | Complete, accurate, two fix approaches |
| Claude Sonnet 4.6 | 13347 | 121 | 739 | 0.0114 | Complete, highly detailed, visual flow diagram, in-flight pattern explained |
| Claude Opus 4.7 | 14795 | 159 | 738 | 0.0577 | Complete, precise semantics, inline Future handling, exception-safe |

## Analysis

Claude Haiku 4.5 delivered a correct diagnosis in 555 tokens. It identified the race condition, explained the TOCTOU bug clearly, and provided two working solutions: one using in-flight task tracking and one using `asyncio.Lock`. The explanations were direct and the code was sound. However, the analysis lacked the step-by-step execution trace that would help readers visualize the race, and the treatment of exception handling in the in-flight solution was implicit rather than explicit. The second lock-based solution is correct but less performant than in-flight tracking for read-heavy patterns. Latency was fastest at 5.5 seconds, and cost was the lowest.

Claude Sonnet 4.6 produced a 739-token response with exceptional clarity. It included a detailed frame-by-frame execution trace showing where each coroutine suspends and resumes, making the bug mechanistic and concrete. The fix centered on in-flight task tracking, with explicit synchronous registration of the task before any `await`, preventing a secondary race between task creation and registration. A strengths table highlighted key properties: no duplicate fetches, no stale writes, exception safety, and synchronous registration. The explanation of why `await` is a suspension point was precise and pedagogically strong. Sonnet's cost was 4 times Haiku's, but for a technical benchmark where clarity is the primary goal, this is a reasonable trade-off.

Claude Opus 4.7 produced a 738-token response with the deepest semantic precision. It correctly identified the "cache stampede" / "missing single-flight" terminology, and its explanation of the race condition was tightly worded. The fix used explicit `asyncio.Future` creation and `set_result` / `set_exception` idioms rather than relying on task return values, giving the reader a more granular understanding of how asyncio manages asynchronous state. The exception-safety narrative was clear: the `finally` block ensures the in-flight entry is cleaned up even on failure, allowing retries. However, the response was slightly more verbose without adding proportional clarity, and the cost was nearly 20 times Haiku's. The input token count was also slightly higher (159 vs. 121), suggesting Opus parsed the prompt differently.

All three models correctly identified the core issue: an `await` suspension point between the cache check and the write, allowing concurrent tasks to observe a stale state and race to update the cache. All three provided working fixes based on in-flight deduplication. None produced incorrect code or missed the required format.

## Winner and why

Claude Sonnet 4.6 is the clear winner for this benchmark. It matched Opus 4.7 in technical correctness and surpassed it in pedagogical clarity, with the execution trace and properties table making the bug and fix both understandable and memorable. Compared to Haiku 4.5, Sonnet provided substantially more detail without unnecessary verbosity, and the cost difference (0.0114 vs. 0.0029) is modest enough that the value gain justifies it for a diagnosis task where explanation quality matters. Sonnet's approach of synchronously registering the task before any `await` is also more explicitly correct than alternatives, reducing the risk of misunderstanding. For a developer encountering this bug in production, Sonnet's response would be the most useful first reference.

## Takeaways

1. **Correctness was table stakes; all three models solved the problem.** Haiku, Sonnet, and Opus each identified the race condition, explained why it occurs at `await` suspension points, and provided working in-flight or lock-based fixes. No model offered incorrect advice or missed the three required sections.

2. **Depth of explanation scales faster than token cost.** Sonnet's execution trace and properties table added substantial clarity (from Haiku to Sonnet is 184 extra tokens and 3.9x cost), while Opus's deeper Future semantics added only marginal value over Sonnet (Opus is 738 output tokens vs. Sonnet's 739) but cost 5x as much. For technical troubleshooting, Sonnet's clarity-to-cost ratio is optimal.

3. **In-flight deduplication is the canonical fix; lock-based approaches are a secondary option.** Haiku's dual solution is valuable for completeness, but both Sonnet and Opus correctly prioritized the in-flight pattern, which avoids lock contention on cache hits. A sophisticated reader would recognize this as the production-grade solution.

4. **Exception safety and synchronous registration matter in production async code.** Opus's explicit treatment of `set_exception` and Sonnet's emphasis on registering the task before any `await` both highlight a critical detail: if either step is missed, a secondary race can occur. Models that glossed over this (neither Haiku nor Sonnet were negligent, but Opus was more explicit) would be preferred by readers writing critical infrastructure.

## Further reading

- [asyncio.Task documentation](https://docs.python.org/3/library/asyncio-task.html) - Official Python asyncio task and coroutine reference.
- [asyncio.Lock documentation](https://docs.python.org/3/library/asyncio-sync.html#lock) - Synchronization primitive for managing concurrent access.
- [PEP 492: Coroutines with async and await syntax](https://www.python.org/dev/peps/pep-0492/) - Python Enhancement Proposal defining async/await semantics and suspension points.
- [Cache stampede pattern](https://en.wikipedia.org/wiki/Cache_stampede) - Wikipedia entry on the cache stampede problem and mitigation strategies.
- [asyncio.Event and asyncio.Condition documentation](https://docs.python.org/3/library/asyncio-sync.html) - Alternative synchronization mechanisms for coordinating concurrent tasks.
```

## Frequently asked

### What is a TOCTOU bug in async Python?

Time-of-check to time-of-use (TOCTOU) occurs when an `await` suspension point falls between a conditional check and an update, allowing other coroutines to observe stale state. In this benchmark, both Task A and B see the cache miss before either writes, triggering duplicate fetches.

### Why does the last writer win in this race condition?

Both fetches run concurrently and write to the same cache key. Whichever coroutine finishes last executes `cache[key] = value` after the earlier one, overwriting the earlier value. This can break consistency: callers may have already received one value while the cache holds another.

### What is the in-flight pattern and why does it solve this?

The in-flight pattern stores a single shared Future or Task for each key before starting the actual fetch. All concurrent callers detect the in-flight entry and await the same Future, ensuring only one fetch runs per key and all callers receive the identical result.

### Is asyncio.Lock a better solution than in-flight tracking?

Both work, but in-flight tracking is more efficient for read-heavy workloads: cache hits bypass the lock entirely, while a per-key lock adds overhead even on repeated hits. The choice depends on contention patterns.