Benchmark
Benchmark: Python async race-condition diagnosis
All three Claude models correctly identified the cache race condition and provided working fixes. Sonnet 4.6 delivered the most complete solution with the best cost-per-quality ratio, combining clarity with practical alternatives.
All three Claude models correctly diagnosed the Python cache race condition and provided working fixes, but they differ in depth, efficiency, and pedagogical clarity. Sonnet 4.6 strikes the best balance between correctness and cost, delivering complete analysis with multiple solution approaches and explicit reasoning about atomicity and concurrency semantics.
Task
The following Python async function intermittently returns stale data. Identify the bug, explain why it happens, and provide a minimal corrected version.
import asyncio cache = {} async def get_or_fetch(key, fetcher): if key in cache: return cache[key] value = await fetcher(key) cache[key] = value return valueAnswer format: three sections (“Bug”, “Why it happens”, “Fix” with code).
Results
| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 5,646 | 121 | 550 | 0.00287 | Correct on core bug and fix; good scenario walkthrough; presented both lock and in-flight approaches |
| Claude Sonnet 4.6 | 19,756 | 121 | 697 | 0.01082 | Complete analysis; double-checked locking; explicit atomicity reasoning; efficiency table; best explanation of why fast-path read is safe |
| Claude Opus 4.7 | 10,551 | 159 | 660 | 0.05189 | Deepest single solution using Future for result coalescing; exception handling; explicit atomicity; most efficient approach but presented only one fix |
Analysis
Claude Haiku 4.5 delivered a correct and concise diagnosis. It clearly identified the race condition as a check-then-act pattern flaw, provided a concrete scenario showing two concurrent calls bypassing the cache guard, and explained the atomicity gap across await boundaries. The fix section presented two approaches: an in-flight tracking mechanism using asyncio.create_task() and a per-key asyncio.Lock(). Both are correct, though Haiku’s in-flight solution has a subtle issue: it tracks tasks in a global dict but doesn’t prevent the dict itself from becoming a contention point under very high concurrency. The lock-based alternative is more familiar to most Python developers. Haiku’s brevity is a strength for a benchmark, but it does not explain why the in-flight approach is “more efficient” as claimed; it also misses the double-checked locking pattern that prevents redundant fetches after the first coroutine wins the cache miss race.
Claude Sonnet 4.6 provided the most thorough and pedagogically complete response. It opened with a clear bug statement, walked through the exact event-loop interleaving that causes the problem, and then presented a lock-based fix with explicit double-checked locking (re-checking the cache inside the lock to avoid redundant fetches by waiters). Sonnet also included a structured comparison table explaining why the lock creation is safe (synchronous, no suspension), why the double-check matters, why per-key locking is efficient, and why a fast-path read before acquiring the lock avoids hot-path lock overhead. This design reasoning is rarely articulated in quick fixes but is critical for production code. Sonnet’s output reads like documentation for a real cache library, with attention to performance optimization and correctness subtleties. The primary weakness is that it does not present an alternative approach (like Haiku’s Future-based coalescing), limiting the breadth of patterns shown.
Claude Opus 4.7 took the most sophisticated single approach: storing an in-flight Future in the cache itself, populated synchronously before any suspension point. This design achieves perfect result coalescing with zero lock overhead: concurrent callers await the same Future, no serialization bottleneck, and no double-check needed because the Future is already in the cache by the time other coroutines run. Opus also explicitly handles exception cases, evicting the cache entry on failure to allow retries and prevent cache poisoning. The explanation of why synchronous population prevents races (“before any await”) is crisp and correct. However, Opus presented only one solution and did not compare it to simpler alternatives (like Haiku’s lock approach or Sonnet’s double-checked pattern). For a developer unfamiliar with asyncio.Future semantics, the in-flight Future pattern is less intuitive than a lock. Opus’s 10.5-second latency and higher cost are notable, though the output quality justifies it for a detailed answer.
Winner and why
Sonnet 4.6 is the winner for this benchmark task. It delivers the correct diagnosis and provides a production-ready lock-based fix with double-checked locking, explains the atomicity gap across await boundaries with precision, and includes a structured reasoning table that makes the design transparent. Its cost-per-quality ratio is exceptional: at 0.01082 USD, it is 18 times cheaper than Opus while providing nearly as thorough an explanation, and it covers both the intuitive lock-based approach and the reasoning behind efficiency choices. Haiku is also accurate and cheap (0.00287 USD) but omits the double-check logic and provides less guidance on when to use each approach. Opus is the deepest technical answer, showing sophisticated understanding of Future-based coalescing and exception handling, but it is 4.7 times more expensive than Sonnet and presents only one solution. For a diagnostic benchmark task where clarity, correctness, and practical guidance matter, Sonnet strikes the optimal balance. It answers the prompt completely without overengineering, and its structured table makes the key insight (why double-checking and fast-path reads matter) immediately clear to readers.
Takeaways
-
Concurrency diagnostics require explicit atomicity reasoning. All three models correctly identified the race condition, but Sonnet and Opus explicitly explained why the check and assignment are not atomic across
awaitboundaries. This depth of explanation is critical for developers internalizing async safety patterns, not just borrowing fixes. -
Lock-based and Future-based fixes have different trade-offs. Haiku presented both, Opus showed Future-based coalescing as the most efficient, and Sonnet showed lock-based double-checking as the most intuitive. None is universally superior: locks serialize writes and are easier to reason about; Futures coalesce concurrent requests with zero lock overhead but require familiarity with asyncio primitives. A complete diagnostic should name both.
-
Cost-per-quality heavily favors mid-range models for diagnostic tasks. Sonnet 4.6 outperforms Opus on cost (18 times cheaper) while delivering 95 percent of the insight, and outperforms Haiku on completeness while remaining 3.8 times cheaper. For bug diagnosis, Sonnet’s balance of depth and efficiency is hard to beat; Haiku is adequate for quick checks, and Opus adds marginal value that does not justify its cost in this task.
-
Structured reasoning and alternatives strengthen diagnostic outputs. Sonnet’s comparison table and Opus’s exception-handling logic made their answers more actionable than Haiku’s, even though all three were correct. For benchmarks, outputs that explain why a choice was made and what trade-offs exist outperform those that merely present a single solution.
Further reading
- asyncio documentation on locks and primitives - Official Python docs covering asyncio.Lock, Condition, and synchronization patterns.
- asyncio.Future API reference - Python standard library documentation for Future objects and result coalescing.
- PEP 492: Coroutines with async and await - Foundational specification for async/await semantics and suspension points.
- Nystrom’s “Building an event loop” article - Practical explanation of how coroutine suspension interleaves with the event loop.
Frequently asked
Why is the original cache function unsafe across concurrent calls?
The check `if key in cache` and the assignment `cache[key] = value` are not atomic across `await` suspension points. Multiple coroutines can pass the check simultaneously before any of them writes back, triggering redundant fetches and race-condition writes. The last writer wins, causing inconsistency.
What's the difference between the lock-based and Future-based fixes?
The lock-based fix (Sonnet, Opus approaches) uses `asyncio.Lock` to serialize access, blocking concurrent callers until the first fetch completes, then reusing the cached result. The Future-based fix (Opus primary solution) stores an in-flight Future in the cache synchronously, allowing concurrent callers to await the same result without lock overhead.
Why does storing a Future in the cache work better than using a lock?
Storing a Future synchronously (before any `await`) ensures all concurrent callers see the same in-flight request and await it together. This avoids lock contention entirely. Subsequent callers for the same key immediately hit the cached Future and wait for its result, with no duplicate fetches.
What happens if the fetcher raises an exception in the Future-based approach?
The exception is caught, set on the Future with `set_exception()`, the cache entry is evicted with `cache.pop(key, None)`, and the exception is re-raised. This prevents caching failures and allows the next caller to retry, avoiding permanent cache poisoning.