Benchmark
Benchmark: Python race-condition diagnosis across Claude
Claude Sonnet 4.6 and Opus 4.7 both correctly identify the race condition and provide production-ready fixes using in-flight deduplication and locking respectively. Haiku 4.5 is accurate but less comprehensive.
Claude Sonnet 4.6 and Opus 4.7 both produce correct, production-ready fixes for the async cache race condition, with Opus offering slightly more pedagogical depth and a second elegant alternative. Haiku 4.5 correctly identifies the bug and provides a sound lock-based solution but is less thorough in explaining the failure modes.
Task
The following Python async function intermittently returns stale data. Identify the bug, explain why it happens, and provide a minimal corrected version.
import asyncio cache = {} async def get_or_fetch(key, fetcher): if key in cache: return cache[key] value = await fetcher(key) cache[key] = value return valueAnswer format: three sections (“Bug”, “Why it happens”, “Fix” with code).
Results
| Model | Latency (ms) | Input tokens | Output tokens | Cost (USD) | Verdict |
|---|---|---|---|---|---|
| Claude Haiku 4.5 | 5555 | 121 | 576 | 0.003 | Correct and concise; covers lock and in-flight alternatives. |
| Claude Sonnet 4.6 | 18645 | 121 | 859 | 0.01325 | Complete; in-flight Future implementation with table and exception handling. |
| Claude Opus 4.7 | 19669 | 159 | 870 | 0.06764 | Correct on all counts; explains stale data scenario; two complete fixes. |
Analysis
Claude Haiku 4.5 delivers a crisp, correct answer. It identifies the race condition in the check-then-act pattern, explains that await creates a suspension point, and offers two working solutions: a lock-based approach using asyncio.Lock and an in-flight task deduplication strategy. The lock implementation is straightforward; the in-flight alternative is elegant. Haiku’s explanation of why the bug happens is accurate but briefer than the others, omitting specific discussion of stale data caused by out-of-order completion. The response is fast and cheap, meeting the minimal requirements.
Claude Sonnet 4.6 goes deeper. It provides a detailed, step-by-step race sequence showing exactly how two coroutines can both see a cache miss, suspend at await, and duplicate the fetch. Critically, Sonnet explicitly mentions the “stale or inconsistent data” problem if fetcher return values differ between invocations, which directly addresses the prompt’s phrase “intermittently returns stale data.” The fix uses in-flight tracking with an asyncio.Future, wraps it in asyncio.shield(), and includes exception handling (try/except/finally) to propagate errors to all waiters and clean up the pending dict. A comparison table shows the behavior under four scenarios. The implementation is more robust than Haiku’s; it handles errors and uses explicit futures rather than relying on task semantics.
Claude Opus 4.7 similarly diagnoses the bug correctly and emphasizes the stale-data failure mode: if a slower fetch started earlier returns data captured from an older replica, and overwrites a fresher value already cached by a faster fetch, callers receive stale data. The explanation of why it happens is concise and accurate. Opus provides two complete, working fixes: first, a per-key asyncio.Lock with double-checked locking (re-check after acquiring the lock), and second, an in-flight task approach that stores asyncio.create_task() directly in the cache dict. Both are correct. The second approach is presented as “often cleaner” and more idiomatic. Opus’s lock cleanup (_locks.pop(key, None)) and the note on optional lock management show careful implementation thinking.
Winner and why
Claude Sonnet 4.6 is the winner. While Opus is marginally more comprehensive and provides two alternatives, Sonnet’s solution is production-ready, includes explicit exception handling and cleanup, and adds a comparison table that makes the fix’s correctness immediately clear. Sonnet’s in-flight Future approach, wrapped with asyncio.shield() and try/except/finally, is more defensible in edge cases (e.g., exception propagation, cancellation) than Opus’s simpler task-in-cache pattern. The cost difference (0.01325 vs. 0.06764 USD) is meaningful for a benchmark at scale: Sonnet costs less than 20% of Opus while delivering equivalent correctness and superior clarity. Haiku is excellent value (0.003 USD) and entirely correct, but its briefer treatment of stale data and lack of exception handling in the code make it marginally less suitable for production guidance. Sonnet threads the needle: thorough enough to earn confidence, lean enough to be efficient.
Takeaways
-
Stale-data framing matters. All three models correctly identify the race condition, but Sonnet and Opus explicitly call out that late-completing fetches can overwrite fresher cache entries, which is the root cause of the “intermittently stale data” symptom in the prompt. Haiku mentions deduplication but not the out-of-order completion problem.
-
Exception handling separates rough from production code. Sonnet’s in-flight Future approach includes try/except/finally to propagate fetcher exceptions to all waiters and clean up the pending dict even on failure. Opus’s lock solution with a simple re-check is correct but doesn’t explicitly address error propagation. Haiku’s alternatives are solid but silent on errors.
-
In-flight task deduplication is more idiomatic than explicit locking for async Python. Both Sonnet and Opus prefer the pattern of storing a pending Future or Task in a dict and awaiting it from concurrent callers, treating the cache as both a results store and an in-flight task tracker. This avoids a separate lock dict and aligns with Python’s cooperative concurrency model. Haiku’s lock approach is correct but feels more like translating a threaded pattern.
-
Cost scales with depth, not correctness. Haiku (0.003 USD, 5.5 s latency) and Sonnet (0.01325 USD, 18.6 s latency) are both entirely correct; Opus (0.06764 USD, 19.7 s latency) adds minimal correctness beyond Sonnet but costs 5x more. For standard diagnosis tasks, Sonnet offers the best signal per token.
Further reading
- Python asyncio documentation on Locks and synchronization primitives serves as the canonical reference for async synchronization patterns in CPython.
- The check-then-act race condition pattern explained on Wikipedia covers the broader category of time-of-check-time-of-use vulnerabilities that apply here.
- “Cooperative multitasking in Python asyncio” documents suspension points and how
awaitenables context switching between coroutines. - asyncio.shield() documentation explains wrapping futures to protect against cancellation, relevant to the Sonnet solution.
- A practical guide to async caching patterns (Starlette source code) demonstrates idiomatic in-flight deduplication in production web frameworks.
Frequently asked
Why does the original cache function have a race condition?
Multiple concurrent coroutines can all pass the `if key in cache` check before any of them writes the result back. Since `await` is a suspension point, the event loop switches between coroutines in the gap between the check and the cache write, causing duplicate fetches and potential stale-data overwrites.
What are the two main approaches to fix this bug?
Per-key locking (using `asyncio.Lock`) ensures only one coroutine fetches per key while others wait. In-flight task tracking stores a pending `Future` or `Task` so concurrent callers await the same fetch result, eliminating duplicates without explicit locks.
Which fix is simpler to understand and maintain?
The in-flight task approach is often simpler: store the `asyncio.create_task(fetcher(key))` in the cache dict itself, and all callers await it. This avoids a separate lock dict and is more idiomatic for async Python.
Can whichever coroutine finishes last cause stale data?
Yes, in the buggy original code. If multiple fetches run concurrently and a slower fetch (or one querying an older data source) completes last, it overwrites the cache with stale data, even though a faster fetch already stored fresher data.