---
title: "Arxiv digest: agents, reasoning, and test-time compute"
description: "This week's arxiv highlights advances in agentic search, test-time reasoning scaling, and mechanistic interpretability for LLM systems."
tldr: "Agentic systems research focuses on retrieval strategy trade-offs, population-based reasoning via pairwise comparison, and grounded evaluation of agent adaptability in dynamic environments."
url: "https://aigentic.blog/arxiv-digest-agents-reasoning-compute"
publishedAt: "2026-05-16T13:00:23.340Z"
updatedAt: "2026-05-16T13:00:23.340Z"
category: "arxiv-digest"
tags: ["arxiv","research","agentic-systems","llm-reasoning","test-time-compute"]
---

# Arxiv digest: agents, reasoning, and test-time compute

> Agentic systems research focuses on retrieval strategy trade-offs, population-based reasoning via pairwise comparison, and grounded evaluation of agent adaptability in dynamic environments.

This week's arxiv highlights a maturing agentic research agenda: practitioners are moving beyond single-model benchmarks toward realistic tool-calling trade-offs, grounded simulation for adaptive capability measurement, and leveraging test-time compute distribution to scale reasoning without architectural overhaul.

## Top 5

| Rank | Title | Authors | Score | Why it matters |
|------|-------|---------|-------|----------------|
| 1 | [Is Grep All You Need? How Agent Harnesses Reshape Agentic Search](http://arxiv.org/abs/2605.15184v1) | Sahil Sen, Akhil Kasturi, et al. | 9/10 | First systematic comparison of retrieval + agent architecture trade-offs; directly informs production RAG pipelines. |
| 2 | [OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation](http://arxiv.org/abs/2605.15177v1) | Shang Zhou, Wenhao Chai, et al. | 8/10 | Population-based test-time compute eliminates noisy pointwise grading; +405 Elo on competitive coding. |
| 3 | [FutureSim: Replaying World Events to Evaluate Adaptive Agents](http://arxiv.org/abs/2605.15188v1) | Shashwat Goel, Nikhil Chandak, et al. | 8/10 | Chronological replay of real events measures agent adaptability; reveals poor calibration in frontier models. |
| 4 | [ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both](http://arxiv.org/abs/2605.15198v1) | Ziyu Guo, Rain Liu, et al. | 7/10 | Single functional token bridges agentic and latent reasoning; reduces context-switching overhead in visual tasks. |
| 5 | [When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability](http://arxiv.org/abs/2605.15183v1) | ML Nissen Gonzalez, Melwina Albuquerque, et al. | 7/10 | Weight-space invariant metric for mechanistic equivalence; enables verification of agent component fidelity. |

## Flagship: Is Grep All You Need? How Agent Harnesses Reshape Agentic Search

[This paper](http://arxiv.org/abs/2605.15184v1) directly addresses a blind spot in agentic systems literature: while retrieval-augmented generation (RAG) is standard in production agent loops, there has been no systematic empirical comparison of how retrieval strategy choice interacts with agent architecture and tool-calling paradigm. The authors report findings from two carefully designed experiments using a custom agent harness called Chronos and provider-native CLI implementations.

The core problem is practical and urgent. Recent advances in LLM agents enable complex agentic workflows where models autonomously retrieve information, call tools, and reason over large corpora. Yet existing literature largely treats retrieval and agency as orthogonal concerns. This gap has real consequences: practitioners lack guidance on whether to deploy grep-style exact matching, vector semantic search, or hybrid approaches, and how tool output presentation affects reasoning quality.

Experiment 1 evaluates grep and vector retrieval on a 116-question sample from LongMemEval. The study is careful to isolate variables: both retrieval strategies serve the same agent, and the researchers measure not just accuracy but also how agent behavior degrades as irrelevant context increases around the target information. Experiment 2 extends this by comparing agent harness choice (Chronos versus provider-native tools) and how each tool-calling paradigm handles retrieved content presentation.

Key findings underscore the non-obvious interaction between retrieval and agent design. Vector retrieval does not uniformly outperform grep despite its higher cost and latency. Performance depends on how much irrelevant text surrounds the target, how the agent's tool-calling interface frames the retrieved snippet, and whether the harness allows the model to reformulate queries. When search results must be presented verbatim without filtering, dense retrieval's ranking advantage can be negated by longer context windows that confuse token-limited models. Grep, conversely, excels when exact string matching is feasible and context is filtered, but fails on paraphrased or fuzzy queries.

The paper does not claim to resolve the grep-versus-vector debate in one direction. Instead, it maps the design space: tool output presentation (raw versus processed), context window handling (filtered versus full), and query reformulation capabilities all interact with retrieval strategy. Practitioners deploying agents must make trade-offs depending on their corpus structure, model context window, and tolerance for latency.

Limitations exist. The evaluation uses only 116 questions, limiting generalization to diverse tasks. The study focuses on a single domain (LongMemEval) and does not evaluate structured queries or multi-step retrieval cascades. The authors acknowledge that frontier models with longer context windows may shift the cost-benefit curves significantly. Finally, the study does not measure end-to-end system latency or cost, which are critical for production deployment.

Despite these boundaries, the paper makes a methodological contribution: it demonstrates that agent research must include the harness and tool interface as first-class variables, not afterthoughts. Subsequent work on agentic RAG should adopt similar systematic breakdowns of retrieval and interface choices.

## Also noteworthy

- [OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation](http://arxiv.org/abs/2605.15177v1) proposes a population-based test-time compute framework that scales breadth by sampling multiple reasoning traces in parallel, then aggregates pairwise LLM comparisons via Bradley-Terry ranking instead of relying on noisy pointwise grading; achieves +405 Codeforces Elo improvement on Gemini 3.1 Pro.

- [FutureSim: Replaying World Events to Evaluate Adaptive Agents](http://arxiv.org/abs/2605.15188v1) introduces a grounded simulation that replays real-world events in chronological order (news articles, question resolutions) to measure frontier agent adaptability over three months; reveals best models achieve only 25% accuracy on forecasting tasks and many perform worse than random.

- [ATLAS: Agentic or Latent Visual Reasoning?](http://arxiv.org/abs/2605.15198v1) proposes a single "functional token" that serves as both an agentic operation (code/tool call) and a latent reasoning unit, reducing context-switching overhead and enabling parallelizable training while maintaining task generalization of latent methods.

## Takeaways

Agentic systems research is maturing beyond isolated model benchmarks toward realistic tool-calling trade-offs and evaluation frameworks grounded in dynamic environments. The week's papers reveal that retrieval strategy alone does not determine agent performance; harness design, tool output presentation, and context window handling are equally critical. Production systems will need empirical characterization of their specific corpus and model combination rather than generic "best practice" guidance.

Test-time compute scaling via breadth (population-based sampling and ranking) is gaining traction as a more practical alternative to depth-only reasoning chains, especially for competitive domains like coding. Bradley-Terry aggregation sidesteps the pointwise grading bottleneck and aligns with how humans evaluate diverse candidate solutions. Expect increased interest in multi-candidate sampling and pairwise comparison methods across inference frameworks.

Agent adaptability remains poorly understood. FutureSim's chronological replay of real events reveals that even frontier models lack robust calibration on forecasting tasks. This suggests that future agentic evaluation should prioritize dynamic, time-ordered environments over static benchmarks, and that agent research should invest in understanding when and why models fail under distributional shift.

## Further reading

- [arxiv.org/abs/2605.15184v1](http://arxiv.org/abs/2605.15184v1): Full paper on agent harness and retrieval strategy trade-offs in agentic search systems.
- [arxiv.org/abs/2605.15177v1](http://arxiv.org/abs/2605.15177v1): OpenDeepThink population-based reasoning via Bradley-Terry aggregation preprint.
- [arxiv.org/abs/2605.15188v1](http://arxiv.org/abs/2605.15188v1): FutureSim framework for evaluating agent adaptability in chronological event replay.
- [arxiv.org/abs/2605.15183v1](http://arxiv.org/abs/2605.15183v1): Tensor similarity metric for mechanistic interpretability of neural networks.
```

## Frequently asked

### What retrieval strategy works best for agentic search systems?

Recent empirical work compares grep and vector retrieval on agent tasks, revealing that tool output presentation and irrelevant context handling significantly impact performance. No single strategy dominates; the choice depends on agent architecture and harness design.

### How can LLMs improve reasoning performance at test time?

Population-based methods like Bradley-Terry aggregation scale breadth by sampling multiple candidates in parallel, then using pairwise LLM comparisons to rank and mutate high-performing branches, avoiding noisy pointwise grading.

### How should we evaluate agent capability in dynamic environments?

Grounded simulations that replay real-world events in chronological order, such as news articles and question resolutions, reveal agent weaknesses better than static benchmarks. Current frontier models show poor calibration on forecasting tasks.

### Why is mechanistic interpretability relevant to agent systems?

Understanding whether two model components compute the same function requires weight-space invariant metrics. Tensor similarity provides this via recursive algorithms, enabling verification of agent subcomponents without relying on empirical behavior alone.
