AIgentic

Agentic Systems & LLM Tooling Daily

Arxiv Digest

Arxiv digest: agent evolution, inference scaling

Recent papers show agents learning to rewrite their own code, LLMs trained for diversity to enable better inference-time search, and new safety mechanisms for KV-cache sharing between agents. Core tension: enabling agents to adapt and scale without losing control.


This week’s arxiv papers cluster around three themes: agents gaining the ability to rewrite their own implementation, LLMs trained for diversity to unlock deeper inference-time search, and nascent safety mechanisms for multi-agent KV-cache communication. The direction is clear: agentic systems are moving from static post-training artifacts toward runtime adaptation, vector rewards, and controlled information sharing.

Top 5

RankTitleAuthorsScoreWhy it matters
1MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent SystemsCai, Zhang, et al.9/10Agents modify their own code, not just prompts. Turing-complete adaptation without eroding under drift.
2Vector Policy Optimization: Training for Diversity Improves Test-Time SearchBahlous-Boldi, Puri, et al.8.5/10GRPO replacement that trains for rollout diversity. Enables deeper inference-time search trees and filters.
3LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent SystemsAsif, Amiri, et al.8/10Blocks sensitive data leakage via KV caches. First safety mechanism for latent agent communication.
4DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/RollbackDong, He, et al.7.5/10Change-based C/R for deep search and RL. 100x speedup vs. full-state duplication. Enables tree search at scale.
5Tokenisation via Convex RelaxationsTempus, Whittington, et al.7/10Optimal tokenizers beat BPE/Unigram. Improves bits-per-byte and downstream tasks. Certifies distance from optimum.

Flagship: MOSS: Self-Evolution through Source-Level Rewriting in Autonomous Agent Systems

MOSS addresses a fundamental problem in deployed agentic systems: they are static. Once shipped, they learn nothing from user interactions, and failures persist until a human-driven code update. Prior work on self-evolving agents confines adaptation to text-mutable artifacts (prompts, skill files, memory schemas, workflow graphs), leaving the agent harness, routing logic, hook ordering, state invariants, and dispatch untouched. Since these mechanisms live in code, not configuration, entire classes of structural failure are unreachable from the text layer.

Cai, Zhang, and co-authors argue that source-level adaptation is a strictly more general medium. Unlike text edits that depend on base-model compliance and erode under long-context drift, code modifications are deterministic. Code-level changes are Turing-complete, a proper superset of every text-mutable scope. The core insight is simple: if an agent can propose a code change, test it, and commit it, it can fix not just behavior but structure.

The MOSS system operates in three phases. First, the agent instruments its own execution to capture failure signals. Second, it generates hypothesized source-level rewrites (e.g., changing a hook ordering, modifying a dispatch condition, or adding a state invariant). Third, it validates the rewrite against a test suite of past failures and new incoming requests. Only validated rewrites are committed to the codebase.

The authors demonstrate MOSS on multi-step reasoning tasks, task planning, and tool-use scenarios. In their experiments, agents using MOSS converge to stable, correct behavior within 10 to 50 interactions, whereas text-only adaptation systems plateau or diverge. The system logs every change, so humans retain auditability. Rewrites can be rolled back if downstream performance degrades.

Two limitations merit discussion. First, MOSS assumes code can be parsed and modified safely (e.g., abstract syntax tree rewriting), which may not hold for obfuscated or dynamically compiled agent implementations. Second, the paper does not address concurrent adaptation (what happens if two agents propose conflicting rewrites?) or security (how to prevent malicious rewrites in a decentralized setting). A third practical concern: the rewrite search space grows exponentially, so the system likely relies on heuristics or LLM proposals to stay tractable. The paper is silent on how many invalid rewrites are generated per successful one.

Despite these caveats, MOSS represents a qualitative shift in how we should think about agent post-deployment behavior. It moves agents from prompt tuning toward infrastructure modification. This is directly relevant to practitioners building production agentic workflows, especially those handling long-tail failure modes that surface only after deployment.

Also noteworthy

  • Vector Policy Optimization: Training for Diversity Improves Test-Time Search: Bahlous-Boldi et al. introduce VPO, a GRPO variant that trains language models to produce diverse rollouts by anticipating vector-valued (per-test-case, multi-objective) rewards. The key insight is that inference-time search procedures like AlphaEvolve need low-entropy regions to explore; standard scalar-reward training collapses response distributions. Experiments show VPO enables deeper search trees and higher final accuracy on code generation and reasoning tasks.

  • LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems: Asif, Amiri, and co-authors tackle a nascent but critical problem: as multi-agent LLM systems move from natural-language communication to latent communication via KV caches for efficiency, they risk leaking sensitive context, intermediate reasoning, and agent-specific information. LCGuard learns representation-level transformations that redact KV artifacts before sharing, using contrastive losses to preserve task-relevant signals while blocking unintended information propagation. First-of-its-kind defense against latent side channels in agent systems.

  • DeltaBox: Scaling Stateful AI Agents with Millisecond-Level Sandbox Checkpoint/Rollback: Dong, He, and collaborators observe that test-time tree search and on-device RL for agents require frequent checkpoint and rollback of sandbox state (files, processes, memory). Existing mechanisms duplicate entire state, incurring 100s of milliseconds to seconds per operation. DeltaBox introduces an OS-level abstraction, DeltaState, that only duplicates changes between consecutive checkpoints. Early results show 10 to 100x speedups, enabling practical deep search in agentic systems at scale.

Takeaways

  • Agent systems are transitioning from post-training-only to runtime-adaptive architectures. MOSS demonstrates that source-level code modification is both tractable and more expressive than prompt-only tuning. Expect more work on safe code synthesis and validation for production agents in the coming months.

  • Vector rewards and inference-time diversity are emerging as pragmatic responses to the limits of scalar optimization. VPO shows that training for rollout diversity, rather than peak performance, enables better test-time search. This aligns with growing interest in scaling compute at inference (AlphaEvolve, test-time scaling) and suggests future training recipes will explicitly optimize for frontier uncertainty, not just expected reward.

  • Multi-agent LLM systems will require safety mechanisms at every information layer: text, latent representations, and KV caches. LCGuard is an early entrant, but the space is nascent. Practitioners building agent ecosystems should plan for representation-level auditing and controlled cache sharing from day one.

Further reading

Frequently asked

What does source-level adaptation mean for agents?

MOSS allows agents to modify their own code (routing, hooks, state invariants) rather than just text artifacts like prompts. This is more general than editing skill files because the agent harness and dispatch logic live in code, not text. It enables agents to fix structural failures that prompt-only systems cannot reach.

Why train LLMs for diversity if the goal is correctness?

Vector Policy Optimization (VPO) recognizes that inference-time search (like AlphaEvolve) needs diverse rollouts to explore multiple solutions under different reward functions. Standard scalar reward training collapses the response distribution; VPO maintains entropy by anticipating vector-valued rewards, producing low-probability solutions that search procedures can then filter.

What security risks do latent KV communications pose?

When agents communicate via transformer KV caches instead of natural language, sensitive context, intermediate reasoning, and agent-specific information propagate opaquely. LCGuard learns to redact latent representations before cache sharing, blocking information leakage at the representation level rather than text level.

How does tokenization affect LLM efficiency?

ConvexTok formulates tokenizer construction as a linear program, replacing greedy algorithms (BPE, Unigram) with globally optimal solutions. Papers show consistent bits-per-byte improvements and downstream task gains. The approach certifies how close a tokenizer is to the theoretical optimum.

← All posts