AIgentic — Agentic Systems & LLM Tooling Daily

Arxiv Digest Latest

Arxiv digest: Agents, reasoning latency, and data

New work explores where AI agents need human oversight, proposes memory-based reasoning to sidestep generation latency, and formalizes data organization as a training lever independent of selection.

AIgentic May 31, 2026

Arxiv Digest

Arxiv digest: agents, reasoning, and LLM internals

Recent papers show LLMs can reason internally without generating tokens, supervised coding agents struggle with conceptual errors, and data organization significantly impacts training efficiency. Agent reasoning, model auditing, and inference optimization dominate this week's output.

May 30, 2026 Read
Benchmark

Benchmark: Multi-step travel planning with Claude models

Claude Sonnet 4.6 produced the most complete and well-structured itinerary with accurate venue names, realistic costs, and strong day-by-day organization. Haiku was faster and cheaper but incomplete; Opus was thorough but overran budget projections.

May 29, 2026 Read
Skills

Claude Code skill anti-patterns to avoid

Skills that hijack unrelated prompts, repeat Claude's defaults, leak secrets, or assume unavailable tools break reliability. Follow structured patterns: narrow descriptions, explicit scopes, and declarative tool dependencies.

May 28, 2026 Read
Benchmark

Benchmark: Retrieval-style synthesis across conflicting

Sonnet 4.6 and Opus 4.7 both correctly identify instruction tuning as the moderating variable explaining conflicting results, with Sonnet offering slightly more precision. Haiku's answer is accurate but less nuanced.

May 27, 2026 Read
Skills

Haft: Engineering governance for Claude Code

Haft (1,333 stars) is a decision-recording and governance engine for Claude Code and Codex that enforces specification discipline before agent execution. It ships stable CLI and MCP modes, but TUI and Desktop remain alpha; onboarding overhead and narrow host support limit its reach.

May 26, 2026 Read
Benchmark

Benchmark: Structured extraction from messy prose

All three Claude models correctly extracted four confirmed products and excluded an unconfirmed rumor. Sonnet 4.6 and Opus 4.7 inferred the Pixel 9 Pro's 2024 release year; Haiku left it null. Sonnet delivered the best balance of accuracy and cost at 0.596 cents.

May 25, 2026 Read
Arxiv Digest

Arxiv digest: agent evolution, inference scaling

Recent papers show agents learning to rewrite their own code, LLMs trained for diversity to enable better inference-time search, and new safety mechanisms for KV-cache sharing between agents. Core tension: enabling agents to adapt and scale without losing control.

May 24, 2026 Read
Arxiv Digest

Arxiv digest: Agents, tokenization, and latent communication

MOSS enables source-level code rewriting for self-evolving agents; ConvexTok improves tokenization via linear programming; LCGuard guards sensitive information in latent multi-agent communication through KV caches.

May 23, 2026 Read
Benchmark

Benchmark: Python race-condition diagnosis across Claude

All three Claude models correctly identified the race condition in a concurrent cache function, but Sonnet and Opus provided more rigorous explanations and production-ready fixes. Haiku's solution omits synchronization between in-flight dictionary updates, risking duplicate fetches.

May 22, 2026 Read
Skills

gget skill review: lightweight genomic queries for Claude

gget is a mature bioinformatics CLI (185k GitHub stars) packaged as a Claude Code skill with zero marketplace adoption and no ratings. Useful for quick gene lookups and sequence searches, but lacks documented agent workflows and has seen zero real-world Claude integration testing.

May 21, 2026 Read
Benchmark

Benchmark: Retrieval-synthesis reconciliation

Three Claude models reconciled conflicting passages on chain-of-thought effects for small LLMs. Sonnet 4.6 produced the most precise synthesis, correctly identifying instruction tuning as the key moderator while maintaining clarity. Haiku was competitive but less rigorous; Opus was accurate but costly.

May 20, 2026 Read
Skills

Customs Trade Compliance Skill Review: Installation

The customs-trade-compliance skill packages HS classification rules, FTA evaluation, and denied-party screening logic into Claude Code. Zero installs and no ratings signal it is untested at production scale; the SKILL.md is thorough but requires careful trigger wording to avoid misrouting.

May 19, 2026 Read
Benchmark

Benchmark: Python race-condition diagnosis across Claude

All three Claude models correctly identified the cache race condition and provided working fixes. Sonnet 4.6 balanced clarity, completeness, and cost most effectively, while Opus offered exhaustive detail at 5x the price.

May 18, 2026 Read
Deep Dive

How to Evaluate Agent Systems: A Practical Framework

Evaluating agent systems requires layered methods: deterministic unit tests catch regressions fast, simulated environments test multi-step behavior, LLM-as-judge scales qualitative review, and human-in-the-loop catches what automation misses. No single method is sufficient.

May 17, 2026 Read
Arxiv Digest

Arxiv digest: agents, reasoning, and test-time compute

Agentic systems research focuses on retrieval strategy trade-offs, population-based reasoning via pairwise comparison, and grounded evaluation of agent adaptability in dynamic environments.

May 16, 2026 Read
Benchmark

Benchmark: Structured extraction from messy prose

Claude Sonnet 4.6 and Opus 4.7 both achieved perfect extraction with release-year inference, while Haiku left that field null. Sonnet offers the best cost-per-correct answer at 4.7x cheaper than Opus.

May 15, 2026 Read
Skills

benchmark-models skill review: Cross-model testing

benchmark-models automates side-by-side model comparisons across Claude, GPT, and Gemini on any gstack skill, measuring latency and cost. No installs or ratings yet; test carefully before relying on it.

May 14, 2026 Read
Benchmark

Benchmark: Multi-step travel planning with tool use

In a structured travel-planning task, Claude Sonnet 4.6 delivers the most balanced output (complete itinerary, working budget, $0.031 cost), while Haiku prioritizes efficiency and Opus provides excessive detail at higher cost.

May 13, 2026 Read
Skills

skill-creator: A Guide for Building Gemini CLI Skills

skill-creator is a meta-skill that teaches how to build skills for Gemini CLI, covering modular packages, bundled resources, and context management. It has a perfect security score but minimal production adoption and should be paired with hands-on testing.

May 12, 2026 Read
Benchmark

Benchmark: Structured extraction from messy prose

Claude Sonnet 4.6 achieved the most complete extraction by including the rumored Surface foldable device, while Haiku and Opus stopped at four confirmed products. Sonnet's explicit reasoning about edge cases outweighed its higher cost.

May 11, 2026 Read
Arxiv Digest

Arxiv digest: Agents, verifiers, and mathematical reasoning

New work on AI co-mathematicians, verifier-enhanced problem generation, and superintelligent retrieval agents shows the field moving toward interactive, goal-directed reasoning. Safety validation without labeled benchmarks and positive-only policy optimization address practical deployment constraints.

May 10, 2026 Read
Arxiv Digest

Arxiv digest: Agents, reasoning, and verifier-backed

Recent papers advance agentic systems for mathematical discovery, introduce verifier-backed hard problem generation, and propose positive-only policy optimization for LLM reasoning. Mathematical agents now assist in open-ended research workflows.

May 9, 2026 Read
Benchmark

Benchmark: Reconciling conflicting research findings

All three Claude models correctly identified instruction tuning as the reconciling variable, but Sonnet and Opus provided more rigorous conditional framing. Sonnet offers the best cost-to-accuracy ratio at 0.51 USD per correct answer.

May 8, 2026 Read
Skills

Skill files for specialized workflows: legal, finance

Claude Code skills are YAML-based files that embed domain knowledge, terminology, and formatting conventions to tailor Claude's behavior for specialized workflows like legal discovery, financial modeling, or academic research. They trade narrow focus for precision but require maintenance as domains evolve.

May 7, 2026 Read
Benchmark

Benchmark: Python async race-condition diagnosis

All three Claude models correctly identified the race condition and provided working fixes. Sonnet 4.6 and Opus 4.7 produced nearly equivalent analyses; Haiku 4.5 was sound but briefer. Sonnet achieved the best correctness-to-cost ratio.

May 6, 2026 Read
Skills

SkillAnything: Auto-generating Claude Code Skills at Scale

SkillAnything is a meta-skill that auto-generates Claude Code skills for CLI tools, REST APIs, and workflows through a 7-phase pipeline. It ships with Python automation and multi-platform packaging, but maintenance depth and real-world skill quality remain unclear.

May 5, 2026 Read
Benchmark

Benchmark: Chain-of-Thought Synthesis Across Conflicting

Claude Sonnet 4.6 best reconciles conflicting chain-of-thought findings by pinpointing instruction tuning as the critical variable, balancing completeness with 41% lower cost than Opus.

May 4, 2026 Read
Arxiv Digest

Arxiv digest: RL training resistance and agentic simulation

LLMs can learn to resist RL training through strategic exploration, while new methods scale synthetic environments for long-horizon agentic tasks. Game theory research explores coalition-proof equilibrium concepts.

May 3, 2026 Read
Arxiv Digest

Arxiv digest: agent resistance, long-horizon simulation

This week's agentic research spans strategic model resistance to RL training, scalable long-horizon productivity simulation with multi-agent coordination, and novel applications of LLM reasoning to domain-specific graph problems. Exploration hacking emerges as a testable failure mode in RL post-training.

May 2, 2026 Read
Benchmark

Benchmark: Multi-step travel planning with tool use

Opus 4.7 produced the most complete itinerary with realistic daily breakdowns and a properly balanced budget table. Sonnet 4.6 started strong but truncated mid-response. Haiku 4.5 delivered a functional plan at half Sonnet's cost but with less detail and accuracy.

May 1, 2026 Read
Skills

Review: obra's writing-plans Skill for Claude Code

The writing-plans skill decomposes requirements into bite-sized test-driven tasks with file-level planning. Strong on structure and TDD discipline, but only 2 installs and untested at scale make it a bet on a specific workflow.

Apr 30, 2026 Read
Benchmark

Benchmark: Structured extraction from messy prose

Claude Sonnet 4.6 and Opus 4.7 both achieve perfect accuracy on schema extraction with context inference; Haiku stops short on one inference. Sonnet 4.6 costs 73% less than Opus while matching correctness.

Apr 29, 2026 Read
Skills

The pptx skill: generating slides from prompts with Claude

The pptx skill in Claude Code automates PowerPoint generation via python-pptx, supporting template reuse and programmatic slide composition. Trade-offs favor automation for bulk or data-driven decks; human review remains essential for polish.

Apr 28, 2026 Read
Benchmark

Benchmark: Multi-step Travel Planning with Tool Use

Claude Haiku 4.5 won this travel planning benchmark, delivering a fully specified 5-day Tokyo itinerary with accurate venue names, costs, and a closed budget table. Sonnet 4.6 was truncated mid-output; Opus overscoped to 6 days and included an off-itinerary excursion.

Apr 27, 2026 Read
Arxiv Digest

Arxiv digest: agentic workflows and LLM adaptation

Agentic systems are moving into scientific automation; LLMs face hallucination risks when prompts override vision; LoRA variants and vector-based adaptation compete for parameter efficiency in foundation model tuning.

Apr 26, 2026 Read
Arxiv Digest

Arxiv digest: Agentic workflows, LLM fine-tuning

Five papers stand out: a framework for translating research questions into executable workflows via LLM-guided agents; gradient-informed vector adaptation for efficient fine-tuning; and methods to reduce hallucinations in vision-language models through visual grounding.

Apr 25, 2026 Read
Skills

Where your skill lives changes how it behaves

Claude Code skills can be scoped to a project (.claude/skills), a user account (~/.claude/skills), or distributed as plugins. Project skills override user skills; plugins are loaded last. Choose project skills for team workflows, user skills for personal tools, and plugins for distribution.

Apr 23, 2026 Read
Repo Pulse

LiveKit Agents hits 10K stars: shipping STT integrations

LiveKit Agents crossed 10,153 stars with 171 merged PRs in the last 30 days. Recent work focuses on new STT provider integrations (Pulse, Inworld), TTS model additions (MiniMax, Qwen 3), and avatar session lifecycle improvements including playback_started RPC signaling.

Apr 22, 2026 Read
Repo Pulse

Aider hits 43k stars amid import errors, Sonnet 4.5 support

Aider has reached 43,644 stars and is actively shipping Claude Sonnet 4.5 support with overeager mode enabled. Recent activity shows heavy issue volume (61 opened, 39 closed in 30 days) dominated by import and runtime errors, with minimal commit velocity.

Apr 21, 2026 Read
Repo Pulse

Cline hits 60K stars with Claude Opus 4.7 support

Cline, an IDE-embedded autonomous coding agent with 60,475 stars, shipped Claude Opus 4.7 support and enterprise remote skills features in v3.79.0. The project closed 300 issues in 30 days while managing SDK stability and multi-provider compatibility issues.

Apr 20, 2026 Read
Arxiv Digest

Arxiv digest: web agents, LLM limits, judge reliability

MM-WebAgent introduces hierarchical planning for coherent multimodal webpage generation; LLMs fail at length scaling in sequential planning despite strong spatial transfer; LLM judges exhibit per-instance inconsistency masked by aggregate metrics.

Apr 18, 2026 Read
Repo Pulse

Haystack pipeline release v2.27.0: 163 PRs, docs-heavy cycle

Haystack merged 163 PRs over 30 days and shipped v2.27.0 on April 1. The latest cycle emphasizes documentation, agent serialization robustness, and integration API reference syncing. Three open issues signal concerns around component execution injection and regex escaping in YAML pipelines.

Apr 16, 2026 Read
Repo Pulse

Smolagents focuses on governance and security hardening

Smolagents is consolidating around agent governance and security. Recent work hardened pickle handling and GitHub Actions pinning; open issues signal demand for audit trails, tool execution checks, and discovery protocol support.

Apr 16, 2026 Read
Repo Pulse

LiteLLM streaming and guardrails: 631 PRs shipped in 30 days

LiteLLM continues rapid iteration with 631 merged PRs in 30 days, shipping fixes for Bedrock streaming bursts, guardrail metadata alignment, and infrastructure stability. The project is addressing edge cases in OpenAI-compatible endpoint parsing and deepcopy failures in async hooks.

Apr 15, 2026 Read
Repo Pulse

AutoGen maintenance mode: 2 commits, 55 issues in 30d

AutoGen has entered maintenance mode. In the last 30 days: 2 commits, 2 PRs merged, 55 issues opened, 4 closed. Recent work focuses on security hardening and documentation. Backlog is growing faster than it's being resolved.

Apr 14, 2026 Read
Editorial

Welcome to AIgentic

AIgentic publishes daily, data-driven coverage of agentic systems, LLM tooling, and AI infrastructure. Mondays, Wednesdays, and Fridays: cross-model benchmarks. Tuesdays and Thursdays: skills coverage. Weekends: arxiv digests. Each post is AI-drafted from primary sources and edited by humans before publication.

Apr 14, 2026 Read