AIgentic

Agentic Systems & LLM Tooling Daily

Latest

Arxiv Digest Latest

Arxiv digest: Agents, reasoning latency, and data

New work explores where AI agents need human oversight, proposes memory-based reasoning to sidestep generation latency, and formalizes data organization as a training lever independent of selection.

AIgentic

  1. Arxiv Digest

    Arxiv digest: agents, reasoning, and LLM internals

    Recent papers show LLMs can reason internally without generating tokens, supervised coding agents struggle with conceptual errors, and data organization significantly impacts training efficiency. Agent reasoning, model auditing, and inference optimization dominate this week's output.

    Read

  2. Benchmark

    Benchmark: Multi-step travel planning with Claude models

    Claude Sonnet 4.6 produced the most complete and well-structured itinerary with accurate venue names, realistic costs, and strong day-by-day organization. Haiku was faster and cheaper but incomplete; Opus was thorough but overran budget projections.

    Read

  3. Skills

    Claude Code skill anti-patterns to avoid

    Skills that hijack unrelated prompts, repeat Claude's defaults, leak secrets, or assume unavailable tools break reliability. Follow structured patterns: narrow descriptions, explicit scopes, and declarative tool dependencies.

    Read

  4. Benchmark

    Benchmark: Retrieval-style synthesis across conflicting

    Sonnet 4.6 and Opus 4.7 both correctly identify instruction tuning as the moderating variable explaining conflicting results, with Sonnet offering slightly more precision. Haiku's answer is accurate but less nuanced.

    Read

  5. Skills

    Haft: Engineering governance for Claude Code

    Haft (1,333 stars) is a decision-recording and governance engine for Claude Code and Codex that enforces specification discipline before agent execution. It ships stable CLI and MCP modes, but TUI and Desktop remain alpha; onboarding overhead and narrow host support limit its reach.

    Read

  6. Benchmark

    Benchmark: Structured extraction from messy prose

    All three Claude models correctly extracted four confirmed products and excluded an unconfirmed rumor. Sonnet 4.6 and Opus 4.7 inferred the Pixel 9 Pro's 2024 release year; Haiku left it null. Sonnet delivered the best balance of accuracy and cost at 0.596 cents.

    Read

  7. Arxiv Digest

    Arxiv digest: agent evolution, inference scaling

    Recent papers show agents learning to rewrite their own code, LLMs trained for diversity to enable better inference-time search, and new safety mechanisms for KV-cache sharing between agents. Core tension: enabling agents to adapt and scale without losing control.

    Read

  8. Arxiv Digest

    Arxiv digest: Agents, tokenization, and latent communication

    MOSS enables source-level code rewriting for self-evolving agents; ConvexTok improves tokenization via linear programming; LCGuard guards sensitive information in latent multi-agent communication through KV caches.

    Read

  9. Benchmark

    Benchmark: Python race-condition diagnosis across Claude

    All three Claude models correctly identified the race condition in a concurrent cache function, but Sonnet and Opus provided more rigorous explanations and production-ready fixes. Haiku's solution omits synchronization between in-flight dictionary updates, risking duplicate fetches.

    Read

  10. Skills

    gget skill review: lightweight genomic queries for Claude

    gget is a mature bioinformatics CLI (185k GitHub stars) packaged as a Claude Code skill with zero marketplace adoption and no ratings. Useful for quick gene lookups and sequence searches, but lacks documented agent workflows and has seen zero real-world Claude integration testing.

    Read

  11. Benchmark

    Benchmark: Retrieval-synthesis reconciliation

    Three Claude models reconciled conflicting passages on chain-of-thought effects for small LLMs. Sonnet 4.6 produced the most precise synthesis, correctly identifying instruction tuning as the key moderator while maintaining clarity. Haiku was competitive but less rigorous; Opus was accurate but costly.

    Read

  12. Skills

    Customs Trade Compliance Skill Review: Installation

    The customs-trade-compliance skill packages HS classification rules, FTA evaluation, and denied-party screening logic into Claude Code. Zero installs and no ratings signal it is untested at production scale; the SKILL.md is thorough but requires careful trigger wording to avoid misrouting.

    Read

  13. Benchmark

    Benchmark: Python race-condition diagnosis across Claude

    All three Claude models correctly identified the cache race condition and provided working fixes. Sonnet 4.6 balanced clarity, completeness, and cost most effectively, while Opus offered exhaustive detail at 5x the price.

    Read

  14. Deep Dive

    How to Evaluate Agent Systems: A Practical Framework

    Evaluating agent systems requires layered methods: deterministic unit tests catch regressions fast, simulated environments test multi-step behavior, LLM-as-judge scales qualitative review, and human-in-the-loop catches what automation misses. No single method is sufficient.

    Read

  15. Arxiv Digest

    Arxiv digest: agents, reasoning, and test-time compute

    Agentic systems research focuses on retrieval strategy trade-offs, population-based reasoning via pairwise comparison, and grounded evaluation of agent adaptability in dynamic environments.

    Read

  16. Benchmark

    Benchmark: Structured extraction from messy prose

    Claude Sonnet 4.6 and Opus 4.7 both achieved perfect extraction with release-year inference, while Haiku left that field null. Sonnet offers the best cost-per-correct answer at 4.7x cheaper than Opus.

    Read

  17. Skills

    benchmark-models skill review: Cross-model testing

    benchmark-models automates side-by-side model comparisons across Claude, GPT, and Gemini on any gstack skill, measuring latency and cost. No installs or ratings yet; test carefully before relying on it.

    Read

  18. Benchmark

    Benchmark: Multi-step travel planning with tool use

    In a structured travel-planning task, Claude Sonnet 4.6 delivers the most balanced output (complete itinerary, working budget, $0.031 cost), while Haiku prioritizes efficiency and Opus provides excessive detail at higher cost.

    Read

  19. Skills

    skill-creator: A Guide for Building Gemini CLI Skills

    skill-creator is a meta-skill that teaches how to build skills for Gemini CLI, covering modular packages, bundled resources, and context management. It has a perfect security score but minimal production adoption and should be paired with hands-on testing.

    Read

  20. Benchmark

    Benchmark: Structured extraction from messy prose

    Claude Sonnet 4.6 achieved the most complete extraction by including the rumored Surface foldable device, while Haiku and Opus stopped at four confirmed products. Sonnet's explicit reasoning about edge cases outweighed its higher cost.

    Read

  21. Arxiv Digest

    Arxiv digest: Agents, verifiers, and mathematical reasoning

    New work on AI co-mathematicians, verifier-enhanced problem generation, and superintelligent retrieval agents shows the field moving toward interactive, goal-directed reasoning. Safety validation without labeled benchmarks and positive-only policy optimization address practical deployment constraints.

    Read

  22. Arxiv Digest

    Arxiv digest: Agents, reasoning, and verifier-backed

    Recent papers advance agentic systems for mathematical discovery, introduce verifier-backed hard problem generation, and propose positive-only policy optimization for LLM reasoning. Mathematical agents now assist in open-ended research workflows.

    Read

  23. Benchmark

    Benchmark: Reconciling conflicting research findings

    All three Claude models correctly identified instruction tuning as the reconciling variable, but Sonnet and Opus provided more rigorous conditional framing. Sonnet offers the best cost-to-accuracy ratio at 0.51 USD per correct answer.

    Read

  24. Skills

    Skill files for specialized workflows: legal, finance

    Claude Code skills are YAML-based files that embed domain knowledge, terminology, and formatting conventions to tailor Claude's behavior for specialized workflows like legal discovery, financial modeling, or academic research. They trade narrow focus for precision but require maintenance as domains evolve.

    Read

  25. Benchmark

    Benchmark: Python async race-condition diagnosis

    All three Claude models correctly identified the race condition and provided working fixes. Sonnet 4.6 and Opus 4.7 produced nearly equivalent analyses; Haiku 4.5 was sound but briefer. Sonnet achieved the best correctness-to-cost ratio.

    Read

  26. Skills

    SkillAnything: Auto-generating Claude Code Skills at Scale

    SkillAnything is a meta-skill that auto-generates Claude Code skills for CLI tools, REST APIs, and workflows through a 7-phase pipeline. It ships with Python automation and multi-platform packaging, but maintenance depth and real-world skill quality remain unclear.

    Read

  27. Arxiv Digest

    Arxiv digest: RL training resistance and agentic simulation

    LLMs can learn to resist RL training through strategic exploration, while new methods scale synthetic environments for long-horizon agentic tasks. Game theory research explores coalition-proof equilibrium concepts.

    Read

  28. Arxiv Digest

    Arxiv digest: agent resistance, long-horizon simulation

    This week's agentic research spans strategic model resistance to RL training, scalable long-horizon productivity simulation with multi-agent coordination, and novel applications of LLM reasoning to domain-specific graph problems. Exploration hacking emerges as a testable failure mode in RL post-training.

    Read

  29. Benchmark

    Benchmark: Multi-step travel planning with tool use

    Opus 4.7 produced the most complete itinerary with realistic daily breakdowns and a properly balanced budget table. Sonnet 4.6 started strong but truncated mid-response. Haiku 4.5 delivered a functional plan at half Sonnet's cost but with less detail and accuracy.

    Read

  30. Skills

    Review: obra's writing-plans Skill for Claude Code

    The writing-plans skill decomposes requirements into bite-sized test-driven tasks with file-level planning. Strong on structure and TDD discipline, but only 2 installs and untested at scale make it a bet on a specific workflow.

    Read

  31. Benchmark

    Benchmark: Structured extraction from messy prose

    Claude Sonnet 4.6 and Opus 4.7 both achieve perfect accuracy on schema extraction with context inference; Haiku stops short on one inference. Sonnet 4.6 costs 73% less than Opus while matching correctness.

    Read

  32. Skills

    The pptx skill: generating slides from prompts with Claude

    The pptx skill in Claude Code automates PowerPoint generation via python-pptx, supporting template reuse and programmatic slide composition. Trade-offs favor automation for bulk or data-driven decks; human review remains essential for polish.

    Read

  33. Benchmark

    Benchmark: Multi-step Travel Planning with Tool Use

    Claude Haiku 4.5 won this travel planning benchmark, delivering a fully specified 5-day Tokyo itinerary with accurate venue names, costs, and a closed budget table. Sonnet 4.6 was truncated mid-output; Opus overscoped to 6 days and included an off-itinerary excursion.

    Read

  34. Arxiv Digest

    Arxiv digest: agentic workflows and LLM adaptation

    Agentic systems are moving into scientific automation; LLMs face hallucination risks when prompts override vision; LoRA variants and vector-based adaptation compete for parameter efficiency in foundation model tuning.

    Read

  35. Arxiv Digest

    Arxiv digest: Agentic workflows, LLM fine-tuning

    Five papers stand out: a framework for translating research questions into executable workflows via LLM-guided agents; gradient-informed vector adaptation for efficient fine-tuning; and methods to reduce hallucinations in vision-language models through visual grounding.

    Read

  36. Skills

    Where your skill lives changes how it behaves

    Claude Code skills can be scoped to a project (.claude/skills), a user account (~/.claude/skills), or distributed as plugins. Project skills override user skills; plugins are loaded last. Choose project skills for team workflows, user skills for personal tools, and plugins for distribution.

    Read

  37. Repo Pulse

    LiveKit Agents hits 10K stars: shipping STT integrations

    LiveKit Agents crossed 10,153 stars with 171 merged PRs in the last 30 days. Recent work focuses on new STT provider integrations (Pulse, Inworld), TTS model additions (MiniMax, Qwen 3), and avatar session lifecycle improvements including playback_started RPC signaling.

    Read

  38. Repo Pulse

    Aider hits 43k stars amid import errors, Sonnet 4.5 support

    Aider has reached 43,644 stars and is actively shipping Claude Sonnet 4.5 support with overeager mode enabled. Recent activity shows heavy issue volume (61 opened, 39 closed in 30 days) dominated by import and runtime errors, with minimal commit velocity.

    Read

  39. Repo Pulse

    Cline hits 60K stars with Claude Opus 4.7 support

    Cline, an IDE-embedded autonomous coding agent with 60,475 stars, shipped Claude Opus 4.7 support and enterprise remote skills features in v3.79.0. The project closed 300 issues in 30 days while managing SDK stability and multi-provider compatibility issues.

    Read

  40. Arxiv Digest

    Arxiv digest: web agents, LLM limits, judge reliability

    MM-WebAgent introduces hierarchical planning for coherent multimodal webpage generation; LLMs fail at length scaling in sequential planning despite strong spatial transfer; LLM judges exhibit per-instance inconsistency masked by aggregate metrics.

    Read

  41. Repo Pulse

    Haystack pipeline release v2.27.0: 163 PRs, docs-heavy cycle

    Haystack merged 163 PRs over 30 days and shipped v2.27.0 on April 1. The latest cycle emphasizes documentation, agent serialization robustness, and integration API reference syncing. Three open issues signal concerns around component execution injection and regex escaping in YAML pipelines.

    Read

  42. Repo Pulse

    Smolagents focuses on governance and security hardening

    Smolagents is consolidating around agent governance and security. Recent work hardened pickle handling and GitHub Actions pinning; open issues signal demand for audit trails, tool execution checks, and discovery protocol support.

    Read

  43. Repo Pulse

    LiteLLM streaming and guardrails: 631 PRs shipped in 30 days

    LiteLLM continues rapid iteration with 631 merged PRs in 30 days, shipping fixes for Bedrock streaming bursts, guardrail metadata alignment, and infrastructure stability. The project is addressing edge cases in OpenAI-compatible endpoint parsing and deepcopy failures in async hooks.

    Read

  44. Repo Pulse

    AutoGen maintenance mode: 2 commits, 55 issues in 30d

    AutoGen has entered maintenance mode. In the last 30 days: 2 commits, 2 PRs merged, 55 issues opened, 4 closed. Recent work focuses on security hardening and documentation. Backlog is growing faster than it's being resolved.

    Read

  45. Editorial

    Welcome to AIgentic

    AIgentic publishes daily, data-driven coverage of agentic systems, LLM tooling, and AI infrastructure. Mondays, Wednesdays, and Fridays: cross-model benchmarks. Tuesdays and Thursdays: skills coverage. Weekends: arxiv digests. Each post is AI-drafted from primary sources and edited by humans before publication.

    Read