---
title: "benchmark-models skill review: Cross-model testing"
description: "garrytan's benchmark-models skill runs prompts through Claude, GPT, and Gemini side-by-side to compare latency, tokens, and cost. Zero installs, untested."
tldr: "benchmark-models automates side-by-side model comparisons across Claude, GPT, and Gemini on any gstack skill, measuring latency and cost. No installs or ratings yet; test carefully before relying on it."
url: "https://aigentic.blog/review-garrytan-benchmark-models"
publishedAt: "2026-05-14T13:00:27.397Z"
updatedAt: "2026-05-14T13:00:27.397Z"
category: "skills"
tags: ["claude-code","skills","data-analytics","garrytan","model-evaluation","benchmarking"]
---

# benchmark-models skill review: Cross-model testing

> benchmark-models automates side-by-side model comparisons across Claude, GPT, and Gemini on any gstack skill, measuring latency and cost. No installs or ratings yet; test carefully before relying on it.

benchmark-models is a Claude Code skill that automates cross-provider model comparison, running the same prompt through Anthropic's Claude, OpenAI's GPT, and Google Gemini to compare latency, token usage, and cost. It is intended to answer the question "which model is actually best for this skill?" with measured data rather than assumption.

| Metric | Value |
|--------|-------|
| Marketplace Rating | 0 (no ratings) |
| Installs | 0 |
| Security Score | 0/100 |
| Content Quality | 67/100 |
| Upstream GitHub Stars | 92,607 |
| Last Updated | 2026-05-14 |

This absence of installs and ratings is the first signal worth examining. Zero adoption means zero real-world feedback, zero bug reports, zero proof of correctness at scale. The 67/100 quality score (likely auto-generated from text analysis) suggests incomplete documentation, narrow capability examples, or unclear setup instructions. The high GitHub star count on an upstream project is not a proxy for this skill's reliability; it may reflect the owner's other work rather than the benchmark-models component itself.

## What the Skill Claims to Do

The SKILL.md preamble states it "runs the same prompt through Claude, GPT (via Codex CLI), and Gemini side-by-side" and "compares latency, tokens, cost, and optionally quality via LLM judge." Triggers include "cross model benchmark," "compare claude gpt gemini," and "which model should I use." The skill declares allowed tools as Bash, Read, and AskUserQuestion, meaning it can execute shell commands, read files, and prompt the user for input.

The declared use cases are narrow: model selection decisions within gstack workflows. If you are deciding whether to route a particular task to Claude Sonnet, GPT-4, or Gemini based on measured behavior, this skill is positioned as the tool. The preamble notes it answers use cases like "which model is best for X" and "cross-model comparison."

## Installation and Platform Compatibility

Skills in Claude Code live in `.claude/skills/` and are invoked by reference. This skill would install at `.claude/skills/garrytan/benchmark-models/SKILL.md`. The marketplace listing states it targets the "openclaw" platform and requires Bash, Read, and AskUserQuestion as available tools. If your Claude Code environment lacks Bash execution or the AskUserQuestion MCP tool, the skill will not run.

The SKILL.md file shown includes a lengthy preamble block that checks for gstack configuration, session tracking, telemetry, and learnings file state before running the actual benchmark logic. This initialization overhead suggests the skill is tightly coupled to the gstack ecosystem (the parent framework). If you are using Claude Code in isolation or with a minimal gstack setup, some of that preamble may error silently or leave stale state in `~/.gstack/analytics/`.

## Mechanics and Dependencies

To function, the skill requires three API credentials:
- Anthropic API key for Claude (likely already set in Claude Code)
- OpenAI API key and Codex CLI installed locally
- Google Gemini API key

The SKILL.md does not document credential setup explicitly. It references "GPT (via Codex CLI)" which is OpenAI's command-line interface; if you do not have Codex CLI installed and configured, that branch of the benchmark will fail. Similarly, Gemini API access requires a separate API key from Google Cloud. The skill does not validate these prerequisites before starting.

Once credentials are in place, the skill collects:
- **Latency**: wall-clock time per model
- **Tokens**: input and output token counts per model
- **Cost**: API pricing multiplied by token counts

Optionally, it can run an LLM judge (presumably Claude) to score output quality if the prompt produces answers you can evaluate. This is useful for tasks like code generation or summarization, but less relevant for classification or structured data extraction where correctness is binary.

## Comparison to Alternatives

| Approach | Model Selection | Manual Setup | Automated Tracking | Output Quality Judge | Cost Tracking |
|----------|-----------------|--------------|------------------|----------------------|---------------|
| benchmark-models | Yes (3 providers) | Medium (3 APIs) | Yes | Optional (LLM judge) | Yes |
| Claude Code manual switching | Yes (1 at a time) | None | No | Manual | No |
| LangSmith or similar observability platform | Limited to instrumented calls | High | Yes | Via integrations | Via usage API |
| Simple shell loop + curl | Yes (3 providers) | High (write yourself) | No | No | Manual math |

benchmark-models sits between manual switching and a full observability platform. It is less flexible than LangSmith (which tracks any LLM call), but requires far less setup than writing your own benchmark harness. However, it assumes all three API credentials are available and functional.

## Known Issues and Gaps

**Zero installs and ratings**: This is not a minor red flag. It means the skill has never been run by another user in the marketplace. You would be the first to discover any environment-specific issues, credential bugs, or output format surprises. The absence of feedback loops means the owner has not had to fix issues or clarify ambiguous documentation.

**Tight coupling to gstack**: The preamble performs heavy framework initialization (session tracking, telemetry, config reading, learnings file loading). If gstack's directory structure or binaries change, or if you are running Claude Code in a minimal environment without full gstack setup, the preamble may hang or error. The preamble does use `|| true` to suppress errors, but that masks failures silently.

**Content quality score (67/100)**: This likely reflects incomplete documentation of:
- Step-by-step examples showing actual benchmark output
- Troubleshooting guide for missing API keys or Codex CLI
- Explanation of cost calculation (which provider's pricing, when last updated)
- Guidance on interpreting results (is 10ms latency difference significant, or noise)

**No visibility into actual benchmark code**: The SKILL.md shown is the preamble and framework scaffolding, not the core benchmark logic. The actual model invocation and metric collection code is absent from the provided file. You cannot audit whether token counts are computed correctly, whether cost data is current, or whether the LLM judge is reliable.

**Codex CLI dependency**: OpenAI's Codex CLI is not widely maintained and has been superseded by the standard OpenAI Python client and REST API. Relying on it adds a stale dependency and makes the skill harder to debug if Codex CLI breaks.

## When to Use This Skill

Use it if:
- You are making a one-time model selection decision for a gstack skill and want empirical data
- You have all three API credentials (Anthropic, OpenAI, Google) configured and tested
- Your environment has Bash, Read, and AskUserQuestion tools available
- You are patient with the possibility of hitting undocumented setup issues

Skip it if:
- You need continuous model monitoring (use an observability platform instead)
- You have only one API credential (Claude) and want to compare only across Claude models
- You are on a time-critical path and cannot afford debugging missing dependencies
- You need the skill to be production-tested and community-validated before using it

## Takeaways

The benchmark-models skill addresses a real need: data-driven model selection for agentic workflows. However, zero installs, zero ratings, and a 67/100 content quality score indicate it has not yet accumulated the community validation needed to be reliable at scale. The tight coupling to gstack and reliance on three separate API credentials (including the aging Codex CLI) introduces multiple failure points that the sparse documentation does not cover. If you have the time to troubleshoot, it may yield useful comparative metrics; otherwise, invest in an observability platform or write a simple benchmark loop that you can test and maintain directly.

## Further Reading

- [benchmark-models on agentskill.sh marketplace](https://agentskill.sh/garrytan/benchmark-models) - official skill listing and installation source
- [OpenAI Codex CLI documentation](https://platform.openai.com/docs/guides/gpt/overview) - context on the GPT integration method (note: Codex CLI is legacy)
- [Anthropic API documentation](https://docs.anthropic.com/) - Claude API reference and authentication setup
- [Google Gemini API docs](https://ai.google.dev/docs) - Gemini API setup and pricing
- [LangSmith observability platform](https://www.langchain.com/langsmith) - alternative for continuous model and cost tracking
```

## Frequently asked

### What does the benchmark-models skill actually do?

It runs the same prompt through Anthropic's Claude, OpenAI's GPT (via Codex CLI), and Google Gemini in sequence, collecting latency, token counts, and cost for each model. Optionally it can run an LLM-based quality judge to compare outputs. You invoke it when deciding which model fits a particular gstack skill best.

### Why should I care about this if it has zero installs and zero ratings?

If you need real data instead of intuition to choose between models for a workload, the concept is sound. But zero installs and zero ratings mean no community validation yet. Use it in a controlled setting first, verify the output format and accuracy, before rolling it into production routing logic.

### What platforms does this skill work on?

According to the listing, it targets 'openclaw' platform and requires Bash and Read tools, plus AskUserQuestion for interactive use. Check your Claude Code environment has those capabilities before invoking it.

### How does this differ from Claude's built-in model comparison?

Claude Code's native environment can switch models, but this skill automates a structured benchmark across all three major providers in a single run, captures metrics, and optionally judges output quality using a separate LLM. It removes manual switching and spreadsheet work.

### What are the failure modes I should watch for?

The skill depends on three separate API credentials (Claude, GPT via Codex CLI, Gemini), cost tracking accuracy, and correct token counting per provider. Any misconfiguration silently skews results. The low quality score (67/100) and lack of usage data suggest the testing surface is narrow.
