Open agent evaluation leaderboard with multi-judge scoring, CLEAR-aligned metrics, and full transparency.
View Leaderboard · Methodology · Contribute
Most agent benchmarks measure one thing — accuracy — and publish results once. COT Bench measures what actually matters for production agents, stays fresh with automated weekly runs, and publishes every score from every judge so you can verify our work.
From the Chain of Thought podcast. Open-source judge models served on Modular MAX.
| Problem with existing benchmarks | How COT Bench solves it |
|---|---|
| Only measure accuracy | 4 CLEAR dimensions: Efficacy, Cost, Reliability, Latency |
| Single judge = single point of bias | 3 independent judges (2 open-source + 1 frontier), all scores published |
| Scoring criteria are a black box | Rubrics are code — read them in eval/scoring/rubrics.py |
| Results go stale within weeks | Automated weekly runs via GitHub Actions |
| OpenAI judges OpenAI models | Vendor-neutral: open-source judges on Modular MAX |
| No way to reproduce results | OpenInference traces for every run, compatible with Arize Phoenix |
Results will appear here after the first evaluation run. See the live leaderboard for interactive results.
| Rank | Model | CLEAR Score | Efficacy | $/Task | Reliability | Latency |
|---|---|---|---|---|---|---|
| — | Evaluation pending | — | — | — | — | — |
COT Bench aligns with the CLEAR framework, which showed that accuracy-only evaluation correlates just 0.41 with production success, while multi-dimensional evaluation achieves 0.83.
| Dimension | Weight | What it measures | How |
|---|---|---|---|
| Efficacy | 35% | Task completion + tool selection accuracy | Multi-judge LLM evaluation (consensus of 3 judges) |
| Reliability | 25% | Consistency across repeated runs | Pass@3 — same scenario run 3×, measuring score variance |
| Cost | 20% | Dollars per task | Token count × published per-token pricing |
| Latency | 20% | Speed of task completion | Wall-clock time across all agent turns |
CLEAR Score = weighted composite of all four dimensions, normalized across evaluated models. See scripts/aggregate_results.py for the exact calculation.
Every scenario is scored independently by three judges. We publish all individual scores plus consensus and agreement rates.
| Judge | Type | Served via | Purpose |
|---|---|---|---|
| Qwen3-235B | Open-source | Modular MAX | Primary open judge |
| DeepSeek-V3 | Open-source | Modular MAX | Second open judge (different training paradigm) |
| Claude Opus 4.6 | Frontier | Anthropic API | Reference judge for calibration |
Using two open-source judges from different training paradigms (Qwen from Alibaba, DeepSeek from DeepSeek AI) means disagreements surface genuine ambiguity rather than shared biases.
V1 targets 10 models across 2 domains:
| Model | Provider | Category |
|---|---|---|
| GPT-4.1 | OpenAI | Frontier |
| GPT-4.1-mini | OpenAI | Efficient |
| Claude Sonnet 4.6 | Anthropic | Frontier |
| Claude Haiku 4.5 | Anthropic | Efficient |
| Gemini 2.5 Pro | Frontier | |
| Gemini 2.5 Flash | Efficient | |
| DeepSeek-V3 | DeepSeek | Open-source |
| Qwen3-235B | Alibaba | Open-source |
| Llama 4 Maverick | Meta (via Together) | Open-source |
| Mistral Large | Mistral | Open-source |
| Domain | Description | Why it's interesting |
|---|---|---|
| Banking | Account management, transactions, fraud detection, compliance | Structured, tool-heavy, strict correctness requirements |
| Customer Success | CRM, ticketing, health scoring, escalation, onboarding | Ambiguous goals, empathy matters, multi-system coordination |
Each domain has 5 scenario categories: adaptive tool use, scope management, empathetic resolution, extreme scenario recovery, and adversarial input mitigation.
git clone https://github.com/conorbronsdon/cot-bench.git
cd cot-bench
pip install -e ".[dev]"cp .env.example .env
# Edit .env with your API keys
source .env # or use direnv/dotenv# Generate tools, personas, and scenarios for a domain
python -m scripts.generate_data --domain banking --scenarios-per-category 20
python -m scripts.generate_data --domain customer_success --scenarios-per-category 20# Evaluate specific models on one domain (frontier judge only — no GPU needed)
python -m scripts.run_eval \
--domains banking \
--models "GPT-4.1" "Claude Sonnet 4.6" \
--judges opus
# Full evaluation with all judges (requires MAX + GPU for open-source judges)
python -m scripts.run_eval
# Evaluate models in parallel (2 at a time)
python -m scripts.run_eval --parallel-models 2python -m scripts.aggregate_results
# Outputs: data/results/leaderboard.json + data/results/latest.csv# Requires GPU — see docs/max-setup.md for hardware requirements
pip install modular
max serve --model Qwen/Qwen3-235B --port 8010 &
max serve --model deepseek-ai/DeepSeek-V3-0324 --port 8011 &┌─────────────────────────────────────────────────────────────────┐
│ Simulation Loop (per scenario, up to 10 turns) │
│ │
│ User Simulator ──→ Agent Under Test ──→ Tool Simulator │
│ (GPT-4.1-mini) (model being (GPT-4.1-mini) │
│ evaluated) │
│ ← ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ → │
│ Repeats until all user goals met or max turns reached │
└─────────────────────────┬───────────────────────────────────────┘
│ transcript
┌─────────────────────────▼───────────────────────────────────────┐
│ Scoring (concurrent, per scenario) │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Qwen3-235B │ │ DeepSeek-V3 │ │ Claude Opus │ │
│ │ (MAX) │ │ (MAX) │ │ (API) │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ └──────────────────┼──────────────────┘ │
│ consensus score │
│ + agreement rate │
└─────────────────────────────────────────────────────────────────┘
cot-bench/
├── eval/ # Core evaluation library
│ ├── config.py # Domains, models, judges, token costs
│ ├── tracing.py # OpenInference/OTel trace emission
│ ├── scoring/
│ │ ├── rubrics.py # Published evaluation rubrics (the IP)
│ │ └── judge.py # Concurrent multi-judge orchestration
│ ├── simulation/
│ │ └── runner.py # Multi-turn conversation engine
│ └── providers/
│ └── registry.py # Config-driven model provider registry
├── data/
│ ├── domains/ # Domain tool + persona definitions
│ ├── scenarios/ # Test scenarios (generated JSON)
│ └── results/ # Evaluation outputs (parquet + csv + json)
├── scripts/
│ ├── run_eval.py # Main evaluation CLI
│ ├── generate_data.py # Synthetic scenario generation
│ └── aggregate_results.py # Leaderboard computation
├── infra/
│ └── max_serve.py # MAX judge server lifecycle management
├── frontend/
│ └── index.html # Leaderboard UI (GitHub Pages)
├── tests/ # 30 tests covering scoring, parsing, config
├── docs/ # Detailed documentation
└── .github/workflows/ # CI + weekly eval + GitHub Pages deploy
- Methodology — detailed explanation of evaluation approach, scoring rubrics, and statistical methods
- Contributing — how to add models, domains, or improve the evaluation
- MAX Setup — hardware requirements and setup for running open-source judge models
- Roadmap — planned improvements and feature priorities
-
Scenario generation: LLM-powered synthetic data creates realistic multi-turn conversations with 5-8 interconnected user goals per scenario, across diverse personas and difficulty levels.
-
Simulation: Each model runs through every scenario in a multi-turn loop. A user simulator drives the conversation, the model under test responds and makes tool calls, and a tool simulator returns realistic responses.
-
Scoring: Three independent judges evaluate the transcript against published rubrics for task completion and tool selection quality. Scores are averaged for consensus, with agreement rates published for transparency.
-
Reliability: Each scenario runs 3 times to measure consistency. A model that scores 0.9 once but 0.3 the next time is less useful than one that consistently scores 0.7.
-
Aggregation: All four CLEAR dimensions are normalized and weighted into a composite score, broken down by domain, category, and individual judge.
- Evaluation methodology inspired by Galileo's agent-leaderboard (Apache 2.0)
- Metrics framework aligned with the CLEAR paper (Simmering et al., 2025)
- Open-source judge inference powered by Modular MAX
- Trace format follows OpenInference semantic conventions
Built by Conor Bronsdon · Chain of Thought