Automatically track trending benchmarks across LLMs, VLMs, and audio-language models
View Latest Benchmark Report β
Key Findings (2026-04-11):
- 263 unique benchmarks discovered from 11,281 mentions
- 167 models analyzed from 18 major labs
- Vision + Text extraction - both sources used for complete coverage
- Top benchmarks: MMLU Pro (57 models), MMLU (57), GPQA Diamond (48)
- 14 categories: Vision (61 benchmarks), Coding (37), Knowledge (25), Math (24)
- Unicode normalization: ΟΒ²-Bench = Ο2-Bench merged successfully
This AI agent automatically:
- Discovers trending models from major labs (Qwen, Meta, Mistral, Google, Microsoft, etc.)
- Extracts benchmarks from model cards and arXiv papers using AI (text + vision extraction)
- Analyzes charts & figures using Claude vision AI to extract benchmarks from PDFs
- Consolidates variations (GSM8K β gsm8k β GSM-8K, ΟΒ²-Bench = Ο2-Bench) using Unicode normalization and AI validation
- Classifies benchmarks into categories using Claude AI
- Tracks trends over time with SQLite caching and snapshots
- Generates reports showing evolution and emerging patterns
Run it monthly to stay current with the AI evaluation landscape.
The agent supports 2 execution paths with 7 individual stages or a full pipeline:
# Full pipeline (all 6 stages)
python -m agents.benchmark_intelligence.main generate
# Individual stages (for debugging/development)
python -m agents.benchmark_intelligence.main filter_models
python -m agents.benchmark_intelligence.main find_docs
python -m agents.benchmark_intelligence.main parse_docs --concurrency 30
python -m agents.benchmark_intelligence.main consolidate_benchmarks
python -m agents.benchmark_intelligence.main categorize_benchmarks
python -m agents.benchmark_intelligence.main report# Full pipeline
/benchmark_intelligence.generate
# Individual stages
/benchmark_intelligence.filter_models
/benchmark_intelligence.find_docs
/benchmark_intelligence.parse_docs --concurrency 30
/benchmark_intelligence.consolidate_benchmarks --from-db
/benchmark_intelligence.categorize_benchmarks
/benchmark_intelligence.report# 1. Set HuggingFace token in Workspace Settings β Environment Variables
# HF_TOKEN = "hf_..."
# 2. Run via Ambient workflow
/benchmark_intelligence.generate
# Or via Python
cd /workspace/repos/trending_benchmarks
python -m agents.benchmark_intelligence.main generate# 1. Install dependencies
pip install -r requirements.txt
# 2. Set API keys
export HF_TOKEN="your_huggingface_token"
export ANTHROPIC_API_KEY="your_claude_key" # Not needed on Ambient
# 3. Run
python -m agents.benchmark_intelligence.main generateExpected runtime: ~50-60 minutes for 65 models (with AI extraction, default concurrency: 20)
| File | Purpose | Location |
|---|---|---|
| benchmark_taxonomy.md | Complete reference of 30+ benchmarks | Root |
| categories.yaml | 13 benchmark categories & definitions | Root |
| config.yaml | Target labs/organizations to track | Config |
| Resource | Description |
|---|---|
| Latest Report | Most recent benchmark intelligence |
| All Reports | Historical snapshots |
| SQLite Database | Queryable cache (see below) |
The agent uses SQLite for intelligent caching with change detection:
benchmark_cache.db
βββ models # Model metadata (name, lab, release_date, downloads)
βββ benchmarks # Unique benchmarks with categories
βββ model_benchmarks # Benchmark scores/results per model
βββ documents # Cached model cards & docs (content-hash tracking)
βββ snapshots # Temporal snapshots for trend analysis
- Content-hash tracking: Models are only reprocessed if their model card changes
- Incremental updates: Subsequent runs only process new/changed models
- Historical snapshots: Trend analysis without re-fetching old data
- Queryable: Use SQL for custom analysis
sqlite3 benchmark_cache.db
# Show all discovered models
SELECT id, lab, release_date, downloads, likes
FROM models
ORDER BY downloads DESC LIMIT 20;
# Top benchmarks by usage
SELECT b.canonical_name, COUNT(DISTINCT mb.model_id) as model_count, b.categories
FROM benchmarks b
JOIN model_benchmarks mb ON b.id = mb.benchmark_id
GROUP BY b.canonical_name
ORDER BY model_count DESC
LIMIT 15;
# Models released in last 12 months
SELECT id, lab, release_date, downloads
FROM models
WHERE release_date >= date('now', '-12 months')
ORDER BY release_date DESC;
# Benchmark trend over time
SELECT s.timestamp, s.benchmark_count, s.model_count
FROM snapshots s
ORDER BY s.timestamp;- File:
benchmark_cache.db(in project root) - Size: ~240KB (current)
- Backed up: Yes (snapshots table tracks history)
7 automated sections:
- Executive Summary: Models & benchmarks tracked
- Trending Models: Sorted by release date & significance
- Most Common Benchmarks: All-time + monthly trends
- Emerging Benchmarks: Recently introduced (<90 days)
- Category Distribution: Breakdown by type (charts)
- Lab Insights: Per-lab statistics & preferences
- Temporal Trends: Evolution over time
Timestamped reports in agents/benchmark_intelligence/reports/:
reports/
βββ trending_benchmarks_20260410_155422.md # Latest
βββ ...
Discover Models (HuggingFace API)
β
Check Cache (content-hash comparison)
β
Parse Documents (model cards, arXiv papers - if changed)
β
Extract Benchmarks (Claude AI: text + vision for charts/figures)
β
Consolidate Names (Unicode normalization + fuzzy matching + AI validation)
β
Classify Benchmarks (multi-label AI categorization)
β
Store in SQLite Cache
β
Create Temporal Snapshot
β
Generate Markdown Report
Key Components:
- HuggingFace Client: Official
huggingface_hublibrary - Universal Claude Client: Auto-detects Ambient/Vertex AI/Anthropic API
- AI Extraction: Claude-powered parsing of model cards and arXiv papers (text + vision)
- Vision AI: Extracts benchmarks from charts/figures in PDFs using Claude vision API
- PDF Processing:
pdfplumberfor embedded image extraction from research papers - Cache Manager: SQLite with content-hash change detection
- Smart Consolidation: Unicode normalization + AI validation ("MMLU", "MMLU-Pro", ΟΒ²-Bench = Ο2-Bench)
Knowledge β’ Reasoning β’ Math β’ Code β’ Vision β’ Audio β’ Multilingual β’ Safety β’ Long-Context β’ Instruction-Following β’ Tool-Use β’ Agent β’ Domain-Specific
Knowledge: MMLU, MMLU-Pro, C-Eval, CMMLU, TriviaQA, GPQA Math: GSM8K, MATH, AIME, Gaokao Code: HumanEval, MBPP, LiveCodeBench, CFBench Vision: MMMU, CMMMU, VQAv2, DocVQA, AI2D Reasoning: ARC, BBH, HellaSwag, PIQA, WinoGrande, BoolQ Safety: TruthfulQA, RewardBench Multimodal: Open LLM Leaderboard, Arena-Hard
See benchmark_taxonomy.md for complete reference with definitions.
- Qwen β’ 01-ai (Yi)
- meta-llama β’ mistralai β’ google
- microsoft β’ anthropic
- alibaba-pai β’ tencent β’ deepseek-ai
- OpenGVLab β’ THUDM (ChatGLM)
- baichuan-inc β’ internlm
- MinimaxAI
Configure in config.yaml
Edit config.yaml:
discovery:
models_per_lab: 15 # Models to fetch per lab
sort_by: "downloads" # downloads | trending | lastModified
filter_tags: [] # Task filters (empty = all)
min_downloads: 1000 # Minimum popularity threshold
date_filter_months: 12 # Only models from last N months
exclude_tags: # Skip these model types
- "time-series-forecasting"
- "fill-mask"
# Concurrency settings
parallelization:
max_concurrent_document_fetches: 5
enabled: true
timeout_per_document_seconds: 60
# Rate limiting (prevents API 429 errors)
rate_limiting:
huggingface:
requests_per_minute: 60
max_retries: 5
initial_backoff_seconds: 2.0
anthropic:
requests_per_minute: 50
max_retries: 5
arxiv:
requests_per_minute: 30
max_retries: 3Default: 20 concurrent workers for document parsing
Adjust based on your needs:
# Low concurrency (safer, slower)
python -m agents.benchmark_intelligence.main parse_docs --concurrency 10
# High concurrency (faster, may hit rate limits)
python -m agents.benchmark_intelligence.main parse_docs --concurrency 50
# Ambient workflow
/benchmark_intelligence.parse_docs --concurrency 30All outputs are saved to agents/benchmark_intelligence/outputs/:
| Stage | Output File | Schema |
|---|---|---|
| filter_models | filtered_models/models_YYYYMMDD_HHMMSS.json |
[{id, author, downloads, likes, tags, created_at}] |
| find_docs | docs/docs_YYYYMMDD_HHMMSS.json |
[{model_id, documents: [{type, url, found}]}] |
| parse_docs | parsed/parsed_YYYYMMDD_HHMMSS.json |
[{model_id, benchmarks: [{name, score, metric}]}] |
| consolidate | consolidated/benchmarks_YYYYMMDD_HHMMSS.json |
[{canonical_name, occurrences, models: [...]}] |
| categorize | categorized/categorized_YYYYMMDD_HHMMSS.json |
[{benchmark_name, category, subcategory, confidence}] |
| report | reports/report_YYYYMMDD_HHMMSS.md |
Markdown report |
- Categories: Edit
categories.yamlat root - Taxonomy: Update
benchmark_taxonomy.mdat root
Monthly runs (recommended):
# Via cron (automatically configured)
0 9 1 * * cd /workspace/repos/trending_benchmarks && /benchmark_intelligence.generate
# Or manual
python -m agents.benchmark_intelligence.main generatepython -m agents.benchmark_intelligence.main --dry-run --verbosepython -m agents.benchmark_intelligence.main \
--labs Qwen,meta-llama,mistralai# Clear cache and start fresh
rm benchmark_cache.db
python -m agents.benchmark_intelligence.main# Edit config.yaml:
discovery:
date_filter_months: 24 # Last 2 yearsLanguage: Python 3.9+ APIs: HuggingFace Hub, Anthropic Claude (or Vertex AI) Storage: SQLite AI: Claude Sonnet 4 for intelligent extraction & classification Format: Markdown, YAML, JSON
Dependencies (7):
huggingface_hub- Model discoveryanthropic- AI-powered parsing (or Vertex AI on Ambient)pdfplumber- PDF parsing and image extractionpyyaml- Configurationrequests- HTTPbeautifulsoup4- HTML parsingpython-dateutil- Date handling
| Document | Purpose |
|---|---|
| AMBIENT_QUICKSTART.md | Get started on Ambient platform |
| agents/.../README.md | Full technical documentation |
| config.yaml | Configuration reference |
| specs/001-.../spec.md | Complete feature specification |
export HF_TOKEN="hf_your_token"Get token: https://huggingface.co/settings/tokens
Only needed outside Ambient. Get key: https://console.anthropic.com On Ambient: Uses native Vertex AI Claude support (no key needed)
Edit config.yaml β remove labs that produce noise (e.g., "huggingface" org gets time-series models)
Edit config.yaml β date_filter_months: 12 (or higher)
rm benchmark_cache.db
# Re-run will rebuild from scratchSymptom: "Too many requests" errors from APIs
Solutions:
-
Reduce concurrency:
python -m agents.benchmark_intelligence.main parse_docs --concurrency 10
-
Adjust rate limits in
config.yaml:rate_limiting: huggingface: requests_per_minute: 30 # Lower = safer
-
Rate limiter automatically retries with exponential backoff
Symptom: "Connection timeout" or "Read timeout"
Solutions:
-
Increase timeout in
config.yaml:parallelization: timeout_per_document_seconds: 120 # Default: 60
-
Reduce concurrent fetches:
parallelization: max_concurrent_document_fetches: 3 # Default: 5
Symptom: Process killed or "Out of memory"
Solutions:
-
Lower concurrency (fewer workers = less memory):
python -m agents.benchmark_intelligence.main parse_docs --concurrency 5
-
Process in batches (run individual stages separately)
Symptom: "No available connections" or hanging requests
Solutions:
- Connection pool auto-manages resources
- Check logs for specific errors
- Restart with lower concurrency
The pipeline is resumable! Hash cache prevents re-processing:
# If interrupted, just re-run the same command
python -m agents.benchmark_intelligence.main generate
# Hash cache will skip already-processed documents
# Only new/changed documents will be processed- Python 3.9 or higher
- HuggingFace account (for API token)
- Anthropic API key (for Claude) OR Ambient Code Platform
- Internet connection
- ~500MB disk space (for cache)
Apache 2.0 - See LICENSE file
- Latest Report β
- Feature Specification - Complete requirements
- Benchmark Taxonomy - Complete reference
- Categories - Category definitions
- HuggingFace Hub
- Anthropic Claude
Version: 1.1.0 Status: β Production Ready Last Run: 2026-04-11 Models: 167 | Benchmarks: 263 | Categories: 14 Features: Vision AI extraction β’ Unicode normalization β’ AI validation β’ Multi-source parsing
Built with β€οΈ using AI β’ Powered by Claude & HuggingFace