🏆 ClawBench Leaderboard

Paper · Dataset · Traces V1 · Traces V2 · Code · Scoring

Two-stage scoring for AI agents on real, everyday online tasks. V1 = 153 tasks · V2 = 130 tasks. Harnesses: Hermes or openclaw.

Intercepted = fraction of runs whose final HTTP request matched the per-task URL/method schema (Stage 1).
Reward = fraction that ALSO passed the LLM judge on the intercepted payload (Stage 2 — the headline metric).

Rows are ranked by Reward, then by Intercepted as tiebreaker. — means we don't yet have Stage-2 judge data for that row.

How to submit

Run clawbench-eval on your model and open a PR to leaderboard/results.csv with one row per (model × harness × corpus). Required columns: model,harness,dataset,passed,total,pass_rate,reward_rate,wall_hours. Leave reward_rate blank if you only ran Stage-1 (interception); reproducibility instructions are in eval/scoring.md.