🏆 ClawBench Leaderboard
Paper · Dataset · Traces V1 · Traces V2 · Code · Scoring
Two-stage scoring for AI agents on real, everyday online tasks. V1 = 153 tasks · V2 = 130 tasks. Harnesses: Hermes or openclaw.
- Intercepted = fraction of runs whose final HTTP request matched the per-task URL/method schema (Stage 1).
- Reward = fraction that ALSO passed the LLM judge on the intercepted payload (Stage 2 — the headline metric).
Rows are ranked by Reward, then by Intercepted as tiebreaker. — means we don't yet have Stage-2 judge data for that row.
How to submit
Run clawbench-eval on your model and open a PR
to leaderboard/results.csv
with one row per (model × harness × corpus). Required columns: model,harness,dataset,passed,total,pass_rate,reward_rate,wall_hours.
Leave reward_rate blank if you only ran Stage-1 (interception); reproducibility instructions are in
eval/scoring.md.