Topic hub · 21 claims
Evaluation, benchmarks, and the harness problem
The benchmarks that define "capable model" — and the methodology caveats that make cross-paper comparisons unreliable. Hand-verified primary sources for every benchmark cited in the literature.
Why benchmarks matter — and why they mislead
Benchmarks are how the field measures progress. MMLU, HumanEval, GLUE, SuperGLUE, Chatbot Arena — each tries to capture a different dimension of capability (knowledge breadth, code generation, language understanding, conversational quality). But the same benchmark name can produce different scores across different evaluation harnesses + prompt formats + decoding strategies, which is exactly why VERITAS does not ship performance-comparison claims (see /blog/why-no-performance-claims/).
The classics
GLUE (Wang et al. 2018) and SuperGLUE (Wang et al. 2019) were the first standardized natural-language-understanding benchmarks. ImageNet (Deng et al., CVPR 2009) preceded them in vision. BLEU (Papineni et al., ACL 2002) and ROUGE (Lin, ACL 2004) measured machine translation and summarization. These benchmarks shaped a decade of progress.
The LLM-era benchmarks
MMLU (Hendrycks et al. 2021) tests knowledge breadth across 57 subjects. HumanEval (Chen et al., OpenAI 2021) tests code generation. AlpacaEval (Tatsu Lab 2023) uses LLM-as-judge. Chatbot Arena (LMSYS 2023) uses pairwise human preferences. Each adds methodological subtlety: which split? which prompt? few-shot or zero-shot? chain-of-thought? The right reading is: track benchmarks as trend signals, not absolute rankings.
Defined terms (3)
- Benchmark
- A standardized dataset and evaluation protocol designed to measure a specific capability across multiple models.
- Evaluation harness
- Software that runs an LLM through a benchmark in a reproducible way. Different harnesses (LM Evaluation Harness, HELM, lm-eval) produce different scores for the same nominal benchmark.
- LLM-as-judge
- Evaluation approach where one LLM scores the outputs of another. Used by AlpacaEval and MT-Bench. Cheaper than human evaluation; biased toward judge-model preferences.
All claims in this topic (21)
- AlpacaEval·introduced in Li et al. 2023 — LLM-as-judge evaluation benchmark(1.00 · 2 sources)
- ARC-AGI benchmark·introduced in Chollet 2019 — abstraction and reasoning corpus(1.00 · 2 sources)
- Chatbot Arena·introduced in Zheng et al. 2023 — LMSYS open platform for evaluating LLMs by human preference(1.00 · 2 sources)
- GLUE benchmark·introduced in paper GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., 2018)(1.00 · 2 sources)
- HELM·introduced in paper Holistic Evaluation of Language Models (Liang et al., Stanford CRFM 2022-11-16)(1.00 · 2 sources)
- HumanEval benchmark·introduced in paper Evaluating Large Language Models Trained on Code (Chen et al., 2021)(1.00 · 2 sources)
- LangSmith·publicly released on 2023-07-18 by LangChain — LLM observability + evaluation platform(1.00 · 2 sources)
- LMArena (Chatbot Arena)·founded in 2023 — LMSYS Chatbot Arena → LMArena.ai 2024(1.00 · 2 sources)
- LongBench·introduced in paper LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding (Bai et al., THU + Zhipu AI 2023-08-28)(1.00 · 2 sources)
- MMLU benchmark·introduced in paper Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020)(1.00 · 2 sources)
- MTEB benchmark·introduced in Muennighoff et al. 2022 — Massive Text Embedding Benchmark(1.00 · 2 sources)
- SuperGLUE benchmark·introduced in paper SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., 2019)(1.00 · 2 sources)
- SWE-bench·introduced in Jimenez et al. 2024 — software engineering benchmark from GitHub issues(1.00 · 2 sources)
- BIG-bench·introduced in paper Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (Srivastava et al., 2022)(0.92 · 3 sources)
- GPQA benchmark·introduced in paper GPQA: A Graduate-Level Google-Proof Q&A Benchmark (Rein et al., 2023)(0.92 · 3 sources)
- GSM8K·introduced in paper Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021)(0.92 · 3 sources)
- HellaSwag benchmark·introduced in paper HellaSwag: Can a Machine Really Finish Your Sentence? (Zellers et al., 2019)(0.92 · 3 sources)
- LiveCodeBench·introduced in paper LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (Jain et al., 2024)(0.92 · 3 sources)
- MATH dataset·introduced in paper Measuring Mathematical Problem Solving With the MATH Dataset (Hendrycks et al., 2021)(0.92 · 3 sources)
- MMLU-Pro benchmark·introduced in paper MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (Wang et al., 2024)(0.92 · 3 sources)
- TruthfulQA benchmark·introduced in paper TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021)(0.92 · 3 sources)