Tag
benchmark
18 verified claims carrying this tag. Each has 2+ primary sources and an HMAC-SHA256 signature.
MMLU benchmark introduced in paper: Measuring Massive Multitask Language Understanding (Hendrycks et al., 2020).
428d754e7c651be6 · 2 sources · 100% confidence
HumanEval benchmark introduced in paper: Evaluating Large Language Models Trained on Code (Chen et al., 2021).
71ec42731d2c9e0c · 2 sources · 100% confidence
SuperGLUE benchmark introduced in paper: SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems (Wang et al., 2019).
1a1e87145608c91a · 2 sources · 100% confidence
GLUE benchmark introduced in paper: GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding (Wang et al., 2018).
aa113b5e61d5c214 · 2 sources · 100% confidence
MTEB benchmark introduced in: Muennighoff et al. 2022 — Massive Text Embedding Benchmark.
cccd161dd058a31e · 2 sources · 100% confidence
ARC-AGI benchmark introduced in: Chollet 2019 — abstraction and reasoning corpus.
cc5df3c14d35fa49 · 2 sources · 100% confidence
SWE-bench introduced in: Jimenez et al. 2024 — software engineering benchmark from GitHub issues.
b16b5f5297e5f621 · 2 sources · 100% confidence
LMArena (Chatbot Arena) founded in: 2023 — LMSYS Chatbot Arena → LMArena.ai 2024.
88ff5918737d7b6b · 2 sources · 100% confidence
LongBench introduced in paper: LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding (Bai et al., THU + Zhipu AI 2023-08-28).
a41ff9e64baa566f · 2 sources · 100% confidence
HELM introduced in paper: Holistic Evaluation of Language Models (Liang et al., Stanford CRFM 2022-11-16).
494f2bf84f0e5dd2 · 2 sources · 100% confidence
GPQA benchmark introduced in paper: GPQA: A Graduate-Level Google-Proof Q&A Benchmark (Rein et al., 2023).
26f75f130f7b395a · 3 sources · 92% confidence
GSM8K introduced in paper: Training Verifiers to Solve Math Word Problems (Cobbe et al., 2021).
dc1ccb567aff584d · 3 sources · 92% confidence
MATH dataset introduced in paper: Measuring Mathematical Problem Solving With the MATH Dataset (Hendrycks et al., 2021).
8c1f847ae98570da · 3 sources · 92% confidence
HellaSwag benchmark introduced in paper: HellaSwag: Can a Machine Really Finish Your Sentence? (Zellers et al., 2019).
b3f34e83dd0c53b9 · 3 sources · 92% confidence
TruthfulQA benchmark introduced in paper: TruthfulQA: Measuring How Models Mimic Human Falsehoods (Lin et al., 2021).
824f830889daf33e · 3 sources · 92% confidence
BIG-bench introduced in paper: Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models (Srivastava et al., 2022).
bde28f6f7e14e0e9 · 3 sources · 92% confidence
MMLU-Pro benchmark introduced in paper: MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark (Wang et al., 2024).
2df92e0b0e4c891b · 3 sources · 92% confidence
LiveCodeBench introduced in paper: LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code (Jain et al., 2024).
b474cbe11ab65d51 · 3 sources · 92% confidence