Standardized tests for language models.
A fixed dataset with known answers, a scoring protocol, and a metric. The appeal is reproducibility: same test, same scoring, comparable results across models.
MMLU tests broad knowledge across 57 subjects. GSM8K tests grade-school math reasoning. HumanEval tests code generation against unit tests. SWE-bench tests whether a model can fix real GitHub issues.
The risk is Goodhart’s law. Once a benchmark becomes a target, models get optimized for it, and scores become less predictive of real-world quality.