Contact

Benchmarks

Standardized tests for language models.

A fixed dataset with known answers, a scoring protocol, and a metric. The appeal is reproducibility: same test, same scoring, comparable results across models.

MMLU tests broad knowledge across 57 subjects. GSM8K tests grade-school math reasoning. HumanEval tests code generation against unit tests. SWE-bench tests whether a model can fix real GitHub issues.

The risk is Goodhart’s law. Once a benchmark becomes a target, models get optimized for it, and scores become less predictive of real-world quality.

Talk to an RL expert