LLM Benchmarks Explained: What MMLU, HumanEval, and HELM Actually Measure

LLM Benchmarks Explained: What MMLU, HumanEval, and HELM Actually Measure

When a model release announces "state-of-the-art on MMLU" or "beats GPT-4 on HumanEval," how much should you care? Understanding what these benchmarks actually measure — and what they don't — is essential for making informed model selection decisions and designing your own evaluation strategy.

This guide breaks down the major LLM benchmarks, their methodologies, their blind spots, and how to use them practically.

Why Benchmarks Matter (and Where They Fall Short)

Benchmarks exist because evaluating a general-purpose language model is enormously difficult. You cannot exhaustively test all possible inputs, so the field relies on curated task collections that proxy for broader capability.

The core tension: benchmarks measure what they measure, not what you care about. A model can score 90% on MMLU (broad factual knowledge) while being terrible at following complex instructions in your specific domain. Conversely, a model that scores lower on academic benchmarks may outperform on practical tasks like code review or customer support.

Use benchmarks as one signal among many — never as the only signal.

Academic Knowledge Benchmarks

MMLU (Massive Multitask Language Understanding)

What it is: 57 subjects spanning STEM, humanities, social sciences, and professional domains. 14,000+ multiple-choice questions at varying difficulty levels.

What it measures: Breadth of factual knowledge and reasoning across academic subjects. A model that scores 85%+ on MMLU has internalized a lot of human knowledge.

Format:

Question: What is the approximate half-life of carbon-14?
A) 570 years
B) 5,730 years  ← correct
C) 57,300 years
D) 573,000 years

Limitations:

  • Multiple-choice format is gameable — models can score well by eliminating wrong answers
  • Doesn't test ability to apply knowledge in open-ended contexts
  • Heavy test set contamination suspected (many questions leaked into training data)
  • High scores don't predict practical usefulness

Typical scores (2026):

  • GPT-4: ~86%
  • Claude Opus: ~88%
  • Llama 3 70B: ~82%
  • Random baseline: 25%

GPQA (Graduate-Level Google-Proof Q&A)

What it is: 448 questions written by PhD researchers, designed to require genuine expertise rather than internet search or rote recall.

What it measures: Deep domain expertise in biology, chemistry, and physics. Named "Google-proof" because the answers require reasoning that can't be found by simple lookup.

Why it matters: It has lower contamination risk than MMLU and better separates genuine understanding from pattern matching. Humans without PhDs score around 34%.

ARC-Challenge

What it is: 1,172 science questions from grade 3-9 exams, filtered to include only questions that retrieval-based systems and statistical methods fail on.

What it measures: Science reasoning that requires multi-step inference rather than keyword matching.

Coding Benchmarks

HumanEval

What it is: 164 Python programming problems. Each problem includes a docstring, function signature, and unit tests. Models generate the function body; correctness is measured by passing all unit tests.

What it measures: Ability to implement simple-to-medium Python functions from specifications.

Format:

def fibonacci(n: int) -> int:
    """Return the nth Fibonacci number.
    
    >>> fibonacci(0)
    0
    >>> fibonacci(1)
    1
    >>> fibonacci(10)
    55
    """
    # Model generates this body

Metric: pass@k — What fraction of problems does the model solve if given k attempts?

Limitations:

  • Only 164 problems — small sample, high variance
  • Only Python
  • Problems are relatively simple (rarely require understanding of external APIs or complex algorithms)
  • Heavy contamination in training data
  • Doesn't test code review, debugging, or working with existing codebases

Typical scores:

  • GPT-4: ~90% (pass@1)
  • Claude Opus: ~88%
  • Llama 3 70B: ~81%

MBPP (Mostly Basic Programming Problems)

974 Python problems sourced from crowd-sourced programming challenges. Broader coverage than HumanEval but similar limitations.

SWE-bench

What it is: 300 real GitHub issues from popular Python repositories. Models must generate a patch that resolves the issue and passes the test suite.

Why it's significant: This is much harder than HumanEval — it requires understanding existing codebases, interpreting bug reports, and making targeted changes. The correlation with real-world software engineering is much higher.

Typical scores: Even frontier models solve only 20-50% of SWE-bench problems (verified). This is far more humbling than HumanEval scores suggest.

Reasoning Benchmarks

GSM8K (Grade School Math)

8,500 grade school math word problems requiring multi-step arithmetic reasoning. Problems require 2–8 reasoning steps.

What it measures: Chain-of-thought reasoning for arithmetic and algebra.

Limitation: Models have largely saturated this benchmark. Most frontier models score 90%+. It's no longer a meaningful differentiator.

MATH

12,500 problems from math competitions (AMC, AIME, etc.) at difficulty levels 1-5. Level 5 problems are competition-level difficulty.

What it measures: Mathematical reasoning across algebra, geometry, number theory, and combinatorics at high difficulty.

More discriminative than GSM8K — even frontier models struggle with Level 4-5 problems.

BIG-Bench Hard (BBH)

23 challenging tasks from the BIG-Bench benchmark where few-shot prompting was required for models to exceed random baselines. Tasks include:

  • Causal judgment
  • Formal fallacies
  • Movie recommendation
  • Object counting
  • Word sorting

Instruction Following

IFEval (Instruction Following Evaluation)

541 verifiable instructions testing whether models follow explicit formatting and content constraints. Examples:

  • "Write a response using exactly 3 paragraphs"
  • "Respond without using the word 'the'"
  • "Include at least 300 words"

What it measures: Instruction adherence — a critical capability for production use cases.

Why it matters: A model that scores 90% on MMLU but only 60% on IFEval will frustrate users who rely on specific output formats.

Holistic Evaluation Frameworks

HELM (Holistic Evaluation of Language Models)

Stanford's HELM evaluates models across 42 scenarios spanning:

  • Question answering
  • Summarization
  • Toxicity and bias
  • Robustness
  • Efficiency (tokens per inference)

What makes HELM different: It evaluates multiple dimensions simultaneously rather than a single accuracy number. A model might score poorly on toxicity even with high accuracy.

Limitation: The framework is computationally expensive and updates slowly compared to newer benchmarks.

LMSYS Chatbot Arena

What it is: Human preference evaluation via head-to-head battles. Users chat with two anonymous models simultaneously and vote for which response they prefer. ELO ratings are computed from millions of votes.

Why it's the most practically relevant benchmark: It measures real user preference on real conversations, not curated academic tasks. There's no contamination risk because the prompts are live user queries.

Limitation: Popularity bias — users may prefer longer, more confident-sounding responses even when they're less accurate.

Current ELO leaders (approximate, changes frequently):

  • Gemini Ultra / GPT-4o / Claude Opus cluster: ~1250-1300 ELO
  • Strong open-source models (Llama 3 70B, Mixtral): ~1100-1150 ELO

Benchmark Contamination: The Hidden Problem

Many benchmark questions appear verbatim or near-verbatim in model training data. This is benchmark contamination, and it significantly inflates reported scores.

Signs of contamination:

  • Model scores near-perfectly on published benchmark but poorly on held-out variants
  • Model can recite the "correct answer" before reasoning through the problem
  • Novel benchmark versions show 10-20% lower performance than published numbers

Mitigation strategies:

  • Use held-out or private test sets for high-stakes evaluation
  • Generate new problem variants dynamically (e.g., change numbers in math problems)
  • Use benchmarks with less public exposure (GPQA, newer datasets)
  • Weight LMSYS Arena results heavily — live user queries can't be contaminated

Choosing the Right Benchmarks

For coding assistants:

  • Primary: SWE-bench (real-world code changes)
  • Secondary: HumanEval, MBPP
  • Custom: Test on problems from your specific tech stack

For knowledge-intensive Q&A:

  • Primary: MMLU, GPQA
  • Custom: Domain-specific question sets from your field

For customer service / conversational AI:

  • Primary: LMSYS Arena ELO
  • Secondary: IFEval (instruction following)
  • Custom: Sample of real user queries from your support queue

For reasoning and analysis:

  • Primary: MATH, BBH
  • Secondary: GSM8K (increasingly saturated)

Building Your Own Benchmark

The best benchmark for your use case is one you build yourself. Academic benchmarks measure general capabilities; you need task-specific performance.

Step 1: Collect representative examples. Sample 200+ real inputs from your use case. If you're building a legal document analyzer, use real legal documents (anonymized).

Step 2: Generate ground truth labels. Have domain experts label the correct outputs. This is expensive but necessary.

Step 3: Define evaluation criteria. What makes a response good? Accuracy? Completeness? Brevity? Define rubrics for each dimension.

Step 4: Automate where possible. Use LLM-as-judge for qualitative dimensions, exact match or regex for factual assertions.

Step 5: Version your benchmark. As your product evolves, your benchmark should too. Track which benchmark version each evaluation run used.

# Custom benchmark runner
class CustomBenchmark:
    def __init__(self, test_cases: list[dict], evaluator):
        self.test_cases = test_cases
        self.evaluator = evaluator
    
    def run(self, model) -> dict:
        results = []
        for case in self.test_cases:
            response = model.generate(case['input'])
            score = self.evaluator.evaluate(
                question=case['input'],
                expected=case['expected'],
                actual=response,
                context=case.get('context')
            )
            results.append({
                'case_id': case['id'],
                'score': score.overall,
                'passed': score.passed,
                'details': score
            })
        
        return {
            'mean_score': sum(r['score'] for r in results) / len(results),
            'pass_rate': sum(r['passed'] for r in results) / len(results),
            'failed_cases': [r for r in results if not r['passed']]
        }

Summary

Benchmark Best For Contamination Risk Practical Relevance
MMLU General knowledge screening High Low-Medium
HumanEval Quick coding check High Medium
SWE-bench Real coding capability Low High
GPQA Expert knowledge Low Medium
IFEval Instruction following Low High
LMSYS Arena Overall user preference None Very High
Custom benchmark Your specific use case None Highest

No single benchmark tells the whole story. Use a combination: academic benchmarks for broad capability screening, LMSYS Arena for overall quality signal, and your custom benchmark for deployment decisions.

The model that wins your custom benchmark is the model you should deploy — regardless of what the leaderboard says.

Read more