Testing

Evaluating Reasoning Models: o3, o4-mini, Claude Thinking, Step Validation, and Rollup Frameworks

HelpMeTest

19 May 2026 — 8 min read

Reasoning models — OpenAI's o3/o4-mini, Anthropic's Claude with extended thinking, and DeepSeek R1 — require different evaluation strategies than standard language models. They trade token budget for reasoning quality, produce verifiable intermediate steps, and have different failure modes. This guide covers building eval frameworks specifically for reasoning models in 2026.

Why Reasoning Models Need Different Evals

Standard LLM evals measure output quality: accuracy, relevance, coherence. Reasoning models need additional eval dimensions:

Step validity — are the intermediate reasoning steps logically sound?
Budget efficiency — does the model use tokens proportional to problem difficulty?
Thinking vs. answer consistency — does the answer follow from the reasoning?
Failure mode detection — does the model admit uncertainty instead of fabricating?
Rollup metrics — aggregate quality across an eval suite

Setting Up Your Eval Infrastructure

// lib/eval-framework/types.ts
export interface EvalCase {
  id: string;
  category: string;
  prompt: string;
  expectedAnswer?: string;
  rubric?: EvalRubric;
  maxBudgetTokens?: number;
}

export interface EvalRubric {
  mustContain?: string[];
  mustNotContain?: string[];
  answerFormat?: 'number' | 'boolean' | 'list' | 'text';
  expectedValue?: string | number | boolean;
  tolerance?: number; // For numeric answers
}

export interface EvalResult {
  caseId: string;
  passed: boolean;
  score: number; // 0-100
  thinking: string | null;
  answer: string;
  budgetTokensUsed: number;
  thinkingTokens: number;
  errors: string[];
  metadata: Record<string, unknown>;
}

export interface EvalSuiteResult {
  suiteName: string;
  totalCases: number;
  passed: number;
  failed: number;
  averageScore: number;
  averageThinkingTokens: number;
  byCategory: Record<string, { passed: number; total: number; avgScore: number }>;
  results: EvalResult[];
  runAt: Date;
}

Building a Reasoning Model Adapter

Create a unified adapter that works across o3, o4-mini, and Claude:

// lib/eval-framework/adapters.ts
import Anthropic from '@anthropic-ai/sdk';
import OpenAI from 'openai';

export interface ReasoningModelAdapter {
  generate(prompt: string, options?: { budgetTokens?: number }): Promise<{
    thinking: string | null;
    answer: string;
    inputTokens: number;
    outputTokens: number;
    thinkingTokens: number;
  }>;
  name: string;
}

export class ClaudeThinkingAdapter implements ReasoningModelAdapter {
  name = 'claude-thinking';
  private client: Anthropic;

  constructor(apiKey: string) {
    this.client = new Anthropic({ apiKey });
  }

  async generate(prompt: string, options: { budgetTokens?: number } = {}) {
    const response = await this.client.messages.create({
      model: 'claude-opus-4-6',
      max_tokens: 16000,
      thinking: {
        type: 'enabled',
        budget_tokens: options.budgetTokens || 5000,
      },
      messages: [{ role: 'user', content: prompt }],
    } as any);

    let thinking: string | null = null;
    let answer = '';

    for (const block of response.content) {
      if (block.type === 'thinking') thinking = (block as any).thinking;
      if (block.type === 'text') answer += block.text;
    }

    return {
      thinking,
      answer: answer.trim(),
      inputTokens: response.usage.input_tokens,
      outputTokens: response.usage.output_tokens,
      thinkingTokens: (response.usage as any).thinking_tokens || 0,
    };
  }
}

export class OpenAIReasoningAdapter implements ReasoningModelAdapter {
  name: string;
  private client: OpenAI;

  constructor(apiKey: string, model: 'o3' | 'o4-mini' = 'o4-mini') {
    this.client = new OpenAI({ apiKey });
    this.name = model;
  }

  async generate(prompt: string, options: { budgetTokens?: number } = {}) {
    const response = await this.client.chat.completions.create({
      model: this.name,
      messages: [{ role: 'user', content: prompt }],
      max_completion_tokens: (options.budgetTokens || 5000) + 2000,
    } as any);

    return {
      thinking: null, // o3/o4-mini don't expose reasoning in the API
      answer: response.choices[0].message.content || '',
      inputTokens: response.usage?.prompt_tokens || 0,
      outputTokens: response.usage?.completion_tokens || 0,
      thinkingTokens: (response.usage as any)?.reasoning_tokens || 0,
    };
  }
}

Step Validation Evals

For math and logic problems, validate that intermediate steps are correct:

// lib/eval-framework/step-validator.ts
export interface Step {
  statement: string;
  isValid: boolean;
  reason?: string;
}

export function validateMathSteps(thinking: string): Step[] {
  const steps: Step[] = [];
  
  // Extract arithmetic expressions from thinking
  const arithmeticPattern = /(\d+(?:\.\d+)?)\s*([+\-×÷*/])\s*(\d+(?:\.\d+)?)\s*=\s*(\d+(?:\.\d+)?)/g;
  let match;
  
  while ((match = arithmeticPattern.exec(thinking)) !== null) {
    const [full, a, op, b, result] = match;
    const aNum = parseFloat(a);
    const bNum = parseFloat(b);
    const resultNum = parseFloat(result);
    
    let expected: number;
    switch (op) {
      case '+': expected = aNum + bNum; break;
      case '-': expected = aNum - bNum; break;
      case '*': case '×': expected = aNum * bNum; break;
      case '/': case '÷': expected = aNum / bNum; break;
      default: continue;
    }
    
    const isValid = Math.abs(expected - resultNum) < 0.001;
    steps.push({
      statement: full,
      isValid,
      reason: isValid ? undefined : `Expected ${expected}, got ${resultNum}`,
    });
  }
  
  return steps;
}

// Tests for step validator
describe('Step validator', () => {
  test('validates correct arithmetic steps', () => {
    const thinking = `
      First: 45 + 37 = 82
      Then: 82 × 3 = 246
      Final check: 246 / 2 = 123
    `;
    
    const steps = validateMathSteps(thinking);
    
    expect(steps.every(s => s.isValid)).toBe(true);
    expect(steps).toHaveLength(3);
  });

  test('detects arithmetic errors in reasoning', () => {
    const thinking = `
      Calculate: 15 × 7 = 100
    `;
    
    const steps = validateMathSteps(thinking);
    
    expect(steps[0].isValid).toBe(false);
    expect(steps[0].reason).toContain('Expected 105');
  });
});

Budget Token Assertions

Reasoning models should use tokens proportionally to problem difficulty:

export const DIFFICULTY_TOKEN_RANGES = {
  trivial: { min: 0, max: 100 },
  simple: { min: 50, max: 500 },
  moderate: { min: 200, max: 2000 },
  complex: { min: 500, max: 8000 },
  expert: { min: 1000, max: 16000 },
};

export function assertBudgetEfficiency(
  result: { thinkingTokens: number; passed: boolean },
  expectedDifficulty: keyof typeof DIFFICULTY_TOKEN_RANGES
): void {
  const range = DIFFICULTY_TOKEN_RANGES[expectedDifficulty];
  
  if (result.thinkingTokens < range.min) {
    console.warn(
      `Possible under-thinking: ${result.thinkingTokens} tokens for ${expectedDifficulty} problem (expected >=${range.min})`
    );
  }
  
  if (result.thinkingTokens > range.max && !result.passed) {
    console.warn(
      `Possible thrashing: ${result.thinkingTokens} tokens but still failed for ${expectedDifficulty} problem`
    );
  }
}

describe('Budget token assertions', () => {
  test('trivial questions use minimal thinking tokens', async () => {
    const mockAdapter = {
      name: 'mock',
      generate: jest.fn().mockResolvedValue({
        thinking: 'Simple.',
        answer: '4',
        inputTokens: 20,
        outputTokens: 5,
        thinkingTokens: 5,
      }),
    };

    const result = await mockAdapter.generate('What is 2 + 2?', { budgetTokens: 100 });
    
    expect(result.thinkingTokens).toBeLessThan(DIFFICULTY_TOKEN_RANGES.trivial.max);
    expect(result.answer).toBe('4');
  });

  test('complex problems use proportionally more tokens', async () => {
    const mockAdapter = {
      name: 'mock',
      generate: jest.fn().mockResolvedValue({
        thinking: 'This requires multi-step analysis...' + 'analysis '.repeat(100),
        answer: 'The complex answer is X because of reasons A, B, and C.',
        inputTokens: 200,
        outputTokens: 100,
        thinkingTokens: 1500,
      }),
    };

    const result = await mockAdapter.generate('Solve this multi-step optimization problem...');
    
    expect(result.thinkingTokens).toBeGreaterThan(DIFFICULTY_TOKEN_RANGES.simple.max);
    expect(result.thinkingTokens).toBeLessThan(DIFFICULTY_TOKEN_RANGES.complex.max);
  });
});

Building a Rollup Eval Framework

Aggregate results across an eval suite for trend analysis:

// lib/eval-framework/runner.ts
export async function runEvalSuite(
  suiteName: string,
  cases: EvalCase[],
  adapter: ReasoningModelAdapter
): Promise<EvalSuiteResult> {
  const results: EvalResult[] = [];

  for (const evalCase of cases) {
    const startTime = Date.now();
    
    try {
      const response = await adapter.generate(evalCase.prompt, {
        budgetTokens: evalCase.maxBudgetTokens,
      });
      
      const evalResult = scoreEvalCase(evalCase, response);
      results.push({
        ...evalResult,
        metadata: { durationMs: Date.now() - startTime },
      });
    } catch (error) {
      results.push({
        caseId: evalCase.id,
        passed: false,
        score: 0,
        thinking: null,
        answer: '',
        budgetTokensUsed: 0,
        thinkingTokens: 0,
        errors: [(error as Error).message],
        metadata: { durationMs: Date.now() - startTime, error: true },
      });
    }
  }

  return computeRollup(suiteName, cases, results);
}

function computeRollup(
  suiteName: string,
  cases: EvalCase[],
  results: EvalResult[]
): EvalSuiteResult {
  const byCategory: Record<string, { passed: number; total: number; avgScore: number }> = {};

  for (const result of results) {
    const evalCase = cases.find(c => c.id === result.caseId)!;
    const category = evalCase.category;
    
    if (!byCategory[category]) {
      byCategory[category] = { passed: 0, total: 0, avgScore: 0 };
    }
    
    byCategory[category].total++;
    if (result.passed) byCategory[category].passed++;
    byCategory[category].avgScore = 
      (byCategory[category].avgScore * (byCategory[category].total - 1) + result.score) /
      byCategory[category].total;
  }

  return {
    suiteName,
    totalCases: results.length,
    passed: results.filter(r => r.passed).length,
    failed: results.filter(r => !r.passed).length,
    averageScore: results.reduce((sum, r) => sum + r.score, 0) / results.length,
    averageThinkingTokens: results.reduce((sum, r) => sum + r.thinkingTokens, 0) / results.length,
    byCategory,
    results,
    runAt: new Date(),
  };
}

// Tests for the rollup framework
describe('Eval suite rollup', () => {
  test('computes correct pass rate', async () => {
    const mockAdapter: ReasoningModelAdapter = {
      name: 'mock',
      generate: jest.fn()
        .mockResolvedValueOnce({ thinking: 'Steps...', answer: '4', inputTokens: 10, outputTokens: 5, thinkingTokens: 50 })
        .mockResolvedValueOnce({ thinking: 'Steps...', answer: 'wrong', inputTokens: 10, outputTokens: 5, thinkingTokens: 100 })
        .mockResolvedValueOnce({ thinking: null, answer: 'Paris', inputTokens: 10, outputTokens: 5, thinkingTokens: 0 }),
    };

    const cases: EvalCase[] = [
      { id: 'math-1', category: 'math', prompt: '2+2?', rubric: { expectedValue: 4 } },
      { id: 'math-2', category: 'math', prompt: '15×7?', rubric: { expectedValue: 105 } },
      { id: 'geo-1', category: 'geography', prompt: 'Capital of France?', rubric: { mustContain: ['Paris'] } },
    ];

    const result = await runEvalSuite('Test Suite', cases, mockAdapter);

    expect(result.totalCases).toBe(3);
    expect(result.passed).toBe(2);
    expect(result.failed).toBe(1);
    expect(result.byCategory.math.total).toBe(2);
    expect(result.byCategory.geography.total).toBe(1);
    expect(result.byCategory.geography.passed).toBe(1);
  });

  test('tracks average thinking tokens across suite', async () => {
    const mockAdapter: ReasoningModelAdapter = {
      name: 'mock',
      generate: jest.fn()
        .mockResolvedValueOnce({ thinking: 'T', answer: '4', inputTokens: 10, outputTokens: 5, thinkingTokens: 100 })
        .mockResolvedValueOnce({ thinking: 'TT', answer: '6', inputTokens: 10, outputTokens: 5, thinkingTokens: 300 }),
    };

    const cases: EvalCase[] = [
      { id: 'c1', category: 'test', prompt: 'Q1' },
      { id: 'c2', category: 'test', prompt: 'Q2' },
    ];

    const result = await runEvalSuite('Token Suite', cases, mockAdapter);
    
    expect(result.averageThinkingTokens).toBe(200); // (100 + 300) / 2
  });
});

Thinking-Answer Consistency Evaluation

Verify the answer follows from the reasoning (catches hallucinations):

export function checkThinkingAnswerConsistency(
  thinking: string,
  answer: string
): { consistent: boolean; issues: string[] } {
  const issues: string[] = [];

  // Extract numeric conclusions from thinking
  const thinkingNumbers = (thinking.match(/=\s*(\d+(?:\.\d+)?)/g) || [])
    .map(m => parseFloat(m.replace('=', '').trim()));
  
  // Extract numbers from answer
  const answerNumbers = (answer.match(/\d+(?:\.\d+)?/g) || [])
    .map(parseFloat);

  // Check if answer numbers appear in thinking conclusions
  for (const answerNum of answerNumbers) {
    if (!thinkingNumbers.some(t => Math.abs(t - answerNum) < 0.001)) {
      issues.push(`Answer contains ${answerNum} which doesn't appear as a conclusion in thinking`);
    }
  }

  return { consistent: issues.length === 0, issues };
}

describe('Thinking-answer consistency', () => {
  test('detects when answer contradicts thinking', () => {
    const thinking = 'Calculate: 6 × 8 = 48. So the area is 48 square meters.';
    const answer = 'The area is 56 square meters.'; // Wrong answer despite correct reasoning

    const { consistent, issues } = checkThinkingAnswerConsistency(thinking, answer);

    expect(consistent).toBe(false);
    expect(issues[0]).toContain('56');
  });

  test('passes when answer matches thinking conclusions', () => {
    const thinking = 'Step 1: 12 × 5 = 60. Step 2: 60 + 15 = 75. Answer = 75.';
    const answer = 'The result is 75.';

    const { consistent } = checkThinkingAnswerConsistency(thinking, answer);

    expect(consistent).toBe(true);
  });
});

Comparing Models with the Same Eval Suite

Use your rollup framework to compare reasoning models:

const MODEL_COMPARISON_CASES: EvalCase[] = [
  { id: 'math-hard', category: 'math', prompt: 'A store sells items at 30% markup. If the cost is $85, and sales tax is 8.5%, what is the final price?', rubric: { expectedValue: 100.54, tolerance: 0.5 } },
  { id: 'logic-hard', category: 'logic', prompt: 'If all A are B, some B are C, and no C are D — can any A be D?', rubric: { mustContain: ['no', 'cannot'] } },
  { id: 'code', category: 'code', prompt: 'Write a function to detect if a string is a palindrome, handling spaces and capitalization.', rubric: { mustContain: ['function', 'return', 'toLowerCase'] } },
];

test('compare Claude thinking vs o4-mini on benchmark', async () => {
  const claudeAdapter = {
    name: 'claude-thinking',
    generate: jest.fn()
      .mockResolvedValueOnce({ thinking: 'Cost + markup = 85 × 1.3 = 110.5. With tax: 110.5 × 1.085 = 119.89', answer: '$119.89', inputTokens: 100, outputTokens: 20, thinkingTokens: 800 })
      .mockResolvedValueOnce({ thinking: 'All A are B. Some B are C. No C are D. A → B, and some B are C, so some A might be C. But C cannot be D. Therefore no A is D.', answer: 'No, A cannot be D.', inputTokens: 100, outputTokens: 30, thinkingTokens: 600 })
      .mockResolvedValueOnce({ thinking: null, answer: 'function isPalindrome(s) { const clean = s.toLowerCase().replace(/[^a-z]/g, ""); return clean === clean.split("").reverse().join(""); }', inputTokens: 100, outputTokens: 80, thinkingTokens: 0 }),
  };

  const o4MiniAdapter = {
    name: 'o4-mini',
    generate: jest.fn()
      .mockResolvedValueOnce({ thinking: null, answer: '$119.89', inputTokens: 80, outputTokens: 15, thinkingTokens: 300 })
      .mockResolvedValueOnce({ thinking: null, answer: 'No, A cannot be D.', inputTokens: 80, outputTokens: 20, thinkingTokens: 400 })
      .mockResolvedValueOnce({ thinking: null, answer: 'function isPalindrome(str) { return str.toLowerCase().replace(/\\s+/g, "") === str.toLowerCase().replace(/\\s+/g, "").split("").reverse().join(""); }', inputTokens: 80, outputTokens: 60, thinkingTokens: 0 }),
  };

  const claudeResults = await runEvalSuite('Claude Thinking', MODEL_COMPARISON_CASES, claudeAdapter);
  const o4MiniResults = await runEvalSuite('o4-mini', MODEL_COMPARISON_CASES, o4MiniAdapter);

  // Log comparison for CI artifacts
  console.log('Model Comparison:', {
    claude: { passRate: claudeResults.passed / claudeResults.totalCases, avgTokens: claudeResults.averageThinkingTokens },
    o4Mini: { passRate: o4MiniResults.passed / o4MiniResults.totalCases, avgTokens: o4MiniResults.averageThinkingTokens },
  });

  // Assert minimum quality thresholds for both models
  expect(claudeResults.passed / claudeResults.totalCases).toBeGreaterThan(0.7);
  expect(o4MiniResults.passed / o4MiniResults.totalCases).toBeGreaterThan(0.7);
});

E2E Monitoring with HelpMeTest

Production AI reasoning features need continuous monitoring:

Navigate to https://your-app.com/ai-tutor
Enter the math problem: "A train travels 180 miles in 3 hours. What is its average speed?"
Click Solve
Verify response appears within 20 seconds
Verify response contains "60" (miles per hour)
Verify step-by-step reasoning is displayed
Click "Show thinking" toggle (if available)
Verify reasoning steps are visible and contain arithmetic

HelpMeTest monitors these tests 24/7, alerting you when reasoning model responses degrade in quality, time out, or return unexpected formats.

Summary

Building eval frameworks for reasoning models requires:

Unified adapters — abstract o3, o4-mini, and Claude thinking into a common interface
Step validation — verify arithmetic and logical steps in thinking chains
Budget token assertions — ensure token usage is proportional to problem difficulty
Rollup frameworks — aggregate pass rates, scores, and token usage across eval suites
Thinking-answer consistency — detect when answers contradict the model's own reasoning
Model comparison — run the same eval suite against multiple models for benchmarking
E2E monitoring — HelpMeTest for production reasoning feature health