Testing

Testing DeepSeek R1: Reasoning Chain Verification and Chain-of-Thought Evals

HelpMeTest

19 May 2026 — 6 min read

DeepSeek R1 is an open-weight reasoning model that exposes its chain-of-thought reasoning as a <think> block before the final answer. This transparency makes it uniquely testable — you can evaluate not just whether the answer is correct, but whether the reasoning process is sound. This guide covers testing strategies for DeepSeek R1 in both local and API deployments.

Understanding DeepSeek R1's Output Format

DeepSeek R1 outputs reasoning in a specific format:

<think>
Let me analyze this step by step.
First, I'll consider...
Then I'll evaluate...
Therefore, the answer must be...
</think>

The final answer is X because Y.

Your integration code needs to parse this format, and your tests need to validate both the thinking process and the final answer.

Parsing DeepSeek R1 Responses

// lib/deepseek-parser.ts
export interface DeepSeekResponse {
  thinking: string | null;
  answer: string;
  raw: string;
}

export function parseDeepSeekResponse(raw: string): DeepSeekResponse {
  const thinkMatch = raw.match(/<think>([\s\S]*?)<\/think>/);
  const thinking = thinkMatch ? thinkMatch[1].trim() : null;
  
  const answer = raw
    .replace(/<think>[\s\S]*?<\/think>/g, '')
    .trim();

  return { thinking, answer, raw };
}

Test the parser thoroughly:

import { parseDeepSeekResponse } from '../lib/deepseek-parser';

describe('DeepSeek R1 response parser', () => {
  test('extracts thinking block correctly', () => {
    const raw = `<think>
I need to calculate the sum.
First: 2 + 3 = 5
Then: 5 + 7 = 12
</think>

The answer is 12.`;

    const result = parseDeepSeekResponse(raw);
    
    expect(result.thinking).toContain('I need to calculate the sum.');
    expect(result.thinking).toContain('5 + 7 = 12');
    expect(result.answer).toBe('The answer is 12.');
  });

  test('handles responses without thinking block', () => {
    const raw = 'The capital of France is Paris.';
    
    const result = parseDeepSeekResponse(raw);
    
    expect(result.thinking).toBeNull();
    expect(result.answer).toBe('The capital of France is Paris.');
  });

  test('handles multi-paragraph answers after thinking', () => {
    const raw = `<think>
Analysis complete.
</think>

First paragraph of the answer.

Second paragraph with more details.`;

    const result = parseDeepSeekResponse(raw);
    
    expect(result.answer).toContain('First paragraph');
    expect(result.answer).toContain('Second paragraph');
    expect(result.thinking).toBe('Analysis complete.');
  });
});

Reasoning Chain Verification

The unique value of DeepSeek R1 is that you can verify the reasoning process, not just the output:

export interface ReasoningEval {
  containsSteps: boolean;
  mentionsKeyConcepts: string[];
  hasLogicalFlow: boolean;
  wordCount: number;
  avoidsFallacies: boolean;
}

export function evaluateReasoning(thinking: string, expectedConcepts: string[]): ReasoningEval {
  const mentionsKeyConcepts = expectedConcepts.filter(concept =>
    thinking.toLowerCase().includes(concept.toLowerCase())
  );

  // Check for logical connectors that indicate structured reasoning
  const logicalConnectors = ['therefore', 'thus', 'because', 'since', 'first', 'then', 'finally'];
  const hasLogicalFlow = logicalConnectors.some(c => thinking.toLowerCase().includes(c));

  // Check for step markers
  const containsSteps = /\d+\.|step \d|first|second|third/i.test(thinking);

  // Check for common logical fallacies (simplified)
  const fallacies = ['always', 'never', 'everyone knows', 'obviously'];
  const avoidsFallacies = !fallacies.some(f => thinking.toLowerCase().includes(f));

  return {
    containsSteps,
    mentionsKeyConcepts,
    hasLogicalFlow,
    wordCount: thinking.split(/\s+/).length,
    avoidsFallacies,
  };
}

describe('Reasoning chain evaluation', () => {
  test('evaluates math problem reasoning correctly', () => {
    const thinking = `
      First, I need to identify the equation: x + 5 = 12.
      Then, I'll solve for x by subtracting 5 from both sides.
      x = 12 - 5 = 7.
      Therefore, x = 7.
    `;

    const eval = evaluateReasoning(thinking, ['equation', 'solve', 'subtract']);
    
    expect(eval.containsSteps).toBe(true);
    expect(eval.hasLogicalFlow).toBe(true);
    expect(eval.mentionsKeyConcepts).toContain('solve');
    expect(eval.mentionsKeyConcepts).toContain('subtract');
    expect(eval.wordCount).toBeGreaterThan(20);
  });

  test('flags reasoning that skips steps', () => {
    const thinking = `Obviously the answer is 7.`;
    
    const eval = evaluateReasoning(thinking, ['equation', 'solve']);
    
    expect(eval.containsSteps).toBe(false);
    expect(eval.avoidsFallacies).toBe(false); // "obviously" is a fallacy marker
    expect(eval.mentionsKeyConcepts).toHaveLength(0);
  });
});

Chain-of-Thought Quality Evals

Build a systematic eval framework for measuring CoT quality:

export interface CoTEval {
  questionId: string;
  question: string;
  expectedAnswer: string;
  modelAnswer: string;
  thinking: string | null;
  answerCorrect: boolean;
  reasoningScore: number; // 0-100
  thinkingTokens: number;
}

export async function runCoTEval(
  testCases: { id: string; question: string; expectedAnswer: string; concepts: string[] }[],
  modelFn: (question: string) => Promise<{ thinking: string | null; answer: string }>,
  options = { scoringFn: defaultReasoningScorer }
): Promise<CoTEval[]> {
  return Promise.all(testCases.map(async (testCase) => {
    const result = await modelFn(testCase.question);
    
    const answerCorrect = result.answer
      .toLowerCase()
      .includes(testCase.expectedAnswer.toLowerCase());
    
    const reasoningScore = result.thinking
      ? options.scoringFn(result.thinking, testCase.concepts)
      : 0;

    return {
      questionId: testCase.id,
      question: testCase.question,
      expectedAnswer: testCase.expectedAnswer,
      modelAnswer: result.answer,
      thinking: result.thinking,
      answerCorrect,
      reasoningScore,
      thinkingTokens: result.thinking?.split(/\s+/).length || 0,
    };
  }));
}

function defaultReasoningScorer(thinking: string, concepts: string[]): number {
  let score = 0;
  const lowerThinking = thinking.toLowerCase();
  
  // Score for mentioning key concepts
  const conceptsFound = concepts.filter(c => lowerThinking.includes(c.toLowerCase())).length;
  score += (conceptsFound / concepts.length) * 40;
  
  // Score for logical flow
  const logicalConnectors = ['therefore', 'thus', 'because', 'first', 'then'];
  const connectorsFound = logicalConnectors.filter(c => lowerThinking.includes(c)).length;
  score += Math.min(connectorsFound * 10, 30);
  
  // Score for length (up to a point)
  const wordCount = thinking.split(/\s+/).length;
  score += Math.min(wordCount / 10, 30);
  
  return Math.min(score, 100);
}

// Usage in tests
const MATH_TEST_CASES = [
  {
    id: 'math-001',
    question: 'If a car travels at 60mph for 2.5 hours, how far does it travel?',
    expectedAnswer: '150',
    concepts: ['distance', 'speed', 'time', 'multiply', '60', '2.5'],
  },
  {
    id: 'math-002',
    question: 'A shirt costs $40. After a 25% discount, what is the new price?',
    expectedAnswer: '30',
    concepts: ['discount', 'percent', 'subtract', '40', '25'],
  },
];

test('DeepSeek R1 achieves >80% reasoning score on math problems', async () => {
  const mockModel = jest.fn().mockImplementation(async (question: string) => {
    // Simulate varying quality responses
    if (question.includes('60mph')) {
      return {
        thinking: 'First, I recall distance = speed × time. Speed = 60mph, time = 2.5h. Therefore distance = 60 × 2.5 = 150 miles.',
        answer: '150 miles',
      };
    }
    return {
      thinking: 'Discount amount = 40 × 25% = $10. New price = 40 - 10 = $30.',
      answer: '$30',
    };
  });

  const evals = await runCoTEval(MATH_TEST_CASES, mockModel);

  const avgReasoningScore = evals.reduce((sum, e) => sum + e.reasoningScore, 0) / evals.length;
  const correctAnswers = evals.filter(e => e.answerCorrect).length;

  expect(avgReasoningScore).toBeGreaterThan(60);
  expect(correctAnswers).toBe(MATH_TEST_CASES.length);
});

Testing Local vs API Deployment

DeepSeek R1 is open-weight and can run locally with Ollama. Test both deployment modes:

// lib/deepseek-client.ts
export interface DeepSeekClient {
  generate(prompt: string, options?: { temperature?: number; maxTokens?: number }): Promise<string>;
}

export class OllamaDeepSeekClient implements DeepSeekClient {
  constructor(private baseUrl: string = 'http://localhost:11434') {}

  async generate(prompt: string, options = {}): Promise<string> {
    const response = await fetch(`${this.baseUrl}/api/generate`, {
      method: 'POST',
      headers: { 'Content-Type': 'application/json' },
      body: JSON.stringify({
        model: 'deepseek-r1:8b',
        prompt,
        stream: false,
        options: { temperature: 0, ...options },
      }),
    });
    
    const data = await response.json();
    return data.response;
  }
}

export class DeepSeekAPIClient implements DeepSeekClient {
  constructor(private apiKey: string) {}

  async generate(prompt: string, options = {}): Promise<string> {
    const response = await fetch('https://api.deepseek.com/chat/completions', {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${this.apiKey}`,
      },
      body: JSON.stringify({
        model: 'deepseek-reasoner',
        messages: [{ role: 'user', content: prompt }],
        ...options,
      }),
    });
    
    const data = await response.json();
    return data.choices[0].message.content;
  }
}

Test client behavior independently with MSW:

import { setupServer } from 'msw/node';
import { http, HttpResponse } from 'msw';

const server = setupServer(
  http.post('http://localhost:11434/api/generate', () => {
    return HttpResponse.json({
      response: '<think>\nCalculating...\n</think>\n\nThe answer is 42.',
      done: true,
    });
  }),

  http.post('https://api.deepseek.com/chat/completions', () => {
    return HttpResponse.json({
      choices: [{
        message: {
          role: 'assistant',
          content: '<think>\nReasoning through...\n</think>\n\n42.',
          reasoning_content: 'Reasoning through...',
        },
      }],
      usage: { prompt_tokens: 50, completion_tokens: 100, total_tokens: 150 },
    });
  })
);

beforeAll(() => server.listen());
afterEach(() => server.resetHandlers());
afterAll(() => server.close());

test('Ollama client sends correct request format', async () => {
  const client = new OllamaDeepSeekClient();
  const result = await client.generate('What is 6 × 7?');
  
  expect(result).toContain('42');
});

test('API client sends correct request format', async () => {
  const client = new DeepSeekAPIClient('test-api-key');
  const result = await client.generate('What is 6 × 7?');
  
  expect(result).toContain('42');
});

test('both clients produce parseable CoT output', async () => {
  const ollamaClient = new OllamaDeepSeekClient();
  const apiClient = new DeepSeekAPIClient('test-key');

  const ollamaResult = parseDeepSeekResponse(await ollamaClient.generate('Test'));
  const apiResult = parseDeepSeekResponse(await apiClient.generate('Test'));

  expect(ollamaResult.thinking).toBeTruthy();
  expect(apiResult.thinking).toBeTruthy();
});

Building a Regression Test Suite

Prevent answer quality regressions when switching model versions:

// eval/deepseek-regression.eval.ts
const REGRESSION_TEST_CASES = [
  {
    id: 'logic-001',
    prompt: 'All roses are flowers. Some flowers fade quickly. Can we conclude all roses fade quickly?',
    expectedInAnswer: ['no', 'cannot'],
    expectedInReasoning: ['some', 'all', 'logic'],
  },
  {
    id: 'code-001',
    prompt: 'Write a TypeScript function that returns the nth Fibonacci number.',
    expectedInAnswer: ['function', 'fibonacci', 'return'],
    expectedInReasoning: ['base case', 'recursive'],
  },
];

test.each(REGRESSION_TEST_CASES)(
  'regression: $id maintains expected behavior',
  async ({ id, prompt, expectedInAnswer, expectedInReasoning }) => {
    const mockModel = createDeepSeekMockForRegressionTest(id);
    const result = parseDeepSeekResponse(await mockModel(prompt));
    
    const answerLower = result.answer.toLowerCase();
    for (const expected of expectedInAnswer) {
      expect(answerLower).toContain(expected);
    }

    if (result.thinking && expectedInReasoning.length > 0) {
      const thinkingLower = result.thinking.toLowerCase();
      const found = expectedInReasoning.filter(e => thinkingLower.includes(e));
      expect(found.length).toBeGreaterThanOrEqual(1);
    }
  }
);

E2E Testing with HelpMeTest

AI reasoning features need monitoring in production to catch quality regressions:

Navigate to https://your-app.com/ai-assistant
Type a math problem: "If 5 apples cost $2.50, how much do 12 apples cost?"
Click Generate
Verify response appears within 15 seconds
Verify response contains the answer "$6.00" or "6"
Verify reasoning panel (if visible) contains calculation steps
Submit a logic puzzle
Verify reasoning chain is displayed before the answer

HelpMeTest continuously monitors your AI feature for response timeouts, empty responses, and quality regressions that unit tests can't catch.

Summary

Testing DeepSeek R1 effectively requires:

Response parser tests — thinking block extraction, edge cases, malformed output
Reasoning chain verification — logical structure, key concept coverage, step detection
CoT quality evals — scoring frameworks for reasoning quality measurement
Local vs API client tests — MSW for both Ollama and DeepSeek API endpoints
Regression test suite — prevent quality degradation across model version updates
E2E monitoring — HelpMeTest for production reasoning feature reliability

Testing DeepSeek R1: Reasoning Chain Verification and Chain-of-Thought Evals

HelpMeTest

Understanding DeepSeek R1's Output Format

Parsing DeepSeek R1 Responses

Reasoning Chain Verification

Chain-of-Thought Quality Evals

Testing Local vs API Deployment

Building a Regression Test Suite

E2E Testing with HelpMeTest

Summary

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest