Testing DeepSeek R1: Reasoning Chain Verification and Chain-of-Thought Evals
DeepSeek R1 is an open-weight reasoning model that exposes its chain-of-thought reasoning as a <think> block before the final answer. This transparency makes it uniquely testable — you can evaluate not just whether the answer is correct, but whether the reasoning process is sound. This guide covers testing strategies for DeepSeek R1 in both local and API deployments.
Understanding DeepSeek R1's Output Format
DeepSeek R1 outputs reasoning in a specific format:
<think>
Let me analyze this step by step.
First, I'll consider...
Then I'll evaluate...
Therefore, the answer must be...
</think>
The final answer is X because Y.Your integration code needs to parse this format, and your tests need to validate both the thinking process and the final answer.
Parsing DeepSeek R1 Responses
// lib/deepseek-parser.ts
export interface DeepSeekResponse {
thinking: string | null;
answer: string;
raw: string;
}
export function parseDeepSeekResponse(raw: string): DeepSeekResponse {
const thinkMatch = raw.match(/<think>([\s\S]*?)<\/think>/);
const thinking = thinkMatch ? thinkMatch[1].trim() : null;
const answer = raw
.replace(/<think>[\s\S]*?<\/think>/g, '')
.trim();
return { thinking, answer, raw };
}Test the parser thoroughly:
import { parseDeepSeekResponse } from '../lib/deepseek-parser';
describe('DeepSeek R1 response parser', () => {
test('extracts thinking block correctly', () => {
const raw = `<think>
I need to calculate the sum.
First: 2 + 3 = 5
Then: 5 + 7 = 12
</think>
The answer is 12.`;
const result = parseDeepSeekResponse(raw);
expect(result.thinking).toContain('I need to calculate the sum.');
expect(result.thinking).toContain('5 + 7 = 12');
expect(result.answer).toBe('The answer is 12.');
});
test('handles responses without thinking block', () => {
const raw = 'The capital of France is Paris.';
const result = parseDeepSeekResponse(raw);
expect(result.thinking).toBeNull();
expect(result.answer).toBe('The capital of France is Paris.');
});
test('handles multi-paragraph answers after thinking', () => {
const raw = `<think>
Analysis complete.
</think>
First paragraph of the answer.
Second paragraph with more details.`;
const result = parseDeepSeekResponse(raw);
expect(result.answer).toContain('First paragraph');
expect(result.answer).toContain('Second paragraph');
expect(result.thinking).toBe('Analysis complete.');
});
});Reasoning Chain Verification
The unique value of DeepSeek R1 is that you can verify the reasoning process, not just the output:
export interface ReasoningEval {
containsSteps: boolean;
mentionsKeyConcepts: string[];
hasLogicalFlow: boolean;
wordCount: number;
avoidsFallacies: boolean;
}
export function evaluateReasoning(thinking: string, expectedConcepts: string[]): ReasoningEval {
const mentionsKeyConcepts = expectedConcepts.filter(concept =>
thinking.toLowerCase().includes(concept.toLowerCase())
);
// Check for logical connectors that indicate structured reasoning
const logicalConnectors = ['therefore', 'thus', 'because', 'since', 'first', 'then', 'finally'];
const hasLogicalFlow = logicalConnectors.some(c => thinking.toLowerCase().includes(c));
// Check for step markers
const containsSteps = /\d+\.|step \d|first|second|third/i.test(thinking);
// Check for common logical fallacies (simplified)
const fallacies = ['always', 'never', 'everyone knows', 'obviously'];
const avoidsFallacies = !fallacies.some(f => thinking.toLowerCase().includes(f));
return {
containsSteps,
mentionsKeyConcepts,
hasLogicalFlow,
wordCount: thinking.split(/\s+/).length,
avoidsFallacies,
};
}
describe('Reasoning chain evaluation', () => {
test('evaluates math problem reasoning correctly', () => {
const thinking = `
First, I need to identify the equation: x + 5 = 12.
Then, I'll solve for x by subtracting 5 from both sides.
x = 12 - 5 = 7.
Therefore, x = 7.
`;
const eval = evaluateReasoning(thinking, ['equation', 'solve', 'subtract']);
expect(eval.containsSteps).toBe(true);
expect(eval.hasLogicalFlow).toBe(true);
expect(eval.mentionsKeyConcepts).toContain('solve');
expect(eval.mentionsKeyConcepts).toContain('subtract');
expect(eval.wordCount).toBeGreaterThan(20);
});
test('flags reasoning that skips steps', () => {
const thinking = `Obviously the answer is 7.`;
const eval = evaluateReasoning(thinking, ['equation', 'solve']);
expect(eval.containsSteps).toBe(false);
expect(eval.avoidsFallacies).toBe(false); // "obviously" is a fallacy marker
expect(eval.mentionsKeyConcepts).toHaveLength(0);
});
});Chain-of-Thought Quality Evals
Build a systematic eval framework for measuring CoT quality:
export interface CoTEval {
questionId: string;
question: string;
expectedAnswer: string;
modelAnswer: string;
thinking: string | null;
answerCorrect: boolean;
reasoningScore: number; // 0-100
thinkingTokens: number;
}
export async function runCoTEval(
testCases: { id: string; question: string; expectedAnswer: string; concepts: string[] }[],
modelFn: (question: string) => Promise<{ thinking: string | null; answer: string }>,
options = { scoringFn: defaultReasoningScorer }
): Promise<CoTEval[]> {
return Promise.all(testCases.map(async (testCase) => {
const result = await modelFn(testCase.question);
const answerCorrect = result.answer
.toLowerCase()
.includes(testCase.expectedAnswer.toLowerCase());
const reasoningScore = result.thinking
? options.scoringFn(result.thinking, testCase.concepts)
: 0;
return {
questionId: testCase.id,
question: testCase.question,
expectedAnswer: testCase.expectedAnswer,
modelAnswer: result.answer,
thinking: result.thinking,
answerCorrect,
reasoningScore,
thinkingTokens: result.thinking?.split(/\s+/).length || 0,
};
}));
}
function defaultReasoningScorer(thinking: string, concepts: string[]): number {
let score = 0;
const lowerThinking = thinking.toLowerCase();
// Score for mentioning key concepts
const conceptsFound = concepts.filter(c => lowerThinking.includes(c.toLowerCase())).length;
score += (conceptsFound / concepts.length) * 40;
// Score for logical flow
const logicalConnectors = ['therefore', 'thus', 'because', 'first', 'then'];
const connectorsFound = logicalConnectors.filter(c => lowerThinking.includes(c)).length;
score += Math.min(connectorsFound * 10, 30);
// Score for length (up to a point)
const wordCount = thinking.split(/\s+/).length;
score += Math.min(wordCount / 10, 30);
return Math.min(score, 100);
}
// Usage in tests
const MATH_TEST_CASES = [
{
id: 'math-001',
question: 'If a car travels at 60mph for 2.5 hours, how far does it travel?',
expectedAnswer: '150',
concepts: ['distance', 'speed', 'time', 'multiply', '60', '2.5'],
},
{
id: 'math-002',
question: 'A shirt costs $40. After a 25% discount, what is the new price?',
expectedAnswer: '30',
concepts: ['discount', 'percent', 'subtract', '40', '25'],
},
];
test('DeepSeek R1 achieves >80% reasoning score on math problems', async () => {
const mockModel = jest.fn().mockImplementation(async (question: string) => {
// Simulate varying quality responses
if (question.includes('60mph')) {
return {
thinking: 'First, I recall distance = speed × time. Speed = 60mph, time = 2.5h. Therefore distance = 60 × 2.5 = 150 miles.',
answer: '150 miles',
};
}
return {
thinking: 'Discount amount = 40 × 25% = $10. New price = 40 - 10 = $30.',
answer: '$30',
};
});
const evals = await runCoTEval(MATH_TEST_CASES, mockModel);
const avgReasoningScore = evals.reduce((sum, e) => sum + e.reasoningScore, 0) / evals.length;
const correctAnswers = evals.filter(e => e.answerCorrect).length;
expect(avgReasoningScore).toBeGreaterThan(60);
expect(correctAnswers).toBe(MATH_TEST_CASES.length);
});Testing Local vs API Deployment
DeepSeek R1 is open-weight and can run locally with Ollama. Test both deployment modes:
// lib/deepseek-client.ts
export interface DeepSeekClient {
generate(prompt: string, options?: { temperature?: number; maxTokens?: number }): Promise<string>;
}
export class OllamaDeepSeekClient implements DeepSeekClient {
constructor(private baseUrl: string = 'http://localhost:11434') {}
async generate(prompt: string, options = {}): Promise<string> {
const response = await fetch(`${this.baseUrl}/api/generate`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
model: 'deepseek-r1:8b',
prompt,
stream: false,
options: { temperature: 0, ...options },
}),
});
const data = await response.json();
return data.response;
}
}
export class DeepSeekAPIClient implements DeepSeekClient {
constructor(private apiKey: string) {}
async generate(prompt: string, options = {}): Promise<string> {
const response = await fetch('https://api.deepseek.com/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${this.apiKey}`,
},
body: JSON.stringify({
model: 'deepseek-reasoner',
messages: [{ role: 'user', content: prompt }],
...options,
}),
});
const data = await response.json();
return data.choices[0].message.content;
}
}Test client behavior independently with MSW:
import { setupServer } from 'msw/node';
import { http, HttpResponse } from 'msw';
const server = setupServer(
http.post('http://localhost:11434/api/generate', () => {
return HttpResponse.json({
response: '<think>\nCalculating...\n</think>\n\nThe answer is 42.',
done: true,
});
}),
http.post('https://api.deepseek.com/chat/completions', () => {
return HttpResponse.json({
choices: [{
message: {
role: 'assistant',
content: '<think>\nReasoning through...\n</think>\n\n42.',
reasoning_content: 'Reasoning through...',
},
}],
usage: { prompt_tokens: 50, completion_tokens: 100, total_tokens: 150 },
});
})
);
beforeAll(() => server.listen());
afterEach(() => server.resetHandlers());
afterAll(() => server.close());
test('Ollama client sends correct request format', async () => {
const client = new OllamaDeepSeekClient();
const result = await client.generate('What is 6 × 7?');
expect(result).toContain('42');
});
test('API client sends correct request format', async () => {
const client = new DeepSeekAPIClient('test-api-key');
const result = await client.generate('What is 6 × 7?');
expect(result).toContain('42');
});
test('both clients produce parseable CoT output', async () => {
const ollamaClient = new OllamaDeepSeekClient();
const apiClient = new DeepSeekAPIClient('test-key');
const ollamaResult = parseDeepSeekResponse(await ollamaClient.generate('Test'));
const apiResult = parseDeepSeekResponse(await apiClient.generate('Test'));
expect(ollamaResult.thinking).toBeTruthy();
expect(apiResult.thinking).toBeTruthy();
});Building a Regression Test Suite
Prevent answer quality regressions when switching model versions:
// eval/deepseek-regression.eval.ts
const REGRESSION_TEST_CASES = [
{
id: 'logic-001',
prompt: 'All roses are flowers. Some flowers fade quickly. Can we conclude all roses fade quickly?',
expectedInAnswer: ['no', 'cannot'],
expectedInReasoning: ['some', 'all', 'logic'],
},
{
id: 'code-001',
prompt: 'Write a TypeScript function that returns the nth Fibonacci number.',
expectedInAnswer: ['function', 'fibonacci', 'return'],
expectedInReasoning: ['base case', 'recursive'],
},
];
test.each(REGRESSION_TEST_CASES)(
'regression: $id maintains expected behavior',
async ({ id, prompt, expectedInAnswer, expectedInReasoning }) => {
const mockModel = createDeepSeekMockForRegressionTest(id);
const result = parseDeepSeekResponse(await mockModel(prompt));
const answerLower = result.answer.toLowerCase();
for (const expected of expectedInAnswer) {
expect(answerLower).toContain(expected);
}
if (result.thinking && expectedInReasoning.length > 0) {
const thinkingLower = result.thinking.toLowerCase();
const found = expectedInReasoning.filter(e => thinkingLower.includes(e));
expect(found.length).toBeGreaterThanOrEqual(1);
}
}
);E2E Testing with HelpMeTest
AI reasoning features need monitoring in production to catch quality regressions:
Navigate to https://your-app.com/ai-assistant
Type a math problem: "If 5 apples cost $2.50, how much do 12 apples cost?"
Click Generate
Verify response appears within 15 seconds
Verify response contains the answer "$6.00" or "6"
Verify reasoning panel (if visible) contains calculation steps
Submit a logic puzzle
Verify reasoning chain is displayed before the answerHelpMeTest continuously monitors your AI feature for response timeouts, empty responses, and quality regressions that unit tests can't catch.
Summary
Testing DeepSeek R1 effectively requires:
- Response parser tests — thinking block extraction, edge cases, malformed output
- Reasoning chain verification — logical structure, key concept coverage, step detection
- CoT quality evals — scoring frameworks for reasoning quality measurement
- Local vs API client tests — MSW for both Ollama and DeepSeek API endpoints
- Regression test suite — prevent quality degradation across model version updates
- E2E monitoring — HelpMeTest for production reasoning feature reliability