Testing with Gemini 2.5 Pro: Multimodal Evals, Thinking Mode, and Grounding
Gemini 2.5 Pro is Google's most capable model in 2026, with native multimodal input (images, audio, video, documents), a 2-million-token context window, and an optional "thinking" mode for complex reasoning. Testing Gemini 2.5 integrations requires eval patterns that account for these unique capabilities. This guide covers how to test them reliably.
What Makes Gemini 2.5 Testing Different
Gemini 2.5 Pro introduces capabilities that require new testing approaches:
- Thinking mode — the model can reason step-by-step before answering, and you can access the thinking tokens separately
- Multimodal inputs — images, documents, audio, and video all need test coverage
- Grounding with Google Search — responses can include search-backed citations
- Cost assertions — thinking tokens are billed separately and can be expensive
Mocking the Gemini SDK
Use Google's @google/generative-ai SDK with Jest mocks:
import { GoogleGenerativeAI } from '@google/generative-ai';
jest.mock('@google/generative-ai');
const mockGenerateContent = jest.fn();
(GoogleGenerativeAI as jest.Mock).mockImplementation(() => ({
getGenerativeModel: jest.fn(() => ({
generateContent: mockGenerateContent,
startChat: jest.fn(() => ({
sendMessage: jest.fn(),
})),
})),
}));
// Helper to create a mock Gemini response
function createMockGeminiResponse(text: string, overrides = {}) {
return {
response: {
text: () => text,
candidates: [{
content: {
parts: [{ text }],
role: 'model',
},
finishReason: 'STOP',
safetyRatings: [],
...overrides,
}],
usageMetadata: {
promptTokenCount: 100,
candidatesTokenCount: 200,
totalTokenCount: 300,
thoughtsTokenCount: 0,
},
},
};
}Testing Basic Gemini Integration
import { GeminiService } from '../services/gemini-service';
describe('GeminiService', () => {
beforeEach(() => {
mockGenerateContent.mockResolvedValue(
createMockGeminiResponse('This is a test response.')
);
});
test('sends prompt to Gemini 2.5 Pro model', async () => {
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
await service.generateResponse('Explain quantum computing');
expect(mockGenerateContent).toHaveBeenCalledWith(
expect.objectContaining({
contents: [{ parts: [{ text: 'Explain quantum computing' }], role: 'user' }],
})
);
});
test('returns parsed text from response', async () => {
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
const result = await service.generateResponse('What is 2+2?');
expect(result.text).toBe('This is a test response.');
});
test('uses gemini-2.5-pro model by default', () => {
const genAI = new GoogleGenerativeAI('test-key');
new GeminiService(genAI);
expect(genAI.getGenerativeModel).toHaveBeenCalledWith(
expect.objectContaining({ model: 'gemini-2.5-pro' })
);
});
});Testing Thinking Mode
Gemini 2.5's thinking mode generates reasoning tokens before the final response:
function createMockThinkingResponse(thinking: string, answer: string) {
return {
response: {
text: () => answer,
candidates: [{
content: {
parts: [
{ thought: true, text: thinking },
{ text: answer },
],
role: 'model',
},
finishReason: 'STOP',
safetyRatings: [],
}],
usageMetadata: {
promptTokenCount: 50,
candidatesTokenCount: 100,
totalTokenCount: 450,
thoughtsTokenCount: 300, // Thinking tokens
},
},
};
}
describe('Thinking mode', () => {
test('enables thinking mode with budget tokens', async () => {
mockGenerateContent.mockResolvedValue(
createMockThinkingResponse(
'Let me think about this step by step...',
'The answer is 42.'
)
);
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
await service.generateWithThinking('Solve this complex problem', {
thinkingBudget: 1024,
});
expect(mockGenerateContent).toHaveBeenCalledWith(
expect.objectContaining({
generationConfig: expect.objectContaining({
thinkingConfig: { thinkingBudget: 1024 },
}),
})
);
});
test('extracts thinking tokens separately from answer', async () => {
mockGenerateContent.mockResolvedValue(
createMockThinkingResponse(
'First, I should consider X. Then Y. Therefore...',
'The conclusion is Z.'
)
);
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
const result = await service.generateWithThinking('Complex question');
expect(result.thinking).toContain('First, I should consider X');
expect(result.answer).toBe('The conclusion is Z.');
});
test('tracks thinking token cost separately', async () => {
mockGenerateContent.mockResolvedValue(
createMockThinkingResponse('Lengthy reasoning...', 'Answer.')
);
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
const result = await service.generateWithThinking('Complex problem', {
thinkingBudget: 2048,
});
expect(result.usage.thoughtsTokenCount).toBe(300);
expect(result.usage.totalTokenCount).toBe(450);
// Calculate cost: thinking tokens cost more
// This assertion catches pricing model changes
const estimatedCost = result.usage.thoughtsTokenCount * 0.000003 +
result.usage.candidatesTokenCount * 0.000001;
expect(estimatedCost).toBeLessThan(0.01); // Sanity check
});
test('falls back to standard mode when thinking budget is 0', async () => {
mockGenerateContent.mockResolvedValue(
createMockGeminiResponse('Direct answer without thinking.')
);
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
await service.generateWithThinking('Simple question', { thinkingBudget: 0 });
expect(mockGenerateContent).toHaveBeenCalledWith(
expect.not.objectContaining({
generationConfig: expect.objectContaining({
thinkingConfig: { thinkingBudget: expect.any(Number) },
}),
})
);
});
});Testing Multimodal Inputs
Gemini 2.5 Pro accepts images, documents, audio, and video. Test that your service constructs the correct multipart requests:
import { readFileSync } from 'fs';
import { join } from 'path';
describe('Multimodal inputs', () => {
test('sends image analysis request with correct MIME type', async () => {
mockGenerateContent.mockResolvedValue(
createMockGeminiResponse('I see a red car in the image.')
);
const imageBuffer = Buffer.from('fake-image-data');
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
await service.analyzeImage(imageBuffer, 'What is in this image?', 'image/jpeg');
expect(mockGenerateContent).toHaveBeenCalledWith({
contents: [{
parts: [
{
inlineData: {
data: imageBuffer.toString('base64'),
mimeType: 'image/jpeg',
},
},
{ text: 'What is in this image?' },
],
role: 'user',
}],
});
});
test('analyzes PDF document with correct content parts', async () => {
mockGenerateContent.mockResolvedValue(
createMockGeminiResponse('The document discusses quarterly revenue...')
);
const pdfBuffer = Buffer.from('fake-pdf-data');
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
const result = await service.analyzeDocument(
pdfBuffer,
'Summarize the key financial metrics',
'application/pdf'
);
expect(mockGenerateContent).toHaveBeenCalledWith({
contents: [{
parts: [
{
inlineData: {
data: pdfBuffer.toString('base64'),
mimeType: 'application/pdf',
},
},
{ text: 'Summarize the key financial metrics' },
],
role: 'user',
}],
});
expect(result.text).toContain('quarterly revenue');
});
test('handles multiple images in single request', async () => {
mockGenerateContent.mockResolvedValue(
createMockGeminiResponse('Image 1 shows X, Image 2 shows Y.')
);
const images = [
{ buffer: Buffer.from('img1'), mimeType: 'image/jpeg' as const },
{ buffer: Buffer.from('img2'), mimeType: 'image/png' as const },
];
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
await service.compareImages(images, 'What differences do you see?');
const call = mockGenerateContent.mock.calls[0][0];
const parts = call.contents[0].parts;
expect(parts).toHaveLength(3); // 2 images + 1 text
expect(parts[0].inlineData.mimeType).toBe('image/jpeg');
expect(parts[1].inlineData.mimeType).toBe('image/png');
expect(parts[2].text).toBe('What differences do you see?');
});
});Testing Grounding with Google Search
Grounded responses include search citations. Test that your service uses grounding correctly:
function createMockGroundedResponse(text: string) {
return {
response: {
text: () => text,
candidates: [{
content: { parts: [{ text }], role: 'model' },
finishReason: 'STOP',
groundingMetadata: {
searchEntryPoint: {
renderedContent: '<html>...</html>',
},
groundingChunks: [
{
web: {
uri: 'https://example.com/article',
title: 'Relevant Article',
},
},
],
groundingSupports: [
{
groundingChunkIndices: [0],
confidenceScores: [0.95],
segment: { startIndex: 0, endIndex: 50, text: text.slice(0, 50) },
},
],
},
safetyRatings: [],
}],
usageMetadata: { promptTokenCount: 100, candidatesTokenCount: 150, totalTokenCount: 250 },
},
};
}
describe('Grounding with Google Search', () => {
test('enables grounding tool in the request', async () => {
mockGenerateContent.mockResolvedValue(
createMockGroundedResponse('According to recent reports, the market grew by 15%.')
);
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
await service.generateWithGrounding('What is the current state of the AI market?');
expect(mockGenerateContent).toHaveBeenCalledWith(
expect.objectContaining({
tools: [{ googleSearch: {} }],
})
);
});
test('extracts citation sources from grounded response', async () => {
mockGenerateContent.mockResolvedValue(
createMockGroundedResponse('The population of Earth is approximately 8 billion.')
);
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
const result = await service.generateWithGrounding('What is the world population?');
expect(result.citations).toHaveLength(1);
expect(result.citations[0].uri).toBe('https://example.com/article');
expect(result.citations[0].title).toBe('Relevant Article');
expect(result.citations[0].confidenceScore).toBe(0.95);
});
test('raises error when response lacks grounding for factual queries', async () => {
mockGenerateContent.mockResolvedValue(
createMockGeminiResponse('The answer might be...') // No grounding metadata
);
const service = new GeminiService(new GoogleGenerativeAI('test-key'));
// Your service should enforce grounding for factual queries
await expect(
service.generateWithGrounding('What is the current stock price of GOOG?', {
requireGrounding: true,
})
).rejects.toThrow('Grounded response required but not returned');
});
});Integration Evals with Real API
For critical AI features, run integration evals against the real Gemini API:
// eval/gemini-integration.eval.ts
// Run these separately from unit tests — they cost tokens
const SHOULD_RUN_EVALS = process.env.RUN_EVALS === 'true';
describe.if(SHOULD_RUN_EVALS)('Gemini 2.5 Pro integration evals', () => {
let model: GenerativeModel;
beforeAll(() => {
if (!process.env.GOOGLE_AI_API_KEY) {
throw new Error('GOOGLE_AI_API_KEY required for evals');
}
const genAI = new GoogleGenerativeAI(process.env.GOOGLE_AI_API_KEY);
model = genAI.getGenerativeModel({ model: 'gemini-2.5-pro' });
});
test('correctly identifies objects in a standard test image', async () => {
const testImage = readFileSync(join(__dirname, 'fixtures/test-image.jpg'));
const result = await model.generateContent({
contents: [{
parts: [
{ inlineData: { data: testImage.toString('base64'), mimeType: 'image/jpeg' } },
{ text: 'What objects do you see in this image? List them.' },
],
role: 'user',
}],
});
const text = result.response.text();
// Verify known objects in the test image are identified
expect(text.toLowerCase()).toContain('car'); // Test image contains a car
expect(text.toLowerCase()).toContain('tree'); // Test image contains trees
}, 30_000);
test('thinking mode produces more accurate answers than standard mode', async () => {
const question = 'If a train travels at 60mph and another at 40mph start 200 miles apart and travel toward each other, when do they meet?';
// Standard mode
const standardResult = await model.generateContent(question);
// Thinking mode
const thinkingResult = await model.generateContent({
contents: [{ parts: [{ text: question }], role: 'user' }],
generationConfig: { thinkingConfig: { thinkingBudget: 512 } } as any,
});
// Both should get 2 hours (200 / (60+40) = 2)
expect(thinkingResult.response.text()).toContain('2');
expect(thinkingResult.response.usageMetadata?.thoughtsTokenCount).toBeGreaterThan(0);
}, 30_000);
});E2E Testing with HelpMeTest
AI-powered features in your UI need end-to-end testing to validate the full stack — from UI input to Gemini API to rendered response:
Navigate to https://your-app.com/ai-assistant
Upload a test image (product photo)
Type "Describe this product in 3 sentences"
Click Generate
Verify a response appears within 10 seconds
Verify the response contains at least 3 sentences
Verify the response describes the image content
Enable grounding toggle
Type "What are the latest developments in AI?"
Click Generate
Verify citations appear below the responseHelpMeTest monitors these flows continuously, catching Gemini API deprecations, rate limit changes, and response quality regressions.
Summary
Testing Gemini 2.5 Pro integrations requires:
- SDK mocking —
@google/generative-aimock with realistic response shapes - Thinking mode tests — token separation, budget assertions, cost tracking
- Multimodal tests — image, PDF, and multi-image request construction
- Grounding tests — citation extraction and grounding enforcement
- Integration evals — real API calls for quality validation (separate from unit tests)
- Cost assertions — verify thinking token usage stays within expected ranges
- E2E monitoring — HelpMeTest for AI feature reliability in production