Testing

Testing with Gemini 2.5 Pro: Multimodal Evals, Thinking Mode, and Grounding

HelpMeTest

19 May 2026 — 6 min read

Gemini 2.5 Pro is Google's most capable model in 2026, with native multimodal input (images, audio, video, documents), a 2-million-token context window, and an optional "thinking" mode for complex reasoning. Testing Gemini 2.5 integrations requires eval patterns that account for these unique capabilities. This guide covers how to test them reliably.

What Makes Gemini 2.5 Testing Different

Gemini 2.5 Pro introduces capabilities that require new testing approaches:

Thinking mode — the model can reason step-by-step before answering, and you can access the thinking tokens separately
Multimodal inputs — images, documents, audio, and video all need test coverage
Grounding with Google Search — responses can include search-backed citations
Cost assertions — thinking tokens are billed separately and can be expensive

Mocking the Gemini SDK

Use Google's @google/generative-ai SDK with Jest mocks:

import { GoogleGenerativeAI } from '@google/generative-ai';

jest.mock('@google/generative-ai');

const mockGenerateContent = jest.fn();

(GoogleGenerativeAI as jest.Mock).mockImplementation(() => ({
  getGenerativeModel: jest.fn(() => ({
    generateContent: mockGenerateContent,
    startChat: jest.fn(() => ({
      sendMessage: jest.fn(),
    })),
  })),
}));

// Helper to create a mock Gemini response
function createMockGeminiResponse(text: string, overrides = {}) {
  return {
    response: {
      text: () => text,
      candidates: [{
        content: {
          parts: [{ text }],
          role: 'model',
        },
        finishReason: 'STOP',
        safetyRatings: [],
        ...overrides,
      }],
      usageMetadata: {
        promptTokenCount: 100,
        candidatesTokenCount: 200,
        totalTokenCount: 300,
        thoughtsTokenCount: 0,
      },
    },
  };
}

Testing Basic Gemini Integration

import { GeminiService } from '../services/gemini-service';

describe('GeminiService', () => {
  beforeEach(() => {
    mockGenerateContent.mockResolvedValue(
      createMockGeminiResponse('This is a test response.')
    );
  });

  test('sends prompt to Gemini 2.5 Pro model', async () => {
    const service = new GeminiService(new GoogleGenerativeAI('test-key'));
    
    await service.generateResponse('Explain quantum computing');

    expect(mockGenerateContent).toHaveBeenCalledWith(
      expect.objectContaining({
        contents: [{ parts: [{ text: 'Explain quantum computing' }], role: 'user' }],
      })
    );
  });

  test('returns parsed text from response', async () => {
    const service = new GeminiService(new GoogleGenerativeAI('test-key'));
    
    const result = await service.generateResponse('What is 2+2?');
    
    expect(result.text).toBe('This is a test response.');
  });

  test('uses gemini-2.5-pro model by default', () => {
    const genAI = new GoogleGenerativeAI('test-key');
    new GeminiService(genAI);
    
    expect(genAI.getGenerativeModel).toHaveBeenCalledWith(
      expect.objectContaining({ model: 'gemini-2.5-pro' })
    );
  });
});

Testing Thinking Mode

Gemini 2.5's thinking mode generates reasoning tokens before the final response:

function createMockThinkingResponse(thinking: string, answer: string) {
  return {
    response: {
      text: () => answer,
      candidates: [{
        content: {
          parts: [
            { thought: true, text: thinking },
            { text: answer },
          ],
          role: 'model',
        },
        finishReason: 'STOP',
        safetyRatings: [],
      }],
      usageMetadata: {
        promptTokenCount: 50,
        candidatesTokenCount: 100,
        totalTokenCount: 450,
        thoughtsTokenCount: 300, // Thinking tokens
      },
    },
  };
}

describe('Thinking mode', () => {
  test('enables thinking mode with budget tokens', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockThinkingResponse(
        'Let me think about this step by step...',
        'The answer is 42.'
      )
    );

    const service = new GeminiService(new GoogleGenerativeAI('test-key'));
    
    await service.generateWithThinking('Solve this complex problem', {
      thinkingBudget: 1024,
    });

    expect(mockGenerateContent).toHaveBeenCalledWith(
      expect.objectContaining({
        generationConfig: expect.objectContaining({
          thinkingConfig: { thinkingBudget: 1024 },
        }),
      })
    );
  });

  test('extracts thinking tokens separately from answer', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockThinkingResponse(
        'First, I should consider X. Then Y. Therefore...',
        'The conclusion is Z.'
      )
    );

    const service = new GeminiService(new GoogleGenerativeAI('test-key'));
    const result = await service.generateWithThinking('Complex question');

    expect(result.thinking).toContain('First, I should consider X');
    expect(result.answer).toBe('The conclusion is Z.');
  });

  test('tracks thinking token cost separately', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockThinkingResponse('Lengthy reasoning...', 'Answer.')
    );

    const service = new GeminiService(new GoogleGenerativeAI('test-key'));
    const result = await service.generateWithThinking('Complex problem', {
      thinkingBudget: 2048,
    });

    expect(result.usage.thoughtsTokenCount).toBe(300);
    expect(result.usage.totalTokenCount).toBe(450);
    
    // Calculate cost: thinking tokens cost more
    // This assertion catches pricing model changes
    const estimatedCost = result.usage.thoughtsTokenCount * 0.000003 +
      result.usage.candidatesTokenCount * 0.000001;
    expect(estimatedCost).toBeLessThan(0.01); // Sanity check
  });

  test('falls back to standard mode when thinking budget is 0', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockGeminiResponse('Direct answer without thinking.')
    );

    const service = new GeminiService(new GoogleGenerativeAI('test-key'));
    
    await service.generateWithThinking('Simple question', { thinkingBudget: 0 });

    expect(mockGenerateContent).toHaveBeenCalledWith(
      expect.not.objectContaining({
        generationConfig: expect.objectContaining({
          thinkingConfig: { thinkingBudget: expect.any(Number) },
        }),
      })
    );
  });
});

Testing Multimodal Inputs

Gemini 2.5 Pro accepts images, documents, audio, and video. Test that your service constructs the correct multipart requests:

import { readFileSync } from 'fs';
import { join } from 'path';

describe('Multimodal inputs', () => {
  test('sends image analysis request with correct MIME type', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockGeminiResponse('I see a red car in the image.')
    );

    const imageBuffer = Buffer.from('fake-image-data');
    const service = new GeminiService(new GoogleGenerativeAI('test-key'));

    await service.analyzeImage(imageBuffer, 'What is in this image?', 'image/jpeg');

    expect(mockGenerateContent).toHaveBeenCalledWith({
      contents: [{
        parts: [
          {
            inlineData: {
              data: imageBuffer.toString('base64'),
              mimeType: 'image/jpeg',
            },
          },
          { text: 'What is in this image?' },
        ],
        role: 'user',
      }],
    });
  });

  test('analyzes PDF document with correct content parts', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockGeminiResponse('The document discusses quarterly revenue...')
    );

    const pdfBuffer = Buffer.from('fake-pdf-data');
    const service = new GeminiService(new GoogleGenerativeAI('test-key'));

    const result = await service.analyzeDocument(
      pdfBuffer,
      'Summarize the key financial metrics',
      'application/pdf'
    );

    expect(mockGenerateContent).toHaveBeenCalledWith({
      contents: [{
        parts: [
          {
            inlineData: {
              data: pdfBuffer.toString('base64'),
              mimeType: 'application/pdf',
            },
          },
          { text: 'Summarize the key financial metrics' },
        ],
        role: 'user',
      }],
    });

    expect(result.text).toContain('quarterly revenue');
  });

  test('handles multiple images in single request', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockGeminiResponse('Image 1 shows X, Image 2 shows Y.')
    );

    const images = [
      { buffer: Buffer.from('img1'), mimeType: 'image/jpeg' as const },
      { buffer: Buffer.from('img2'), mimeType: 'image/png' as const },
    ];

    const service = new GeminiService(new GoogleGenerativeAI('test-key'));
    await service.compareImages(images, 'What differences do you see?');

    const call = mockGenerateContent.mock.calls[0][0];
    const parts = call.contents[0].parts;
    
    expect(parts).toHaveLength(3); // 2 images + 1 text
    expect(parts[0].inlineData.mimeType).toBe('image/jpeg');
    expect(parts[1].inlineData.mimeType).toBe('image/png');
    expect(parts[2].text).toBe('What differences do you see?');
  });
});

Testing Grounding with Google Search

Grounded responses include search citations. Test that your service uses grounding correctly:

function createMockGroundedResponse(text: string) {
  return {
    response: {
      text: () => text,
      candidates: [{
        content: { parts: [{ text }], role: 'model' },
        finishReason: 'STOP',
        groundingMetadata: {
          searchEntryPoint: {
            renderedContent: '<html>...</html>',
          },
          groundingChunks: [
            {
              web: {
                uri: 'https://example.com/article',
                title: 'Relevant Article',
              },
            },
          ],
          groundingSupports: [
            {
              groundingChunkIndices: [0],
              confidenceScores: [0.95],
              segment: { startIndex: 0, endIndex: 50, text: text.slice(0, 50) },
            },
          ],
        },
        safetyRatings: [],
      }],
      usageMetadata: { promptTokenCount: 100, candidatesTokenCount: 150, totalTokenCount: 250 },
    },
  };
}

describe('Grounding with Google Search', () => {
  test('enables grounding tool in the request', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockGroundedResponse('According to recent reports, the market grew by 15%.')
    );

    const service = new GeminiService(new GoogleGenerativeAI('test-key'));
    
    await service.generateWithGrounding('What is the current state of the AI market?');

    expect(mockGenerateContent).toHaveBeenCalledWith(
      expect.objectContaining({
        tools: [{ googleSearch: {} }],
      })
    );
  });

  test('extracts citation sources from grounded response', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockGroundedResponse('The population of Earth is approximately 8 billion.')
    );

    const service = new GeminiService(new GoogleGenerativeAI('test-key'));
    const result = await service.generateWithGrounding('What is the world population?');

    expect(result.citations).toHaveLength(1);
    expect(result.citations[0].uri).toBe('https://example.com/article');
    expect(result.citations[0].title).toBe('Relevant Article');
    expect(result.citations[0].confidenceScore).toBe(0.95);
  });

  test('raises error when response lacks grounding for factual queries', async () => {
    mockGenerateContent.mockResolvedValue(
      createMockGeminiResponse('The answer might be...') // No grounding metadata
    );

    const service = new GeminiService(new GoogleGenerativeAI('test-key'));

    // Your service should enforce grounding for factual queries
    await expect(
      service.generateWithGrounding('What is the current stock price of GOOG?', {
        requireGrounding: true,
      })
    ).rejects.toThrow('Grounded response required but not returned');
  });
});

Integration Evals with Real API

For critical AI features, run integration evals against the real Gemini API:

// eval/gemini-integration.eval.ts
// Run these separately from unit tests — they cost tokens

const SHOULD_RUN_EVALS = process.env.RUN_EVALS === 'true';

describe.if(SHOULD_RUN_EVALS)('Gemini 2.5 Pro integration evals', () => {
  let model: GenerativeModel;

  beforeAll(() => {
    if (!process.env.GOOGLE_AI_API_KEY) {
      throw new Error('GOOGLE_AI_API_KEY required for evals');
    }
    const genAI = new GoogleGenerativeAI(process.env.GOOGLE_AI_API_KEY);
    model = genAI.getGenerativeModel({ model: 'gemini-2.5-pro' });
  });

  test('correctly identifies objects in a standard test image', async () => {
    const testImage = readFileSync(join(__dirname, 'fixtures/test-image.jpg'));
    
    const result = await model.generateContent({
      contents: [{
        parts: [
          { inlineData: { data: testImage.toString('base64'), mimeType: 'image/jpeg' } },
          { text: 'What objects do you see in this image? List them.' },
        ],
        role: 'user',
      }],
    });

    const text = result.response.text();
    
    // Verify known objects in the test image are identified
    expect(text.toLowerCase()).toContain('car'); // Test image contains a car
    expect(text.toLowerCase()).toContain('tree'); // Test image contains trees
  }, 30_000);

  test('thinking mode produces more accurate answers than standard mode', async () => {
    const question = 'If a train travels at 60mph and another at 40mph start 200 miles apart and travel toward each other, when do they meet?';
    
    // Standard mode
    const standardResult = await model.generateContent(question);
    
    // Thinking mode
    const thinkingResult = await model.generateContent({
      contents: [{ parts: [{ text: question }], role: 'user' }],
      generationConfig: { thinkingConfig: { thinkingBudget: 512 } } as any,
    });

    // Both should get 2 hours (200 / (60+40) = 2)
    expect(thinkingResult.response.text()).toContain('2');
    expect(thinkingResult.response.usageMetadata?.thoughtsTokenCount).toBeGreaterThan(0);
  }, 30_000);
});

E2E Testing with HelpMeTest

AI-powered features in your UI need end-to-end testing to validate the full stack — from UI input to Gemini API to rendered response:

Navigate to https://your-app.com/ai-assistant
Upload a test image (product photo)
Type "Describe this product in 3 sentences"
Click Generate
Verify a response appears within 10 seconds
Verify the response contains at least 3 sentences
Verify the response describes the image content
Enable grounding toggle
Type "What are the latest developments in AI?"
Click Generate
Verify citations appear below the response

HelpMeTest monitors these flows continuously, catching Gemini API deprecations, rate limit changes, and response quality regressions.

Summary

Testing Gemini 2.5 Pro integrations requires:

SDK mocking — @google/generative-ai mock with realistic response shapes
Thinking mode tests — token separation, budget assertions, cost tracking
Multimodal tests — image, PDF, and multi-image request construction
Grounding tests — citation extraction and grounding enforcement
Integration evals — real API calls for quality validation (separate from unit tests)
Cost assertions — verify thinking token usage stays within expected ranges
E2E monitoring — HelpMeTest for AI feature reliability in production

Testing with Gemini 2.5 Pro: Multimodal Evals, Thinking Mode, and Grounding

HelpMeTest

What Makes Gemini 2.5 Testing Different

Mocking the Gemini SDK

Testing Basic Gemini Integration

Testing Thinking Mode

Testing Multimodal Inputs

Testing Grounding with Google Search

Integration Evals with Real API

E2E Testing with HelpMeTest

Summary

Read more

Testing React Router v7 with Vite + Vitest: Setup and Best Practices

E2E Testing React Router v7 Apps with Playwright

Migrating from Remix to React Router v7: Testing Your Migration

Testing React Router v7 Loaders and Actions with Vitest