Multimodal AI Testing: Vision-Language Models, GPT-4V, and Gemini Vision

Multimodal AI Testing: Vision-Language Models, GPT-4V, and Gemini Vision

Vision-language models changed the product surface for AI applications. GPT-4V, Gemini Vision, Claude Vision, and LLaVA can describe images, read documents, analyze charts, and answer questions about visual content. They're appearing in production features — receipt extraction, UI accessibility checking, content moderation, medical image description — and they need tests.

Testing multimodal AI is harder than testing text-only LLMs because the input space is two-dimensional. You're testing the model's ability to process a visual input and produce a textual output, and both dimensions can fail independently. This guide covers how to build a test suite for vision-language models.

The Testing Challenge

With text-only LLMs, you can exhaustively enumerate input variants. With vision models, the space of possible inputs includes image content, lighting, resolution, compression artifacts, layout, and more. Testing strategy must be deliberate about which conditions you cover.

Key failure modes for vision models:

  • Hallucination — the model confidently describes things that aren't in the image
  • OCR errors — text in images is misread, especially handwriting or decorative fonts
  • Layout confusion — tables, multi-column layouts, and infographics are misinterpreted
  • Resolution sensitivity — small text or fine detail is missed below certain resolutions
  • Prompt sensitivity — the same image produces different results with slightly different prompts
  • Refusal false positives — benign medical or artistic images trigger content filters

Building Image Fixtures

Fixture quality determines test quality. For each use case your application handles, create a fixture set:

tests/
  fixtures/
    images/
      documents/
        invoice_simple.jpg         # Clean scan, machine-printed
        invoice_handwritten.jpg    # Handwritten amounts
        invoice_low_quality.jpg    # Phone photo, skewed, noise
        receipt_crumpled.jpg       # Real-world degradation
      charts/
        bar_chart_simple.png       # Single series, clear labels
        line_chart_dense.png       # Multiple series, overlapping
        pie_chart_small_slices.png # Many small segments
      ui_screenshots/
        login_form.png             # Standard form layout
        dashboard_complex.png      # Dense data visualization
        mobile_viewport.png        # Small screen layout
      edge_cases/
        empty_image.png            # Blank white/black
        very_small_16x16.png       # Minimal resolution
        very_large_8k.png          # High resolution
        text_heavy.png             # Primarily text content
        no_text.png                # Pure visual, no text
    ground_truth/
      documents/
        invoice_simple.json        # Expected extraction result
        invoice_handwritten.json
      # ...

Create fixtures programmatically for synthetic test cases:

from PIL import Image, ImageDraw, ImageFont
import json

def create_invoice_fixture(total_amount: float, items: list[dict]) -> tuple[Image.Image, dict]:
    """Create a synthetic invoice image with known ground truth."""
    img = Image.new("RGB", (800, 1000), color="white")
    draw = ImageDraw.Draw(img)
    
    y = 50
    draw.text((50, y), "INVOICE", fill="black")
    y += 60
    
    for item in items:
        line = f"{item['description']}: ${item['amount']:.2f}"
        draw.text((50, y), line, fill="black")
        y += 40
    
    draw.text((50, y + 20), f"TOTAL: ${total_amount:.2f}", fill="black")
    
    ground_truth = {
        "total": total_amount,
        "items": items,
        "currency": "USD"
    }
    
    return img, ground_truth

# Generate fixture
img, truth = create_invoice_fixture(
    total_amount=247.50,
    items=[
        {"description": "Widget A", "amount": 125.00},
        {"description": "Widget B", "amount": 122.50}
    ]
)
img.save("tests/fixtures/images/documents/invoice_synthetic.jpg")
with open("tests/fixtures/ground_truth/documents/invoice_synthetic.json", "w") as f:
    json.dump(truth, f)

Unit Testing Vision Model Integration Code

Unit tests mock the vision API and test your wrapper logic:

# tests/unit/test_vision_client.py
import pytest
import base64
from unittest.mock import patch, MagicMock
from pathlib import Path

def encode_image_b64(path: str) -> str:
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

class TestVisionClient:
    
    def test_sends_image_as_base64(self):
        """Verify image is encoded correctly before sending to API."""
        from myapp.vision import VisionClient
        
        with patch("openai.resources.chat.completions.Completions.create") as mock_create:
            mock_create.return_value = MagicMock(
                choices=[MagicMock(message=MagicMock(content="An invoice for $247.50"))]
            )
            
            client = VisionClient(api_key="test-key")
            result = client.describe_image(
                "tests/fixtures/images/documents/invoice_simple.jpg",
                prompt="Extract the total amount from this invoice."
            )
            
            # Verify the API call structure
            call_messages = mock_create.call_args.kwargs["messages"]
            image_content = call_messages[0]["content"][1]
            assert image_content["type"] == "image_url"
            assert image_content["image_url"]["url"].startswith("data:image/")
    
    def test_handles_content_policy_refusal(self):
        """Should handle refusal gracefully and return structured error."""
        from myapp.vision import VisionClient, ContentPolicyError
        
        with patch("openai.resources.chat.completions.Completions.create") as mock_create:
            mock_create.return_value = MagicMock(
                choices=[MagicMock(
                    message=MagicMock(
                        content="I'm sorry, I can't help with that.",
                        refusal="Content policy violation"
                    ),
                    finish_reason="content_filter"
                )]
            )
            
            client = VisionClient(api_key="test-key")
            
            with pytest.raises(ContentPolicyError):
                client.describe_image("image.jpg", prompt="Describe this image.")
    
    def test_url_image_not_re_encoded(self):
        """Images provided as URLs should be sent as URLs, not downloaded and re-encoded."""
        from myapp.vision import VisionClient
        
        with patch("openai.resources.chat.completions.Completions.create") as mock_create:
            mock_create.return_value = MagicMock(
                choices=[MagicMock(message=MagicMock(content="A product photo"))]
            )
            
            client = VisionClient(api_key="test-key")
            result = client.describe_image(
                "https://example.com/product.jpg",
                prompt="Describe this product."
            )
            
            call_messages = mock_create.call_args.kwargs["messages"]
            image_content = call_messages[0]["content"][1]
            assert image_content["image_url"]["url"] == "https://example.com/product.jpg"
    
    def test_max_tokens_controls_output_length(self):
        """max_tokens should be configurable and passed through."""
        from myapp.vision import VisionClient
        
        with patch("openai.resources.chat.completions.Completions.create") as mock_create:
            mock_create.return_value = MagicMock(
                choices=[MagicMock(message=MagicMock(content="Short desc"))]
            )
            
            client = VisionClient(api_key="test-key")
            client.describe_image("image.jpg", prompt="Describe.", max_tokens=50)
            
            assert mock_create.call_args.kwargs["max_tokens"] == 50

Evaluating Structured Output Extraction

For structured tasks (invoice extraction, form parsing), compare extracted JSON against ground truth:

# tests/integration/test_invoice_extraction.py
import pytest
import json
from deepdiff import DeepDiff

class TestInvoiceExtraction:
    
    EXTRACTION_CASES = [
        "invoice_simple",
        "invoice_handwritten",
        "invoice_low_quality",
    ]
    
    @pytest.mark.integration
    @pytest.mark.parametrize("fixture_name", EXTRACTION_CASES)
    def test_invoice_extraction_accuracy(self, fixture_name):
        from myapp.vision import InvoiceExtractor
        
        extractor = InvoiceExtractor(api_key=os.environ["OPENAI_API_KEY"])
        
        with open(f"tests/fixtures/ground_truth/documents/{fixture_name}.json") as f:
            expected = json.load(f)
        
        result = extractor.extract(
            f"tests/fixtures/images/documents/{fixture_name}.jpg"
        )
        
        # Total amount must be exactly correct (or within rounding)
        assert abs(result["total"] - expected["total"]) < 0.01, (
            f"Total mismatch: expected {expected['total']}, got {result['total']}"
        )
        
        # All line items must be present
        expected_descriptions = {item["description"] for item in expected["items"]}
        result_descriptions = {item["description"] for item in result.get("items", [])}
        
        missing = expected_descriptions - result_descriptions
        assert not missing, f"Missing line items: {missing}"
    
    def test_extraction_returns_valid_json(self):
        """Model output must be parseable JSON, not prose."""
        from myapp.vision import InvoiceExtractor
        
        with patch("openai.resources.chat.completions.Completions.create") as mock_create:
            # Simulate model returning prose instead of JSON
            mock_create.return_value = MagicMock(
                choices=[MagicMock(message=MagicMock(
                    content="The invoice total is $247.50 for two items."
                ))]
            )
            
            extractor = InvoiceExtractor(api_key="test-key")
            
            with pytest.raises(Exception):  # Should fail to parse
                extractor.extract("invoice.jpg")

Regression Testing Across Model Versions

Model upgrades can change output even when your prompts don't change. Capture baseline outputs and compare:

# tests/regression/capture_baselines.py
"""Run this script to capture golden outputs for regression testing."""
import json
import os
from myapp.vision import VisionClient

REGRESSION_FIXTURES = [
    ("tests/fixtures/images/charts/bar_chart_simple.png", 
     "Describe the data shown in this bar chart."),
    ("tests/fixtures/images/documents/invoice_simple.jpg",
     "Extract all text from this invoice."),
]

def capture_baseline(output_dir: str = "tests/regression/baselines"):
    os.makedirs(output_dir, exist_ok=True)
    client = VisionClient(api_key=os.environ["OPENAI_API_KEY"])
    
    for image_path, prompt in REGRESSION_FIXTURES:
        name = os.path.basename(image_path).replace(".", "_")
        result = client.describe_image(image_path, prompt=prompt)
        
        baseline = {
            "image": image_path,
            "prompt": prompt,
            "output": result,
            "model": "gpt-4o",
            "captured_at": "2026-05-19"
        }
        
        with open(f"{output_dir}/{name}.json", "w") as f:
            json.dump(baseline, f, indent=2)
        
        print(f"Captured baseline for {name}")

# tests/regression/test_regression.py
@pytest.mark.regression
def test_chart_description_regression():
    """Chart description output should not change significantly between runs."""
    from myapp.vision import VisionClient
    
    with open("tests/regression/baselines/bar_chart_simple_png.json") as f:
        baseline = json.load(f)
    
    client = VisionClient(api_key=os.environ["OPENAI_API_KEY"])
    current_output = client.describe_image(
        baseline["image"],
        prompt=baseline["prompt"]
    )
    
    # Check key facts are still present (model wording may vary)
    baseline_facts = extract_key_facts(baseline["output"])
    current_facts = extract_key_facts(current_output)
    
    missing_facts = baseline_facts - current_facts
    assert not missing_facts, f"Regression: these facts disappeared: {missing_facts}"

Gemini Vision: Key Differences

Gemini's multimodal API handles images natively rather than via base64 URLs:

import google.generativeai as genai
from PIL import Image

def test_gemini_vision_invoice():
    """Gemini Vision accepts PIL Image objects directly."""
    genai.configure(api_key=os.environ["GEMINI_API_KEY"])
    model = genai.GenerativeModel("gemini-1.5-flash")
    
    image = Image.open("tests/fixtures/images/documents/invoice_simple.jpg")
    
    response = model.generate_content([
        "Extract the total amount from this invoice. Return only the number.",
        image
    ])
    
    # Parse the number from response
    total_str = response.text.strip().replace("$", "").replace(",", "")
    total = float(total_str)
    
    assert abs(total - 247.50) < 0.01, f"Expected 247.50, got {total}"

CI Pipeline for Vision Tests

# .github/workflows/vision-quality.yml
name: Vision AI Quality

on:
  pull_request:
    paths:
      - "myapp/vision/**"
      - "tests/fixtures/images/**"
  schedule:
    - cron: "0 7 * * 1"  # Weekly on Monday

jobs:
  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install pytest pillow openai
      - run: pytest tests/unit/ -v

  integration-tests:
    runs-on: ubuntu-latest
    if: github.event_name == 'schedule' || contains(github.event.pull_request.labels.*.name, 'needs-integration')
    steps:
      - uses: actions/checkout@v4
      - run: pip install pytest pillow openai google-generativeai deepdiff
      - name: Run extraction quality tests
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: pytest tests/integration/ -v -m integration

Production Monitoring

Vision model quality can drift when your upstream provider updates their model. HelpMeTest lets you run scheduled tests that submit reference images and assert on output content — no infrastructure required. Set up a daily test that verifies your invoice extraction still returns the correct total for a known fixture, and get alerted when it breaks.

Conclusion

Testing vision-language models requires richer fixtures than text-only AI — you need diverse images that cover your production conditions. Layer unit tests (for integration code) with integration tests (for real model quality) and regression tests (to detect upstream model changes). For structured extraction tasks, ground truth JSON gives you precise quality metrics. For descriptive tasks, key fact extraction from baseline outputs provides a practical comparison method even when model wording varies.

Read more