Multimodal AI Testing: Vision-Language Models, GPT-4V, and Gemini Vision
Vision-language models changed the product surface for AI applications. GPT-4V, Gemini Vision, Claude Vision, and LLaVA can describe images, read documents, analyze charts, and answer questions about visual content. They're appearing in production features — receipt extraction, UI accessibility checking, content moderation, medical image description — and they need tests.
Testing multimodal AI is harder than testing text-only LLMs because the input space is two-dimensional. You're testing the model's ability to process a visual input and produce a textual output, and both dimensions can fail independently. This guide covers how to build a test suite for vision-language models.
The Testing Challenge
With text-only LLMs, you can exhaustively enumerate input variants. With vision models, the space of possible inputs includes image content, lighting, resolution, compression artifacts, layout, and more. Testing strategy must be deliberate about which conditions you cover.
Key failure modes for vision models:
- Hallucination — the model confidently describes things that aren't in the image
- OCR errors — text in images is misread, especially handwriting or decorative fonts
- Layout confusion — tables, multi-column layouts, and infographics are misinterpreted
- Resolution sensitivity — small text or fine detail is missed below certain resolutions
- Prompt sensitivity — the same image produces different results with slightly different prompts
- Refusal false positives — benign medical or artistic images trigger content filters
Building Image Fixtures
Fixture quality determines test quality. For each use case your application handles, create a fixture set:
tests/
fixtures/
images/
documents/
invoice_simple.jpg # Clean scan, machine-printed
invoice_handwritten.jpg # Handwritten amounts
invoice_low_quality.jpg # Phone photo, skewed, noise
receipt_crumpled.jpg # Real-world degradation
charts/
bar_chart_simple.png # Single series, clear labels
line_chart_dense.png # Multiple series, overlapping
pie_chart_small_slices.png # Many small segments
ui_screenshots/
login_form.png # Standard form layout
dashboard_complex.png # Dense data visualization
mobile_viewport.png # Small screen layout
edge_cases/
empty_image.png # Blank white/black
very_small_16x16.png # Minimal resolution
very_large_8k.png # High resolution
text_heavy.png # Primarily text content
no_text.png # Pure visual, no text
ground_truth/
documents/
invoice_simple.json # Expected extraction result
invoice_handwritten.json
# ...Create fixtures programmatically for synthetic test cases:
from PIL import Image, ImageDraw, ImageFont
import json
def create_invoice_fixture(total_amount: float, items: list[dict]) -> tuple[Image.Image, dict]:
"""Create a synthetic invoice image with known ground truth."""
img = Image.new("RGB", (800, 1000), color="white")
draw = ImageDraw.Draw(img)
y = 50
draw.text((50, y), "INVOICE", fill="black")
y += 60
for item in items:
line = f"{item['description']}: ${item['amount']:.2f}"
draw.text((50, y), line, fill="black")
y += 40
draw.text((50, y + 20), f"TOTAL: ${total_amount:.2f}", fill="black")
ground_truth = {
"total": total_amount,
"items": items,
"currency": "USD"
}
return img, ground_truth
# Generate fixture
img, truth = create_invoice_fixture(
total_amount=247.50,
items=[
{"description": "Widget A", "amount": 125.00},
{"description": "Widget B", "amount": 122.50}
]
)
img.save("tests/fixtures/images/documents/invoice_synthetic.jpg")
with open("tests/fixtures/ground_truth/documents/invoice_synthetic.json", "w") as f:
json.dump(truth, f)Unit Testing Vision Model Integration Code
Unit tests mock the vision API and test your wrapper logic:
# tests/unit/test_vision_client.py
import pytest
import base64
from unittest.mock import patch, MagicMock
from pathlib import Path
def encode_image_b64(path: str) -> str:
with open(path, "rb") as f:
return base64.b64encode(f.read()).decode("utf-8")
class TestVisionClient:
def test_sends_image_as_base64(self):
"""Verify image is encoded correctly before sending to API."""
from myapp.vision import VisionClient
with patch("openai.resources.chat.completions.Completions.create") as mock_create:
mock_create.return_value = MagicMock(
choices=[MagicMock(message=MagicMock(content="An invoice for $247.50"))]
)
client = VisionClient(api_key="test-key")
result = client.describe_image(
"tests/fixtures/images/documents/invoice_simple.jpg",
prompt="Extract the total amount from this invoice."
)
# Verify the API call structure
call_messages = mock_create.call_args.kwargs["messages"]
image_content = call_messages[0]["content"][1]
assert image_content["type"] == "image_url"
assert image_content["image_url"]["url"].startswith("data:image/")
def test_handles_content_policy_refusal(self):
"""Should handle refusal gracefully and return structured error."""
from myapp.vision import VisionClient, ContentPolicyError
with patch("openai.resources.chat.completions.Completions.create") as mock_create:
mock_create.return_value = MagicMock(
choices=[MagicMock(
message=MagicMock(
content="I'm sorry, I can't help with that.",
refusal="Content policy violation"
),
finish_reason="content_filter"
)]
)
client = VisionClient(api_key="test-key")
with pytest.raises(ContentPolicyError):
client.describe_image("image.jpg", prompt="Describe this image.")
def test_url_image_not_re_encoded(self):
"""Images provided as URLs should be sent as URLs, not downloaded and re-encoded."""
from myapp.vision import VisionClient
with patch("openai.resources.chat.completions.Completions.create") as mock_create:
mock_create.return_value = MagicMock(
choices=[MagicMock(message=MagicMock(content="A product photo"))]
)
client = VisionClient(api_key="test-key")
result = client.describe_image(
"https://example.com/product.jpg",
prompt="Describe this product."
)
call_messages = mock_create.call_args.kwargs["messages"]
image_content = call_messages[0]["content"][1]
assert image_content["image_url"]["url"] == "https://example.com/product.jpg"
def test_max_tokens_controls_output_length(self):
"""max_tokens should be configurable and passed through."""
from myapp.vision import VisionClient
with patch("openai.resources.chat.completions.Completions.create") as mock_create:
mock_create.return_value = MagicMock(
choices=[MagicMock(message=MagicMock(content="Short desc"))]
)
client = VisionClient(api_key="test-key")
client.describe_image("image.jpg", prompt="Describe.", max_tokens=50)
assert mock_create.call_args.kwargs["max_tokens"] == 50Evaluating Structured Output Extraction
For structured tasks (invoice extraction, form parsing), compare extracted JSON against ground truth:
# tests/integration/test_invoice_extraction.py
import pytest
import json
from deepdiff import DeepDiff
class TestInvoiceExtraction:
EXTRACTION_CASES = [
"invoice_simple",
"invoice_handwritten",
"invoice_low_quality",
]
@pytest.mark.integration
@pytest.mark.parametrize("fixture_name", EXTRACTION_CASES)
def test_invoice_extraction_accuracy(self, fixture_name):
from myapp.vision import InvoiceExtractor
extractor = InvoiceExtractor(api_key=os.environ["OPENAI_API_KEY"])
with open(f"tests/fixtures/ground_truth/documents/{fixture_name}.json") as f:
expected = json.load(f)
result = extractor.extract(
f"tests/fixtures/images/documents/{fixture_name}.jpg"
)
# Total amount must be exactly correct (or within rounding)
assert abs(result["total"] - expected["total"]) < 0.01, (
f"Total mismatch: expected {expected['total']}, got {result['total']}"
)
# All line items must be present
expected_descriptions = {item["description"] for item in expected["items"]}
result_descriptions = {item["description"] for item in result.get("items", [])}
missing = expected_descriptions - result_descriptions
assert not missing, f"Missing line items: {missing}"
def test_extraction_returns_valid_json(self):
"""Model output must be parseable JSON, not prose."""
from myapp.vision import InvoiceExtractor
with patch("openai.resources.chat.completions.Completions.create") as mock_create:
# Simulate model returning prose instead of JSON
mock_create.return_value = MagicMock(
choices=[MagicMock(message=MagicMock(
content="The invoice total is $247.50 for two items."
))]
)
extractor = InvoiceExtractor(api_key="test-key")
with pytest.raises(Exception): # Should fail to parse
extractor.extract("invoice.jpg")Regression Testing Across Model Versions
Model upgrades can change output even when your prompts don't change. Capture baseline outputs and compare:
# tests/regression/capture_baselines.py
"""Run this script to capture golden outputs for regression testing."""
import json
import os
from myapp.vision import VisionClient
REGRESSION_FIXTURES = [
("tests/fixtures/images/charts/bar_chart_simple.png",
"Describe the data shown in this bar chart."),
("tests/fixtures/images/documents/invoice_simple.jpg",
"Extract all text from this invoice."),
]
def capture_baseline(output_dir: str = "tests/regression/baselines"):
os.makedirs(output_dir, exist_ok=True)
client = VisionClient(api_key=os.environ["OPENAI_API_KEY"])
for image_path, prompt in REGRESSION_FIXTURES:
name = os.path.basename(image_path).replace(".", "_")
result = client.describe_image(image_path, prompt=prompt)
baseline = {
"image": image_path,
"prompt": prompt,
"output": result,
"model": "gpt-4o",
"captured_at": "2026-05-19"
}
with open(f"{output_dir}/{name}.json", "w") as f:
json.dump(baseline, f, indent=2)
print(f"Captured baseline for {name}")
# tests/regression/test_regression.py
@pytest.mark.regression
def test_chart_description_regression():
"""Chart description output should not change significantly between runs."""
from myapp.vision import VisionClient
with open("tests/regression/baselines/bar_chart_simple_png.json") as f:
baseline = json.load(f)
client = VisionClient(api_key=os.environ["OPENAI_API_KEY"])
current_output = client.describe_image(
baseline["image"],
prompt=baseline["prompt"]
)
# Check key facts are still present (model wording may vary)
baseline_facts = extract_key_facts(baseline["output"])
current_facts = extract_key_facts(current_output)
missing_facts = baseline_facts - current_facts
assert not missing_facts, f"Regression: these facts disappeared: {missing_facts}"Gemini Vision: Key Differences
Gemini's multimodal API handles images natively rather than via base64 URLs:
import google.generativeai as genai
from PIL import Image
def test_gemini_vision_invoice():
"""Gemini Vision accepts PIL Image objects directly."""
genai.configure(api_key=os.environ["GEMINI_API_KEY"])
model = genai.GenerativeModel("gemini-1.5-flash")
image = Image.open("tests/fixtures/images/documents/invoice_simple.jpg")
response = model.generate_content([
"Extract the total amount from this invoice. Return only the number.",
image
])
# Parse the number from response
total_str = response.text.strip().replace("$", "").replace(",", "")
total = float(total_str)
assert abs(total - 247.50) < 0.01, f"Expected 247.50, got {total}"CI Pipeline for Vision Tests
# .github/workflows/vision-quality.yml
name: Vision AI Quality
on:
pull_request:
paths:
- "myapp/vision/**"
- "tests/fixtures/images/**"
schedule:
- cron: "0 7 * * 1" # Weekly on Monday
jobs:
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install pytest pillow openai
- run: pytest tests/unit/ -v
integration-tests:
runs-on: ubuntu-latest
if: github.event_name == 'schedule' || contains(github.event.pull_request.labels.*.name, 'needs-integration')
steps:
- uses: actions/checkout@v4
- run: pip install pytest pillow openai google-generativeai deepdiff
- name: Run extraction quality tests
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: pytest tests/integration/ -v -m integrationProduction Monitoring
Vision model quality can drift when your upstream provider updates their model. HelpMeTest lets you run scheduled tests that submit reference images and assert on output content — no infrastructure required. Set up a daily test that verifies your invoice extraction still returns the correct total for a known fixture, and get alerted when it breaks.
Conclusion
Testing vision-language models requires richer fixtures than text-only AI — you need diverse images that cover your production conditions. Layer unit tests (for integration code) with integration tests (for real model quality) and regression tests (to detect upstream model changes). For structured extraction tasks, ground truth JSON gives you precise quality metrics. For descriptive tasks, key fact extraction from baseline outputs provides a practical comparison method even when model wording varies.