Flutter

Flutter Golden File Tests: Screenshot Comparison Testing

HelpMeTest

15 May 2026 — 4 min read

Flutter golden tests capture a screenshot of a widget and compare it pixel-by-pixel against a reference image (the "golden file"). If the screenshot changes, the test fails. This catches visual regressions — layout shifts, color changes, font differences — that behavioral tests miss entirely. This guide covers writing golden tests, generating reference images, and handling the CI challenges that come with pixel-perfect comparison.

Key Takeaways

expectLater(find.byType(MyWidget), matchesGoldenFile('my_widget.png')) is the core assertion. The path is relative to the test file. On first run (or with --update-goldens), the file is created. On subsequent runs, it's compared.

Golden files must be committed to your repository. They're the reference images. Keep them in a goldens/ subfolder next to your test files.

Platform rendering differs. A golden generated on macOS will fail on Linux CI (different font rendering). Use --platform=linux or generate goldens on the same OS as your CI.

golden_toolkit makes multi-device and multi-theme testing easy. Test the same widget across iPhone 14, Galaxy S22, and iPad Mini in one call.

Increase tolerance for CI flakiness. Use comparator: GoldenFileComparator(failurePercent: 0.01) if pixel-perfect comparison is too strict for your workflow.

What Golden Tests Catch

Golden tests catch visual regressions that no behavioral test can:

A CSS-like change shifted all text 2px left
A refactor changed a container's background color
An icon was replaced with a similar but different one
Responsive layout broke at a specific width
Dark mode colors are slightly off

Behavioral tests only verify that "the button exists" or "clicking it calls the right function." Golden tests verify "the button looks like this."

Basic Golden Test

// test/widgets/user_card_golden_test.dart
import 'package:flutter/material.dart';
import 'package:flutter_test/flutter_test.dart';
import 'package:my_app/widgets/user_card.dart';
import 'package:my_app/models/user.dart';

void main() {
  testWidgets('UserCard matches golden file', (tester) async {
    await tester.pumpWidget(
      MaterialApp(
        theme: ThemeData.light(),
        home: Scaffold(
          body: UserCard(
            user: User(
              id: '1',
              name: 'Alice Johnson',
              email: 'alice@example.com',
              role: UserRole.admin,
            ),
          ),
        ),
      ),
    );

    await expectLater(
      find.byType(UserCard),
      matchesGoldenFile('goldens/user_card_light.png'),
    );
  });
}

On first run, Flutter creates the golden file. On subsequent runs, it compares. If the widget looks different, the test fails with a diff image.

Generating Golden Files

# Generate (or update) golden files
flutter <span class="hljs-built_in">test --update-goldens <span class="hljs-built_in">test/widgets/user_card_golden_test.dart

<span class="hljs-comment"># Run golden tests (compare against existing goldens)
flutter <span class="hljs-built_in">test <span class="hljs-built_in">test/widgets/user_card_golden_test.dart

Commit the generated .png files to your repository. They are the reference images for future comparison.

Testing Multiple Variants

testWidgets('UserCard dark theme golden', (tester) async {
  await tester.pumpWidget(
    MaterialApp(
      theme: ThemeData.dark(),
      home: Scaffold(
        body: UserCard(user: aliceUser),
      ),
    ),
  );

  await expectLater(
    find.byType(UserCard),
    matchesGoldenFile('goldens/user_card_dark.png'),
  );
});

testWidgets('UserCard admin badge golden', (tester) async {
  await tester.pumpWidget(
    MaterialApp(
      home: Scaffold(
        body: UserCard(user: adminUser),  // has admin badge
      ),
    ),
  );

  await expectLater(
    find.byType(UserCard),
    matchesGoldenFile('goldens/user_card_admin.png'),
  );
});

golden_toolkit for Multi-Device Testing

golden_toolkit makes it easy to test across multiple screen sizes:

# pubspec.yaml
dev_dependencies:
  golden_toolkit: ^0.15.0

import 'package:flutter_test/flutter_test.dart';
import 'package:golden_toolkit/golden_toolkit.dart';

void main() {
  testGoldens('UserCard on multiple devices', (tester) async {
    await multiScreenGolden(
      tester,
      'user_card_devices',
      widget: MaterialApp(
        home: Scaffold(
          body: UserCard(user: aliceUser),
        ),
      ),
      devices: [
        Device.phone,
        Device.iphone11,
        Device.tabletLandscape,
      ],
    );
  });
}

This creates one golden file per device size, named user_card_devices.phone.png, user_card_devices.iphone11.png, etc.

Pump Options

golden_toolkit also provides pumpWidgetBuilder with custom fonts:

void main() {
  // Load fonts for golden tests
  setUpAll(() async {
    await loadAppFonts();
  });

  testGoldens('UserCard with custom fonts', (tester) async {
    await tester.pumpWidgetBuilder(
      UserCard(user: aliceUser),
      surfaceSize: const Size(400, 200),
    );

    await screenMatchesGolden(tester, 'user_card_with_fonts');
  });
}

Handling Platform Differences

Golden files are platform-specific. Fonts render differently on macOS vs Linux vs Windows. A golden generated on macOS will fail on Ubuntu CI.

Solution 1: Generate goldens on CI

Use a matrix strategy to generate goldens only on the CI platform (Linux):

# .github/workflows/golden-tests.yml
- name: Update golden files (Linux only)
  if: github.event_name == 'workflow_dispatch'
  run: flutter test --update-goldens test/

- name: Run golden tests
  run: flutter test test/

Generate goldens locally only on Linux, or use Docker to match your CI environment.

Solution 2: Use a custom comparator with tolerance

// test/flutter_test_config.dart (applies to all tests in the directory)
import 'dart:async';
import 'package:flutter_test/flutter_test.dart';

Future<void> testExecutable(FutureOr<void> Function() testMain) async {
  // Allow up to 0.5% pixel difference (reduces platform-specific failures)
  goldenFileComparator = TolerantGoldenFileComparator(
    Uri.file('test/'),
    failurePercent: 0.005,
  );
  await testMain();
}

A custom comparator that accepts small pixel differences handles minor font rendering variations between platforms.

Golden Diffs on Failure

When a golden test fails, Flutter generates diff images showing what changed:

Golden test failure: user_card_light.png
  Pixel count: 12,345
  Mismatch: 234 pixels (1.9%)
  
  Output files:
  - test/goldens/failures/user_card_light_isolatedDiff.png
  - test/goldens/failures/user_card_light_maskedDiff.png
  - test/goldens/failures/user_card_light_image.png

The isolatedDiff.png shows only changed pixels. The maskedDiff.png overlays changes on the original. Use these to understand what changed.

CI Setup

# .github/workflows/flutter-tests.yml
- name: Run flutter tests with goldens
  run: flutter test test/
  
- name: Upload golden diffs on failure
  if: failure()
  uses: actions/upload-artifact@v3
  with:
    name: golden-failures
    path: test/goldens/failures/

Upload the failure artifacts so you can inspect what changed without pulling the branch locally.

When NOT to Use Golden Tests

Golden tests are high-maintenance:

Every intentional UI change requires updating goldens
CI platform differences cause false failures
They're slower than behavioral tests

Use golden tests for:

Components with complex visual states (charts, custom painters)
Brand-critical UI (login screens, onboarding)
Components where small visual changes matter (color palettes, icon sets)

Avoid golden tests for:

Every single widget
Components that change frequently
Functional behavior that behavioral tests can verify

Combining Test Types

A good Flutter test strategy:

Unit tests — business logic (fast, no flakiness)
Widget tests — behavioral widget testing (fast)
Golden tests — critical visual components (medium, needs maintenance)
Integration tests — user flows on device (slow, run in CI only)
Production monitoring — HelpMeTest for live API/backend testing

Golden tests fill the visual gap between widget tests and manual review. Use them selectively for the components where appearance matters most.