Testing

Testing PDF Generation with Puppeteer: Visual Diffs, Text Extraction, and Layout Assertions

HelpMeTest

17 May 2026 — 6 min read

Puppeteer's page.pdf() generates PDFs from HTML pages — but verifying the output requires more than file existence checks. This guide covers testing PDF content via text extraction, visual regression diffs, layout assertions, and metadata checks using Node.js, Puppeteer, and the pdf-parse library.

Key Takeaways

Use pdf-parse to extract text for content assertions. Puppeteer produces binary PDF files — pdf-parse extracts the text layer so you can assert on content without visual comparison.

Visual snapshot tests catch layout regressions. Render the PDF as a PNG using pdf2pic or pdftoppm, then compare against a stored baseline. These catch font changes, spacing regressions, and element overflow.

Assert on PDF metadata separately from content. Page count, author, title, and creator are metadata — verify them independently of content assertions.

Use a headless Chromium in CI with a fixed version. Puppeteer's bundled Chromium version affects PDF rendering. Pin the Puppeteer version and use a fixed Chromium executable path in CI to get deterministic output.

Test the HTML source, not just the PDF. Most PDF rendering bugs originate in CSS or HTML. Print-specific CSS (@media print, page-break-*) is the most common source of failures — test it directly before asserting on PDF output.

Why Puppeteer PDF Tests Are Hard

Puppeteer generates PDFs by printing a webpage with headless Chromium. The output is a binary blob. Asserting that "the invoice shows the correct total" requires either:

Extracting text from the PDF and matching against expected strings
Rendering the PDF as images and comparing pixels against a baseline
Parsing PDF structure using a library like pdf-lib or pdfjs-dist

Each approach has different trade-offs. This guide covers all three.

Setup

npm install puppeteer pdf-parse
npm install --save-dev jest pdf2pic

A minimal PDF generation function:

// src/pdf/generator.js
import puppeteer from 'puppeteer';

export async function generateInvoicePdf(invoiceData) {
  const browser = await puppeteer.launch({
    headless: 'new',
    args: ['--no-sandbox', '--disable-setuid-sandbox'],
  });

  const page = await browser.newPage();

  const html = renderInvoiceHtml(invoiceData);
  await page.setContent(html, { waitUntil: 'networkidle0' });

  const pdfBuffer = await page.pdf({
    format: 'A4',
    printBackground: true,
    margin: { top: '20mm', bottom: '20mm', left: '15mm', right: '15mm' },
  });

  await browser.close();
  return pdfBuffer;
}

Text Extraction Tests

Use pdf-parse to extract the text layer and assert on content:

// src/__tests__/invoice-pdf.test.js
import pdfParse from 'pdf-parse';
import { generateInvoicePdf } from '../pdf/generator.js';

const sampleInvoice = {
  invoiceNumber: 'INV-2026-001',
  date: '2026-05-17',
  client: { name: 'Acme Corp', address: '123 Main St, New York, NY 10001' },
  lineItems: [
    { description: 'Web Development Services', quantity: 40, rate: 150, total: 6000 },
    { description: 'Design Consultation', quantity: 5, rate: 200, total: 1000 },
  ],
  subtotal: 7000,
  tax: 700,
  total: 7700,
};

describe('Invoice PDF generation', () => {
  let pdfBuffer;
  let pdfData;

  beforeAll(async () => {
    pdfBuffer = await generateInvoicePdf(sampleInvoice);
    pdfData = await pdfParse(pdfBuffer);
  }, 30000); // PDF generation is slow

  it('generates a valid PDF buffer', () => {
    expect(pdfBuffer).toBeInstanceOf(Buffer);
    expect(pdfBuffer.length).toBeGreaterThan(1000);
    // PDF files start with '%PDF'
    expect(pdfBuffer.toString('ascii', 0, 4)).toBe('%PDF');
  });

  it('contains invoice number', () => {
    expect(pdfData.text).toContain('INV-2026-001');
  });

  it('contains client name and address', () => {
    expect(pdfData.text).toContain('Acme Corp');
    expect(pdfData.text).toContain('123 Main St');
  });

  it('contains all line items', () => {
    expect(pdfData.text).toContain('Web Development Services');
    expect(pdfData.text).toContain('Design Consultation');
  });

  it('shows correct totals', () => {
    expect(pdfData.text).toContain('7,000');
    expect(pdfData.text).toContain('700');
    expect(pdfData.text).toContain('7,700');
  });

  it('has correct page count', () => {
    expect(pdfData.numpages).toBe(1);
  });
});

Page Count Assertions

For multi-page documents, page count is a critical assertion:

describe('Multi-page reports', () => {
  it('long item list generates multiple pages', async () => {
    const report = {
      ...sampleInvoice,
      lineItems: Array.from({ length: 50 }, (_, i) => ({
        description: `Line item ${i + 1}`,
        quantity: 1,
        rate: 100,
        total: 100,
      })),
    };

    const pdf = await generateInvoicePdf(report);
    const data = await pdfParse(pdf);

    expect(data.numpages).toBeGreaterThan(1);
  });

  it('single-item invoice fits on one page', async () => {
    const minimal = {
      ...sampleInvoice,
      lineItems: [{ description: 'Single Service', quantity: 1, rate: 500, total: 500 }],
    };

    const pdf = await generateInvoicePdf(minimal);
    const data = await pdfParse(pdf);

    expect(data.numpages).toBe(1);
  });
});

PDF Metadata Tests

Puppeteer's page.pdf() accepts metadata options. Test that metadata is set correctly:

export async function generateInvoicePdf(invoiceData) {
  // ... browser/page setup ...

  const pdfBuffer = await page.pdf({
    format: 'A4',
    printBackground: true,
    displayHeaderFooter: false,
    // Chromium doesn't expose PDF metadata via page.pdf() directly,
    // but pdf-lib can be used to set it post-generation
  });

  // Post-process with pdf-lib to set metadata
  const { PDFDocument } = await import('pdf-lib');
  const doc = await PDFDocument.load(pdfBuffer);
  doc.setTitle(`Invoice ${invoiceData.invoiceNumber}`);
  doc.setAuthor('HelpMeTest Billing System');
  doc.setCreator('HelpMeTest');
  doc.setCreationDate(new Date(invoiceData.date));

  return Buffer.from(await doc.save());
}

Testing metadata with pdf-parse:

it('has correct PDF metadata', async () => {
  const data = await pdfParse(pdfBuffer);

  expect(data.info.Title).toBe('Invoice INV-2026-001');
  expect(data.info.Author).toBe('HelpMeTest Billing System');
  expect(data.info.Creator).toBe('HelpMeTest');
});

Visual Snapshot Tests

Text extraction doesn't catch layout problems — elements overlapping, wrong fonts, or broken table borders. Visual snapshots compare pixel-by-pixel:

// src/__tests__/invoice-pdf-visual.test.js
import { fromBuffer } from 'pdf2pic';
import { existsSync, mkdirSync, readFileSync, writeFileSync } from 'fs';
import path from 'path';

const SNAPSHOTS_DIR = path.join(process.cwd(), 'tests/snapshots/pdf');

async function renderPdfPage(pdfBuffer, pageNumber = 1) {
  const convert = fromBuffer(pdfBuffer, {
    density: 150,         // 150 DPI for reasonable file size
    format: 'png',
    width: 794,           // A4 at 96 DPI
    height: 1123,
    preserveAspectRatio: true,
  });

  const result = await convert(pageNumber, { responseType: 'buffer' });
  return result.buffer;
}

describe('Invoice PDF visual regression', () => {
  const SNAPSHOT_PATH = path.join(SNAPSHOTS_DIR, 'invoice-page1.png');

  beforeAll(() => {
    mkdirSync(SNAPSHOTS_DIR, { recursive: true });
  });

  it('matches stored visual snapshot', async () => {
    const pdf = await generateInvoicePdf(sampleInvoice);
    const rendered = await renderPdfPage(pdf, 1);

    if (!existsSync(SNAPSHOT_PATH)) {
      // First run: save baseline
      writeFileSync(SNAPSHOT_PATH, rendered);
      console.log('Baseline snapshot created:', SNAPSHOT_PATH);
      return;
    }

    const baseline = readFileSync(SNAPSHOT_PATH);

    // Simple pixel comparison — in practice use pixelmatch or jest-image-snapshot
    const { default: pixelmatch } = await import('pixelmatch');
    const { PNG } = await import('pngjs');

    const img1 = PNG.sync.read(baseline);
    const img2 = PNG.sync.read(rendered);

    expect(img1.width).toBe(img2.width);
    expect(img1.height).toBe(img2.height);

    const numDiffPixels = pixelmatch(
      img1.data, img2.data,
      null,
      img1.width, img1.height,
      { threshold: 0.1 }
    );

    // Allow up to 100 pixels of difference (font rendering variation)
    expect(numDiffPixels).toBeLessThan(100);
  }, 30000);
});

Using jest-image-snapshot simplifies this considerably:

import { toMatchImageSnapshot } from 'jest-image-snapshot';
expect.extend({ toMatchImageSnapshot });

it('matches stored visual snapshot', async () => {
  const pdf = await generateInvoicePdf(sampleInvoice);
  const rendered = await renderPdfPage(pdf, 1);

  expect(rendered).toMatchImageSnapshot({
    failureThreshold: 0.01,
    failureThresholdType: 'percent',
    customSnapshotsDir: 'tests/snapshots/pdf',
  });
}, 30000);

Testing Print CSS Before PDF Generation

Most PDF layout bugs originate in print CSS. Test the HTML rendering in print mode directly in Puppeteer before asserting on PDF output:

describe('Invoice print layout', () => {
  let browser;
  let page;

  beforeAll(async () => {
    browser = await puppeteer.launch({ headless: 'new' });
    page = await browser.newPage();
    await page.setContent(renderInvoiceHtml(sampleInvoice), { waitUntil: 'networkidle0' });
    await page.emulateMediaType('print');
  });

  afterAll(() => browser.close());

  it('line items table is visible in print media', async () => {
    const tableVisible = await page.$eval('table.line-items', el => {
      const style = window.getComputedStyle(el);
      return style.display !== 'none';
    });
    expect(tableVisible).toBe(true);
  });

  it('total section does not overflow page', async () => {
    const totalEl = await page.$('.invoice-total');
    const box = await totalEl.boundingBox();

    // A4 width at 96 DPI is ~794px
    expect(box.x + box.width).toBeLessThanOrEqual(794);
  });

  it('page breaks don\'t split table rows', async () => {
    // Check that no table row has page-break-inside: avoid violated
    const badRows = await page.$$eval('table.line-items tr', rows =>
      rows.filter(row => {
        const style = window.getComputedStyle(row);
        return style.pageBreakInside === 'auto';
      }).length
    );
    expect(badRows).toBe(0);
  });
});

Error Cases

Test what happens when PDF generation fails:

describe('PDF generation error handling', () => {
  it('throws on invalid HTML template', async () => {
    const invalidData = { ...sampleInvoice, client: null };

    await expect(generateInvoicePdf(invalidData))
      .rejects.toThrow();
  });

  it('handles very long client names without overflow', async () => {
    const longNameData = {
      ...sampleInvoice,
      client: {
        ...sampleInvoice.client,
        name: 'A'.repeat(200),
      },
    };

    // Should not throw
    const pdf = await generateInvoicePdf(longNameData);
    expect(pdf).toBeInstanceOf(Buffer);

    const data = await pdfParse(pdf);
    expect(data.text).toContain('A'.repeat(100));
  });
});

CI Configuration

PDF generation requires Chromium. In CI, either use Puppeteer's bundled Chromium or install it explicitly:

name: PDF Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with: { node-version: '20' }

      # Install Chromium dependencies
      - name: Install Chromium dependencies
        run: |
          sudo apt-get update
          sudo apt-get install -y \
            libgbm-dev libnss3-dev libatk-bridge2.0-dev \
            libdrm-dev libxkbcommon-dev libxcomposite-dev \
            libxdamage-dev libxrandr-dev libgtk-3-dev libasound-dev

      - run: npm ci

      # Install PDF rendering dependencies for visual tests
      - run: sudo apt-get install -y poppler-utils

      - run: npm test
        env:
          PUPPETEER_SKIP_CHROMIUM_DOWNLOAD: false

For deterministic rendering, pin Puppeteer to a specific version and set PUPPETEER_EXECUTABLE_PATH if using a system Chromium.

What to Test

Test type	What it catches
File header (`%PDF`)	Generator produces a valid PDF binary
Text extraction	Required content is present in the text layer
Page count	Layout doesn't overflow/underflow pages unexpectedly
Visual snapshot	Layout regressions — spacing, font, table borders
Print CSS assertions	CSS bugs before they reach PDF output
Metadata	Title, author, creator fields set correctly
Error cases	Generator doesn't crash on edge-case data

For end-to-end testing of PDF download flows — clicking a "Download Invoice" button and verifying the file downloads correctly — HelpMeTest covers the browser interaction layer that Puppeteer unit tests can't easily replicate.