Apache PDFBox Testing Tutorial: Parsing, Comparing, and Asserting PDF Content

Apache PDFBox Testing Tutorial: Parsing, Comparing, and Asserting PDF Content

Apache PDFBox is an open-source Java library for reading and writing PDFs. Unlike iText (commercial license for production use), PDFBox is Apache-licensed. It's ideal for testing PDFs regardless of how they were generated — iText, wkhtmltopdf, Puppeteer, or any other tool. This guide shows how to parse, compare, and assert on PDF content using PDFBox in JUnit 5 tests.

Key Takeaways

PDFBox works as a test reader for any PDF generator. You can use PDFBox to test PDFs generated by iText, WeasyPrint, Puppeteer, or any other tool — it's the reader, not the writer.

PDFStripper extracts text with position control. PDFTextStripper extracts text from the whole document; PDFTextStripperByArea extracts from specific rectangular regions.

Test form fields explicitly. PDFBox provides direct access to AcroForm fields — check field names, types, and current values in test assertions.

Image and annotation testing requires traversing the page structure. PDFBox exposes images and annotations via the page content stream — iterate the page to verify visual content beyond text.

PDFBox 3.x changed the rendering API. The PDFRenderer class API changed significantly in PDFBox 3.0 — check your version before copying examples.

Why PDFBox for Testing

Apache PDFBox is commonly used as the reader in a test setup, even when the PDF was generated by a different library. Its advantages:

  • Apache 2.0 license — no commercial restrictions in tests
  • Rich introspection API — text, fonts, images, annotations, form fields, bookmarks
  • Per-region text extraction — extract text from specific page areas
  • AcroForm support — read and write PDF form field values
  • Rendering — render pages to BufferedImage for visual tests

Maven Setup

<dependency>
    <groupId>org.apache.pdfbox</groupId>
    <artifactId>pdfbox</artifactId>
    <version>3.0.2</version>
    <scope>test</scope>
</dependency>
<dependency>
    <groupId>org.junit.jupiter</groupId>
    <artifactId>junit-jupiter</artifactId>
    <version>5.10.0</version>
    <scope>test</scope>
</dependency>

Loading PDFs in Tests

import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.junit.jupiter.api.*;

import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;

class PdfContentTest {

    private PDDocument document;

    @BeforeEach
    void setUp() throws IOException {
        // Option 1: Load from file
        document = Loader.loadPDF(Path.of("src/test/resources/sample-invoice.pdf").toFile());

        // Option 2: Load from byte array (from your generator)
        byte[] pdfBytes = invoiceGenerator.generate(sampleInvoice);
        document = Loader.loadPDF(pdfBytes);
    }

    @AfterEach
    void tearDown() throws IOException {
        if (document != null) {
            document.close();
        }
    }
}

Note: In PDFBox 3.x, use Loader.loadPDF() rather than PDDocument.load() (deprecated).

Text Extraction

Full Document Text

import org.apache.pdfbox.text.PDFTextStripper;

@Test
void documentContainsInvoiceHeader() throws IOException {
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);

    assertTrue(text.contains("Invoice #INV-2026-001"));
    assertTrue(text.contains("Acme Corp"));
}

@Test
void totalAmountIsPresentInDocument() throws IOException {
    PDFTextStripper stripper = new PDFTextStripper();
    String text = stripper.getText(document);

    assertTrue(text.contains("7,700.00") || text.contains("$7700"),
        "Document should show the invoice total");
}

Per-Page Text Extraction

@Test
void firstPageContainsHeaderInformation() throws IOException {
    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setStartPage(1);
    stripper.setEndPage(1);

    String firstPageText = stripper.getText(document);

    assertTrue(firstPageText.contains("INV-2026-001"));
    assertTrue(firstPageText.contains("Acme Corp"));
}

@Test
void lastPageContainsTotalsSection() throws IOException {
    int lastPage = document.getNumberOfPages();

    PDFTextStripper stripper = new PDFTextStripper();
    stripper.setStartPage(lastPage);
    stripper.setEndPage(lastPage);

    String lastPageText = stripper.getText(document);

    assertTrue(lastPageText.contains("Total"));
    assertTrue(lastPageText.contains("7,700"));
}

Region-Based Text Extraction

Extract text from a specific rectangle on a page:

import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.awt.Rectangle;

@Test
void headerRegionContainsCompanyName() throws IOException {
    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.setSortByPosition(true);

    // Define header region: x=0, y=0, width=595 (A4 width), height=100 points
    Rectangle headerRegion = new Rectangle(0, 0, 595, 100);
    stripper.addRegion("header", headerRegion);

    PDPage firstPage = document.getPage(0);
    stripper.extractRegions(firstPage);

    String headerText = stripper.getTextForRegion("header");
    assertTrue(headerText.contains("HelpMeTest Invoicing"));
}

@Test
void totalColumnShowsCorrectAmounts() throws IOException {
    // A4 is 595×842 points; right column starts at ~450 points
    Rectangle totalColumn = new Rectangle(450, 200, 145, 400);

    PDFTextStripperByArea stripper = new PDFTextStripperByArea();
    stripper.addRegion("totals", totalColumn);
    stripper.extractRegions(document.getPage(0));

    String columnText = stripper.getTextForRegion("totals");
    assertTrue(columnText.contains("6,000"));
    assertTrue(columnText.contains("1,000"));
    assertTrue(columnText.contains("7,000"));
}

Page Count and Structure

@Test
void singleInvoiceFitsOnOnePage() {
    assertEquals(1, document.getNumberOfPages(),
        "Standard invoice should fit on one page");
}

@Test
void longInvoiceSpansExpectedPages() throws IOException {
    byte[] longPdf = generator.generate(largeInvoice(150));
    try (PDDocument doc = Loader.loadPDF(longPdf)) {
        assertTrue(doc.getNumberOfPages() >= 3,
            "150-item invoice should require at least 3 pages");
    }
}

@Test
void documentHasCorrectPageSize() {
    PDPage page = document.getPage(0);
    PDRectangle mediaBox = page.getMediaBox();

    // A4 dimensions in PDF points (1 point = 1/72 inch)
    assertEquals(595, Math.round(mediaBox.getWidth()), "Page width should be A4");
    assertEquals(842, Math.round(mediaBox.getHeight()), "Page height should be A4");
}

Document Metadata

@Test
void documentMetadataIsSetCorrectly() {
    PDDocumentInformation info = document.getDocumentInformation();

    assertEquals("Invoice INV-2026-001", info.getTitle());
    assertEquals("HelpMeTest", info.getCreator());
    assertEquals("HelpMeTest Billing", info.getAuthor());
}

@Test
void creationDateIsRecent() {
    PDDocumentInformation info = document.getDocumentInformation();
    Calendar creationDate = info.getCreationDate();

    assertNotNull(creationDate);

    // Creation date should be within the last minute for a freshly generated PDF
    long ageMillis = System.currentTimeMillis() - creationDate.getTimeInMillis();
    assertTrue(ageMillis < 60_000, "PDF creation date should be recent");
}

PDF Form Field Testing

PDFBox provides full AcroForm support. Test form PDFs used for fillable documents:

import org.apache.pdfbox.pdmodel.interactive.form.*;

@Test
void formFieldsArePresent() {
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    PDAcroForm acroForm = catalog.getAcroForm();

    assertNotNull(acroForm, "Document should have an AcroForm");

    // Check specific fields exist
    assertNotNull(acroForm.getField("client_name"), "client_name field must exist");
    assertNotNull(acroForm.getField("invoice_date"), "invoice_date field must exist");
    assertNotNull(acroForm.getField("total_amount"), "total_amount field must exist");
}

@Test
void formFieldValuesAreFilledCorrectly() throws IOException {
    // Fill the form
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    PDAcroForm acroForm = catalog.getAcroForm();

    acroForm.getField("client_name").setValue("Test Corp");
    acroForm.getField("invoice_date").setValue("2026-05-17");
    acroForm.getField("total_amount").setValue("$5,000.00");

    // Write to bytes and re-read to verify
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    document.save(baos);

    try (PDDocument filled = Loader.loadPDF(baos.toByteArray())) {
        PDAcroForm filledForm = filled.getDocumentCatalog().getAcroForm();
        assertEquals("Test Corp", filledForm.getField("client_name").getValueAsString());
        assertEquals("$5,000.00", filledForm.getField("total_amount").getValueAsString());
    }
}

@Test
void requiredFieldsAreNotEmpty() {
    PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
    List<String> requiredFields = List.of("client_name", "invoice_number", "total_amount");

    for (String fieldName : requiredFields) {
        PDField field = acroForm.getField(fieldName);
        assertNotNull(field, "Required field missing: " + fieldName);
        assertFalse(field.getValueAsString().isBlank(),
            "Required field is empty: " + fieldName);
    }
}

Image and Annotation Testing

Checking for Image Presence

import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.pdmodel.graphics.image.PDImage;

@Test
void firstPageContainsLogo() throws IOException {
    List<PDImage> images = extractImages(document.getPage(0));
    assertFalse(images.isEmpty(), "Invoice page should contain at least one image (logo)");
}

private List<PDImage> extractImages(PDPage page) throws IOException {
    List<PDImage> images = new ArrayList<>();

    PDFStreamEngine engine = new PDFStreamEngine() {
        @Override
        protected void processOperator(Operator operator, List<COSBase> operands)
                throws IOException {
            if ("Do".equals(operator.getName())) {
                // A "Do" operator invokes an XObject — check if it's an image
                COSName xObjectName = (COSName) operands.get(0);
                PDXObject xObject = page.getResources().getXObject(xObjectName);
                if (xObject instanceof PDImage) {
                    images.add((PDImage) xObject);
                }
            }
            super.processOperator(operator, operands);
        }
    };

    engine.processPage(page);
    return images;
}
@Test
void firstPageHasHyperlinkToWebsite() {
    PDPage page = document.getPage(0);
    List<PDAnnotation> annotations = page.getAnnotations();

    boolean hasLink = annotations.stream()
        .anyMatch(ann -> ann instanceof PDAnnotationLink);

    assertTrue(hasLink, "Invoice should have at least one clickable link");
}

@Test
void linkAnnotationPointsToCorrectUrl() {
    PDPage page = document.getPage(0);
    List<PDAnnotation> annotations = page.getAnnotations();

    PDAnnotationLink link = annotations.stream()
        .filter(a -> a instanceof PDAnnotationLink)
        .map(a -> (PDAnnotationLink) a)
        .findFirst()
        .orElseThrow(() -> new AssertionError("No link annotation found"));

    PDAction action = link.getAction();
    assertTrue(action instanceof PDActionURI, "Link action should be a URI action");
    assertEquals("https://helpmetest.com", ((PDActionURI) action).getURI());
}

Rendering Pages for Visual Tests

import org.apache.pdfbox.rendering.PDFRenderer;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;

@Test
void pageRenderingDoesNotThrow() throws IOException {
    PDFRenderer renderer = new PDFRenderer(document);

    // Render at 150 DPI
    BufferedImage image = renderer.renderImageWithDPI(0, 150);

    assertNotNull(image);
    assertTrue(image.getWidth() > 0);
    assertTrue(image.getHeight() > 0);
}

@Test
void renderedPageMatchesBaseline() throws IOException {
    PDFRenderer renderer = new PDFRenderer(document);
    BufferedImage rendered = renderer.renderImageWithDPI(0, 150);

    Path baselinePath = Path.of("src/test/resources/baselines/invoice-page1.png");

    if (!Files.exists(baselinePath)) {
        // Write baseline
        Files.createDirectories(baselinePath.getParent());
        ImageIO.write(rendered, "PNG", baselinePath.toFile());
        return;
    }

    BufferedImage baseline = ImageIO.read(baselinePath.toFile());

    // Compare dimensions
    assertEquals(baseline.getWidth(), rendered.getWidth(), "Page width mismatch");
    assertEquals(baseline.getHeight(), rendered.getHeight(), "Page height mismatch");

    // Compare pixel content (simplified — use pixelmatch library for real tests)
    long differentPixels = 0;
    for (int y = 0; y < baseline.getHeight(); y++) {
        for (int x = 0; x < baseline.getWidth(); x++) {
            if (baseline.getRGB(x, y) != rendered.getRGB(x, y)) {
                differentPixels++;
            }
        }
    }

    double diffPercent = (double) differentPixels / (baseline.getWidth() * baseline.getHeight());
    assertTrue(diffPercent < 0.01, "Visual diff exceeds 1%: " + diffPercent);
}

PDF/UA Accessibility Testing

@Test
void documentHasLanguageSet() {
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    assertNotNull(catalog.getLanguage(), "PDF should have language set for accessibility");
    assertFalse(catalog.getLanguage().isBlank());
}

@Test
void documentHasStructureTree() {
    PDDocumentCatalog catalog = document.getDocumentCatalog();
    assertNotNull(catalog.getMarkInfo(), "Accessible PDF must have mark info");
    assertTrue(catalog.getMarkInfo().isMarked(), "PDF must be marked for accessibility");
}

CI Setup

name: PDFBox Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-java@v4
        with:
          distribution: 'temurin'
          java-version: '21'
      - name: Cache Maven
        uses: actions/cache@v4
        with:
          path: ~/.m2
          key: ${{ runner.os }}-m2-${{ hashFiles('**/pom.xml') }}
      - run: mvn test -pl pdf-module
      - name: Upload visual diff artifacts
        if: failure()
        uses: actions/upload-artifact@v4
        with:
          name: pdf-visual-diffs
          path: target/test-output/pdfs/

Summary

PDFBox gives Java developers a complete toolkit for PDF test assertions:

Test concern PDFBox API
All text PDFTextStripper.getText()
Per-page text setStartPage() / setEndPage()
Region text PDFTextStripperByArea
Page count getNumberOfPages()
Page size PDPage.getMediaBox()
Metadata PDDocumentInformation
Form fields PDAcroForm
Images Content stream Do operators
Annotations/links PDPage.getAnnotations()
Visual comparison PDFRenderer.renderImageWithDPI()

For testing the full user flow — a user clicks "Export PDF," the file downloads, and opens correctly — HelpMeTest adds browser-level E2E coverage that complements your PDFBox unit tests.

Read more