Apache PDFBox Testing Tutorial: Parsing, Comparing, and Asserting PDF Content
Apache PDFBox is an open-source Java library for reading and writing PDFs. Unlike iText (commercial license for production use), PDFBox is Apache-licensed. It's ideal for testing PDFs regardless of how they were generated — iText, wkhtmltopdf, Puppeteer, or any other tool. This guide shows how to parse, compare, and assert on PDF content using PDFBox in JUnit 5 tests.
Key Takeaways
PDFBox works as a test reader for any PDF generator. You can use PDFBox to test PDFs generated by iText, WeasyPrint, Puppeteer, or any other tool — it's the reader, not the writer.
PDFStripper extracts text with position control. PDFTextStripper extracts text from the whole document; PDFTextStripperByArea extracts from specific rectangular regions.
Test form fields explicitly. PDFBox provides direct access to AcroForm fields — check field names, types, and current values in test assertions.
Image and annotation testing requires traversing the page structure. PDFBox exposes images and annotations via the page content stream — iterate the page to verify visual content beyond text.
PDFBox 3.x changed the rendering API. The PDFRenderer class API changed significantly in PDFBox 3.0 — check your version before copying examples.
Why PDFBox for Testing
Apache PDFBox is commonly used as the reader in a test setup, even when the PDF was generated by a different library. Its advantages:
- Apache 2.0 license — no commercial restrictions in tests
- Rich introspection API — text, fonts, images, annotations, form fields, bookmarks
- Per-region text extraction — extract text from specific page areas
- AcroForm support — read and write PDF form field values
- Rendering — render pages to
BufferedImagefor visual tests
Maven Setup
<dependency>
<groupId>org.apache.pdfbox</groupId>
<artifactId>pdfbox</artifactId>
<version>3.0.2</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.junit.jupiter</groupId>
<artifactId>junit-jupiter</artifactId>
<version>5.10.0</version>
<scope>test</scope>
</dependency>Loading PDFs in Tests
import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.junit.jupiter.api.*;
import java.io.IOException;
import java.nio.file.Files;
import java.nio.file.Path;
class PdfContentTest {
private PDDocument document;
@BeforeEach
void setUp() throws IOException {
// Option 1: Load from file
document = Loader.loadPDF(Path.of("src/test/resources/sample-invoice.pdf").toFile());
// Option 2: Load from byte array (from your generator)
byte[] pdfBytes = invoiceGenerator.generate(sampleInvoice);
document = Loader.loadPDF(pdfBytes);
}
@AfterEach
void tearDown() throws IOException {
if (document != null) {
document.close();
}
}
}Note: In PDFBox 3.x, use Loader.loadPDF() rather than PDDocument.load() (deprecated).
Text Extraction
Full Document Text
import org.apache.pdfbox.text.PDFTextStripper;
@Test
void documentContainsInvoiceHeader() throws IOException {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
assertTrue(text.contains("Invoice #INV-2026-001"));
assertTrue(text.contains("Acme Corp"));
}
@Test
void totalAmountIsPresentInDocument() throws IOException {
PDFTextStripper stripper = new PDFTextStripper();
String text = stripper.getText(document);
assertTrue(text.contains("7,700.00") || text.contains("$7700"),
"Document should show the invoice total");
}Per-Page Text Extraction
@Test
void firstPageContainsHeaderInformation() throws IOException {
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(1);
stripper.setEndPage(1);
String firstPageText = stripper.getText(document);
assertTrue(firstPageText.contains("INV-2026-001"));
assertTrue(firstPageText.contains("Acme Corp"));
}
@Test
void lastPageContainsTotalsSection() throws IOException {
int lastPage = document.getNumberOfPages();
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage(lastPage);
stripper.setEndPage(lastPage);
String lastPageText = stripper.getText(document);
assertTrue(lastPageText.contains("Total"));
assertTrue(lastPageText.contains("7,700"));
}Region-Based Text Extraction
Extract text from a specific rectangle on a page:
import org.apache.pdfbox.text.PDFTextStripperByArea;
import java.awt.Rectangle;
@Test
void headerRegionContainsCompanyName() throws IOException {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
// Define header region: x=0, y=0, width=595 (A4 width), height=100 points
Rectangle headerRegion = new Rectangle(0, 0, 595, 100);
stripper.addRegion("header", headerRegion);
PDPage firstPage = document.getPage(0);
stripper.extractRegions(firstPage);
String headerText = stripper.getTextForRegion("header");
assertTrue(headerText.contains("HelpMeTest Invoicing"));
}
@Test
void totalColumnShowsCorrectAmounts() throws IOException {
// A4 is 595×842 points; right column starts at ~450 points
Rectangle totalColumn = new Rectangle(450, 200, 145, 400);
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.addRegion("totals", totalColumn);
stripper.extractRegions(document.getPage(0));
String columnText = stripper.getTextForRegion("totals");
assertTrue(columnText.contains("6,000"));
assertTrue(columnText.contains("1,000"));
assertTrue(columnText.contains("7,000"));
}Page Count and Structure
@Test
void singleInvoiceFitsOnOnePage() {
assertEquals(1, document.getNumberOfPages(),
"Standard invoice should fit on one page");
}
@Test
void longInvoiceSpansExpectedPages() throws IOException {
byte[] longPdf = generator.generate(largeInvoice(150));
try (PDDocument doc = Loader.loadPDF(longPdf)) {
assertTrue(doc.getNumberOfPages() >= 3,
"150-item invoice should require at least 3 pages");
}
}
@Test
void documentHasCorrectPageSize() {
PDPage page = document.getPage(0);
PDRectangle mediaBox = page.getMediaBox();
// A4 dimensions in PDF points (1 point = 1/72 inch)
assertEquals(595, Math.round(mediaBox.getWidth()), "Page width should be A4");
assertEquals(842, Math.round(mediaBox.getHeight()), "Page height should be A4");
}Document Metadata
@Test
void documentMetadataIsSetCorrectly() {
PDDocumentInformation info = document.getDocumentInformation();
assertEquals("Invoice INV-2026-001", info.getTitle());
assertEquals("HelpMeTest", info.getCreator());
assertEquals("HelpMeTest Billing", info.getAuthor());
}
@Test
void creationDateIsRecent() {
PDDocumentInformation info = document.getDocumentInformation();
Calendar creationDate = info.getCreationDate();
assertNotNull(creationDate);
// Creation date should be within the last minute for a freshly generated PDF
long ageMillis = System.currentTimeMillis() - creationDate.getTimeInMillis();
assertTrue(ageMillis < 60_000, "PDF creation date should be recent");
}PDF Form Field Testing
PDFBox provides full AcroForm support. Test form PDFs used for fillable documents:
import org.apache.pdfbox.pdmodel.interactive.form.*;
@Test
void formFieldsArePresent() {
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDAcroForm acroForm = catalog.getAcroForm();
assertNotNull(acroForm, "Document should have an AcroForm");
// Check specific fields exist
assertNotNull(acroForm.getField("client_name"), "client_name field must exist");
assertNotNull(acroForm.getField("invoice_date"), "invoice_date field must exist");
assertNotNull(acroForm.getField("total_amount"), "total_amount field must exist");
}
@Test
void formFieldValuesAreFilledCorrectly() throws IOException {
// Fill the form
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDAcroForm acroForm = catalog.getAcroForm();
acroForm.getField("client_name").setValue("Test Corp");
acroForm.getField("invoice_date").setValue("2026-05-17");
acroForm.getField("total_amount").setValue("$5,000.00");
// Write to bytes and re-read to verify
ByteArrayOutputStream baos = new ByteArrayOutputStream();
document.save(baos);
try (PDDocument filled = Loader.loadPDF(baos.toByteArray())) {
PDAcroForm filledForm = filled.getDocumentCatalog().getAcroForm();
assertEquals("Test Corp", filledForm.getField("client_name").getValueAsString());
assertEquals("$5,000.00", filledForm.getField("total_amount").getValueAsString());
}
}
@Test
void requiredFieldsAreNotEmpty() {
PDAcroForm acroForm = document.getDocumentCatalog().getAcroForm();
List<String> requiredFields = List.of("client_name", "invoice_number", "total_amount");
for (String fieldName : requiredFields) {
PDField field = acroForm.getField(fieldName);
assertNotNull(field, "Required field missing: " + fieldName);
assertFalse(field.getValueAsString().isBlank(),
"Required field is empty: " + fieldName);
}
}Image and Annotation Testing
Checking for Image Presence
import org.apache.pdfbox.contentstream.PDFStreamEngine;
import org.apache.pdfbox.contentstream.operator.Operator;
import org.apache.pdfbox.cos.COSBase;
import org.apache.pdfbox.pdmodel.graphics.image.PDImage;
@Test
void firstPageContainsLogo() throws IOException {
List<PDImage> images = extractImages(document.getPage(0));
assertFalse(images.isEmpty(), "Invoice page should contain at least one image (logo)");
}
private List<PDImage> extractImages(PDPage page) throws IOException {
List<PDImage> images = new ArrayList<>();
PDFStreamEngine engine = new PDFStreamEngine() {
@Override
protected void processOperator(Operator operator, List<COSBase> operands)
throws IOException {
if ("Do".equals(operator.getName())) {
// A "Do" operator invokes an XObject — check if it's an image
COSName xObjectName = (COSName) operands.get(0);
PDXObject xObject = page.getResources().getXObject(xObjectName);
if (xObject instanceof PDImage) {
images.add((PDImage) xObject);
}
}
super.processOperator(operator, operands);
}
};
engine.processPage(page);
return images;
}Testing Annotations (Links, Comments)
@Test
void firstPageHasHyperlinkToWebsite() {
PDPage page = document.getPage(0);
List<PDAnnotation> annotations = page.getAnnotations();
boolean hasLink = annotations.stream()
.anyMatch(ann -> ann instanceof PDAnnotationLink);
assertTrue(hasLink, "Invoice should have at least one clickable link");
}
@Test
void linkAnnotationPointsToCorrectUrl() {
PDPage page = document.getPage(0);
List<PDAnnotation> annotations = page.getAnnotations();
PDAnnotationLink link = annotations.stream()
.filter(a -> a instanceof PDAnnotationLink)
.map(a -> (PDAnnotationLink) a)
.findFirst()
.orElseThrow(() -> new AssertionError("No link annotation found"));
PDAction action = link.getAction();
assertTrue(action instanceof PDActionURI, "Link action should be a URI action");
assertEquals("https://helpmetest.com", ((PDActionURI) action).getURI());
}Rendering Pages for Visual Tests
import org.apache.pdfbox.rendering.PDFRenderer;
import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
@Test
void pageRenderingDoesNotThrow() throws IOException {
PDFRenderer renderer = new PDFRenderer(document);
// Render at 150 DPI
BufferedImage image = renderer.renderImageWithDPI(0, 150);
assertNotNull(image);
assertTrue(image.getWidth() > 0);
assertTrue(image.getHeight() > 0);
}
@Test
void renderedPageMatchesBaseline() throws IOException {
PDFRenderer renderer = new PDFRenderer(document);
BufferedImage rendered = renderer.renderImageWithDPI(0, 150);
Path baselinePath = Path.of("src/test/resources/baselines/invoice-page1.png");
if (!Files.exists(baselinePath)) {
// Write baseline
Files.createDirectories(baselinePath.getParent());
ImageIO.write(rendered, "PNG", baselinePath.toFile());
return;
}
BufferedImage baseline = ImageIO.read(baselinePath.toFile());
// Compare dimensions
assertEquals(baseline.getWidth(), rendered.getWidth(), "Page width mismatch");
assertEquals(baseline.getHeight(), rendered.getHeight(), "Page height mismatch");
// Compare pixel content (simplified — use pixelmatch library for real tests)
long differentPixels = 0;
for (int y = 0; y < baseline.getHeight(); y++) {
for (int x = 0; x < baseline.getWidth(); x++) {
if (baseline.getRGB(x, y) != rendered.getRGB(x, y)) {
differentPixels++;
}
}
}
double diffPercent = (double) differentPixels / (baseline.getWidth() * baseline.getHeight());
assertTrue(diffPercent < 0.01, "Visual diff exceeds 1%: " + diffPercent);
}PDF/UA Accessibility Testing
@Test
void documentHasLanguageSet() {
PDDocumentCatalog catalog = document.getDocumentCatalog();
assertNotNull(catalog.getLanguage(), "PDF should have language set for accessibility");
assertFalse(catalog.getLanguage().isBlank());
}
@Test
void documentHasStructureTree() {
PDDocumentCatalog catalog = document.getDocumentCatalog();
assertNotNull(catalog.getMarkInfo(), "Accessible PDF must have mark info");
assertTrue(catalog.getMarkInfo().isMarked(), "PDF must be marked for accessibility");
}CI Setup
name: PDFBox Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with:
distribution: 'temurin'
java-version: '21'
- name: Cache Maven
uses: actions/cache@v4
with:
path: ~/.m2
key: ${{ runner.os }}-m2-${{ hashFiles('**/pom.xml') }}
- run: mvn test -pl pdf-module
- name: Upload visual diff artifacts
if: failure()
uses: actions/upload-artifact@v4
with:
name: pdf-visual-diffs
path: target/test-output/pdfs/Summary
PDFBox gives Java developers a complete toolkit for PDF test assertions:
| Test concern | PDFBox API |
|---|---|
| All text | PDFTextStripper.getText() |
| Per-page text | setStartPage() / setEndPage() |
| Region text | PDFTextStripperByArea |
| Page count | getNumberOfPages() |
| Page size | PDPage.getMediaBox() |
| Metadata | PDDocumentInformation |
| Form fields | PDAcroForm |
| Images | Content stream Do operators |
| Annotations/links | PDPage.getAnnotations() |
| Visual comparison | PDFRenderer.renderImageWithDPI() |
For testing the full user flow — a user clicks "Export PDF," the file downloads, and opens correctly — HelpMeTest adds browser-level E2E coverage that complements your PDFBox unit tests.