Mohak Garg
Back to projects
Personal Project
2025

PDF Remediation Tool: Automated PDF/UA-1 Accessibility

Python pipeline that automatically transforms any PDF into a fully PDF/UA-1 compliant, accessible document — passing PAC and veraPDF validation out of the box.

Engineer — Architecture, Pipeline Design, PDF Structure Implementation, Validation

Python
PDF/UA-1
Streamlit
veraPDF
PDF Structure Trees
Accessibility
PDF/UA-1
Compliance standard
4
Pipeline stages
8+
Matterhorn checkpoints
CLI + Web UI
Interface
The Problem

PDFs distributed by organizations are rarely accessible to screen readers or assistive technologies. Manual remediation is expensive, slow, and requires specialist knowledge of PDF structure standards — creating a compliance bottleneck for legal, educational, and government documents.

The Solution

Built a 4-stage Python pipeline that automatically injects structure tags, embeds fonts, writes XMP metadata, and wires link annotations to produce PDF/UA-1 compliant documents. Includes both a CLI and a Streamlit web UI. Validated against PAC and veraPDF — addresses 8 Matterhorn Protocol checkpoints.

Product Artifacts

GitHub
Source code
View

Case Study

Context
  • PDF/UA-1 (ISO 14289-1) accessibility compliance is a legal requirement in many jurisdictions for government, education, and enterprise documents — yet virtually all PDFs in the wild fail validation.
  • Manual remediation requires a PDF specialist: structure tree editing, MCID marking, XMP metadata, font embedding — hours of work per document.
  • Existing tools (Adobe Acrobat's accessibility checker, Foxit) flag problems but don't fix them automatically. The gap between detection and remediation is where the real cost lives.
  • Target use case: an organization with hundreds of legacy PDFs needing bulk compliance remediation without manual intervention.
Decision
  • Chose a 4-stage pipeline architecture (extract → tag → postprocess → validate) so each concern is isolated and independently testable.
  • Used font-size ratio heuristics for heading detection rather than ML — simpler, deterministic, and tunable via config without retraining.
  • Built both CLI and Streamlit UI from the same core — the CLI handles bulk processing, the web UI handles one-off reviews.
  • Targeted PDF/UA-1 specifically (not WCAG or PDF/A) because it's the standard validators like PAC and veraPDF measure.
  • Watermarks and headers/footers tagged as /Artifact so screen readers skip them — a detail most manual remediators overlook.
Execution
  • Stage 1 (Extractor): parses text blocks, font metrics, and bounding boxes; detects headings by font-size ratio, identifies artifacts by position and rotation.
  • Stage 2 (Tagger): writes PDF structure tree directly into the file — /Document, /P, /H1–H6, /Figure, /Table, /L, /LI — and marks content streams with MCID markers.
  • Stage 3 (Postprocessor): injects XMP metadata (dc:title, dc:language, pdfuaid:part=1), embeds unembedded fonts, sets MarkInfo and TabOrder.
  • Stage 4 (Validator): runs veraPDF against the ua1 profile and surfaces structured pass/fail results with checkpoint IDs.
  • Addressed 8 Matterhorn Protocol checkpoints including tagged PDF flag, link MCR+OBJR structure, /Figure Alt text, and artifact marking.
Outcome
  • Produces PDF/UA-1 compliant output passing PAC and veraPDF validation out of the box on standard single-column documents.
  • Reduces remediation time from hours of manual work to seconds of automated pipeline execution.
  • Streamlit web UI enables non-technical users to run remediation without CLI access.
  • Successfully remediating real client documents including contracts, observer forms, and teaching notes.
Learnings
  • PDF internals are surprisingly underspecified — the spec allows ambiguity that real validators interpret differently. Compliance requires reading veraPDF source, not just the ISO spec.
  • Heuristic heading detection works well for typical business documents but breaks on creative layouts — knowing the tool's limits is as important as what it can do.
  • Separating extraction from tagging was the right call: it made the heading detection logic independently testable and swappable without touching the structure tree code.
  • Building for a strict external validator (veraPDF) is a forcing function for correctness — there's no 'good enough', it either passes or it doesn't.