PDF Processing and Analysis With Open-Source Tools
PDF Processing and Analysis With Open-Source Tools
PDF multi-tools
Before diving into any speci�c tasks, let’s start with
some general-purpose PDF tools and toolkits. Each
of these are capable of a wide range of tasks
(including some I won’t explicitly address here), and
they can be seen as “Swiss army-knives” of PDF
processing. Whenever I need to get some PDF
processing or analysis done and I’m not sure what
tool to use, these are usually my starting points. In
the majority of cases, at least one of them turns out
to have the functionality I’m looking for, so it’s a
good idea to check them out if you’re not familiar
with them already.
Xpdf/Poppler
Xpdf (https://www.xpdfreader.com/) and Poppler
(https://poppler.freedesktop.org/) are both PDF
viewers that include a collection of tools for
processing and manipulating PDF �les. Poppler is a
fork of this software, which adds a number of unique
tools that are not part of the original Xpdf package.
The tools included with Poppler are:
Apache PDFBox
Apache PDFBox (https://pdfbox.apache.org/) is an
open source Java library for working with PDF
documents. It includes a set of command-line tools
(https://pdfbox.apache.org/2.0/commandline.html)
for various PDF processing tasks. Binary
distributions (as JAR (https://en.wikipedia.org/wiki/
JAR_(�le_format)) packages) are available here
(https://pdfbox.apache.org/download.html) (you’ll
need the “standalone” JARs).
QPDF
QPDF (http://qpdf.sourceforge.net/) is “a command-
line program that does structural, content-
preserving transformations on PDF �les”.
MuPDF
MuPDF (https://www.mupdf.com/) is “a lightweight
PDF, XPS, and E-book viewer”. It includes the mutool
(https://www.mupdf.com/docs/index.html) utility,
which can do a number of PDF processing tasks.
PDFtk
PDFtk (https://www.pd�abs.com/tools/pdftk-
server/) (server edition) is a “command-line tool for
working with PDFs” that is “commonly used for
client-side scripting or server-side processing of
PDFs”. More information can be found in the
documentation (https://www.pd�abs.com/docs/
pdftk-man-page/), and the command-line examples
page (https://www.pd�abs.com/docs/pdftk-cli-
examples/). For Ubuntu/Linux Mint users, the most
straightforward installation option is the “pdftk-
java” Debian package. This is a Java fork of PDFtk1.
Ghostscript
Ghostscript (https://www.ghostscript.com/) is “an
interpreter for the PostScript language and PDF
�les”. It provides rendering to a variety of raster and
vector formats.
option:
exiftool corrupted.pdf
Result:
verapdf -l
Result:
pdfinfo whatever.pdf
Result:
Content-Length: 24728
Content-Type: application/pdf
X-TIKA:Parsed-By: org.apache.tika.parser.DefaultParser
X-TIKA:Parsed-By: org.apache.tika.parser.pdf.PDFParser
access_permission:assemble_document: true
access_permission:can_modify: true
access_permission:can_print: true
access_permission:can_print_degraded: true
access_permission:extract_content: true
access_permission:extract_for_accessibility: true
access_permission:fill_in_form: true
access_permission:modify_annotations: true
dc:format: application/pdf; version=1.6
dcterms:created: 2021-09-02T05:52:56Z
dcterms:modified: 2021-09-02T05:53:20Z
pdf:PDFVersion: 1.6
pdf:charsPerPage: 0
pdf:docinfo:created: 2021-09-02T05:52:56Z
pdf:docinfo:creator_tool: PdfCompressor 3.1.32
pdf:docinfo:modified: 2021-09-02T05:53:20Z
pdf:docinfo:producer: CVISION Technologies
pdf:encrypted: false
pdf:hasMarkedContent: false
pdf:hasXFA: false
pdf:hasXMP: true
pdf:producer: CVISION Technologies
pdf:unmappedUnicodeCharsPerPage: 0
resourceName: whatever.pdf
xmp:CreateDate: 2021-09-02T07:52:56Z
xmp:CreatorTool: PdfCompressor 3.1.32
xmp:MetadataDate: 2021-09-02T07:53:20Z
xmp:ModifyDate: 2021-09-02T07:53:20Z
xmpMM:DocumentID: uuid:2ec84d65-f99d-49fe-9aac-bd0c1ff
f5e66
xmpTPg:NPages: 1
exiftool whatever.pdf
Result:
<?xml version="1.0"?>
<sch:schema xmlns:sch="http://purl.oclc.org/dsdl/schem
atron" queryBinding="xslt">
<sch:pattern name="Disallow encrypt in trailer dic
tionary">
<sch:rule context="/report/jobs/job/featuresRe
port/documentSecurity">
<sch:assert test="not(encryptMetadata = 't
rue')">Encrypt in trailer dictionary</sch:assert>
</sch:rule>
</sch:pattern>
</sch:schema>
Text extraction
Text extraction from PDF documents is notoriously
hard. This post (https://�lingdb.com/b/pdf-text-
extraction) gives a good overview of the main
pitfalls. Tim Allison’s excellent Brief Overview of the
Portable Document Format (PDF) and Some
Challenges for Text Extraction (https://irsg.bcs.org/
informer/wp-content/uploads/
OverviewOfTextExtractionFromPDFs.pdf) provides
a more in-depth discussion, and this really is a must-
read for anyone seriously interested in this subject.
With that said, quite a few tools are available, and
below I list a few that are useful starting points.
Link extraction
When extracting (hyper)links, it’s important to make
a distinction between the following two cases:
1. Links that are encoded as a “link annotation”,
which is a data structure in PDF that results in a
clickable link
2. Non-clickable links/URLs that are just part of the
body text.
gs -sDEVICE=pdfwrite \
-dCompatibilityLevel=1.4 \
-dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=whatever_small.pdf whatever_large.pdf
Result:
trailer
<<
/DecodeParms <<
/Columns 3
/Predictor 12
>>
/Filter /FlateDecode
/ID [ <500AB94E8F45C149808B2EEE98528B78> <431017E49
5216040A953126BB73D0CD4> ]
/Index [ 11 10 ]
/Info 10 0 R
/Length 47
/Prev 24426
/Root 12 0 R
/Size 21
/Type /XRef
/W [ 1 2 0 ]
>>
xref
0 21
00000: 0000000000 00000 f
00001: 0000019994 00000 n
00002: 0000020399 00000 n
00003: 0000020534 00000 n
::
etc
Result:
12 0 obj
<<
/Metadata 4 0 R
/Pages 9 0 R
/Type /Catalog
>>
endobj
Result:
Final remarks
I intend to make this post a “living” document, and
will add more PDF “recipes” over time. Feel free to
leave a comment in case you spot any errors or
omissions!
Further resources
• Moritz Mähr, “Working with batches of PDF �les”,
The Programming Historian 9 (2020) (https://
doi.org/10.46430/phen0088)
• PDF tools in Community Owned Digital
Preservation Tool Registry (COPTR) (https://
coptr.digipres.org/index.php/PDF)
• Policy-based assessment with VeraPDF - a �rst
impression (/2017/06/01/policy-based-
assessment-with-verapdf-a-�rst-impression)
• What’s so hard about PDF text extraction?
(https://�lingdb.com/b/pdf-text-extraction)
• Tim Allison, “Brief Overview of the Portable
Document Format (PDF) and Some Challenges
for Text Extraction” (https://irsg.bcs.org/
informer/wp-content/uploads/
OverviewOfTextExtractionFromPDFs.pdf)
• Yvonne Tunnat, “PDF Validation with ExifTool –
quick and not so dirty” (https://
openpreservation.org/blogs/pdf-validation-with-
exiftool-quick-and-not-so-dirty/)
• Micky Lindlar, “Trouble-shooting PDF validation
errors – a case of PDF-HUL-38” (https://
openpreservation.org/blogs/trouble-shooting-
pdf-validation-errors-a-case-of-pdf-hul-38/)
• Hacker News topic on this post (https://
news.ycombinator.com/item?id=33145498)
Revision history
• 7 September 2021: added sections on metadata
extraction and Tika batch processing, following
suggestions by Tim Allison.
• 8 September 2021: added section on inspecting
low-level PDF structure with iText RUPS, as
suggested by Mark Stephens; added sections on
PDFtk as suggested by Tyler Thorsted; corrected
errors in pdftocairo and gs examples.
• 9 September 2021: added section on image to
PDF conversion.
• 27 January 2022: added reference to Tim
Allison’s article on PDF text extraction.
• 7 February 2022: added sections on Exiftool, and
added reference to Yvonne Tunnat’s blog post on
PDF validation with ExifTool.
• 10 October 2022: added update on and link to
Hacker News topic on this post.
• 28 November 2022: added reference to Micky
Lindlar’s blog post on trouble-shooting PDF
validation errors.
• 16 February 2023: added section on reducing
PDF �le size with ImageMagick’s convert tool.
1. The Debian package of the “original” PDFtk
software was removed from the Ubuntu
repositories (https://www.joho.se/
2020/10/01/pdftk-and-php-pdftk-on-
ubuntu-18-04-without-using-snap/) around
2018 due to “dependency issues”. ↩
← Previous (/2021/02/24/towards-a-preservation-
work�ow-for-mobile-apps)
Next → (/2021/09/24/on-the-signi�cant-properties-of-
spreadsheets)
Comments
(https://github.com/
markee174)markee174 (https://github.com/
markee174) wrote:
2021-09-08T08:50:25Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-915044600)
(https://github.com/
bitsgalore)bitsgalore (https://github.com/
bitsgalore) wrote:
2021-09-08T15:29:06Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-915341461)
(https://github.com/
gettalong)gettalong (https://github.com/
gettalong) wrote:
2022-10-10T15:17:15Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-1273470872)
(https://github.com/gollux)gollux
(https://github.com/gollux) wrote:
2022-10-10T15:42:02Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-1273504056)
(https://github.com/jsnmrs)jsnmrs
(https://github.com/jsnmrs) wrote:
2022-10-12T01:23:40Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-1275464703)
(https://github.com/
ItsIgnacioPortal)ItsIgnacioPortal (https://
github.com/ItsIgnacioPortal) wrote:
2022-11-18T00:36:29Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-1319398252)
(https://github.com/atul2023-
at)atul2023-at (https://github.com/atul2023-
at) wrote:
nice…
2023-04-21T10:14:53Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-1517606743)
(https://github.com/
paolovolterra)paolovolterra (https://
github.com/paolovolterra) wrote:
2023-05-06T04:56:37Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-1537049405)
(https://github.com/Allasso)Allasso
(https://github.com/Allasso) wrote:
2024-01-31T22:17:16Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-1920071321)
(https://github.com/
mercury64)mercury64 (https://github.com/
mercury64) wrote:
“ FWIW, I've found simply
grep -a 'http' file.pdf useful for �nding
outside links. Haven't tested exhaustively, so
may not work on all pdfs, depending on the app
that created them.
2024-02-29T16:18:10Z (https://github.com/
bitsgalore/bitsgalore.github.io/issues/
76#issuecomment-1971485646)
(https://
www.bitsgalore.org/
about.html)
About
(https://
www.bitsgalore.org/
about.html)
Search
Search (DuckDuckGo)
Tags
▸ Android
▸ Apache-
Pre�ight
▸ Apache-
Tika
▸ APK
▸ Debian
▸ digital-
dark-age
▸ digital-
preservation-
day
▸ disk-
imaging
▸ diskimgr
▸ DNS
▸ DROID
▸ e-depot
▸ emulation
▸ EPUB
▸ EPUBCheck
▸ ExifTool
▸ Fido
▸ FITS
▸ FLAC
▸ �oppy-
disks
▸ format-
identi�cation
▸ format-
validation
▸ geodata
▸ GitHub-
Pages
▸ GW-
BASIC
▸ HFS
▸ High-
Sierra
▸ ImageMagick
▸ internet
▸ iOS
▸ IPA
▸ iromlab
▸ ISO-9660
▸ isolyzer
▸ JHOVE
▸ JHOVE2
▸ JP2
▸ jpeg-2000
▸ jpylyzer
▸ magic
▸ Microsoft
▸ omimgr
▸ OneDrive
▸ optical-
media
▸ packaging
▸ PDF
▸ preservation-
risks
▸ python
▸ Quattro-
Pro
▸ rant
▸ schematron
▸ Siegfried
▸ signi�cant-
properties
▸ spreadsheets
▸ tapeimgr
▸ tapes
▸ TIFF
▸ Twitter
▸ UDF
▸ unix-�le
▸ VeraPDF
▸ virtualization
▸ WAVE
▸ web-
archaeology
▸ web-
archiving
▸ XS4ALL
▸ ZIP
Archive
▾ 2024
March
Multi-
image
TIFFs,
sub�les
and
image
�le
directories
(/
2024/03/11/
multi-
image-
tiffs-
sub�les-
and-
image-
�le-
directories)
▸ 2023
▸ 2022
▸ 2021
▸ 2020
▸ 2019
▸ 2018
▸ 2017
▸ 2016
▸ 2015
▸ 2014
▸ 2013
▸ 2012
▸ 2011
▸ 2010
Issues
Report a
problem
with this
site (https://
github.com/
bitsgalore/
bitsgalore.github.io/
issues)
Hackers
Hall of
Fame
(https://
www.bitsgalore.org/
hackers-
hall-of-
fame.html)
Social
Mastodon
(digipres.club)
(https://
digipres.club/
@bitsgalore)
Feeds
RSS
(https://
www.bitsgalore.org/
rss.xml)
ATOM
(https://
www.bitsgalore.org/
atom.xml)
© 2024 Johan van der Knijff. All content on this blog is licensed
under a Creative Commons Attribution 4.0 International
License (http://creativecommons.org/licenses/by/4.0/), unless
indicated otherwise. Created using Jekyll Bootstrap (http://
jekyllbootstrap.com) and Twitter Bootstrap (https://
getbootstrap.com/).