0% found this document useful (0 votes)
54 views

Dissecting PDF Documents: Mark S. Rasmussen - Ipaper Mark@Improve - DK

This document discusses tools and techniques for analyzing and extracting information from PDF documents. It provides an overview of the PDF format structure and object model. It then describes several popular open-source and commercial libraries for parsing PDFs, including ABCpdf, Acrobat, Xpdf, SWFTools, and iTextSharp. It discusses how these tools can be used to extract text, bookmarks, links, and other data from PDF documents.

Uploaded by

S G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views

Dissecting PDF Documents: Mark S. Rasmussen - Ipaper Mark@Improve - DK

This document discusses tools and techniques for analyzing and extracting information from PDF documents. It provides an overview of the PDF format structure and object model. It then describes several popular open-source and commercial libraries for parsing PDFs, including ABCpdf, Acrobat, Xpdf, SWFTools, and iTextSharp. It discusses how these tools can be used to extract text, bookmarks, links, and other data from PDF documents.

Uploaded by

S G
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Dissecting PDF Documents

Mark S. Rasmussen – iPaper


[email protected]
What Is This Session NOT About?
• Creating PDFs
• How to use Acrobat
• Transparency flattening options in InDesign

• So what is it about?
– PDF documents
– Tooling
– Extracting data
The PDF Format
• 1.0 released in 1993
• Open standard as of July 1st 2008
• Reference publicly available
– http://www.adobe.com/devnet/pdf/pdf_reference_archive.html

1500

1000

500

0
PDF 1.3 PDF 1.4 PDF 1.5 PDF 1.6 PDF 1.7 OOXML 1.0
PDF Structure
• Header
– %PDF-1.4
– %âãÏÓ (optional but common)

• Body
– Objects

• Xref table
– Index table containing pointers to objects

• Trailer
– Pointers to Xref table, key objects
– %%EOF
PDF Objects
”A PDF file should be thought of as a flattened
representation of a data structure consisting of a
collection of objects that can refer to each other in
any arbitrary way.”

• Boolean, Number, String, Name, Array,


Dictionary, Stream, Null
• Indirect & direct objects
• Random access
Reading A PDF – The Ninja Way!
Incremental Changes
• Fast saves, but not for free
• Undo & history
• Save vs Save As
• Single-pass writing
• Linearization
Linearization & Xref Chaining
PDF Objects: Image
• Stream object with dictionary header
ABCpdf
• Commercial
• Excellent .NET API
• ObjectSoup is a
valuable friend
• Good image rendering
• Useless SWF rendering
• Unstable rendering
• Decent support
• http://www.websupergoo.com/secret.htm
Acrobat
• Commercial (tricky license)
• No COM libraries after 7.x
• Surprisingly stable and fast
• Ugly API
Rendering Using Acrobat
Xpdf
• Open source (GPL)
• Pdffonts, pdfimages,
pdfinfo, pdftops, pdftotext
• Basis for many other libraries & tools
• Commercial license & COM library available at
www.glyphandcog.com
• http://www.foolabs.com/xpdf/
PDF Font Management
• Client must have fonts used in PDF document
• However…
– Complete font can be embedded
– Or a subset
– 14 standard fonts (Courier, Helvetica, Times + ITC
Zapf & Dingbats)
– Font replacement
Text In PDF
• No concept of text, just characters
• Flow order not guaranteed
• Requires guesstimation to extract text
• Extraction may require embedded fonts
• Lots of tools, some better than others
Text According To ABCpdf
1 2
3
4
5

1
2
5

3 6
4
Text According To Xpdf
1 2
3
4
5

1
3 5
4

6
2
Physical Text According To Xpdf
1 2
3
4
5

3
1 2
4
5

6
SWFTools
• Open source (GPL)
• PDF2SWF converts PDF files to SWF format
– Based on Xpdf
– Active mailing list
– Author actively working on project
– Use dev snapshots / git repo
– Stable, but some kinks
• http://www.swftools.org
iTextSharp
• Open source (5.0 – AGPL(!), 4.1 - LGPL)
• Commercial license available
• .NET port of iText
• Very stable
• Excellent for creating &
modifying PDFs
• No rendering capabilites
• http://itextsharp.sourceforge.net/
• http://itextpdf.com/
Extracting Bookmarks
Extracting Links
Thank you!
For attending this session

Email [email protected]
Twitter @improvedk
Blog improve.dk

You might also like