Dissecting PDF Documents: Mark S. Rasmussen - Ipaper Mark@Improve - DK
Dissecting PDF Documents: Mark S. Rasmussen - Ipaper Mark@Improve - DK
• So what is it about?
– PDF documents
– Tooling
– Extracting data
The PDF Format
• 1.0 released in 1993
• Open standard as of July 1st 2008
• Reference publicly available
– http://www.adobe.com/devnet/pdf/pdf_reference_archive.html
1500
1000
500
0
PDF 1.3 PDF 1.4 PDF 1.5 PDF 1.6 PDF 1.7 OOXML 1.0
PDF Structure
• Header
– %PDF-1.4
– %âãÏÓ (optional but common)
• Body
– Objects
• Xref table
– Index table containing pointers to objects
• Trailer
– Pointers to Xref table, key objects
– %%EOF
PDF Objects
”A PDF file should be thought of as a flattened
representation of a data structure consisting of a
collection of objects that can refer to each other in
any arbitrary way.”
1
2
5
3 6
4
Text According To Xpdf
1 2
3
4
5
1
3 5
4
6
2
Physical Text According To Xpdf
1 2
3
4
5
3
1 2
4
5
6
SWFTools
• Open source (GPL)
• PDF2SWF converts PDF files to SWF format
– Based on Xpdf
– Active mailing list
– Author actively working on project
– Use dev snapshots / git repo
– Stable, but some kinks
• http://www.swftools.org
iTextSharp
• Open source (5.0 – AGPL(!), 4.1 - LGPL)
• Commercial license available
• .NET port of iText
• Very stable
• Excellent for creating &
modifying PDFs
• No rendering capabilites
• http://itextsharp.sourceforge.net/
• http://itextpdf.com/
Extracting Bookmarks
Extracting Links
Thank you!
For attending this session
Email [email protected]
Twitter @improvedk
Blog improve.dk