Dissecting PDF Documents: Mark S. Rasmussen - Ipaper Mark@Improve - DK

This document discusses tools and techniques for analyzing and extracting information from PDF documents. It provides an overview of the PDF format structure and object model. It then describes several popular open-source and commercial libraries for parsing PDFs, including ABCpdf, Acrobat, Xpdf, SWFTools, and iTextSharp. It discusses how these tools can be used to extract text, bookmarks, links, and other data from PDF documents.

Uploaded by

S G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views

Dissecting PDF Documents: Mark S. Rasmussen - Ipaper Mark@Improve - DK

Uploaded by

S G

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Dissecting PDF Documents

Mark S. Rasmussen – iPaper

[email protected]
What Is This Session NOT About?
• Creating PDFs
• How to use Acrobat
• Transparency flattening options in InDesign

• So what is it about?
– PDF documents
– Tooling
– Extracting data
The PDF Format
• 1.0 released in 1993
• Open standard as of July 1st 2008
• Reference publicly available
– http://www.adobe.com/devnet/pdf/pdf_reference_archive.html

1500

1000

500

0
PDF 1.3 PDF 1.4 PDF 1.5 PDF 1.6 PDF 1.7 OOXML 1.0
PDF Structure
• Header
– %PDF-1.4
– %âãÏÓ (optional but common)

• Body
– Objects

• Xref table
– Index table containing pointers to objects

• Trailer
– Pointers to Xref table, key objects
– %%EOF
PDF Objects
”A PDF file should be thought of as a flattened
representation of a data structure consisting of a
collection of objects that can refer to each other in
any arbitrary way.”

• Boolean, Number, String, Name, Array,

Dictionary, Stream, Null
• Indirect & direct objects
• Random access
Reading A PDF – The Ninja Way!
Incremental Changes
• Fast saves, but not for free
• Undo & history
• Save vs Save As
• Single-pass writing
• Linearization
Linearization & Xref Chaining
PDF Objects: Image
• Stream object with dictionary header
ABCpdf
• Commercial
• Excellent .NET API
• ObjectSoup is a
valuable friend
• Good image rendering
• Useless SWF rendering
• Unstable rendering
• Decent support
• http://www.websupergoo.com/secret.htm
Acrobat
• Commercial (tricky license)
• No COM libraries after 7.x
• Surprisingly stable and fast
• Ugly API
Rendering Using Acrobat
Xpdf
• Open source (GPL)
• Pdffonts, pdfimages,
pdfinfo, pdftops, pdftotext
• Basis for many other libraries & tools
• Commercial license & COM library available at
www.glyphandcog.com
• http://www.foolabs.com/xpdf/
PDF Font Management
• Client must have fonts used in PDF document
• However…
– Complete font can be embedded
– Or a subset
– 14 standard fonts (Courier, Helvetica, Times + ITC
Zapf & Dingbats)
– Font replacement
Text In PDF
• No concept of text, just characters
• Flow order not guaranteed
• Requires guesstimation to extract text
• Extraction may require embedded fonts
• Lots of tools, some better than others
Text According To ABCpdf
1 2
3
4
5

1
2
5

3 6
4
Text According To Xpdf
1 2
3
4
5

1
3 5
4

6
2
Physical Text According To Xpdf
1 2
3
4
5

3
1 2
4
5

6
SWFTools
• Open source (GPL)
• PDF2SWF converts PDF files to SWF format
– Based on Xpdf
– Active mailing list
– Author actively working on project
– Use dev snapshots / git repo
– Stable, but some kinks
• http://www.swftools.org
iTextSharp
• Open source (5.0 – AGPL(!), 4.1 - LGPL)
• Commercial license available
• .NET port of iText
• Very stable
• Excellent for creating &
modifying PDFs
• No rendering capabilites
• http://itextsharp.sourceforge.net/
• http://itextpdf.com/
Extracting Bookmarks
Extracting Links
Thank you!
For attending this session

Email [email protected]
Twitter @improvedk
Blog improve.dk

Q Tips: Fast, Scalable, and Maintainable Kdb+
From Everand
Q Tips: Fast, Scalable, and Maintainable Kdb+
Nick Psaris
No ratings yet
Immediate download Data Structures Demystified 1st Edition James Edward Keogh ebooks 2024
No ratings yet
Immediate download Data Structures Demystified 1st Edition James Edward Keogh ebooks 2024
77 pages
The ABC of PDF With Itext
No ratings yet
The ABC of PDF With Itext
32 pages
TR11 Wolf OMG PDF PDF
No ratings yet
TR11 Wolf OMG PDF PDF
197 pages
History of PDF - Wikipedia
No ratings yet
History of PDF - Wikipedia
38 pages
Open Source Tool
No ratings yet
Open Source Tool
3 pages
History of PDF - Wikipedia 2
No ratings yet
History of PDF - Wikipedia 2
10 pages
PDF File Format - What Is A PDF
No ratings yet
PDF File Format - What Is A PDF
4 pages
PDF
No ratings yet
PDF
14 pages
PDF - Wikipedia: About 489,000,000 Results (0.52 Seconds)
No ratings yet
PDF - Wikipedia: About 489,000,000 Results (0.52 Seconds)
2 pages
Redundantly File Format Adobe Documents Application Software Hardware Operating Systems Postscript Fonts Vector Graphics Raster Images Open Format
No ratings yet
Redundantly File Format Adobe Documents Application Software Hardware Operating Systems Postscript Fonts Vector Graphics Raster Images Open Format
4 pages
PDF Processing and Analysis With Open-Source Tools
No ratings yet
PDF Processing and Analysis With Open-Source Tools
54 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
23 pages
History of PDF - Wikipedia
No ratings yet
History of PDF - Wikipedia
10 pages
Jump To Navigation Jump To Search: For Other Uses, See
No ratings yet
Jump To Navigation Jump To Search: For Other Uses, See
16 pages
Itext Pdfabc
No ratings yet
Itext Pdfabc
152 pages
Basics
No ratings yet
Basics
2 pages
Introduction To PDF Programming: Leonard Rosenthol Lazerware
No ratings yet
Introduction To PDF Programming: Leonard Rosenthol Lazerware
33 pages
Introduction To PDF Programming: Leonard Rosenthol Lazerware
No ratings yet
Introduction To PDF Programming: Leonard Rosenthol Lazerware
33 pages
PDF Reader From Scratch
No ratings yet
PDF Reader From Scratch
26 pages
The Panel: Beebe@math - Utah.edu
No ratings yet
The Panel: Beebe@math - Utah.edu
7 pages
Pdfreader Documentation: Release 0.1.7
No ratings yet
Pdfreader Documentation: Release 0.1.7
40 pages
Pdfreader Documentation: Release 0.1.10
No ratings yet
Pdfreader Documentation: Release 0.1.10
40 pages
Minimal PDF: Adobe PDF Specification ("ISO Approved Copy of The ISO 32000-1 Standards Document") Tips
No ratings yet
Minimal PDF: Adobe PDF Specification ("ISO Approved Copy of The ISO 32000-1 Standards Document") Tips
3 pages
Part 5
No ratings yet
Part 5
8 pages
PDFlib 9 Datasheet PDF
No ratings yet
PDFlib 9 Datasheet PDF
6 pages
PDF - Wiki
No ratings yet
PDF - Wiki
24 pages
Adobe PDF Library SDK
No ratings yet
Adobe PDF Library SDK
17 pages
Pdfreader Readthedocs Io en Latest
No ratings yet
Pdfreader Readthedocs Io en Latest
40 pages
History of Changes - Itext 5.0.2
No ratings yet
History of Changes - Itext 5.0.2
2 pages
PDF Basics CheatSheet
No ratings yet
PDF Basics CheatSheet
2 pages
Adobe PDF Library SDK
No ratings yet
Adobe PDF Library SDK
3 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
24 pages
Documentation PDF
No ratings yet
Documentation PDF
3 pages
Plug in Help
No ratings yet
Plug in Help
6 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
22 pages
Anatomy of Malicious PDF Documents
No ratings yet
Anatomy of Malicious PDF Documents
6 pages
King James, Inside PDF Ecosystem
No ratings yet
King James, Inside PDF Ecosystem
37 pages
How PDF Works: Gary Staas
No ratings yet
How PDF Works: Gary Staas
35 pages
How PDF Works: Gary Staas
No ratings yet
How PDF Works: Gary Staas
35 pages
How PDF Works: Gary Staas
No ratings yet
How PDF Works: Gary Staas
35 pages
Adobe PDF Library SDK 18.0.3 Read Me: Installation Instructions
No ratings yet
Adobe PDF Library SDK 18.0.3 Read Me: Installation Instructions
9 pages
Move Your Mouse Down Here!
100% (1)
Move Your Mouse Down Here!
4 pages
PDF-Origami by Darth Origami
No ratings yet
PDF-Origami by Darth Origami
108 pages
Acrobat X Comparison Matrix
No ratings yet
Acrobat X Comparison Matrix
4 pages
Create PDF Documents Using Itextsharp
No ratings yet
Create PDF Documents Using Itextsharp
14 pages
PDf-1
No ratings yet
PDf-1
3 pages
Pdfreader Documentation: Release 0.1.6
No ratings yet
Pdfreader Documentation: Release 0.1.6
38 pages
XML Forms Data Format (XFDF)
No ratings yet
XML Forms Data Format (XFDF)
6 pages
Malicious Origami in PDF: FR Ed Eric Raynal Guillaume Delugr e
No ratings yet
Malicious Origami in PDF: FR Ed Eric Raynal Guillaume Delugr e
107 pages
PDF Lover View
No ratings yet
PDF Lover View
17 pages
Pros and Cons of PDF Files
100% (2)
Pros and Cons of PDF Files
6 pages
History of Portable Document Format
No ratings yet
History of Portable Document Format
19 pages
Extracting text from PDF files with Python_ A comprehensive guide - Modo leitor
No ratings yet
Extracting text from PDF files with Python_ A comprehensive guide - Modo leitor
17 pages
Source
No ratings yet
Source
8 pages
List of PDF Software
No ratings yet
List of PDF Software
9 pages
PDFOxford150715 PDF - Myths Vs Facts Ange Albertini PDF
No ratings yet
PDFOxford150715 PDF - Myths Vs Facts Ange Albertini PDF
57 pages
Nice Document - Expert Review
No ratings yet
Nice Document - Expert Review
20 pages
Open Publish 2002: The Past, Present and Future of The Portable Document Format PDF
No ratings yet
Open Publish 2002: The Past, Present and Future of The Portable Document Format PDF
24 pages
Mastering PHP: Web Development Practices
From Everand
Mastering PHP: Web Development Practices
William Smith
No ratings yet
Learn PHP: Learn PHP Programming in 4 hours! PHP for Beginners - Smart and Easy Ways to learn PHP & MySQL
From Everand
Learn PHP: Learn PHP Programming in 4 hours! PHP for Beginners - Smart and Easy Ways to learn PHP & MySQL
Barry Page
3.5/5 (2)
2024 Model Answer Paper
No ratings yet
2024 Model Answer Paper
20 pages
Creating The Login: Java Web Services Part 4
100% (1)
Creating The Login: Java Web Services Part 4
2 pages
DSPic 30F4013
No ratings yet
DSPic 30F4013
228 pages
Introduction To Assembler
No ratings yet
Introduction To Assembler
11 pages
Hana PDF
No ratings yet
Hana PDF
7 pages
Memory_management_report[1]
No ratings yet
Memory_management_report[1]
10 pages
Baskara Blog Delete ALL Records Database ADOTable+
No ratings yet
Baskara Blog Delete ALL Records Database ADOTable+
14 pages
Cloud Migration Guide SQL Server Azure
No ratings yet
Cloud Migration Guide SQL Server Azure
37 pages
1125-CTS Interview Questions
No ratings yet
1125-CTS Interview Questions
1 page
Section: COMP - 101 /final Makeup Exam
No ratings yet
Section: COMP - 101 /final Makeup Exam
3 pages
4000 Computer MCQ
No ratings yet
4000 Computer MCQ
83 pages
Number System PDF
No ratings yet
Number System PDF
29 pages
CS201 SOLVED MCQs FINAL TERM BY JUNAID
No ratings yet
CS201 SOLVED MCQs FINAL TERM BY JUNAID
45 pages
INAND e MMC 4 41 If Data Sheet v1 0
No ratings yet
INAND e MMC 4 41 If Data Sheet v1 0
29 pages
CSC 252 LECTURE 1 Intro To Data Structure
No ratings yet
CSC 252 LECTURE 1 Intro To Data Structure
5 pages
CO Lec.4
No ratings yet
CO Lec.4
36 pages
Storage Area Networks
No ratings yet
Storage Area Networks
4 pages
Literature Survey: A Novel Approach For Secure Key-Deduplication With Ibbe
No ratings yet
Literature Survey: A Novel Approach For Secure Key-Deduplication With Ibbe
3 pages
Serial Schedule Non-Serial Schedule: Checkpoints
No ratings yet
Serial Schedule Non-Serial Schedule: Checkpoints
7 pages
St-55 Data Communication (Compatibility Mode)
No ratings yet
St-55 Data Communication (Compatibility Mode)
28 pages
Std-10-Computer-Chapter 12 Using IO Operations
No ratings yet
Std-10-Computer-Chapter 12 Using IO Operations
7 pages
SQL Problems and Solutions
No ratings yet
SQL Problems and Solutions
9 pages
DBMS QuestionBANK
No ratings yet
DBMS QuestionBANK
6 pages
DBA Roles and Responsibilities
100% (2)
DBA Roles and Responsibilities
2 pages
Campus Network Design and Implementation Using Top Down Approach - A Case Study Tarumanagara Uni
0% (1)
Campus Network Design and Implementation Using Top Down Approach - A Case Study Tarumanagara Uni
6 pages
Computer by World Inbox Academy-1
No ratings yet
Computer by World Inbox Academy-1
20 pages
Extended Hamming Code PDF
No ratings yet
Extended Hamming Code PDF
2 pages
DSA Inside BOOKcontents
No ratings yet
DSA Inside BOOKcontents
4 pages
Cookies in Servlet
No ratings yet
Cookies in Servlet
8 pages

Dissecting PDF Documents: Mark S. Rasmussen - Ipaper Mark@Improve - DK

Uploaded by

Dissecting PDF Documents: Mark S. Rasmussen - Ipaper Mark@Improve - DK

Uploaded by

Dissecting PDF Documents

Mark S. Rasmussen – iPaper

• Boolean, Number, String, Name, Array,

You might also like