0% found this document useful (0 votes)
6 views6 pages

vertopal.com_2-Working-with-PDFs

The document provides an overview of working with PDF files in Python, specifically using the PyPDF2 library. It covers installation, reading text from PDFs, and limitations regarding image extraction and writing to PDFs. The document also includes examples of how to read and append pages from PDF files using PyPDF2.

Uploaded by

MuHaMMad SHouKaT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views6 pages

vertopal.com_2-Working-with-PDFs

The document provides an overview of working with PDF files in Python, specifically using the PyPDF2 library. It covers installation, reading text from PDFs, and limitations regarding image extraction and writing to PDFs. The document also includes examples of how to read and append pages from PDF files using PyPDF2.

Uploaded by

MuHaMMad SHouKaT
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Working with PDF Files

Welcome back Agent. Often you will have to deal with PDF files. There are many libraries in
Python for working with PDFs, each with their pros and cons, the most common one being
PyPDF2. You can install it with (note the case-sensitivity, you need to make sure your
capitilization matches):

pip install PyPDF2

Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a
special encoding, encrypted, or maybe just created with a particular program that doesn't work
well with PyPDF2 won't be able to be read. If you find yourself in this situation, try using the
libraries linked above, but keep in mind, these may also not work. The reason for this is because
of the many different parameters for a PDF and how non-standard the settings can be, text
could be shown as an image instead of a utf-8 encoding. There are many parameters to consider
in this aspect.

As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to
grab images or other media files from a PDF. ___

Working with PyPDF2


Let's being showing the basics of the PyPDF2 library.

!pip install PyPDF2

Collecting PyPDF2
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
------------------------------------ 232.6/232.6 kB 490.8 kB/s
eta 0:00:00
Requirement already satisfied: typing_extensions>=3.10.0.0 in c:\
users\jmpor\anaconda3\lib\site-packages (from PyPDF2) (4.3.0)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1

# note the capitalization


import PyPDF2

Reading PDFs
Similar to the csv library, we open a pdf, then create a reader object for it. Notice how we use the
binary method of reading , 'rb', instead of just 'r'.

# Notice we read it as a binary with 'rb'


f = open('Working_Business_Proposal.pdf','rb')
pdf_reader = PyPDF2.PdfReader(f)

len(pdf_reader.pages)

page_number = 0
page_one = pdf_reader.pages[0]

We can then extract the text:

page_one_text = page_one.extract_text()

page_one_text

'Business Proposal The Revolution is Coming Leverage agile frameworks


to provide a robust synopsis for high level overviews. Iterative
approaches to corporate strategy foster collaborative thinking to
further the overall value proposition. Organically grow the holistic
world view of disruptive innovation via workplace diversity and
empowerment. Bring to the table win-win survival strategies to ensure
proactive domination. At the end of the day, going forward, a new
normal that has evolved from generation X is on the runway heading
towards a streamlined cloud solution. User generated content in real-
time will have multiple touchpoints for offshoring. Capitalize on low
hanging fruit to identify a ballpark value added activity to beta
test. Override the digital divide with additional clickthroughs from
DevOps. Nanotechnology immersion along the information highway will
close the loop on focusing solely on the bottom line. Podcasting
operational change management inside of workflows to establish a
framework. Taking seamless key performance indicators offline to
maximise the long tail. Keeping your eye on the ball while performing
a deep dive on the start-up mentality to derive convergence on cross-
platform integration. Collaboratively administrate empowered markets
via plug-and-play networks. Dynamically procrastinate B2C users after
installed base benefits. Dramatically visualize customer directed
convergence without revolutionary ROI. Efficiently unleash cross-media
information without cross-media value. Quickly maximize timely
deliverables for real-time schemas. Dramatically maintain clicks-and-
mortar solutions without functional solutions. BUSINESS PROPOSAL!1'

f.close()

Adding to PDFs
We can not write to PDFs using Python because of the differences between the single string
type of Python, and the variety of fonts, placements, and other parameters that a PDF could
have.

What we can do is copy pages and append pages to the end.


f = open('Working_Business_Proposal.pdf','rb')
pdf_reader = PyPDF2.PdfReader(f)

page_number = 0
page_one = pdf_reader.pages[0]

pdf_writer = PyPDF2.PdfWriter()

pdf_writer.add_page(page_one);

pdf_output = open("Some_New_Doc.pdf","wb")

pdf_writer.write(pdf_output)

(False, <_io.BufferedWriter name='Some_New_Doc.pdf'>)

f.close()

Now we have copied a page and added it to another new document!

Simple Example
Let's try to grab all the text from this PDF file:

f = open('Working_Business_Proposal.pdf','rb')

# List of every page's text.


# The index will correspond to the page number.
pdf_text = []

pdf_reader = PyPDF2.PdfReader(f)

for p in range(len(pdf_reader.pages)):

page = pdf_reader.pages[0]

pdf_text.append(page.extract_text())

pdf_text

['Business Proposal The Revolution is Coming Leverage agile frameworks


to provide a robust synopsis for high level overviews. Iterative
approaches to corporate strategy foster collaborative thinking to
further the overall value proposition. Organically grow the holistic
world view of disruptive innovation via workplace diversity and
empowerment. Bring to the table win-win survival strategies to ensure
proactive domination. At the end of the day, going forward, a new
normal that has evolved from generation X is on the runway heading
towards a streamlined cloud solution. User generated content in real-
time will have multiple touchpoints for offshoring. Capitalize on low
hanging fruit to identify a ballpark value added activity to beta
test. Override the digital divide with additional clickthroughs from
DevOps. Nanotechnology immersion along the information highway will
close the loop on focusing solely on the bottom line. Podcasting
operational change management inside of workflows to establish a
framework. Taking seamless key performance indicators offline to
maximise the long tail. Keeping your eye on the ball while performing
a deep dive on the start-up mentality to derive convergence on cross-
platform integration. Collaboratively administrate empowered markets
via plug-and-play networks. Dynamically procrastinate B2C users after
installed base benefits. Dramatically visualize customer directed
convergence without revolutionary ROI. Efficiently unleash cross-media
information without cross-media value. Quickly maximize timely
deliverables for real-time schemas. Dramatically maintain clicks-and-
mortar solutions without functional solutions. BUSINESS PROPOSAL!1',
'Business Proposal The Revolution is Coming Leverage agile frameworks
to provide a robust synopsis for high level overviews. Iterative
approaches to corporate strategy foster collaborative thinking to
further the overall value proposition. Organically grow the holistic
world view of disruptive innovation via workplace diversity and
empowerment. Bring to the table win-win survival strategies to ensure
proactive domination. At the end of the day, going forward, a new
normal that has evolved from generation X is on the runway heading
towards a streamlined cloud solution. User generated content in real-
time will have multiple touchpoints for offshoring. Capitalize on low
hanging fruit to identify a ballpark value added activity to beta
test. Override the digital divide with additional clickthroughs from
DevOps. Nanotechnology immersion along the information highway will
close the loop on focusing solely on the bottom line. Podcasting
operational change management inside of workflows to establish a
framework. Taking seamless key performance indicators offline to
maximise the long tail. Keeping your eye on the ball while performing
a deep dive on the start-up mentality to derive convergence on cross-
platform integration. Collaboratively administrate empowered markets
via plug-and-play networks. Dynamically procrastinate B2C users after
installed base benefits. Dramatically visualize customer directed
convergence without revolutionary ROI. Efficiently unleash cross-media
information without cross-media value. Quickly maximize timely
deliverables for real-time schemas. Dramatically maintain clicks-and-
mortar solutions without functional solutions. BUSINESS PROPOSAL!1',
'Business Proposal The Revolution is Coming Leverage agile frameworks
to provide a robust synopsis for high level overviews. Iterative
approaches to corporate strategy foster collaborative thinking to
further the overall value proposition. Organically grow the holistic
world view of disruptive innovation via workplace diversity and
empowerment. Bring to the table win-win survival strategies to ensure
proactive domination. At the end of the day, going forward, a new
normal that has evolved from generation X is on the runway heading
towards a streamlined cloud solution. User generated content in real-
time will have multiple touchpoints for offshoring. Capitalize on low
hanging fruit to identify a ballpark value added activity to beta
test. Override the digital divide with additional clickthroughs from
DevOps. Nanotechnology immersion along the information highway will
close the loop on focusing solely on the bottom line. Podcasting
operational change management inside of workflows to establish a
framework. Taking seamless key performance indicators offline to
maximise the long tail. Keeping your eye on the ball while performing
a deep dive on the start-up mentality to derive convergence on cross-
platform integration. Collaboratively administrate empowered markets
via plug-and-play networks. Dynamically procrastinate B2C users after
installed base benefits. Dramatically visualize customer directed
convergence without revolutionary ROI. Efficiently unleash cross-media
information without cross-media value. Quickly maximize timely
deliverables for real-time schemas. Dramatically maintain clicks-and-
mortar solutions without functional solutions. BUSINESS PROPOSAL!1',
'Business Proposal The Revolution is Coming Leverage agile frameworks
to provide a robust synopsis for high level overviews. Iterative
approaches to corporate strategy foster collaborative thinking to
further the overall value proposition. Organically grow the holistic
world view of disruptive innovation via workplace diversity and
empowerment. Bring to the table win-win survival strategies to ensure
proactive domination. At the end of the day, going forward, a new
normal that has evolved from generation X is on the runway heading
towards a streamlined cloud solution. User generated content in real-
time will have multiple touchpoints for offshoring. Capitalize on low
hanging fruit to identify a ballpark value added activity to beta
test. Override the digital divide with additional clickthroughs from
DevOps. Nanotechnology immersion along the information highway will
close the loop on focusing solely on the bottom line. Podcasting
operational change management inside of workflows to establish a
framework. Taking seamless key performance indicators offline to
maximise the long tail. Keeping your eye on the ball while performing
a deep dive on the start-up mentality to derive convergence on cross-
platform integration. Collaboratively administrate empowered markets
via plug-and-play networks. Dynamically procrastinate B2C users after
installed base benefits. Dramatically visualize customer directed
convergence without revolutionary ROI. Efficiently unleash cross-media
information without cross-media value. Quickly maximize timely
deliverables for real-time schemas. Dramatically maintain clicks-and-
mortar solutions without functional solutions. BUSINESS PROPOSAL!1',
'Business Proposal The Revolution is Coming Leverage agile frameworks
to provide a robust synopsis for high level overviews. Iterative
approaches to corporate strategy foster collaborative thinking to
further the overall value proposition. Organically grow the holistic
world view of disruptive innovation via workplace diversity and
empowerment. Bring to the table win-win survival strategies to ensure
proactive domination. At the end of the day, going forward, a new
normal that has evolved from generation X is on the runway heading
towards a streamlined cloud solution. User generated content in real-
time will have multiple touchpoints for offshoring. Capitalize on low
hanging fruit to identify a ballpark value added activity to beta
test. Override the digital divide with additional clickthroughs from
DevOps. Nanotechnology immersion along the information highway will
close the loop on focusing solely on the bottom line. Podcasting
operational change management inside of workflows to establish a
framework. Taking seamless key performance indicators offline to
maximise the long tail. Keeping your eye on the ball while performing
a deep dive on the start-up mentality to derive convergence on cross-
platform integration. Collaboratively administrate empowered markets
via plug-and-play networks. Dynamically procrastinate B2C users after
installed base benefits. Dramatically visualize customer directed
convergence without revolutionary ROI. Efficiently unleash cross-media
information without cross-media value. Quickly maximize timely
deliverables for real-time schemas. Dramatically maintain clicks-and-
mortar solutions without functional solutions. BUSINESS PROPOSAL!1']

print(pdf_text[3])

Business Proposal The Revolution is Coming Leverage agile frameworks


to provide a robust synopsis for high level overviews. Iterative
approaches to corporate strategy foster collaborative thinking to
further the overall value proposition. Organically grow the holistic
world view of disruptive innovation via workplace diversity and
empowerment. Bring to the table win-win survival strategies to ensure
proactive domination. At the end of the day, going forward, a new
normal that has evolved from generation X is on the runway heading
towards a streamlined cloud solution. User generated content in real-
time will have multiple touchpoints for offshoring. Capitalize on low
hanging fruit to identify a ballpark value added activity to beta
test. Override the digital divide with additional clickthroughs from
DevOps. Nanotechnology immersion along the information highway will
close the loop on focusing solely on the bottom line. Podcasting
operational change management inside of workflows to establish a
framework. Taking seamless key performance indicators offline to
maximise the long tail. Keeping your eye on the ball while performing
a deep dive on the start-up mentality to derive convergence on cross-
platform integration. Collaboratively administrate empowered markets
via plug-and-play networks. Dynamically procrastinate B2C users after
installed base benefits. Dramatically visualize customer directed
convergence without revolutionary ROI. Efficiently unleash cross-media
information without cross-media value. Quickly maximize timely
deliverables for real-time schemas. Dramatically maintain clicks-and-
mortar solutions without functional solutions. BUSINESS PROPOSAL!1

Excellent work! That is all for PyPDF2 for now, remember that this won't work with every PDF
file and is limited in its scope to only text of PDFs.

You might also like