vertopal.com_2-Working-with-PDFs
vertopal.com_2-Working-with-PDFs
Welcome back Agent. Often you will have to deal with PDF files. There are many libraries in
Python for working with PDFs, each with their pros and cons, the most common one being
PyPDF2. You can install it with (note the case-sensitivity, you need to make sure your
capitilization matches):
Keep in mind that not every PDF file can be read with this library. PDFs that are too blurry, have a
special encoding, encrypted, or maybe just created with a particular program that doesn't work
well with PyPDF2 won't be able to be read. If you find yourself in this situation, try using the
libraries linked above, but keep in mind, these may also not work. The reason for this is because
of the many different parameters for a PDF and how non-standard the settings can be, text
could be shown as an image instead of a utf-8 encoding. There are many parameters to consider
in this aspect.
As far as PyPDF2 is concerned, it can only read the text from a PDF document, it won't be able to
grab images or other media files from a PDF. ___
Collecting PyPDF2
Downloading pypdf2-3.0.1-py3-none-any.whl (232 kB)
------------------------------------ 232.6/232.6 kB 490.8 kB/s
eta 0:00:00
Requirement already satisfied: typing_extensions>=3.10.0.0 in c:\
users\jmpor\anaconda3\lib\site-packages (from PyPDF2) (4.3.0)
Installing collected packages: PyPDF2
Successfully installed PyPDF2-3.0.1
Reading PDFs
Similar to the csv library, we open a pdf, then create a reader object for it. Notice how we use the
binary method of reading , 'rb', instead of just 'r'.
len(pdf_reader.pages)
page_number = 0
page_one = pdf_reader.pages[0]
page_one_text = page_one.extract_text()
page_one_text
f.close()
Adding to PDFs
We can not write to PDFs using Python because of the differences between the single string
type of Python, and the variety of fonts, placements, and other parameters that a PDF could
have.
page_number = 0
page_one = pdf_reader.pages[0]
pdf_writer = PyPDF2.PdfWriter()
pdf_writer.add_page(page_one);
pdf_output = open("Some_New_Doc.pdf","wb")
pdf_writer.write(pdf_output)
f.close()
Simple Example
Let's try to grab all the text from this PDF file:
f = open('Working_Business_Proposal.pdf','rb')
pdf_reader = PyPDF2.PdfReader(f)
for p in range(len(pdf_reader.pages)):
page = pdf_reader.pages[0]
pdf_text.append(page.extract_text())
pdf_text
print(pdf_text[3])
Excellent work! That is all for PyPDF2 for now, remember that this won't work with every PDF
file and is limited in its scope to only text of PDFs.