Working with PDF in depth using Python

Working with PDF in depth using Python

What is PDF Document?

PDF stands for Portable Document Format and its file name represents with .pdf file extension. PDF contains many formats of information like images, texts, tabular data, feeding forms even PDF can be automated like other programming, hence it is turned a very demanding application. Lets take a look how we can perform various operations on PDF using Python.

Prerequisites:

In given example we will have to import couple of libraries to manipulate PDF document:

  • PyPDF2: this is the main library which helps developer to write code to manipulate a PDF document

Following code will extract information from a pdf document and print over console:

#import library
import PyPDF2

#Create object and grab pdf instance
oFileObj = open('myExample.pdf', 'rb')

#Fill reader
objReader = PyPDF2.PdfFileReader(oFileObj)

#Print number of pages in PDF document
print("Total Pages in File : ", objReader.numPages)

#Get Page number 2 from the objReader object
objPage = objReader.getPage(2)

#Print information available on Page 2
print(objPage.extractText())

Adding Watermark: following example will add watermark in a PDF document:

#Import library
import PyPDF2
oFile = open('SamplePDFFile.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(oFile)
oFirstPage = pdfReader.getPage(0)
pdfWatermarkReader = PyPDF2.PdfFileReader(open('watermarkPdf.pdf', 'rb'))
oFirstPage.mergePage(pdfWatermarkReader.getPage(0))
pdfWriter = PyPDF2.PdfFileWriter()
pdfWriter.addPage(oFirstPage)

#Iterate pages
for pageNum in range(1, pdfReader.numPages):
   pageObj = pdfReader.getPage(pageNum)
   pdfWriter.addPage(pageObj)

resultPdfFile = open('watermarkedFile.pdf', 'wb')
pdfWriter.write(resultPdfFile)
oFile.close()
resultPdfFile.close()

Leave a Reply

Your email address will not be published. Required fields are marked *