What is PDF Document?
PDF stands for Portable Document Format and its file name represents with .pdf file extension. PDF contains many formats of information like images, texts, tabular data, feeding forms even PDF can be automated like other programming, hence it is turned a very demanding application. Lets take a look how we can perform various operations on PDF using Python.
Prerequisites:
In given example we will have to import couple of libraries to manipulate PDF document:
- PyPDF2: this is the main library which helps developer to write code to manipulate a PDF document
Following code will extract information from a pdf document and print over console:
#import library import PyPDF2 #Create object and grab pdf instance oFileObj = open('myExample.pdf', 'rb') #Fill reader objReader = PyPDF2.PdfFileReader(oFileObj) #Print number of pages in PDF document print("Total Pages in File : ", objReader.numPages) #Get Page number 2 from the objReader object objPage = objReader.getPage(2) #Print information available on Page 2 print(objPage.extractText())
Adding Watermark: following example will add watermark in a PDF document:
#Import library import PyPDF2 oFile = open('SamplePDFFile.pdf', 'rb') pdfReader = PyPDF2.PdfFileReader(oFile) oFirstPage = pdfReader.getPage(0) pdfWatermarkReader = PyPDF2.PdfFileReader(open('watermarkPdf.pdf', 'rb')) oFirstPage.mergePage(pdfWatermarkReader.getPage(0)) pdfWriter = PyPDF2.PdfFileWriter() pdfWriter.addPage(oFirstPage) #Iterate pages for pageNum in range(1, pdfReader.numPages): pageObj = pdfReader.getPage(pageNum) pdfWriter.addPage(pageObj) resultPdfFile = open('watermarkedFile.pdf', 'wb') pdfWriter.write(resultPdfFile) oFile.close() resultPdfFile.close()