Read/Write PDF using Python

Read/Write PDF using Python

PDF

In today’s digital world paper converted into digital form by transforming into various readable formats like PDF, Word, Google Doc etc. These medias are safe, easy to access and easy to transfer. Since PDF is a very good source of forming the Text expression, it turned challenge some time while extracting the information out of it.

Requirement

I will be using Python 3.6.3, you can use any version you like (as long as it supports given libraries).

  • PyPDF2 (To convert simple, text-based PDF files into text readable by Python)
  • textract (To convert non-trivial, scanned PDF files into text readable by Python)
  • nltk (To clean and convert phrases into keywords)

Import

pip install PyPDF2
pip install textract
pip install nltk

Code example

#Import required libraries
import PyPDF2 
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

#write a for-loop to open many files 
filename = 'enter the name of the file here' 
#open to read the file
pdfFileObj = open(filename,'rb')

#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""

#The while loop will read each page
while count < num_pages:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()

#This if statement exists to check if the above library returned 
#words. It's done because PyPDF2 cannot read scanned files.
if text != "":
   text = text
#If the above returns as False, we run the OCR library textract to 
#convert scanned/image based PDF files into text
else:
   text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived 
#from our PDF file. Type print(text) to see what it contains. It 
#likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.

#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)

#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']

#We initialize the stopwords variable which is a list of words like 
#"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')

#We create a list comprehension which only returns a list of words 
#that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]

Please leave your comments or queries under comment section also please do subscribe to out blogs to keep your self upto date.

Next: Working with PDF in depth using Python

Leave a Reply

Your email address will not be published. Required fields are marked *