In today’s digital world paper converted into digital form by transforming into various readable formats like PDF, Word, Google Doc etc. These medias are safe, easy to access and easy to transfer. Since PDF is a very good source of forming the Text expression, it turned challenge some time while extracting the information out of it.
Requirement
I will be using Python 3.6.3, you can use any version you like (as long as it supports given libraries).
- PyPDF2 (To convert simple, text-based PDF files into text readable by Python)
- textract (To convert non-trivial, scanned PDF files into text readable by Python)
- nltk (To clean and convert phrases into keywords)
Import
pip install PyPDF2 pip install textract pip install nltk
Code example
#Import required libraries import PyPDF2 import textract from nltk.tokenize import word_tokenize from nltk.corpus import stopwords #write a for-loop to open many files filename = 'enter the name of the file here' #open to read the file pdfFileObj = open(filename,'rb') #The pdfReader variable is a readable object that will be parsed pdfReader = PyPDF2.PdfFileReader(pdfFileObj) #discerning the number of pages will allow us to parse through all #the pages num_pages = pdfReader.numPages count = 0 text = "" #The while loop will read each page while count < num_pages: pageObj = pdfReader.getPage(count) count +=1 text += pageObj.extractText() #This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files. if text != "": text = text #If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text else: text = textract.process(fileurl, method='tesseract', language='eng') # Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc. # Now, we will clean our text variable, and return it as a list of keywords. #The word_tokenize() function will break our text phrases into #individual words tokens = word_tokenize(text) #we'll create a new list which contains punctuation we wish to clean punctuations = ['(',')',';',':','[',']',','] #We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords stop_words = stopwords.words('english') #We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations. keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Please leave your comments or queries under comment section also please do subscribe to out blogs to keep your self upto date.
Next: Working with PDF in depth using Python