November 20, 2019 General
Retrieve words' page number in .pdf with PDFMiner(.six)
PDFMiner
PDFMiner is a text extraction tool for PDF documents. Just notice that starting from version 20191010, PDFMiner supports Python 3 only. For Python 2 support, check out pdfminer.six.
python3: pdfminer, https://github.com/euske/pdfminer
python2/3: pdfminer.six, https://github.com/pdfminer/pdfminer.six
PDFMiner Features:
Pure Python (3.6 or above).
Supports PDF-1.7. (well, almost)
Obtains the exact location of text as well as other layout information (fonts, etc.).
Performs automatic layout analysis.
Can convert PDF into other formats (HTML/XML).
Can extract an outline (TOC).
Can extract tagged contents.
Supports basic encryption (RC4 and AES).
Supports various font types (Type1, TrueType, Type3, and CID).
Supports CJK languages and vertical writing scripts.
Has an extensible PDF parser that can be used for other purposes.
Steps
- Open and parse .pdf to .txt file.
- Meanwhile add page tags into .txt file.
- Retrieve words and get page nubmers in .txt file or its tags.
Codes with python
The code refers from Ccircus's article in References 2.
#GetPageNumber.py
# -*- coding: utf-8 -*-
#import libs
import os
import re
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager,PDFPageInterpreter
from pdfminer.converter import PDFPageAggregator
from pdfminer.layout import LTTextBoxHorizontal,LAParams
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
#parse pdf
def parsePDFtoTXT(pdf_path):
fp = open(pdf_path, 'rb')
parser = PDFParser(fp)
document= PDFDocument(parser)
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
else:
rsrcmgr=PDFResourceManager()
laparams=LAParams()
device=PDFPageAggregator(rsrcmgr,laparams=laparams)
interpreter=PDFPageInterpreter(rsrcmgr,device)
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
layout=device.get_result()
print(layout)
output=str(layout)
for x in layout:
if (isinstance(x,LTTextBoxHorizontal)):
text=x.get_text()
output+=text
output=re.sub('\s','',output)
with open(os.path.join(base_dir,'pdfoutput.txt'),'a',encoding='utf-8') as f:
f.write(output)
#get page
def get_word_page(word_list):
f=open(os.path.join(base_dir,'pdfoutput.txt'),encoding='utf-8')
text_list=f.read().split('<LTPage')
f.close()
n=len(text_list)
for w in word_list:
page_list=[]
for i in range(1,n):
if w in text_list[i]:
page_list.append(i)
with open(os.path.join(base_dir,'result.txt'),'a',encoding='utf-8') as f:
f.write(w+str(page_list)+'\n')
if __name__=='__main__':
base_dir = 'D:/Files/IN/' #pdf file directory
parsePDFtoTXT(os.path.join(base_dir,'Fuzhi.pdf'))
fl=open(os.path.join(base_dir,'list.txt'),encoding='utf-8-sig')
word_list = list(fl)
word_list = [x.strip() for x in word_list]
fl.close()
get_word_page(word_list)
That's all.
Nov 20, 2019
Dec 1, 2019 Revised
References:
- https://github.com/euske/pdfminer
- https://www.cnblogs.com/zm-pop-pk/p/11255436.html
- https://euske.github.io/pdfminer/programming.html