我试图使用Python提取包含在这个PDF文件中的文本。
我正在使用PyPDF2包(版本1.27.2),并有以下脚本:
import PyPDF2
with open("sample.pdf", "rb") as pdf_file:
read_pdf = PyPDF2.PdfFileReader(pdf_file)
number_of_pages = read_pdf.getNumPages()
page = read_pdf.pages[0]
page_content = page.extractText()
print(page_content)
当我运行代码时,我得到以下输出,这与PDF文档中包含的输出不同:
! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)
%
如何提取PDF文档中的文本?
从2021年开始,我想推荐pdfreader,因为pypddf2 /3现在看起来很麻烦,tika实际上是用java写的,需要在后台安装jre。Pdfreader是python的,目前维护得很好,这里有大量的文档。
正常安装:pip install pdfreader
用法的简短例子:
from pdfreader import PDFDocument, SimplePDFViewer
# get raw document
fd = open(file_name, "rb")
doc = PDFDocument(fd)
# there is an iterator for pages
page_one = next(doc.pages())
all_pages = [p for p in doc.pages()]
# and even a viewer
fd = open(file_name, "rb")
viewer = SimplePDFViewer(fd)
我尝试过许多Python PDF转换器,我想更新这篇评论。蒂卡是最棒的之一。但是PyMuPDF是@ehsaneha用户的好消息。
我做了一个代码来比较一下:https://github.com/erfelipe/PDFtextExtraction希望对大家有所帮助。
Tika-Python是Apache Tika™REST服务的Python绑定
允许在Python社区中本地调用Tika。
from tika import parser
raw = parser.from_file("///Users/Documents/Textos/Texto1.pdf")
raw = str(raw)
safe_text = raw.encode('utf-8', errors='ignore')
safe_text = str(safe_text).replace("\n", "").replace("\\", "")
print('--- safe text ---' )
print( safe_text )
看看PyPDF2<=1.26.0的代码:
import PyPDF2
pdf_file = open('sample.pdf', 'rb')
read_pdf = PyPDF2.PdfFileReader(pdf_file)
page = read_pdf.getPage(0)
page_content = page.extractText()
print page_content.encode('utf-8')
输出结果为:
!"#$%#$%&%$&'()*%+,-%./01'*23%4
5'%1$#26%3/%7/))/8%&)/26%8#3"%3"*%313/9#&)
%
使用相同的代码从201308FCR.pdf读取pdf
.输出正常。
它的文档解释了原因:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""