如何从PDF文件中提取文本?

我试图使用Python提取包含在这个PDF文件中的文本。

我正在使用PyPDF2包(版本1.27.2)，并有以下脚本:

import PyPDF2

with open("sample.pdf", "rb") as pdf_file:
    read_pdf = PyPDF2.PdfFileReader(pdf_file)
    number_of_pages = read_pdf.getNumPages()
    page = read_pdf.pages[0]
    page_content = page.extractText()
print(page_content)

当我运行代码时，我得到以下输出，这与PDF文档中包含的输出不同:

 ! " # $ % # $ % &% $ &' ( ) * % + , - % . / 0 1 ' * 2 3% 4
5
 ' % 1 $ # 2 6 % 3/ % 7 / ) ) / 8 % &) / 2 6 % 8 # 3" % 3" * % 31 3/ 9 # &)
%

如何提取PDF文档中的文本?

当前回答

您可能希望使用经过时间验证的xPDF和派生工具来提取文本，因为pyPDF2在文本提取方面似乎仍然存在各种问题。

长的答案是，文本如何在PDF中编码有很多变化，它可能需要解码PDF字符串本身，然后可能需要与CMAP映射，然后可能需要分析单词和字母之间的距离等。

如果PDF被损坏(即显示正确的文本，但复制时产生垃圾)，并且您确实需要提取文本，那么您可能需要考虑将PDF转换为图像(使用ImageMagik)，然后使用Tesseract使用OCR从图像中获取文本。

2016-01-18 08:42:47

其他回答

Camelot似乎是在Python中从pdf中提取表的一个相当强大的解决方案。

乍一看，它似乎实现了几乎和CreekGeek建议的tabura -py包一样准确的提取，CreekGeek在可靠性方面已经超过了任何其他发布的解决方案，但它应该是更可配置的。此外，它有自己的精度指示器(results.parsing_report)，以及强大的调试功能。

Camelot和Tabula都将结果作为Pandas的dataframe提供，因此之后很容易调整表。

pip install camelot-py

(不要与卡梅洛特的包装混淆。)

import camelot

df_list = []
results = camelot.read_pdf("file.pdf", ...)
for table in results:
    print(table.parsing_report)
    df_list.append(results[0].df)

它还可以输出结果为CSV, JSON, HTML或Excel。

卡梅洛特的到来是以牺牲许多属地为代价的。

NB :由于我的输入非常复杂，有许多不同的表，我最终使用Camelot和Tabula，根据表，以达到最好的结果。

2021-02-01 16:56:54

在尝试textract(似乎有太多依赖项)和pypdf2(无法从我测试的pdf中提取文本)和tika(太慢)后，我最终使用xpdf中的pdftotext(正如已经在另一个答案中建议的那样)，并直接从python中调用二进制(您可能需要调整路径到pdftotext):

import os, subprocess
SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
args = ["/usr/local/bin/pdftotext",
        '-enc',
        'UTF-8',
        "{}/my-pdf.pdf".format(SCRIPT_DIR),
        '-']
res = subprocess.run(args, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
output = res.stdout.decode('utf-8')

有pdftotext，它基本上相同，但这假设pdftotext在/usr/local/bin中，而我在AWS lambda中使用这个，并希望从当前目录使用它。

顺便说一句:要在lambda上使用这个，你需要把二进制文件和依赖项放到libstdc++中。到函数中。我个人需要编译xpdf。由于这方面的说明会让这个答案变得更糟，我把它们放在了我的个人博客上。

2018-03-13 20:30:57

使用pdfminer.six。这里是文档:https://pdfminersix.readthedocs.io/en/latest/index.html

将pdf转换为文本:

    def pdf_to_text():
        from pdfminer.high_level import extract_text

        text = extract_text('test.pdf')
        print(text)

2021-01-03 19:31:48

试试borb，一个纯python PDF库

import typing  
from borb.pdf.document import Document  
from borb.pdf.pdf import PDF  
from borb.toolkit.text.simple_text_extraction import SimpleTextExtraction  


def main():

    # variable to hold Document instance
    doc: typing.Optional[Document] = None  

    # this implementation of EventListener handles text-rendering instructions
    l: SimpleTextExtraction = SimpleTextExtraction()  

    # open the document, passing along the array of listeners
    with open("input.pdf", "rb") as in_file_handle:  
        doc = PDF.loads(in_file_handle, [l])  
  
    # were we able to read the document?
    assert doc is not None  

    # print the text on page 0
    print(l.get_text(0))  

if __name__ == "__main__":
    main()

2021-08-04 07:19:04

我在这里找到了一个解决方案PDFLayoutTextStripper

这很好，因为它可以保持原始PDF的布局。

它是用Java编写的，但我已经添加了一个网关来支持Python。

示例代码:

from py4j.java_gateway import JavaGateway

gw = JavaGateway()
result = gw.entry_point.strip('samples/bus.pdf')

# result is a dict of {
#   'success': 'true' or 'false',
#   'payload': pdf file content if 'success' is 'true'
#   'error': error message if 'success' is 'false'
# }

print result['payload']

示例输出PDFLayoutTextStripper:

你可以在这里看到更多细节Stripper with Python

2019-05-07 01:54:26

如何从PDF文件中提取文本?

推荐文章

最新文章

标签