使用Python从HTML文件中提取文本

我想使用Python从HTML文件中提取文本。我想从本质上得到相同的输出，如果我从浏览器复制文本，并将其粘贴到记事本。

我想要一些更健壮的东西，而不是使用正则表达式，正则表达式可能会在格式不佳的HTML上失败。我见过很多人推荐Beautiful Soup，但我在使用它时遇到了一些问题。首先，它会抓取不需要的文本，比如JavaScript源代码。此外，它也不解释HTML实体。例如，我会期望'在HTML源代码中转换为文本中的撇号，就像我将浏览器内容粘贴到记事本一样。

更新html2text看起来很有希望。它正确地处理HTML实体，而忽略JavaScript。然而，它并不完全生成纯文本;它产生的降价，然后必须转换成纯文本。它没有示例或文档，但代码看起来很干净。

相关问题:

在python中过滤HTML标签并解析实体在Python中将XML/HTML实体转换为Unicode字符串

当前回答

你可以用BeautifulSoup从HTML中提取文本

url = "https://www.geeksforgeeks.org/extracting-email-addresses-using-regular-expressions-python/"
con = urlopen(url).read()
soup = BeautifulSoup(con,'html.parser')
texts = soup.get_text()
print(texts)

2018-04-13 11:03:57

其他回答

今天我发现自己面临着同样的问题。我编写了一个非常简单的HTML解析器来剥离传入内容中的所有标记，仅以最小的格式返回剩余的文本。

from HTMLParser import HTMLParser
from re import sub
from sys import stderr
from traceback import print_exc

class _DeHTMLParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.__text = []

    def handle_data(self, data):
        text = data.strip()
        if len(text) > 0:
            text = sub('[ \t\r\n]+', ' ', text)
            self.__text.append(text + ' ')

    def handle_starttag(self, tag, attrs):
        if tag == 'p':
            self.__text.append('\n\n')
        elif tag == 'br':
            self.__text.append('\n')

    def handle_startendtag(self, tag, attrs):
        if tag == 'br':
            self.__text.append('\n\n')

    def text(self):
        return ''.join(self.__text).strip()


def dehtml(text):
    try:
        parser = _DeHTMLParser()
        parser.feed(text)
        parser.close()
        return parser.text()
    except:
        print_exc(file=stderr)
        return text


def main():
    text = r'''
        <html>
            <body>
                <b>Project:</b> DeHTML<br>
                <b>Description</b>:<br>
                This small script is intended to allow conversion from HTML markup to 
                plain text.
            </body>
        </html>
    '''
    print(dehtml(text))


if __name__ == '__main__':
    main()

2010-10-21 13:14:38

我知道已经有很多答案了，但是我找到的最优雅、最python化的解决方案在这里进行了部分描述。

from bs4 import BeautifulSoup

text = ' '.join(BeautifulSoup(some_html_string, "html.parser").findAll(text=True))

更新

根据弗雷泽的评论，这里有一个更优雅的解决方案:

from bs4 import BeautifulSoup

clean_text = ' '.join(BeautifulSoup(some_html_string, "html.parser").stripped_strings)

2016-10-06 15:08:54

而不是HTMLParser模块，签出htmllib。它有一个类似的界面，但是为您做了更多的工作。(它非常古老，所以在摆脱javascript和css方面没有多大帮助。你可以创建一个派生类，但是可以添加start_script和end_style这样的方法(详见python文档)，但对于格式不正确的html来说，很难可靠地做到这一点。)不管怎样，这里有一些简单的东西，它将纯文本打印到控制台

from htmllib import HTMLParser, HTMLParseError
from formatter import AbstractFormatter, DumbWriter
p = HTMLParser(AbstractFormatter(DumbWriter()))
try: p.feed('hello<br>there'); p.close() #calling close is not usually needed, but let's play it safe
except HTMLParseError: print ':(' #the html is badly malformed (or you found a bug)

2012-02-20 06:39:50

注意:NTLK不再支持clean_html函数

下面是原始答案，评论部分有备选答案。

使用NLTK

我浪费了4-5个小时来修复html2text的问题。幸运的是我遇到了NLTK。它神奇地起作用。

import nltk   
from urllib import urlopen

url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"    
html = urlopen(url).read()    
raw = nltk.clean_html(html)  
print(raw)

2011-11-20 12:34:09

我知道这里已经有很多答案了，但我认为newspaper3k也值得一提。我最近需要完成一个类似的任务，即从网络上的文章中提取文本，到目前为止，这个库在我的测试中完成了出色的工作。它忽略菜单项和边栏中的文本，以及OP请求时出现在页面上的任何JavaScript。

from newspaper import Article

article = Article(url)
article.download()
article.parse()
article.text

如果你已经下载了HTML文件，你可以这样做:

article = Article('')
article.set_html(html)
article.parse()
article.text

它甚至有一些NLP功能来总结文章的主题:

article.nlp()
article.summary

2018-02-18 13:36:16

使用Python从HTML文件中提取文本

推荐文章

最新文章

标签