在Python中从字符串中剥离HTML

from mechanize import Browser
br = Browser()
br.open('http://somewebpage')
html = br.response().readlines()
for line in html:
  print line

当在HTML文件中打印一行时，我试图找到一种方法，只显示每个HTML元素的内容，而不是格式本身。如果它发现'<a href="等等。例如">some text</a>'，它只会打印'some text'， 'hello'打印'hello'，等等。该怎么做呢?

当前回答

我已经成功地在Python 3.1中使用了Eloff的答案[非常感谢!]。

我升级到Python 3.2.3，并遇到了错误。

解决方案，这里提供感谢响应器Thomas K，是插入super().__init__()到以下代码:

def __init__(self):
    self.reset()
    self.fed = []

．.．为了让它看起来像这样:

def __init__(self):
    super().__init__()
    self.reset()
    self.fed = []

．.．它适用于Python 3.2.3。

再次感谢Thomas K的修复和Eloff提供的原始代码!

2012-06-18 15:29:15

其他回答

如果您需要剥离HTML标记来进行文本处理，那么一个简单的正则表达式就可以了。如果您希望清除用户生成的HTML以防止XSS攻击，请不要使用此方法。删除所有<script>标签或跟踪<img>s不是一个安全的方法。下面的正则表达式将相当可靠地剥离大多数HTML标记:

import re

re.sub('<[^<]+?>', '', text)

对于那些不理解regex的人来说，这将搜索字符串<…>，其中内部内容由一个或多个不是<的(+)字符组成。的吗?意味着它将匹配它能找到的最小字符串。例如，给定Hello，它将分别用?匹配<'p>和。没有它，它将匹配整个字符串<..Hello..>。

如果非标签<出现在html(例如。2 < 3)，它应该被写成转义序列&…总之，^<可能是不必要的。

2011-02-02 01:09:16

如果你需要保留HTML实体(即&)，我在Eloff的答案中添加了“handle_entityref”方法。

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def handle_entityref(self, name):
        self.fed.append('&%s;' % name)
    def get_data(self):
        return ''.join(self.fed)

def html_to_text(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

2012-12-04 13:25:42

# This is a regex solution.
import re
def removeHtml(html):
  if not html: return html
  # Remove comments first
  innerText = re.compile('<!--[\s\S]*?-->').sub('',html)
  while innerText.find('>')>=0: # Loop through nested Tags
    text = re.compile('<[^<>]+?>').sub('',innerText)
    if text == innerText:
      break
    innerText = text

  return innerText.strip()

2019-12-08 10:35:11

你可以编写自己的函数:

def StripTags(text):
     finished = 0
     while not finished:
         finished = 1
         start = text.find("<")
         if start >= 0:
             stop = text[start:].find(">")
             if stop >= 0:
                 text = text[:start] + text[start+stop+1:]
                 finished = 0
     return text

2010-10-04 15:26:49

使用BeautifulSoup, html2text或来自@Eloff的代码，大多数时候，它仍然是一些html元素，javascript代码…

所以你可以使用这些库的组合并删除markdown格式(Python 3):

import re
import html2text
from bs4 import BeautifulSoup
def html2Text(html):
    def removeMarkdown(text):
        for current in ["^[ #*]{2,30}", "^[ ]{0,30}\d\\\.", "^[ ]{0,30}\d\."]:
            markdown = re.compile(current, flags=re.MULTILINE)
            text = markdown.sub(" ", text)
        return text
    def removeAngular(text):
        angular = re.compile("[{][|].{2,40}[|][}]|[{][*].{2,40}[*][}]|[{][{].{2,40}[}][}]|\[\[.{2,40}\]\]")
        text = angular.sub(" ", text)
        return text
    h = html2text.HTML2Text()
    h.images_to_alt = True
    h.ignore_links = True
    h.ignore_emphasis = False
    h.skip_internal_links = True
    text = h.handle(html)
    soup = BeautifulSoup(text, "html.parser")
    text = soup.text
    text = removeAngular(text)
    text = removeMarkdown(text)
    return text

这对我来说很有效，但当然还可以增强……

2017-12-27 14:41:49

在Python中从字符串中剥离HTML

推荐文章

最新文章

标签