在我的工作中,我做了很多HTML解析。到目前为止,我一直在使用HtmlUnit无头浏览器进行解析和浏览器自动化。
现在,我想把这两个任务分开。
我想使用一个轻量级的HTML解析器,因为在HtmlUnit中首先加载一个页面,然后获取源代码,然后解析它需要很多时间。
我想知道哪个HTML解析器可以有效地解析HTML。我需要
速度
通过“id”或“name”或“tag type”轻松定位任何HtmlElement。
如果它不清除肮脏的HTML代码,对我来说就可以了。我不需要清除任何HTML源代码。我只需要一个最简单的方法来跨越HtmlElements并从它们中获取数据。
到目前为止我看到的最好的是HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
使用HtmlCleaner,您可以使用XPath定位任何元素。
对于其他html解析器,请参阅这个SO问题。
到目前为止我看到的最好的是HtmlCleaner:
HtmlCleaner is open-source HTML parser written in Java. HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.
使用HtmlCleaner,您可以使用XPath定位任何元素。
对于其他html解析器,请参阅这个SO问题。
自我插入:我刚刚发布了一个新的Java HTML解析器:jsoup。我在这里提到它,因为我认为它会做你想要的。
它的派对技巧是一个CSS选择器语法来查找元素,例如:
String html = "<html><head><title>First parse</title></head>"
+ "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);
Elements links = doc.select("a");
Element head = doc.select("head").first();
有关更多信息,请参阅Selector javadoc。
这是一个新的项目,所以任何改进的想法都是非常欢迎的!