我试图开发一个简单的网页刮板。我想提取没有HTML代码的文本。它适用于普通HTML,但不适用于JavaScript代码添加文本的某些页面。

例如,如果一些JavaScript代码添加了一些文本,我不能看到它,因为当我调用:

response = urllib2.urlopen(request)

我得到了原始文本而没有添加的文本(因为JavaScript是在客户端执行的)。

所以,我正在寻找一些解决这个问题的想法。


当前回答

把BeautifulSoup和Selenium混合在一起对我来说效果很好。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
    try:
        element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions  such as visibility_of_element_located or text_to_be_present_in_element

        html = driver.page_source
        soup = bs(html, "lxml")
        dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional
    else:
        print("Couldnt locate element")

附注:你可以在这里找到更多的等待条件

其他回答

如果你以前曾经使用过python的Requests模块,我最近发现开发人员创建了一个名为Requests- html的新模块,现在它也有呈现JavaScript的能力。

你也可以访问https://html.python-requests.org/来了解更多关于这个模块的信息,或者如果你只对呈现JavaScript感兴趣,那么你可以访问https://html.python-requests.org/?#javascript-support来直接学习如何使用该模块使用Python来呈现JavaScript。

从本质上讲,一旦你正确安装了Requests-HTML模块,下面的例子,在上面的链接中显示,展示了你如何使用这个模块来抓取一个网站,并呈现网站中包含的JavaScript:

from requests_html import HTMLSession
session = HTMLSession()

r = session.get('http://python-requests.org/')

r.html.render()

r.html.search('Python 2 will retire in only {months} months!')['months']

'<time>25</time>' #This is the result.

我最近从YouTube上的一个视频中了解到这一点。点击这里!观看YouTube上演示该模块如何工作的视频。

这似乎是一个很好的解决方案,从一个伟大的博客文章

import sys  
from PyQt4.QtGui import *  
from PyQt4.QtCore import *  
from PyQt4.QtWebKit import *  
from lxml import html 

#Take this class for granted.Just use result of rendering.
class Render(QWebPage):  
  def __init__(self, url):  
    self.app = QApplication(sys.argv)  
    QWebPage.__init__(self)  
    self.loadFinished.connect(self._loadFinished)  
    self.mainFrame().load(QUrl(url))  
    self.app.exec_()  

  def _loadFinished(self, result):  
    self.frame = self.mainFrame()  
    self.app.quit()  

url = 'http://pycoders.com/archive/'  
r = Render(url)  
result = r.frame.toHtml()
# This step is important.Converting QString to Ascii for lxml to process

# The following returns an lxml element tree
archive_links = html.fromstring(str(result.toAscii()))
print archive_links

# The following returns an array containing the URLs
raw_links = archive_links.xpath('//div[@class="campaign"]/a/@href')
print raw_links

如前所述,Selenium是呈现JavaScript结果的好选择:

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
browser = Firefox(executable_path="/usr/local/bin/geckodriver", options=options)

url = "https://www.example.com"
browser.get(url)

gazpacho是一个非常容易解析渲染html的库:

from gazpacho import Soup

soup = Soup(browser.page_source)
soup.find("a").attrs['href']

听起来好像你真正要找的数据可以通过主页面上的一些javascript调用的辅助URL访问。

虽然您可以尝试在服务器上运行javascript来处理这个问题,但一种更简单的方法可能是使用Firefox加载页面,并使用Charles或Firebug之类的工具来准确识别辅助URL。然后,您可以直接查询该URL以获得您感兴趣的数据。

You'll want to use urllib, requests, beautifulSoup and selenium web driver in your script for different parts of the page, (to name a few). Sometimes you'll get what you need with just one of these modules. Sometimes you'll need two, three, or all of these modules. Sometimes you'll need to switch off the js on your browser. Sometimes you'll need header info in your script. No websites can be scraped the same way and no website can be scraped in the same way forever without having to modify your crawler, usually after a few months. But they can all be scraped! Where there's a will there's a way for sure. If you need scraped data continuously into the future just scrape everything you need and store it in .dat files with pickle. Just keep searching how to try what with these modules and copying and pasting your errors into the Google.