我试图开发一个简单的网页刮板。我想提取没有HTML代码的文本。它适用于普通HTML,但不适用于JavaScript代码添加文本的某些页面。

例如,如果一些JavaScript代码添加了一些文本,我不能看到它,因为当我调用:

response = urllib2.urlopen(request)

我得到了原始文本而没有添加的文本(因为JavaScript是在客户端执行的)。

所以,我正在寻找一些解决这个问题的想法。


当前回答

EDIT 2021年9月:phantomjs也不再维护

EDIT 30/Dec/2017:这个答案出现在谷歌搜索的顶部结果中,所以我决定更新它。老答案仍然在最后。

dryscape不再维护,开发人员推荐的库dryscape仅适用于Python 2。我发现使用Selenium的python库和Phantom JS作为web驱动程序足够快,也很容易完成工作。

一旦你安装了Phantom JS,确保phantomjs二进制文件在当前路径下可用:

phantomjs --version
# result:
2.1.1

#例子 为了给出一个例子,我用下面的HTML代码创建了一个示例页面。(链接):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

没有javascript,它说:不支持javascript和javascript:耶!支持javascript

#抓取没有JS支持:

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

#抓取与JS支持:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

你也可以使用Python库dryscraping来抓取javascript驱动的网站。

#抓取与JS支持:

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>

其他回答

我最近使用requests_html库来解决这个问题。

他们的扩展文档在readthedocs。IO非常好(跳过pypi.org上的带注释的版本)。如果您的用例是基本的,那么您可能会取得一些成功。

from requests_html import HTMLSession
session = HTMLSession()
response = session.request(method="get",url="www.google.com/")
response.html.render()

如果你在使用response.html.render()呈现你需要的数据时遇到麻烦,你可以将一些javascript传递给呈现函数来呈现你需要的特定js对象。这是从他们的文档中复制的,但这可能正是你需要的:

如果指定了script,它将在 运行时。例子:

script = """
    () => {
        return {
            width: document.documentElement.clientWidth,
            height: document.documentElement.clientHeight,
            deviceScaleFactor: window.devicePixelRatio,
        }
    } 
"""

返回执行脚本的返回值,如果有的话:

>>> response.html.render(script=script)
{'width': 800, 'height': 600, 'deviceScaleFactor': 1}

In my case, the data I wanted were the arrays that populated a javascript plot but the data wasn't getting rendered as text anywhere in the html. Sometimes its not clear at all what the object names are of the data you want if the data is populated dynamically. If you can't track down the js objects directly from view source or inspect, you can type in "window" followed by ENTER in the debugger console in the browser (Chrome) to pull up a full list of objects rendered by the browser. If you make a few educated guesses about where the data is stored, you might have some luck finding it there. My graph data was under window.view.data in the console, so in the "script" variable passed to the .render() method quoted above, I used:

return {
    data: window.view.data
}

EDIT 2021年9月:phantomjs也不再维护

EDIT 30/Dec/2017:这个答案出现在谷歌搜索的顶部结果中,所以我决定更新它。老答案仍然在最后。

dryscape不再维护,开发人员推荐的库dryscape仅适用于Python 2。我发现使用Selenium的python库和Phantom JS作为web驱动程序足够快,也很容易完成工作。

一旦你安装了Phantom JS,确保phantomjs二进制文件在当前路径下可用:

phantomjs --version
# result:
2.1.1

#例子 为了给出一个例子,我用下面的HTML代码创建了一个示例页面。(链接):

<!DOCTYPE html>
<html>
<head>
  <meta charset="utf-8">
  <title>Javascript scraping test</title>
</head>
<body>
  <p id='intro-text'>No javascript support</p>
  <script>
     document.getElementById('intro-text').innerHTML = 'Yay! Supports javascript';
  </script> 
</body>
</html>

没有javascript,它说:不支持javascript和javascript:耶!支持javascript

#抓取没有JS支持:

import requests
from bs4 import BeautifulSoup
response = requests.get(my_url)
soup = BeautifulSoup(response.text)
soup.find(id="intro-text")
# Result:
<p id="intro-text">No javascript support</p>

#抓取与JS支持:

from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(my_url)
p_element = driver.find_element_by_id(id_='intro-text')
print(p_element.text)
# result:
'Yay! Supports javascript'

你也可以使用Python库dryscraping来抓取javascript驱动的网站。

#抓取与JS支持:

import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
session.visit(my_url)
response = session.body()
soup = BeautifulSoup(response)
soup.find(id="intro-text")
# Result:
<p id="intro-text">Yay! Supports javascript</p>

把BeautifulSoup和Selenium混合在一起对我来说效果很好。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup as bs

driver = webdriver.Firefox()
driver.get("http://somedomain/url_that_delays_loading")
    try:
        element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.ID, "myDynamicElement"))) #waits 10 seconds until element is located. Can have other wait conditions  such as visibility_of_element_located or text_to_be_present_in_element

        html = driver.page_source
        soup = bs(html, "lxml")
        dynamic_text = soup.find_all("p", {"class":"class_name"}) #or other attributes, optional
    else:
        print("Couldnt locate element")

附注:你可以在这里找到更多的等待条件

Playwright-Python

还有一种选择是剧作家- Python,它是微软剧作家(本身是受木偶大师影响的浏览器自动化库)到Python的移植。

下面是选择一个元素并抓取它的文本的最小示例:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("http://whatsmyuseragent.org/")
    ua = page.query_selector(".user-agent");
    print(ua.text_content())
    browser.close()

如前所述,Selenium是呈现JavaScript结果的好选择:

from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options

options = Options()
options.headless = True
browser = Firefox(executable_path="/usr/local/bin/geckodriver", options=options)

url = "https://www.example.com"
browser.get(url)

gazpacho是一个非常容易解析渲染html的库:

from gazpacho import Soup

soup = Soup(browser.page_source)
soup.find("a").attrs['href']