使用Python的web抓取JavaScript页面

我试图开发一个简单的网页刮板。我想提取没有HTML代码的文本。它适用于普通HTML，但不适用于JavaScript代码添加文本的某些页面。

例如，如果一些JavaScript代码添加了一些文本，我不能看到它，因为当我调用:

response = urllib2.urlopen(request)

我得到了原始文本而没有添加的文本(因为JavaScript是在客户端执行的)。

所以，我正在寻找一些解决这个问题的想法。

当前回答

Playwright-Python

还有一种选择是剧作家- Python，它是微软剧作家(本身是受木偶大师影响的浏览器自动化库)到Python的移植。

下面是选择一个元素并抓取它的文本的最小示例:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch()
    page = browser.new_page()
    page.goto("http://whatsmyuseragent.org/")
    ua = page.query_selector(".user-agent");
    print(ua.text_content())
    browser.close()

2022-03-20 20:21:29

其他回答

简单快捷的解决方案:

我也遇到过同样的问题。我想刮一些数据是用JavaScript构建的。如果我只用BeautifulSoup从这个网站抓取文本，那么我就以文本中的标签结束。我想渲染这个标签，并将从中抓取信息。另外，我不想使用像Scrapy和selenium这样的笨重框架。

我发现请求模块的get方法接受url，它实际上呈现脚本标签。

例子:

import requests
custom_User_agent = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"
url = "https://www.abc.xyz/your/url"
response = requests.get(url, headers={"User-Agent": custom_User_agent})
html_text = response.text

这将呈现加载站点和呈现标签。

希望这将有助于作为快速和简单的解决方案，渲染网站加载脚本标签。

2021-06-16 07:34:26

我们没有得到正确的结果，因为任何javascript生成的内容都需要在DOM上呈现。当我们获取一个HTML页面时，我们获取初始的，未经javascript修改的DOM。

因此，我们需要在抓取页面之前呈现javascript内容。

由于selenium已经在本线程中多次提到(有时也提到了它的速度有多慢)，我将列出其他两种可能的解决方案。

解决方案1:这是一个关于如何使用Scrapy抓取javascript生成内容的非常好的教程，我们将遵循这一点。

我们需要:

Docker installed in our machine. This is a plus over other solutions until this point, as it utilizes an OS-independent platform. Install Splash following the instruction listed for our corresponding OS.Quoting from splash documentation: Splash is a javascript rendering service. It’s a lightweight web browser with an HTTP API, implemented in Python 3 using Twisted and QT5. Essentially we are going to use Splash to render Javascript generated content. Run the splash server: sudo docker run -p 8050:8050 scrapinghub/splash. Install the scrapy-splash plugin: pip install scrapy-splash Assuming that we already have a Scrapy project created (if not, let's make one), we will follow the guide and update the settings.py: Then go to your scrapy project’s settings.py and set these middlewares: DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, } The URL of the Splash server(if you’re using Win or OSX this should be the URL of the docker machine: How to get a Docker container's IP address from the host?): SPLASH_URL = 'http://localhost:8050' And finally you need to set these values too: DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter' HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage' Finally, we can use a SplashRequest: In a normal spider you have Request objects which you can use to open URLs. If the page you want to open contains JS generated data you have to use SplashRequest(or SplashFormRequest) to render the page. Here’s a simple example: class MySpider(scrapy.Spider): name = "jsscraper" start_urls = ["http://quotes.toscrape.com/js/"] def start_requests(self): for url in self.start_urls: yield SplashRequest( url=url, callback=self.parse, endpoint='render.html' ) def parse(self, response): for q in response.css("div.quote"): quote = QuoteItem() quote["author"] = q.css(".author::text").extract_first() quote["quote"] = q.css(".text::text").extract_first() yield quote SplashRequest renders the URL as html and returns the response which you can use in the callback(parse) method.

解决方案2:我们暂且称之为实验性的(2018年5月)…… 此解决方案仅适用于Python版本3.6(目前)。

你知道请求模块吗(谁不知道呢)? 现在它有了一个网络爬行的小兄弟:requests-HTML:

这个库旨在使解析HTML(例如抓取网页)尽可能简单和直观。

安装请求-html: pipenv 对页面的url进行请求: 导入HTMLSession 会话= HTMLSession() R = session.get(a_page_url) 渲染响应以获得Javascript生成的比特: r.html.render ()

最后，该模块似乎提供了抓取功能。或者，我们也可以尝试使用我们刚刚渲染的r.html对象来使用BeautifulSoup。

2018-05-30 19:52:45

Selenium是抓取JS和Ajax内容的最佳工具。

查看这篇文章，了解如何使用Python从web中提取数据

$ pip install selenium

然后下载Chrome webdriver。

from selenium import webdriver

browser = webdriver.Chrome()

browser.get("https://www.python.org/")

nav = browser.find_element_by_id("mainnav")

print(nav.text)

容易,对吧?

2018-01-18 18:45:27

听起来好像你真正要找的数据可以通过主页面上的一些javascript调用的辅助URL访问。

虽然您可以尝试在服务器上运行javascript来处理这个问题，但一种更简单的方法可能是使用Firefox加载页面，并使用Charles或Firebug之类的工具来准确识别辅助URL。然后，您可以直接查询该URL以获得您感兴趣的数据。

2011-11-08 11:23:50

Pyppeteer

你可以考虑Pyppeteer，它是Chrome/Chromium驱动程序前端的Python移植版本。

下面是一个简单的例子，展示了如何使用pyppeterer动态地访问被注入到页面中的数据:

import asyncio
from pyppeteer import launch

async def main():
    browser = await launch({"headless": True})
    [page] = await browser.pages()

    # normally, you go to a live site...
    #await page.goto("http://www.example.com")
    # but for this example, just set the HTML directly:
    await page.setContent("""
    <body>
    <script>
    // inject content dynamically with JS, not part of the static HTML!
    document.body.innerHTML = `<p>hello world</p>`; 
    </script>
    </body>
    """)
    print(await page.content()) # shows that the `<p>` was inserted

    # evaluate a JS expression in browser context and scrape the data
    expr = "document.querySelector('p').textContent"
    print(await page.evaluate(expr, force_expr=True)) # => hello world

    await browser.close()

asyncio.get_event_loop().run_until_complete(main())

请参阅pyppeterer的参考文档。

2021-03-30 21:32:38

使用Python的web抓取JavaScript页面

推荐文章

最新文章

标签