我想刮取无限滚动实现的页面的所有数据。下面的python代码可以工作。

for i in range(100):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(5)

这意味着每当我向下滚动到底部时,我都需要等待5秒,这通常足以让页面完成加载新生成的内容。但是,这可能并不省时。页面可能在5秒内完成新内容的加载。如何在每次向下滚动时检测页面是否完成了新内容的加载?如果我能检测到这一点,一旦我知道页面完成加载,我就可以再次向下滚动以查看更多内容。这样更节省时间。


当前回答

另外,您可以检查DOM是否没有更多的修改,而不是向下滚动100次(在页面底部是AJAX惰性加载的情况下)

def scrollDown(driver, value):
    driver.execute_script("window.scrollBy(0,"+str(value)+")")

# Scroll down the page
def scrollDownAllTheWay(driver):
    old_page = driver.page_source
    while True:
        logging.debug("Scrolling loop")
        for i in range(2):
            scrollDown(driver, 500)
            time.sleep(2)
        new_page = driver.page_source
        if new_page != old_page:
            old_page = new_page
        else:
            break
    return True

其他回答

如果您试图滚动并找到页面上的所有项目。您可以考虑使用以下方法。这是其他人在这里提到的一些方法的组合。它帮我完成了任务:

while True:
    try:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem1 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_1 = len(elem1)
        print(f"A list Length {len_elem_1}")
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        driver.implicitly_wait(30)
        time.sleep(4)
        elem2 = WebDriverWait(driver, 30).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "element-name")))
        len_elem_2 = len(elem2)
        print(f"B list Length {len_elem_2}")
        if len_elem_1 == len_elem_2:
            print(f"final length = {len_elem_1}")
            break
    except TimeoutException:
            print("Loading took too much time!")

Selenium无法检测页面是否完全加载,但javascript可以。我建议你试试这个。

from selenium.webdriver.support.ui import WebDriverWait
WebDriverWait(driver, 100).until(lambda driver: driver.execute_script('return document.readyState') == 'complete')

这将执行javascript代码,而不是使用python,因为javascript可以检测页面何时完全加载,它将显示“完成”。这个代码的意思是在100秒内,继续尝试这个文档。readyState直到complete显示。

你可以通过这个函数简单地做到这一点:

def page_is_loading(driver):
    while True:
        x = driver.execute_script("return document.readyState")
        if x == "complete":
            return True
        else:
            yield False

当你想在页面加载完成后做一些事情时,你可以使用:

Driver = webdriver.Firefox(options=Options, executable_path='geckodriver.exe')
Driver.get("https://www.google.com/")

while not page_is_loading(Driver):
    continue

Driver.execute_script("alert('page is loaded')")

从硒/ webdriver /支持/ wait.py

driver = ...
from selenium.webdriver.support.wait import WebDriverWait
element = WebDriverWait(driver, 10).until(
    lambda x: x.find_element_by_id("someId"))

webdriver将在默认情况下通过.get()方法等待页面加载。

正如你可能正在寻找一些特定的元素@user227215所说的,你应该使用WebDriverWait来等待位于你页面中的元素:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException

browser = webdriver.Firefox()
browser.get("url")
delay = 3 # seconds
try:
    myElem = WebDriverWait(browser, delay).until(EC.presence_of_element_located((By.ID, 'IdOfMyElement')))
    print "Page is ready!"
except TimeoutException:
    print "Loading took too much time!"

我用它来检查提醒。您可以使用任何其他类型方法来查找定位器。

编辑1:

I should mention that the webdriver will wait for a page to load by default. It does not wait for loading inside frames or for ajax requests. It means when you use .get('url'), your browser will wait until the page is completely loaded and then go to the next command in the code. But when you are posting an ajax request, webdriver does not wait and it's your responsibility to wait an appropriate amount of time for the page or a part of page to load; so there is a module named expected_conditions.