递归错误及其解决方法

89 阅读2分钟

在使用PythonSelenium进行网页抓取时,如果使用了递归函数,就有可能遇到递归错误。例如,下面的代码在抓取网页时就遇到了递归错误:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

def isReady(browser):
    return browser.execute_script("return document.readyState") == "complete"

def waitUntilReady(browser):
    if not isReady(browser):
        waitUntilReady(browser)

browser = webdriver.Firefox()
browser.get('http://www.usprwire.com/cgi-bin/news/search.cgi')

# make a search
query = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.NAME, "query")))
query.send_keys('"test"')
submit = browser.find_element_by_xpath("//input[@value='Search']")
submit.click()
numarticles = 0

# grab article urls
npages = 1
article_urls = []
for page in range(1, npages + 1):
    article_urls += [elm.get_attribute("href") for elm in browser.find_elements_by_class_name('category_links')]
    if page <= 121: #click to the next page
        browser.find_element_by_link_text('[>>]').click()
    if page == 122: #last page in search results, so no '[>>]'' to click on. Move on to next steps.
        continue



# iterate over urls and save the HTML source
for url in article_urls:
    browser.get(url)
    waitUntilReady(browser)
    numarticles = numarticles+1
    title = browser.current_url.split("/")[-1]
    with open('/Users/My/Dropbox/File/Place/'+str(numarticles)+str(title), 'w') as fw:
        fw.write(browser.page_source.encode('utf-8'))

Many thanks in advance for any input.

2、解决方案

出现这个错误的原因是waitUntilReady函数使用了递归。每次调用waitUntilReady函数都会再次调用自己,从而导致递归错误。为了解决这个问题,可以将waitUntilReady函数修改为一个循环。例如,可以将waitUntilReady函数修改为以下形式:

def waitUntilReady(browser):
    while not isReady(browser):
        time.sleep(1)

这样,waitUntilReady函数就不会再使用递归,而是会循环等待页面加载完成。

代码示例

以下是修改后的代码:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
import time

def isReady(browser):
    return browser.execute_script("return document.readyState") == "complete"

def waitUntilReady(browser):
    while not isReady(browser):
        time.sleep(1)

browser = webdriver.Firefox()
browser.get('http://www.usprwire.com/cgi-bin/news/search.cgi')

# make a search
query = WebDriverWait(browser, 60).until(EC.presence_of_element_located((By.NAME, "query")))
query.send_keys('"test"')
submit = browser.find_element_by_xpath("//input[@value='Search']")
submit.click()
numarticles = 0

# grab article urls
npages = 1
article_urls = []
for page in range(1, npages + 1):
    article_urls += [elm.get_attribute("href") for elm in browser.find_elements_by_class_name('category_links')]
    if page <= 121: #click to the next page
        browser.find_element_by_link_text('[>>]').click()
    if page == 122: #last page in search results, so no '[>>]'' to click on. Move on to next steps.
        continue



# iterate over urls and save the HTML source
for url in article_urls:
    browser.get(url)
    waitUntilReady(browser)
    numarticles = numarticles+1
    title = browser.current_url.split("/")[-1]
    with open('/Users/My/Dropbox/File/Place/'+str(numarticles)+str(title), 'w') as fw:
        fw.write(browser.page_source.encode('utf-8'))

Many thanks in advance for any input.

这样,代码就不会再出现递归错误了。