Scrapy 请求函数未被调用

229 阅读1分钟

在使用Scrapy爬虫框架时,遇到了一个问题:scrapy的Request函数没有被调用,尝试通过主解析函数调用另一个解析函数,但该函数并未被执行。以下为相关代码:

class CodechefSpider(CrawlSpider):
    name = "codechef_crawler"
    allowed_domains = ["codechef.com"]
    start_urls = ["http://www.codechef.com/problems/easy/","http://www.codechef.com/problems/medium/","http://www.codechef.com/problems/hard/","http://www.codechef.com/problems/challenege/"]

    rules = (Rule(SgmlLinkExtractor(allow=('/problems/[A-Z,0-9,-]+')), callback='parse_item'),)

    def parse_solution(self,response):

        hxs = HtmlXPathSelector(response)
        x = hxs.select("//tr[@class='kol']//td[8]").exctract()
        f = open('test/'+response.url.split('/')[-1]+'.txt','wb')
        f.write(x.encode("utf-8"))
        f.close()



    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = Problem()
        item['title'] = hxs.select("//table[@class='pagetitle-prob']/tr/td/h1/text()").extract()
        item['content'] = hxs.select("//div[@class='node clear-block']//div[@class='content']").extract()
        filename = str(item['title'][0])
        solutions_url = 'http://www.codechef.com/status/' + response.url.split('/')[-1] + '?language=All&status=15&handle=&sort_by=Time&sorting_order=asc'
        Request(solutions_url, callback = self.parse_solution)
        f = open('problems/'+filename+'.html','wb')
        f.write("<div style='width:800px;margin:50px'>")
        for i in item['content']:
            f.write(i.encode("utf-8"))
        f.write("</div>")
        f.close()

parse solution method is not being called. The spider runs without any errors.

在上述代码中,parse_solution方法没有被调用,爬虫运行时也不会报错。

2、解决方案

根据提供的答案,解决方法是将Request(solutions_url, callback = self.parse_solution)修改为yield Request(solutions_url, callback = self.parse_solution)。修改后的代码如下:

class CodechefSpider(CrawlSpider):
    name = "codechef_crawler"
    allowed_domains = ["codechef.com"]
    start_urls = ["http://www.codechef.com/problems/easy/","http://www.codechef.com/problems/medium/","http://www.codechef.com/problems/hard/","http://www.codechef.com/problems/challenege/"]

    rules = (Rule(SgmlLinkExtractor(allow=('/problems/[A-Z,0-9,-]+')), callback='parse_item'),)

    def parse_solution(self,response):

        hxs = HtmlXPathSelector(response)
        x = hxs.select("//tr[@class='kol']//td[8]").exctract()
        f = open('test/'+response.url.split('/')[-1]+'.txt','wb')
        f.write(x.encode("utf-8"))
        f.close()



    def parse_item(self, response):
        hxs = HtmlXPathSelector(response)
        item = Problem()
        item['title'] = hxs.select("//table[@class='pagetitle-prob']/tr/td/h1/text()").extract()
        item['content'] = hxs.select("//div[@class='node clear-block']//div[@class='content']").extract()
        filename = str(item['title'][0])
        solutions_url = 'http://www.codechef.com/status/' + response.url.split('/')[-1] + '?language=All&status=15&handle=&sort_by=Time&sorting_order=asc'
        yield Request(solutions_url, callback = self.parse_solution)
        f = open('problems/'+filename+'.html','wb')
        f.write("<div style='width:800px;margin:50px'>")
        for i in item['content']:
            f.write(i.encode("utf-8"))
        f.write("</div>")
        f.close()

parse solution method is not being called. The spider runs without any errors.

通过添加yield关键字,scrapy的Request函数现在可以正常调用,parse_solution方法也会被执行。