如何创建一个Python Scrapy项目

1,490 阅读4分钟

How To Create A Python Scrapy Project

要在Scrapy中创建一个项目,你首先要确保你对这个框架有一个很好的介绍。这将确保Scrapy已经安装并准备就绪。一旦你准备好了,我们将看看如何创建一个新的Python Scrapy项目,以及一旦创建了该项目该做什么。这个过程对所有的Scrapy项目都是类似的,这是一个很好的练习,可以使用Scrapy练习网络刮擦。

启动项目

为了开始这个项目,我们可以运行scrapy startproject命令,同时输入我们将称之为项目的名称。目标网站位于books.toscrape.com。

scrapy $scrapy startproject bookstoscrape
New Scrapy project 'bookstoscrape', using template directory 
'\python\python39\lib\site-packages\scrapy\templates\project', created in:
    C:\python\scrapy\bookstoscrape

You can start your first spider with:
    cd bookstoscrape
    scrapy genspider example example.com

我们可以在PyCharm中打开该项目,此时项目的文件夹结构对你来说应该很熟悉。
scrapy bookstoscrape pycharm

genspider

一旦项目被创建,你要为该项目生成一个或多个Spider。这可以通过scrapy genspider命令完成。

bookstoscrape $scrapy genspider books books.toscrape.com
Created spider 'books' using template 'basic' in module:
  bookstoscrape.spiders.books

scrapy genspider books bookstoscrape


books.py

这里是Scrapy中新生成的蜘蛛的默认模板代码。为我们设置代码的结构是很好的。

import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        pass

测试XPath和CSS选择器

为了让自己准备好向已经为我们创建的Spider添加代码,你首先需要弄清楚哪些选择器会让你得到你想要的数据。这是通过Scrapy shell完成的,方法是检查目标页面的源标记并在浏览器控制台测试选择器。

bookstoscrape $scrapy shell 'https://books.toscrape.com/'
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000001F2C93E31F0>
[s]   item       {}
[s]   request    <GET https://books.toscrape.com/>
[s]   response   <200 https://books.toscrape.com/>
[s]   settings   <scrapy.settings.Settings object at 0x000001F2C93E3430>
[s]   spider     <BooksSpider 'books' at 0x1f2c98485b0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser

检查HTML源代码

在页面上点击右键,你就可以检查任何你喜欢的元素。
browser inspect source

我们对每本书及其相关数据感兴趣,所有这些都包含在一个文章元素中。
how to determine xpath or css selectors

在浏览器控制台测试XPath和CSS选择器

Firefox和Chrome都提供了XPath和CSS选择器工具,你可以在控制台中使用。

$x('the xpath')

根据我们通过检查上面的源文件发现的情况,我们知道页面上的每个图书项目都在一个**

标签内,该标签的类别是product_pod**。如果我们使用XPath,那么表达式**$x('//article')**就可以得到这第一页上的所有20个图书项目。
test xpath selector browser console

$$('the css selector')

如果你愿意使用CSS选择器版本,它提供了同样的结果,那么**$$('.product_pod')**就可以做到。
test css selector browser console

在Scrapy Shell中测试选择器

一旦我们对XPath或CSS选择器在浏览器控制台中的工作有了概念,我们就可以在中测试它们,这是一个伟大的工具。在Scrapy Shell中输入response.xpath('//article')response.css('.product_pod'),你会看到两种情况下都返回了20个选择器对象,这很有意义,因为在被搜刮的页面上有20个图书项目。

从外壳到蜘蛛

在浏览器的控制台和Scrapy shell中尝试这些XPath和CSS选择器是有意义的。这样可以很好地了解一旦开始向Scrapy框架提供的Spider模板代码添加自己的自定义代码时,哪些代码会起作用。

建立parse()方法

parse()方法的目的是查看返回的响应,并对输出进行解析。有很多方法可以构建Spider的这一部分,从非常基本的到更高级的,当你开始添加项目和项目加载器时。最初,唯一的目标是从该函数中返回产生一个[Python 字典]。我们将在这里看一个使用yield的例子,我们要把自定义的代码添加到强调的模板中。

import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.xpath('//article'):
            yield {
                'booktitle': book.xpath('.//a/text()').get()
            }

Scrapy爬行 {你的蜘蛛}

我们现在可以使用scrapy crawl命令来运行Spider。

bookstoscrape $scrapy crawl books

控制台中会有大量的输出,但你应该能找到所有的书名。

{'booktitle': 'A Light in the ...'}
{'booktitle': 'Tipping the Velvet'}
{'booktitle': 'Soumission'}
{'booktitle': 'Sharp Objects'}
{'booktitle': 'Sapiens: A Brief History ...'}
{'booktitle': 'The Requiem Red'}
{'booktitle': 'The Dirty Little Secrets ...'}
{'booktitle': 'The Coming Woman: A ...'}
{'booktitle': 'The Boys in the ...'}
{'booktitle': 'The Black Maria'}
{'booktitle': 'Starving Hearts (Triangular Trade ...'}
{'booktitle': "Shakespeare's Sonnets"}
{'booktitle': 'Set Me Free'}
{'booktitle': "Scott Pilgrim's Precious Little ..."}
{'booktitle': 'Rip it Up and ...'}
{'booktitle': 'Our Band Could Be ...'}
{'booktitle': 'Olio'}
{'booktitle': 'Mesaerion: The Best Science ...'}
{'booktitle': 'Libertarianism for Beginners'}
{'booktitle': "It's Only the Himalayas"}

我的yield语句没有迭代!

重要的是!上面的例子使用的是yield语句而不是return语句。还要注意的是,我们在yield语句中使用的是XPath的子查询。当你在一个循环中使用XPath来完成子查询时,你必须在XPath选择器中加入一个前导句号。如果你省略了前导句号,你将在循环运行的次数上得到第一个结果。

leading period xpath sub query


先大后小

当你使用XPath和CSS选择器时,你很可能会看一下目标页面,然后为你想搜刮的每一个不同的信息获得一个新的查询。例如,我们的初始查询选择了20个文章元素,然后我们可以从那里单独缩小范围。你不会想看一下这个页面,然后说我想要这个页面上每本书的标题、评级、价格和可用性。你不会为此使用80个不同的选择器。你要在顶层抓取20本书,然后从每本书中获取4条数据。下面的代码显示了如何在原始XPath查询上建立这些子查询。

import scrapy


class BooksSpider(scrapy.Spider):
    name = 'books'
    allowed_domains = ['books.toscrape.com']
    start_urls = ['http://books.toscrape.com/']

    def parse(self, response):
        for book in response.xpath('//article'):
            yield {
                'booktitle': book.xpath('.//a/text()').get(),
                'bookrating': book.xpath('.//p').attrib['class'],
                'bookprice': book.xpath('.//div[2]/p/text()').get(),
                'bookavailability': book.xpath('.//div[2]/p[2]/i/following-sibling::text()').get().strip()
            }

bookavailability选择器有点棘手。我们试图获得标签之后的文本,然而该文本有点像无人区。为此,我们可以使用 following-sibling::text() 选择器。我们还添加了strip()函数来去除一些空白,但我们很快就会了解到如何使用项目加载器来更好地处理这个问题。

<p class="instock availability">
    <i class="icon-ok"></i>
    
        In stock
    
</p>

Scrapy输出

为了实际输出我们捕获的数据,我们可以在使用scrapy crawl命令时添加**-o**标志,以输出到CSV或json文件。

bookstoscrape $scrapy crawl books -o books.json

一旦你运行了这个命令,你会看到Scrapy项目中出现一个新的文件,里面有你刚刚收集的所有数据。

how to output python scrapy data to json

books.json result
最后的结果是一个JSON文件,其中有20个对象,每个对象有4个属性,分别是标题、评级、价格和可用性。现在你可以在你收集的各种数据集上实践你的数据科学技能。

[  {    "booktitle": "A Light in the ...",    "bookrating": "star-rating Three",    "bookprice": "£51.77",    "bookavailability": "In stock"  },  {    "booktitle": "Tipping the Velvet",    "bookrating": "star-rating One",    "bookprice": "£53.74",    "bookavailability": "In stock"  },  {    "booktitle": "Soumission",    "bookrating": "star-rating One",    "bookprice": "£50.10",    "bookavailability": "In stock"  },  {    "booktitle": "Sharp Objects",    "bookrating": "star-rating Four",    "bookprice": "£47.82",    "bookavailability": "In stock"  },  {    "booktitle": "Sapiens: A Brief History ...",    "bookrating": "star-rating Five",    "bookprice": "£54.23",    "bookavailability": "In stock"  },  {    "booktitle": "The Requiem Red",    "bookrating": "star-rating One",    "bookprice": "£22.65",    "bookavailability": "In stock"  },  {    "booktitle": "The Dirty Little Secrets ...",    "bookrating": "star-rating Four",    "bookprice": "£33.34",    "bookavailability": "In stock"  },  {    "booktitle": "The Coming Woman: A ...",    "bookrating": "star-rating Three",    "bookprice": "£17.93",    "bookavailability": "In stock"  },  {    "booktitle": "The Boys in the ...",    "bookrating": "star-rating Four",    "bookprice": "£22.60",    "bookavailability": "In stock"  },  {    "booktitle": "The Black Maria",    "bookrating": "star-rating One",    "bookprice": "£52.15",    "bookavailability": "In stock"  },  {    "booktitle": "Starving Hearts (Triangular Trade ...",    "bookrating": "star-rating Two",    "bookprice": "£13.99",    "bookavailability": "In stock"  },  {    "booktitle": "Shakespeare's Sonnets",    "bookrating": "star-rating Four",    "bookprice": "£20.66",    "bookavailability": "In stock"  },  {    "booktitle": "Set Me Free",    "bookrating": "star-rating Five",    "bookprice": "£17.46",    "bookavailability": "In stock"  },  {    "booktitle": "Scott Pilgrim's Precious Little ...",    "bookrating": "star-rating Five",    "bookprice": "£52.29",    "bookavailability": "In stock"  },  {    "booktitle": "Rip it Up and ...",    "bookrating": "star-rating Five",    "bookprice": "£35.02",    "bookavailability": "In stock"  },  {    "booktitle": "Our Band Could Be ...",    "bookrating": "star-rating Three",    "bookprice": "£57.25",    "bookavailability": "In stock"  },  {    "booktitle": "Olio",    "bookrating": "star-rating One",    "bookprice": "£23.88",    "bookavailability": "In stock"  },  {    "booktitle": "Mesaerion: The Best Science ...",    "bookrating": "star-rating One",    "bookprice": "£37.59",    "bookavailability": "In stock"  },  {    "booktitle": "Libertarianism for Beginners",    "bookrating": "star-rating Two",    "bookprice": "£51.33",    "bookavailability": "In stock"  },  {    "booktitle": "It's Only the Himalayas",    "bookrating": "star-rating Two",    "bookprice": "£45.17",    "bookavailability": "In stock"  }]