我正在参加「掘金·启航计划」

模块库

requests

Requests 基于下载量第一的库 urllib3。有了它，发送请求变得极其简单。

import requests
r = requests.get( https://api.github.com/user , auth=( user ,  pass ))
r.status_code # 200
r.headers[ content-type ]
#  application/json; charset=utf8
r.encoding
#  utf-8
r.text
# u {"type":"User"...
r.json()
# {u disk_usage : 368627, u private_gists : 484, ...}

Scrapy

Scrapy 是用 Python 实现的一个为了爬取网站数据、提取结构性数据而编写的应用框架。

Scrapy 常应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。

通常我们可以很简单的通过 Scrapy 框架实现一个爬虫，抓取指定网站的内容或图片。

制作 Scrapy 爬虫一共需要4步：

新建项目 (scrapy startproject xxx)：新建一个新的爬虫项目
明确目标（编写items.py）：明确你想要抓取的目标
制作爬虫（spiders/xxspider.py）：制作爬虫开始爬取网页
存储内容（pipelines.py）：设计管道存储爬取内容

Windows 安装方式

通过 pip 安装 Scrapy 框架:

pip install Scrapy

新建项目

在开始爬取之前，必须创建一个新的Scrapy项目。进入自定义的项目目录中，运行下列命令：

scrapy startproject mySpider

其中， mySpider 为项目名称，可以看到将会创建一个 mySpider 文件夹，目录结构大致如下：

下面来简单介绍一下各个主要文件的作用：

mySpider/
    scrapy.cfg
    mySpider/
        __init__.py
        items.py
        pipelines.py
        settings.py
        spiders/
            __init__.py
            ...

这些文件分别是:

scrapy.cfg: 项目的配置文件。
mySpider/: 项目的Python模块，将会从这里引用代码。
mySpider/items.py: 项目的目标文件。
mySpider/pipelines.py: 项目的管道文件。
mySpider/settings.py: 项目的设置文件。
mySpider/spiders/: 存储爬虫代码目录。

爬虫示例

明确目标

抓取 http://www.itcast.cn/channel/teacher.shtml 网站里的所有讲师的姓名、职称和个人信息。

打开 mySpider 目录下的 items.py。
Item 定义结构化数据字段，用来保存爬取到的数据，有点像 Python 中的 dict，但是提供了一些额外的保护减少错误。
可以通过创建一个 scrapy.Item 类，并且定义类型为 scrapy.Field 的类属性来定义一个 Item（可以理解成类似于 ORM 的映射关系）。

制作爬虫

爬数据

在mySpider/mySpider目录下执行下面命令，下面命令中 itcast 代表执行后的文件名， ‘itcast.cn’代表被爬的网站域名.

scrapy genspider itcast "itcast.cn"

打开 mySpider/spider目录里的 itcast.py，默认增加了下列代码:

    import scrapy  
      
    class ItcastSpider(scrapy.Spider):  
        # 爬虫名字  
        name = 'itcast'  
        # 允许爬取的范围  
        allowed_domains = ['itcast.cn']  
        # 开始爬取的url地址  
        start_urls = ['https://www.itcast.cn/channel/teacher.shtml']  
      
        # 数据提取的方法，接受下载中间件传过来的response  
        def parse(self, response, *args, **kwargs):  
            filename = "teacher.html"  
            open(filename, 'wb+').write(response.body)

name = "" ：这个爬虫的识别名称，必须是唯一的，在不同的爬虫必须定义不同的名字。

allow_domains = [] 是搜索的域名范围，也就是爬虫的约束区域，规定爬虫只爬取这个域名下的网页，不存在的URL会被忽略。

start_urls = () ：爬取的URL元祖/列表。爬虫从这里开始抓取数据，所以，第一次下载的数据将会从这些urls开始。其他子URL将会从这些起始URL中继承性生成。

parse(self, response) ：解析的方法，每个初始URL完成下载后将被调用，调用的时候传入从每一个URL传回的Response对象来作为唯一参数，主要作用如下：

运行

运行路径：在spiders这个目录的的上级目录下运行下面命令

scrapy crawl itcast

运行之后，如果打印的日志出现 [scrapy] INFO: Spider closed (finished)，代表执行完成。之后当前文件夹中就出现了一个 teacher.html 文件，里面就是我们刚刚要爬取的网页的全部源代码信息。

取数据

之前在mySpider/items.py 里定义了一个ItcastItem类。这里引入进来

 from mySpider.items import ItcastItem

xx.items xx代表目录名称， .items, items 代表文件名, import 后面为需要导入的类

itcast.py 具体代码如下：

import scrapy  
from moni.items import MoniItem  
  
  
class ItcastSpider(scrapy.Spider):  
    # 爬虫名字  
    name = 'itcast'  
    # 允许爬取的范围  
    allowed_domains = ['itcast.cn']  
    # 开始爬取的url地址  
    start_urls = ['https://www.itcast.cn/channel/teacher.shtml']  

    # 数据提取的方法，接受下载中间件传过来的response  
    def parse(self, response, *args, **kwargs):  
        # 存放老师信息的集合  
        items = []  

        for each in response.xpath("//div[@class='li_txt']"):  
            # 将我们得到的数据封装到一个 `ItcastItem` 对象  
            item = MoniItem()  
            # extract()方法返回的都是unicode字符串  
            name = each.xpath("h3/text()").extract()  
            title = each.xpath("h4/text()").extract()  
            info = each.xpath("p/text()").extract()  

            # xpath返回的是包含一个元素的列表  
            item['name'] = name[0]  
            item['title'] = title[0]  
            item['info'] = info[0]  

            items.append(item)  

            yield item  

        # 直接返回最后数据  
        # return items  

        # filename = "teacher.html"  
        # open(filename, 'wb+').write(response.body)

scrapy保存信息的最简单的方法主要有四种，-o 输出指定格式的文件，，命令如下：

json格式，默认为Unicode编码

scrapy crawl itcast -o teachers.json

json lines格式，默认为Unicode编码

scrapy crawl itcast -o teachers.jsonl

csv 逗号表达式，可用Excel打开

scrapy crawl itcast -o teachers.csv

xml格式

scrapy crawl itcast -o teachers.xml

python -模块库scrapy

模块库