Python分布式爬虫打造搜索引擎完整版-基于Scrapy、Redis、elasticsearch和django打造一个完整的搜索引擎网站_scrapy搜索引擎

62 阅读31分钟

分析页面,可以发现页面内有一部html是通过JavaScript ajax交互来生成的,因此在f12检查元素时的页面结构里有,而xpath不对
xpath是基于html源代码文件结构来找的

xpath可以有多种多样的写法:

re_selector = response.xpath("/html/body/div[1]/div[3]/div[1]/div[1]/h1/text()")
re2_selector = response.xpath('//\*[@id="post-110287"]/div[1]/h1/text()')
re3_selector = response.xpath('//div[@class="entry-header]/h1/text()')

推荐使用id型。因为页面id唯一。

推荐使用class型,因为后期循环爬取可扩展通用性强。

通过了解了这些此时我们已经可以抓取到页面的标题,此时可以使用xpath利器照猫画虎抓取任何内容。只需要点击右键查看xpath。

开启控制台调试

scrapy shell http://blog.jobbole.com/110287/

完整的xpath提取伯乐在线字段代码

# -\*- coding: utf-8 -\*-
import scrapy
import re

class JobboleSpider(scrapy.Spider):
    name = "jobbole"
    allowed_domains = ["blog.jobbole.com"]
    start_urls = ['http://blog.jobbole.com/110287/']

    def parse(self, response):
        #提取文章的具体字段
        title = response.xpath('//div[@class="entry-header"]/h1/text()').extract_first("")
        create_date = response.xpath("//p[@class='entry-meta-hide-on-mobile']/text()").extract()[0].strip().replace("·","").strip()
        praise_nums = response.xpath("//span[contains(@class, 'vote-post-up')]/h10/text()").extract()[0]
        fav_nums = response.xpath("//span[contains(@class, 'bookmark-btn')]/text()").extract()[0]
        match_re = re.match(".\*?(\d+).\*", fav_nums)
        if match_re:
            fav_nums = match_re.group(1)

        comment_nums = response.xpath("//a[@href='#article-comment']/span/text()").extract()[0]
        match_re = re.match(".\*?(\d+).\*", comment_nums)
        if match_re:
            comment_nums = match_re.group(1)

        content = response.xpath("//div[@class='entry']").extract()[0]

        tag_list = response.xpath("//p[@class='entry-meta-hide-on-mobile']/a/text()").extract()
        tag_list = [element for element in tag_list if not element.strip().endswith("评论")]
        tags = ",".join(tag_list)
        pass
css选择器的使用:
# 通过css选择器提取字段
        # front\_image\_url = response.meta.get("front\_image\_url", "") #文章封面图
        title = response.css(".entry-header h1::text").extract_first()
        create_date = response.css("p.entry-meta-hide-on-mobile::text").extract()[0].strip().replace("·","").strip()
        praise_nums = response.css(".vote-post-up h10::text").extract()[0]
        fav_nums = response.css(".bookmark-btn::text").extract()[0]
        match_re = re.match(".\*?(\d+).\*", fav_nums)
        if match_re:
            fav_nums = int(match_re.group(1))
        else:
            fav_nums = 0

        comment_nums = response.css("a[href='#article-comment'] span::text").extract()[0]
        match_re = re.match(".\*?(\d+).\*", comment_nums)
        if match_re:
            comment_nums = int(match_re.group(1))
        else:
            comment_nums = 0

        content = response.css("div.entry").extract()[0]

        tag_list = response.css("p.entry-meta-hide-on-mobile a::text").extract()
        tag_list = [element for element in tag_list if not element.strip().endswith("评论")]
        tags = ",".join(tag_list)
        pass

3. 爬取所有文章

yield关键字

#使用request下载详情页面,下载完成后回调方法parse\_detail()提取文章内容中的字段
yield Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail)

scrapy.http import Request下载网页

from scrapy.http import Request
Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail)

parse拼接网址应对herf内有可能网址不全

from urllib import parse
url=parse.urljoin(response.url,post_url)
parse.urljoin("http://blog.jobbole.com/all-posts/","http://blog.jobbole.com/111535/")
#结果为http://blog.jobbole.com/111535/

class层级关系

next_url = response.css(".next.page-numbers::attr(href)").extract_first("")
#如果.next .pagenumber 是指两个class为层级关系。而不加空格为同一个标签

twist异步机制

Scrapy使用了Twisted作为框架,Twisted有些特殊的地方是它是事件驱动的,并且比较适合异步的代码。在任何情况下,都不要写阻塞的代码。阻塞的代码包括:

  • 访问文件、数据库或者Web
  • 产生新的进程并需要处理新进程的输出,如运行shell命令
  • 执行系统层次操作的代码,如等待系统队列

实现全部文章字段下载的代码:

    def parse(self, response):
        """
 1. 获取文章列表页中的文章url并交给scrapy下载后并进行解析
 2. 获取下一页的url并交给scrapy进行下载, 下载完成后交给parse
 """
        # 解析列表页中的所有文章url并交给scrapy下载后并进行解析
        post_urls = response.css("#archive .floated-thumb .post-thumb a::attr(href)").extract()
        for post_url in post_urls:
            #request下载完成之后,回调parse\_detail进行文章详情页的解析
            # Request(url=post\_url,callback=self.parse\_detail)
            print(response.url)
            print(post_url)
            yield Request(url=parse.urljoin(response.url,post_url),callback=self.parse_detail)
            #遇到href没有域名的解决方案
            #response.url + post\_url
            print(post_url)
        # 提取下一页并交给scrapy进行下载
        next_url = response.css(".next.page-numbers::attr(href)").extract_first("")
        if next_url:
            yield Request(url=parse.urljoin(response.url, post_url), callback=self.parse)

全部文章的逻辑流程图

所有文章流程图

4. scrapy的items整合字段

数据爬取的任务就是从非结构的数据中提取出结构性的数据。
items 可以让我们自定义自己的字段(类似于字典,但比字典的功能更齐全)

在当前页,需要提取多个url

原始写法,extract之后则生成list列表,无法进行二次筛选:

post_urls = response.css("#archive .floated-thumb .post-thumb a::attr(href)").extract()

改进写法:

post_nodes = response.css("#archive .floated-thumb .post-thumb a")
        for post_node in post_nodes:
            #获取封面图的url
            image_url = post_node.css("img::attr(src)").extract_first("")
            post_url = post_node.css("::attr(href)").extract_first("")

在下载网页的时候把获取到的封面图的url传给parse_detail的response
在下载网页时将这个封面url获取到,并通过meta将他发送出去。在callback的回调函数中接收该值

yield Request(url=parse.urljoin(response.url,post_url),meta={"front\_image\_url":image_url},callback=self.parse_detail)

front_image_url = response.meta.get("front\_image\_url", "")

urljoin的好处
如果你没有域名,我就从response里取出来,如果你有域名则我对你起不了作用了

**编写我们自定义的item并在jobboled.py中填充。

class JobBoleArticleItem(scrapy.Item):
    title = scrapy.Field()
    create_date = scrapy.Field()
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    front_image_url = scrapy.Field()
    front_image_path = scrapy.Field()
    praise_nums = scrapy.Field()
    comment_nums = scrapy.Field()
    fav_nums = scrapy.Field()
    content = scrapy.Field()
    tags = scrapy.Field()

import之后实例化,实例化之后填充:

1. from ArticleSpider.items import JobBoleArticleItem
2. article_item = JobBoleArticleItem()
3. article_item["title"] = title
        article_item["url"] = response.url
        article_item["create\_date"] = create_date
        article_item["front\_image\_url"] = [front_image_url]
        article_item["praise\_nums"] = praise_nums
        article_item["comment\_nums"] = comment_nums
        article_item["fav\_nums"] = fav_nums
        article_item["tags"] = tags
        article_item["content"] = content

yield article_item将这个item传送到pipelines中
pipelines可以接收到传送过来的item
将setting.py中的pipeline配置取消注释

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
   'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
}

当我们的item被传输到pipeline我们可以将其进行存储到数据库等工作

setting设置下载图片pipeline

ITEM_PIPELINES={
'scrapy.pipelines.images.ImagesPipeline': 1,
}

H:\CodePath\pyEnvs\articlespider3\Lib\site-packages\scrapy\pipelines
里面有三个scrapy默认提供的pipeline
提供了文件,图片,媒体。

ITEM_PIPELINES是一个数据管道的登记表,每一项具体的数字代表它的优先级,数字越小,越早进入。

setting设置下载图片的地址

# IMAGES\_MIN\_HEIGHT = 100
# IMAGES\_MIN\_WIDTH = 100

设置下载图片的最小高度,宽度。

新建文件夹images在

IMAGES_URLS_FIELD = "front\_image\_url"
project_dir = os.path.abspath(os.path.dirname(__file__))
IMAGES_STORE = os.path.join(project_dir, 'images')

安装PIL
pip install pillow

定制自己的pipeline使其下载图片后能保存下它的本地路径
get_media_requests()接收一个迭代器对象下载图片
item_completed获取到图片的下载地址

自定义图片pipeline的调试信息

继承并重写item_completed()

from scrapy.pipelines.images import ImagesPipeline

class ArticleImagePipeline(ImagesPipeline):
    #重写该方法可从result中获取到图片的实际下载地址
    def item\_completed(self, results, item, info):
        for ok, value in results:
            image_file_path = value["path"]
        item["front\_image\_path"] = image_file_path

        return item

setting中设置使用我们自定义的pipeline,而不是系统自带的

ITEM_PIPELINES = {
   'ArticleSpider.pipelines.ArticlespiderPipeline': 300,
   # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'ArticleSpider.pipelines.ArticleImagePipeline':1,
}

保存下来的本地地址

图片url的md5处理
新建package utils

import hashlib

def get\_md5(url):
    m = hashlib.md5()
    m.update(url)
    return m.hexdigest()

if __name__ == "\_\_main\_\_":
    print(get_md5("http://jobbole.com".encode("utf-8")))

不确定用户传入的是不是:

def get\_md5(url):
    #str就是unicode了
    if isinstance(url, str):
        url = url.encode("utf-8")
    m = hashlib.md5()
    m.update(url)
    return m.hexdigest()

在jobbole.py中将url的md5保存下来

from ArticleSpider.utils.common import get_md5
article_item["url\_object\_id"] = get_md5(response.url)

5. 数据保存到本地文件以及mysql中

保存到本地json文件

import codecs打开文件避免一些编码问题,自定义JsonWithEncodingPipeline实现json本地保存

class JsonWithEncodingPipeline(object):
    #自定义json文件的导出
    def \_\_init\_\_(self):
        self.file = codecs.open('article.json', 'w', encoding="utf-8")
    def process\_item(self, item, spider):
        #将item转换为dict,然后生成json对象,false避免中文出错
        lines = json.dumps(dict(item), ensure_ascii=False) + "\n"
        self.file.write(lines)
        return item
    #当spider关闭的时候
    def spider\_closed(self, spider):
        self.file.close()

setting.py注册pipeline

ITEM_PIPELINES = {
   'ArticleSpider.pipelines.JsonWithEncodingPipeline': 2,
   # 'scrapy.pipelines.images.ImagesPipeline': 1,
    'ArticleSpider.pipelines.ArticleImagePipeline':1,
}

scrapy exporters JsonItemExporter导出

scrapy自带的导出:

       - 'CsvItemExporter', 
       - 'XmlItemExporter',
       - 'JsonItemExporter'

from scrapy.exporters import JsonItemExporter

class JsonExporterPipleline(object):
    #调用scrapy提供的json export导出json文件
    def \_\_init\_\_(self):
        self.file = open('articleexport.json', 'wb')
        self.exporter = JsonItemExporter(self.file, encoding="utf-8", ensure_ascii=False)
        self.exporter.start_exporting()

    def close\_spider(self, spider):
        self.exporter.finish_exporting()
        self.file.close()

    def process\_item(self, item, spider):
        self.exporter.export_item(item)
        return item

设置setting.py注册该pipeline

'ArticleSpider.pipelines.JsonExporterPipleline ': 2
保存到数据库(mysql)

数据库设计数据表,表的内容字段是和item一致的。数据库与item的关系。类似于django中model与form的关系。
日期的转换,将字符串转换为datetime

import datetime
 try:
            create_date = datetime.datetime.strptime(create_date, "%Y/%m/%d").date()
        except Exception as e:
            create_date = datetime.datetime.now().date()

数据库表设计

jobbole数据表设计

  • 三个num字段均设置不能为空,然后默认0.
  • content设置为longtext
  • 主键设置为url_object_id

数据库驱动安装
pip install mysqlclient

Linux报错解决方案:
ubuntu:
sudo apt-get install libmysqlclient-dev
centos:
sudo yum install python-devel mysql-devel

保存到数据库pipeline(同步)编写

import MySQLdb
class MysqlPipeline(object):
    #采用同步的机制写入mysql
    def \_\_init\_\_(self):
        self.conn = MySQLdb.connect('127.0.0.1', 'root', 'password', 'article\_spider', charset="utf8", use_unicode=True)
        self.cursor = self.conn.cursor()

    def process\_item(self, item, spider):
        insert_sql = """
 insert into jobbole\_article(title, url, create\_date, fav\_nums)
 VALUES (%s, %s, %s, %s)
 """
        self.cursor.execute(insert_sql, (item["title"], item["url"], item["create\_date"], item["fav\_nums"]))
        self.conn.commit()

保存到数据库的(异步Twisted)编写
因为我们的爬取速度可能大于数据库存储的速度。异步操作。
设置可配置参数
seeting.py设置

MYSQL\_HOST = "127.0.0.1"
MYSQL\_DBNAME = "article\_spider"
MYSQL\_USER = "root"
MYSQL\_PASSWORD = "123456"

代码中获取到设置的可配置参数
twisted异步:

import MySQLdb.cursors
from twisted.enterprise import adbapi

#连接池ConnectionPool
# def \_\_init\_\_(self, dbapiName, \*connargs, \*\*connkw):
class MysqlTwistedPipline(object):
    def \_\_init\_\_(self, dbpool):
        self.dbpool = dbpool

    @classmethod
    def from\_settings(cls, settings):
        dbparms = dict(
            host = settings["MYSQL\_HOST"],
            db = settings["MYSQL\_DBNAME"],
            user = settings["MYSQL\_USER"],
            passwd = settings["MYSQL\_PASSWORD"],
            charset='utf8',
            cursorclass=MySQLdb.cursors.DictCursor,
            use_unicode=True,
        )
        #\*\*dbparms-->("MySQLdb",host=settings['MYSQL\_HOST']
        dbpool = adbapi.ConnectionPool("MySQLdb", **dbparms)

        return cls(dbpool)

    def process\_item(self, item, spider):
        #使用twisted将mysql插入变成异步执行
        query = self.dbpool.runInteraction(self.do_insert, item)
        query.addErrback(self.handle_error, item, spider) #处理异常

    def handle\_error(self, failure, item, spider):
        #处理异步插入的异常
        print (failure)

    def do\_insert(self, cursor, item):
        #执行具体的插入
        #根据不同的item 构建不同的sql语句并插入到mysql中
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)

可选django.items

github.com/scrapy-plug…

可以让我们保存的item直接变成django的models.

scrapy的itemloader来维护提取代码

itemloadr提供了一个容器,让我们配置某一个字段该使用哪种规则。
add_css add_value add_xpath

from scrapy.loader import ItemLoader
# 通过item loader加载item
        front_image_url = response.meta.get("front\_image\_url", "")  # 文章封面图
        item_loader = ItemLoader(item=JobBoleArticleItem(), response=response)
        item_loader.add_css("title", ".entry-header h1::text")
        item_loader.add_value("url", response.url)
        item_loader.add_value("url\_object\_id", get_md5(response.url))
        item_loader.add_css("create\_date", "p.entry-meta-hide-on-mobile::text")
        item_loader.add_value("front\_image\_url", [front_image_url])
        item_loader.add_css("praise\_nums", ".vote-post-up h10::text")
        item_loader.add_css("comment\_nums", "a[href='#article-comment'] span::text")
        item_loader.add_css("fav\_nums", ".bookmark-btn::text")
        item_loader.add_css("tags", "p.entry-meta-hide-on-mobile a::text")
        item_loader.add_css("content", "div.entry")
        #调用这个方法来对规则进行解析生成item对象
        article_item = item_loader.load_item()

直接使用itemloader的问题

  1. 所有值变成了list
  2. 对于这些值做一些处理函数
    item.py中对于item process处理函数
    MapCompose可以传入函数对于该字段进行处理,而且可以传入多个
from scrapy.loader.processors import MapCompose
def add\_mtianyan(value):
    return value+"-mtianyan"

 title = scrapy.Field(
        input_processor=MapCompose(lambda x:x+"mtianyan",add_mtianyan),
    )

注意:此处的自定义方法一定要写在代码前面。

    create_date = scrapy.Field(
        input_processor=MapCompose(date_convert),
        output_processor=TakeFirst()
    )

只取list中的第一个值。

自定义itemloader实现默认提取第一个

class ArticleItemLoader(ItemLoader):
    #自定义itemloader实现默认提取第一个
    default_output_processor = TakeFirst()

list保存原值

def return\_value(value):
    return value

front_image_url = scrapy.Field(
        output_processor=MapCompose(return_value)
    )

下载图片pipeline增加if增强通用性

class ArticleImagePipeline(ImagesPipeline):
    #重写该方法可从result中获取到图片的实际下载地址
    def item\_completed(self, results, item, info):
        if "front\_image\_url" in item:
            for ok, value in results:
                image_file_path = value["path"]
            item["front\_image\_path"] = image_file_path

        return item

自定义的item带处理函数的完整代码

class JobBoleArticleItem(scrapy.Item):
    title = scrapy.Field()
    create_date = scrapy.Field(
        input_processor=MapCompose(date_convert),
    )
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    front_image_url = scrapy.Field(
        output_processor=MapCompose(return_value)
    )
    front_image_path = scrapy.Field()
    praise_nums = scrapy.Field(
        input_processor=MapCompose(get_nums)
    )
    comment_nums = scrapy.Field(
        input_processor=MapCompose(get_nums)
    )
    fav_nums = scrapy.Field(
        input_processor=MapCompose(get_nums)
    )
    #因为tag本身是list,所以要重写
    tags = scrapy.Field(
        input_processor=MapCompose(remove_comment_tags),
        output_processor=Join(",")
    )
    content = scrapy.Field()

三、知乎网问题和答案爬取

1. 基础知识

session和cookie机制

cookie:
浏览器支持的存储方式
key-value

http无状态请求,两次请求没有联系

session的工作原理

(1)当一个session第一次被启用时,一个唯一的标识被存储于本地的cookie中。

(2)首先使用session_start()函数,从session仓库中加载已经存储的session变量。

(3)通过使用session_register()函数注册session变量。

(4)脚本执行结束时,未被销毁的session变量会被自动保存在本地一定路径下的session库中.

request模拟知乎的登录

http状态码

获取crsftoken

def get\_xsrf():
    #获取xsrf code
    response = requests.get("https://www.zhihu.com",headers =header)
    # # print(response.text)
    # text ='<input type="hidden" name="\_xsrf" value="ca70366e5de5d133c3ae09fb16d9b0fa"/>'
    match_obj = re.match('.\*name="\_xsrf" value="(.\*?)"', response.text)
    if match_obj:
        return (match_obj.group(1))
    else:
        return ""

python模拟知乎登录代码:

# \_\*\_ coding: utf-8 \_\*\_

import requests
try:
    import cookielib
except:
    import http.cookiejar as cookielib
import re

__author__ = 'mtianyan'
__date__ = '2017/5/23 16:42'


import requests
try:
    import cookielib
except:
    import http.cookiejar as cookielib

import re

session = requests.session()
session.cookies = cookielib.LWPCookieJar(filename="cookies.txt")
try:
    session.cookies.load(ignore_discard=True)
except:
    print ("cookie未能加载")

agent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36"
header = {
    "HOST":"www.zhihu.com",
    "Referer": "https://www.zhizhu.com",
    'User-Agent': agent
}

def is\_login():
    #通过个人中心页面返回状态码来判断是否为登录状态
    inbox_url = "https://www.zhihu.com/question/56250357/answer/148534773"
    response = session.get(inbox_url, headers=header, allow_redirects=False)
    if response.status_code != 200:
        return False
    else:
        return True

def get\_xsrf():
    #获取xsrf code
    response = session.get("https://www.zhihu.com", headers=header)
    response_text = response.text
    #reDOTAll 匹配全文
    match_obj = re.match('.\*name="\_xsrf" value="(.\*?)"', response_text, re.DOTALL)
    xsrf = ''
    if match_obj:
        xsrf = (match_obj.group(1))
        return xsrf


def get\_index():
    response = session.get("https://www.zhihu.com", headers=header)
    with open("index\_page.html", "wb") as f:
        f.write(response.text.encode("utf-8"))
    print ("ok")

def get\_captcha():
    import time
    t = str(int(time.time()*1000))
    captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
    t = session.get(captcha_url, headers=header)
    with open("captcha.jpg","wb") as f:
        f.write(t.content)
        f.close()

    from PIL import Image
    try:
        im = Image.open('captcha.jpg')
        im.show()
        im.close()
    except:
        pass

    captcha = input("输入验证码\n>")
    return captcha

def zhihu\_login(account, password):
    #知乎登录
    if re.match("^1\d{10}",account):
        print ("手机号码登录")
        post_url = "https://www.zhihu.com/login/phone\_num"
        post_data = {
            "\_xsrf": get_xsrf(),
            "phone\_num": account,
            "password": password,
            "captcha":get_captcha()
        }
    else:
        if "@" in account:
            #判断用户名是否为邮箱
            print("邮箱方式登录")
            post_url = "https://www.zhihu.com/login/email"
            post_data = {
                "\_xsrf": get_xsrf(),
                "email": account,
                "password": password
            }

    response_text = session.post(post_url, data=post_data, headers=header)
    session.cookies.save()

# get\_index()
# is\_login()
# get\_captcha()
zhihu_login("phone", "password")
zhihu_login("shouji", "mima")

2. scrapy创建知乎爬虫登录

scrapy genspider zhihu www.zhihu.com

因为知乎我们需要先进行登录,所以我们重写它的start_requests

    def start\_requests(self):
        return [scrapy.Request('https://www.zhihu.com/#signin', headers=self.headers, callback=self.login)]

  1. 下载首页然后回调login函数。
  2. login函数请求验证码并回调login_after_captcha函数.此处通过meta将post_data传送出去,后面的回调函数来用。
    def login(self, response):
        response_text = response.text
        #获取xsrf。
        match_obj = re.match('.\*name="\_xsrf" value="(.\*?)"', response_text, re.DOTALL)
        xsrf = ''
        if match_obj:
            xsrf = (match_obj.group(1))

        if xsrf:
            post_url = "https://www.zhihu.com/login/phone\_num"
            post_data = {
                "\_xsrf": xsrf,
                "phone\_num": "phone",
                "password": "password",
                "captcha": ""
            }

            import time
            t = str(int(time.time() * 1000))
            captcha_url = "https://www.zhihu.com/captcha.gif?r={0}&type=login".format(t)
            #请求验证码并回调login\_after\_captcha.
            yield scrapy.Request(captcha_url, headers=self.headers, 
                meta={"post\_data":post_data}, callback=self.login_after_captcha)
  1. login_after_captcha函数将验证码图片保存到本地,然后使用PIL库打开图片,肉眼识别后在控制台输入验证码值
    然后接受步骤一的meta数据,一并提交至登录接口。回调check_login检查是否登录成功。
    def login\_after\_captcha(self, response):
        with open("captcha.jpg", "wb") as f:
            f.write(response.body)
            f.close()

        from PIL import Image
        try:
            im = Image.open('captcha.jpg')
            im.show()
            im.close()
        except:
            pass

        captcha = input("输入验证码\n>")

        post_data = response.meta.get("post\_data", {})
        post_url = "https://www.zhihu.com/login/phone\_num"
        post_data["captcha"] = captcha
        return [scrapy.FormRequest(
            url=post_url,
            formdata=post_data,
            headers=self.headers,
            callback=self.check_login
        )]
  1. check_login函数,验证服务器的返回数据判断是否成功
    scrapy会对request的URL去重(RFPDupeFilter),加上dont_filter则告诉它这个URL不参与去重.

源码中的startrequest:

    def start\_requests(self):
        for url in self.start_urls:
            yield self.make_requests_from_url(url)

我们将原本的start_request的代码放在了现在重写的,回调链最后的check_login

 def check\_login(self, response):
        #验证服务器的返回数据判断是否成功
        text_json = json.loads(response.text)
        if "msg" in text_json and text_json["msg"] == "登录成功":
            for url in self.start_urls:
                yield scrapy.Request(url, dont_filter=True, headers=self.headers)

登录代码流程

3. 知乎数据表设计

知乎答案版本1

上图为知乎答案版本1

知乎答案版本2

上图为知乎答案版本2

设置数据表字段

问题字段回答字段
zhihu_idzhihu_id
topicsurl
urlquestion_id
titleauthor_id
contentcontent
answer_numparise_num
comments_numcomments_num
watch_user_numcreate_time
click_numupdate_time
crawl_timecrawl_time

知乎问题表

知乎答案表

知乎url分析

点具体问题下查看更多。
可获得接口:

www.zhihu.com/api/v4/ques…

重点参数:
offset=43
isend = true
next
点击更多接口返回

href=”/question/25460323”

all_urls = [parse.urljoin(response.url, url) for url in all_urls]
  1. 从首页获取所有a标签。如果提取的url中格式为 /question/xxx 就下载之后直接进入解析函数parse_question
    如果不是question页面则直接进一步跟踪。
def parse(self, response):
    """
 提取出html页面中的所有url 并跟踪这些url进行一步爬取
 如果提取的url中格式为 /question/xxx 就下载之后直接进入解析函数
 """
    all_urls = response.css("a::attr(href)").extract()
    all_urls = [parse.urljoin(response.url, url) for url in all_urls]
    #使用lambda函数对于每一个url进行过滤,如果是true放回列表,返回false去除。
    all_urls = filter(lambda x:True if x.startswith("https") else False, all_urls)
    for url in all_urls:
        match_obj = re.match("(.\*zhihu.com/question/(\d+))(/|$).\*", url)
        if match_obj:
            # 如果提取到question相关的页面则下载后交由提取函数进行提取
            request_url = match_obj.group(1)
            yield scrapy.Request(request_url, headers=self.headers, callback=self.parse_question)
        else:
            # 如果不是question页面则直接进一步跟踪
            yield scrapy.Request(url, headers=self.headers, callback=self.parse)
  1. 进入parse_question函数处理
    **创建我们的item

item要用到的方法ArticleSpider\utils\common.py:

def extract\_num(text):
    #从字符串中提取出数字
    match_re = re.match(".\*?(\d+).\*", text)
    if match_re:
        nums = int(match_re.group(1))
    else:
        nums = 0

    return nums

setting.py中设置
SQL_DATETIME_FORMAT = "%Y-%m-%d %H:%M:%S" SQL_DATE_FORMAT = "%Y-%m-%d"
使用:

from ArticleSpider.settings import SQL_DATETIME_FORMAT

知乎的问题 item

class ZhihuQuestionItem(scrapy.Item):
    #知乎的问题 item
    zhihu_id = scrapy.Field()
    topics = scrapy.Field()
    url = scrapy.Field()
    title = scrapy.Field()
    content = scrapy.Field()
    answer_num = scrapy.Field()
    comments_num = scrapy.Field()
    watch_user_num = scrapy.Field()
    click_num = scrapy.Field()
    crawl_time = scrapy.Field()

    def get\_insert\_sql(self):
        #插入知乎question表的sql语句
        insert_sql = """
 insert into zhihu\_question(zhihu\_id, topics, url, title, content, answer\_num, comments\_num,
 watch\_user\_num, click\_num, crawl\_time
 )
 VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
 ON DUPLICATE KEY UPDATE content=VALUES(content), answer\_num=VALUES(answer\_num), comments\_num=VALUES(comments\_num),
 watch\_user\_num=VALUES(watch\_user\_num), click\_num=VALUES(click\_num)
 """
        zhihu_id = self["zhihu\_id"][0]
        topics = ",".join(self["topics"])
        url = self["url"][0]
        title = "".join(self["title"])
        content = "".join(self["content"])
        answer_num = extract_num("".join(self["answer\_num"]))
        comments_num = extract_num("".join(self["comments\_num"]))

        if len(self["watch\_user\_num"]) == 2:
            watch_user_num = int(self["watch\_user\_num"][0])
            click_num = int(self["watch\_user\_num"][1])
        else:
            watch_user_num = int(self["watch\_user\_num"][0])
            click_num = 0

        crawl_time = datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)

        params = (zhihu_id, topics, url, title, content, answer_num, comments_num,
                  watch_user_num, click_num, crawl_time)

        return insert_sql, params

知乎问题回答item

class ZhihuAnswerItem(scrapy.Item):
    #知乎的问题回答item
    zhihu_id = scrapy.Field()
    url = scrapy.Field()
    question_id = scrapy.Field()
    author_id = scrapy.Field()
    content = scrapy.Field()
    parise_num = scrapy.Field()
    comments_num = scrapy.Field()
    create_time = scrapy.Field()
    update_time = scrapy.Field()
    crawl_time = scrapy.Field()

    def get\_insert\_sql(self):
        #插入知乎question表的sql语句
        insert_sql = """
 insert into zhihu\_answer(zhihu\_id, url, question\_id, author\_id, content, parise\_num, comments\_num,
 create\_time, update\_time, crawl\_time
 ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
 ON DUPLICATE KEY UPDATE content=VALUES(content), comments\_num=VALUES(comments\_num), parise\_num=VALUES(parise\_num),
 update\_time=VALUES(update\_time)
 """

        create_time = datetime.datetime.fromtimestamp(self["create\_time"]).strftime(SQL_DATETIME_FORMAT)
        update_time = datetime.datetime.fromtimestamp(self["update\_time"]).strftime(SQL_DATETIME_FORMAT)
        params = (
            self["zhihu\_id"], self["url"], self["question\_id"],
            self["author\_id"], self["content"], self["parise\_num"],
            self["comments\_num"], create_time, update_time,
            self["crawl\_time"].strftime(SQL_DATETIME_FORMAT),
        )

        return insert_sql, params

有了两个item之后,我们继续完善我们的逻辑

    def parse\_question(self, response):
        #处理question页面, 从页面中提取出具体的question item
        if "QuestionHeader-title" in response.text:
            #处理新版本
            match_obj = re.match("(.\*zhihu.com/question/(\d+))(/|$).\*", response.url)
            if match_obj:
                question_id = int(match_obj.group(2))

            item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)
            item_loader.add_css("title", "h1.QuestionHeader-title::text")
            item_loader.add_css("content", ".QuestionHeader-detail")
            item_loader.add_value("url", response.url)
            item_loader.add_value("zhihu\_id", question_id)
            item_loader.add_css("answer\_num", ".List-headerText span::text")
            item_loader.add_css("comments\_num", ".QuestionHeader-actions button::text")
            item_loader.add_css("watch\_user\_num", ".NumberBoard-value::text")
            item_loader.add_css("topics", ".QuestionHeader-topics .Popover div::text")

            question_item = item_loader.load_item()
        else:
            #处理老版本页面的item提取
            match_obj = re.match("(.\*zhihu.com/question/(\d+))(/|$).\*", response.url)
            if match_obj:
                question_id = int(match_obj.group(2))

            item_loader = ItemLoader(item=ZhihuQuestionItem(), response=response)
            # item\_loader.add\_css("title", ".zh-question-title h2 a::text")
            item_loader.add_xpath("title", "//\*[@id='zh-question-title']/h2/a/text()|//\*[@id='zh-question-title']/h2/span/text()")
            item_loader.add_css("content", "#zh-question-detail")
            item_loader.add_value("url", response.url)
            item_loader.add_value("zhihu\_id", question_id)
            item_loader.add_css("answer\_num", "#zh-question-answer-num::text")
            item_loader.add_css("comments\_num", "#zh-question-meta-wrap a[name='addcomment']::text")
            # item\_loader.add\_css("watch\_user\_num", "#zh-question-side-header-wrap::text")
            item_loader.add_xpath("watch\_user\_num", "//\*[@id='zh-question-side-header-wrap']/text()|//\*[@class='zh-question-followers-sidebar']/div/a/strong/text()")
            item_loader.add_css("topics", ".zm-tag-editor-labels a::text")

            question_item = item_loader.load_item()

        yield scrapy.Request(self.start_answer_url.format(question_id, 20, 0), headers=self.headers, callback=self.parse_answer)
        yield question_item

处理问题回答提取出需要的字段

    def parse\_answer(self, reponse):
        #处理questionanswer
        ans_json = json.loads(reponse.text)
        is_end = ans_json["paging"]["is\_end"]
        next_url = ans_json["paging"]["next"]

        #提取answer的具体字段
        for answer in ans_json["data"]:
            answer_item = ZhihuAnswerItem()
            answer_item["zhihu\_id"] = answer["id"]
            answer_item["url"] = answer["url"]
            answer_item["question\_id"] = answer["question"]["id"]
            answer_item["author\_id"] = answer["author"]["id"] if "id" in answer["author"] else None
            answer_item["content"] = answer["content"] if "content" in answer else None
            answer_item["parise\_num"] = answer["voteup\_count"]
            answer_item["comments\_num"] = answer["comment\_count"]
            answer_item["create\_time"] = answer["created\_time"]
            answer_item["update\_time"] = answer["updated\_time"]
            answer_item["crawl\_time"] = datetime.datetime.now()

            yield answer_item

        if not is_end:
            yield scrapy.Request(next_url, headers=self.headers, callback=self.parse_answer)

知乎提取字段流程图:

知乎问题及答案提取流程图

深度优先:

  1. 提取出页面所有的url,并过滤掉不需要的url
  2. 如果是questionurl就进入question的解析
  3. 把该问题的爬取完了然后就返回初始解析
将item写入数据库

pipelines.py错误处理
插入时错误可通过该方法监控

    def handle\_error(self, failure, item, spider):
        #处理异步插入的异常
        print (failure)

改造pipeline使其变得更通用
原本具体硬编码的pipeline

  def do\_insert(self, cursor, item):
        #执行具体的插入
        insert_sql = """
 insert into jobbole\_article(title, url, create\_date, fav\_nums)
 VALUES (%s, %s, %s, %s)
 """
        cursor.execute(insert_sql, (item["title"], item["url"], item["create\_date"], item["fav\_nums"]))

改写后的:

    def do\_insert(self, cursor, item):
        #根据不同的item 构建不同的sql语句并插入到mysql中
        insert_sql, params = item.get_insert_sql()
        cursor.execute(insert_sql, params)

可选方法一:

    if item.__class__.__name__ == "JobBoleArticleItem":
        #执行具体的插入
        insert_sql = """
 insert into jobbole\_article(title, url, create\_date, fav\_nums)
 VALUES (%s, %s, %s, %s)
 """
        cursor.execute(insert_sql, (item["title"], item["url"], item["create\_date"], item["fav\_nums"]))

推荐方法:
把sql语句等放到item里面:
jobboleitem类内部方法

    def get\_insert\_sql(self):
        insert_sql = """
 insert into jobbole\_article(title, url, create\_date, fav\_nums)
 VALUES (%s, %s, %s, %s) ON DUPLICATE KEY UPDATE content=VALUES(fav\_nums)
 """
        params = (self["title"], self["url"], self["create\_date"], self["fav\_nums"])

        return insert_sql, params

知乎问题:

    def get\_insert\_sql(self):
        #插入知乎question表的sql语句
        insert_sql = """
 insert into zhihu\_question(zhihu\_id, topics, url, title, content, answer\_num, comments\_num,
 watch\_user\_num, click\_num, crawl\_time
 )
 VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
 ON DUPLICATE KEY UPDATE content=VALUES(content), answer\_num=VALUES(answer\_num), comments\_num=VALUES(comments\_num),
 watch\_user\_num=VALUES(watch\_user\_num), click\_num=VALUES(click\_num)
 """
        zhihu_id = self["zhihu\_id"][0]
        topics = ",".join(self["topics"])
        url = self["url"][0]
        title = "".join(self["title"])
        content = "".join(self["content"])
        answer_num = extract_num("".join(self["answer\_num"]))
        comments_num = extract_num("".join(self["comments\_num"]))

        if len(self["watch\_user\_num"]) == 2:
            watch_user_num = int(self["watch\_user\_num"][0])
            click_num = int(self["watch\_user\_num"][1])
        else:
            watch_user_num = int(self["watch\_user\_num"][0])
            click_num = 0

        crawl_time = datetime.datetime.now().strftime(SQL_DATETIME_FORMAT)

        params = (zhihu_id, topics, url, title, content, answer_num, comments_num,
                  watch_user_num, click_num, crawl_time)

        return insert_sql, params

知乎回答:

    def get\_insert\_sql(self):
        #插入知乎回答表的sql语句
        insert_sql = """
 insert into zhihu\_answer(zhihu\_id, url, question\_id, author\_id, content, parise\_num, comments\_num,
 create\_time, update\_time, crawl\_time
 ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
 ON DUPLICATE KEY UPDATE content=VALUES(content), comments\_num=VALUES(comments\_num), parise\_num=VALUES(parise\_num),
 update\_time=VALUES(update\_time)
 """

        create_time = datetime.datetime.fromtimestamp(self["create\_time"]).strftime(SQL_DATETIME_FORMAT)
        update_time = datetime.datetime.fromtimestamp(self["update\_time"]).strftime(SQL_DATETIME_FORMAT)
        params = (
            self["zhihu\_id"], self["url"], self["question\_id"],
            self["author\_id"], self["content"], self["parise\_num"],
            self["comments\_num"], create_time, update_time,
            self["crawl\_time"].strftime(SQL_DATETIME_FORMAT),
        )

        return insert_sql, params

第二次爬取到相同数据,更新数据

ON DUPLICATE KEY UPDATE content=VALUES(content), answer\_num=VALUES(answer\_num), comments\_num=VALUES(comments\_num),
 watch\_user\_num=VALUES(watch\_user\_num), click\_num=VALUES(click\_num)

调试技巧

            if match_obj:
                #如果提取到question相关的页面则下载后交由提取函数进行提取
                request_url = match_obj.group(1)
                yield scrapy.Request(request_url, headers=self.headers, callback=self.parse_question)
                #方便调试
                break
            else:
                #方便调试
                pass
                #如果不是question页面则直接进一步跟踪
                #方便调试
                # yield scrapy.Request(url, headers=self.headers, callback=self.parse)
    #方便调试
        # yield question\_item

错误排查
[key error] title
pipeline中debug定位到哪一个item的错误。

四、通过CrawlSpider对招聘网站拉钩网进行整站爬取

推荐工具cmder
cmder.net/
下载full版本,使我们在windows环境下也可以使用linux部分命令。
配置path环境变量

1. 设计拉勾网的数据表结构

拉勾网数据库表设计

2. 初始化拉钩网项目并解读crawl源码

scrapy genspider --list
查看可使用的初始化模板
ailable templates:

  • basic
  • crawl
  • csvfeed
  • xmlfeed
scrapy genspider -t crawl lagou www.lagou.com

cmd与pycharm不同,mark root
setting.py 设置目录

crawl模板

class LagouSpider(CrawlSpider):
    name = 'lagou'
    allowed_domains = ['www.lagou.com']
    start_urls = ['http://www.lagou.com/']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse\_item', follow=True),
    )

    def parse\_item(self, response):
        i = {}
        #i['domain\_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

源码阅读剖析
doc.scrapy.org/en/1.3/topi…

提供了一些可以让我们进行简单的follow的规则,link,迭代爬取

rules:

规则,crawel spider读取并执行

parse_start_url(response):

example:

rules是一个可迭代对象,里面有Rule实例->LinkExtractor的分析
allow=('category\.php', ), callback='parse_item',
allow允许的url模式。callback,要回调的函数名。
因为rules里面没有self,无法获取到方法。

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse\_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse\_item'),
    )

    def parse\_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item\_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item\_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item\_description"]/text()').extract()
        return item 

分析拉勾网模板代码

  1. 将http加上s
  2. 重命名parse_item为我们自定义的parse_job
  3. 点击class LagouSpider(CrawlSpider):的CrawlSpider,进入crawl源码
  4. class CrawlSpider(Spider):可以看出它继承于spider
  5. 入口:def start_requests(self):
  6. alt+左右方向键,不同代码跳转
  7. 5->之后默认parse CrawlSpider里面有parse函数。但是这次我们不能向以前一样覆盖

Crawl.py核心函数parse。

parse函数调用_parse_response

 def parse(self, response):
        return self._parse_response(response, self.parse_start_url, cb_kwargs={}, follow=True)

_parse_response

  1. 判断是否有callback即有没有self.parse_start_url
  2. 我们可以重载parse_start_url加入自己的处理
  3. 把参数传递给函数,并调用process_results函数

_parse_response函数

    def \_parse\_response(self, response, callback, cb\_kwargs, follow=True):
        if callback:
            cb_res = callback(response, **cb_kwargs) or ()
            cb_res = self.process_results(response, cb_res)
            for requests_or_item in iterate_spider_output(cb_res):
                yield requests_or_item

        if follow and self._follow_links:
            for request_or_item in self._requests_to_follow(response):
                yield request_or_item

parse_start_url的return值将会被process_results方法接收处理
如果不重写,因为返回为空,然后就相当于什么都没做

    def process\_results(self, response, results):
        return results

点击followlink

    def set\_crawler(self, crawler):
        super(CrawlSpider, self).set_crawler(crawler)
        self._follow_links = crawler.settings.getbool('CRAWLSPIDER\_FOLLOW\_LINKS', True)

如果setting中有这个参数,则可以进一步执行到parse

_requests_to_follow

  1. 判断传入的是不是response,如果不是直接returns
  2. 针对当前response设置一个空set,去重
  3. 把self的rules通过enumerate变成一个可迭代对象
  4. 跳转rules详情
  5. 拿到link通过link_extractor.extract_links抽取出具体的link
  6. 执行我们的process_links
  7. link制作完成发起Request,回调_response_downloaded函数
  8. 然后执行parse_respose
    def \_requests\_to\_follow(self, response):
        if not isinstance(response, HtmlResponse):
            return
        seen = set()
        for n, rule in enumerate(self._rules):
            links = [lnk for lnk in rule.link_extractor.extract_links(response)
                     if lnk not in seen]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                yield rule.process_request(r)

_compile_rules

  1. 在我们初始化时会调用_compile_rules
  2. copy.copy(r) for r in self.rules]将我们的rules进行一个copy
  3. 调用回调函数get_method。
  4. 调用rules里面我们定义的process_links
  5. 调用rules里面我们定义的process_request
    def \_compile\_rules(self):
        def get\_method(method):
            if callable(method):
                return method
            elif isinstance(method, six.string_types):
                return getattr(self, method, None)

        self._rules = [copy.copy(r) for r in self.rules]
        for rule in self._rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)
    self.process_links = process_links
    self.process_request = process_request

可以通过在rules里面传入我们自己的处理函数,实现对url的自定义。
达到负载均衡,多地不同ip访问。

_response_downloaded
通过rule取到具体的rule
调用我们自己的回调函数

    def \_response\_downloaded(self, response):
        rule = self._rules[response.meta['rule']]
        return self._parse_response(response, rule.callback, rule.cb_kwargs, rule.follow)
  • allow :符合这个url我就爬取
  • deny : 符合这个url规则我就放弃
  • allow_domin : 这个域名下的我才处理
  • allow_domin : 这个域名下的我不处理
  • restrict_xpaths:进一步限定xpath
self, allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths=(),
                 tags=('a', 'area'), attrs=('href',), canonicalize=True,
                 unique=True, process_value=None, deny_extensions=None, restrict_css=()

extract_links
如果有restrict_xpaths,他会进行读取执行

    def extract\_links(self, response):
        base_url = get_base_url(response)
        if self.restrict_xpaths:
            docs = [subdoc
                    for x in self.restrict_xpaths
                    for subdoc in response.xpath(x)]
        else:
            docs = [response.selector]
        all_links = []
        for doc in docs:
            links = self._extract_links(doc, response.url, response.encoding, base_url)
            all_links.extend(self._process_links(links))
        return unique_list(all_links)

get_base_url:

urllib.parse.urljoin替我们拼接好url

def get\_base\_url(text, baseurl='', encoding='utf-8'):
    """Return the base url if declared in the given HTML `text`,
 relative to the given base url.

 If no base url is found, the given `baseurl` is returned.

 """

    text = to_unicode(text, encoding)
    m = _baseurl_re.search(text)
    if m:
        return moves.urllib.parse.urljoin(
            safe_url_string(baseurl),
            safe_url_string(m.group(1), encoding=encoding)
        )
    else:
        return safe_url_string(baseurl)
编写rule规则
    rules = (
        Rule(LinkExtractor(allow=("zhaopin/.\*",)), follow=True),
        Rule(LinkExtractor(allow=("gongsi/j\d+.html",)), follow=True),
        Rule(LinkExtractor(allow=r'jobs/\d+.html'), callback='parse\_job', follow=True),
    )

3. 设计lagou的items

需要用到的方法

from w3lib.html import remove_tags
def remove\_splash(value):
    #去掉工作城市的斜线
    return value.replace("/","")

def handle\_jobaddr(value):
    addr_list = value.split("\n")
    addr_list = [item.strip() for item in addr_list if item.strip()!="查看地图"]
    return "".join(addr_list)

定义好的item

class LagouJobItem(scrapy.Item):
    #拉勾网职位信息
    title = scrapy.Field()
    url = scrapy.Field()
    url_object_id = scrapy.Field()
    salary = scrapy.Field()
    job_city = scrapy.Field(
        input_processor=MapCompose(remove_splash),
    )
    work_years = scrapy.Field(
        input_processor = MapCompose(remove_splash),
    )
    degree_need = scrapy.Field(
        input_processor = MapCompose(remove_splash),
    )
    job_type = scrapy.Field()
    publish_time = scrapy.Field()
    job_advantage = scrapy.Field()
    job_desc = scrapy.Field()
    job_addr = scrapy.Field(
        input_processor=MapCompose(remove_tags, handle_jobaddr),
    )
    company_name = scrapy.Field()
    company_url = scrapy.Field()
    tags = scrapy.Field(
        input_processor = Join(",")
    )
    crawl_time = scrapy.Field()


重写的itemloader
设置默认只提取第一个

class LagouJobItemLoader(ItemLoader):
    #自定义itemloader
    default_output_processor = TakeFirst()
4. 提取字段值并存入数据库
    def parse\_job(self, response):

        #解析拉勾网的职位
        item_loader = LagouJobItemLoader(item=LagouJobItem(), response=response)
        item_loader.add_css("title", ".job-name::attr(title)")
        item_loader.add_value("url", response.url)
        item_loader.add_value("url\_object\_id", get_md5(response.url))
        item_loader.add_css("salary", ".job\_request .salary::text")
        item_loader.add_xpath("job\_city", "//\*[@class='job\_request']/p/span[2]/text()")
        item_loader.add_xpath("work\_years", "//\*[@class='job\_request']/p/span[3]/text()")
        item_loader.add_xpath("degree\_need", "//\*[@class='job\_request']/p/span[4]/text()")
        item_loader.add_xpath("job\_type", "//\*[@class='job\_request']/p/span[5]/text()")

        item_loader.add_css("tags", '.position-label li::text')
        item_loader.add_css("publish\_time", ".publish\_time::text")
        item_loader.add_css("job\_advantage", ".job-advantage p::text")
        item_loader.add_css("job\_desc", ".job\_bt div")
        item_loader.add_css("job\_addr", ".work\_addr")
        item_loader.add_css("company\_name", "#job\_company dt a img::attr(alt)")
        item_loader.add_css("company\_url", "#job\_company dt a::attr(href)")
        item_loader.add_value("crawl\_time", datetime.now())

        job_item = item_loader.load_item()

        return job_item

获得的拉勾网item数据
拉勾网item数据

5. items中添加get_insert_sql实现存入数据库
 def get\_insert\_sql(self):
        insert_sql = """
 insert into lagou\_job(title, url, url\_object\_id, salary, job\_city, work\_years, degree\_need,
 job\_type, publish\_time, job\_advantage, job\_desc, job\_addr, company\_name, company\_url,
 tags, crawl\_time) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
 ON DUPLICATE KEY UPDATE salary=VALUES(salary), job\_desc=VALUES(job\_desc)
 """
        params = (
            self["title"], self["url"], self["url\_object\_id"], self["salary"], self["job\_city"],
            self["work\_years"], self["degree\_need"], self["job\_type"],
            self["publish\_time"], self["job\_advantage"], self["job\_desc"],
            self["job\_addr"], self["company\_name"], self["company\_url"],
            self["job\_addr"], self["crawl\_time"].strftime(SQL_DATETIME_FORMAT),
        )

        return insert_sql, params

五、爬虫与反爬虫

1. 基础知识

如何使我们的爬虫不被禁止掉

爬虫:

自动获取数据的程序,关键是批量的获取

反爬虫:

使用技术手段防止爬虫程序的方法

误伤:

反爬虫技术将普通用户识别为爬虫,效果再好也不能用

学校,网吧,出口的公网ip只有一个,所以禁止ip不能用。

ip动态分配。a爬封b

成本:

反爬虫人力和机器成本

拦截:

拦截率越高,误伤率越高

反爬虫的目的:

反爬虫的目的

爬虫与反爬虫的对抗过程:

爬虫与反爬虫斗争

使用检查可以查看到价格,而查看网页源代码无法查看到价格字段。
scrapy下载到的网页时网页源代码。
js(ajax)填充的动态数据无法通过网页获取到。

2. scrapy架构及源码介绍

scrapy组件分析图

scrapy官方架构图

  1. 我们编写的spider,然后yield一个request发送给engine
  2. engine拿到什么都不做然后给scheduler
  3. engine会生成一个request给engine
  4. engine拿到之后通过downloadermiddleware 给downloader
  5. downloader再发送response回来给engine。
  6. engine拿到之后,response给spider。
  7. spider进行处理,解析出item & request,
  8. item->给itempipeline;如果是request,跳转步骤二

path:articlespider3\Lib\site-packages\scrapy\core

  • engine.py:
  • scheduler.py
  • downloader
  • item
  • pipeline
  • spider

engine.py:重要函数schedule

  1. enqueue_request:把request放scheduler
  2. _next_request_from_scheduler:从调度器拿。
    def schedule(self, request, spider):
        self.signals.send_catch_log(signal=signals.request_scheduled,
                request=request, spider=spider)
        if not self.slot.scheduler.enqueue_request(request):
            self.signals.send_catch_log(signal=signals.request_dropped,
                                        request=request, spider=spider)

articlespider3\Lib\site-packages\scrapy\core\downloader\handlers

支持文件,ftp,http下载(https).

后期定制middleware:

  • spidermiddlewire
  • downloadmiddlewire

django和scrapy结构类似

3. scrapy的两个重要类:request和response

类似于django httprequest

yield Request(url=parse.urljoin(response.url, post_url))

request参数:

class Request(object\_ref):

    def \_\_init\_\_(self, url, callback=None, method='GET', headers=None, body=None,
 cookies=None, meta=None, encoding='utf-8', priority=0,
 dont\_filter=False, errback=None):

cookies:
Lib\site-packages\scrapy\downloadermiddlewares\cookies.py

cookiejarkey = request.meta.get("cookiejar")
  • priority: 优先级,影响调度顺序
  • dont_filter:我的同样的request不会被过滤
  • errback:错误时的回调函数

doc.scrapy.org/en/1.2/topi…

errback example:

class ErrbackSpider(scrapy.Spider):
    name = "errback\_example"
    start_urls = [
        "http://www.httpbin.org/",              # HTTP 200 expected
        "http://www.httpbin.org/status/404",    # Not found error
        "http://www.httpbin.org/status/500",    # server issue
        "http://www.httpbin.org:12345/",        # non-responding host, timeout expected
        "http://www.httphttpbinbin.org/",       # DNS error expected
    ]

    def start\_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, callback=self.parse_httpbin,
                                    errback=self.errback_httpbin,
                                    dont_filter=True)

    def parse\_httpbin(self, response):
        self.logger.info('Got successful response from {}'.format(response.url))
        # do something useful here...

    def errback\_httpbin(self, failure):
        # log all failures
        self.logger.error(repr(failure))

        # in case you want to do something special for some errors,
        # you may need the failure's type:

        if failure.check(HttpError):
            # these exceptions come from HttpError spider middleware
            # you can get the non-200 response
            response = failure.value.response
            self.logger.error('HttpError on %s', response.url)

        elif failure.check(DNSLookupError):
            # this is the original request
            request = failure.request
            self.logger.error('DNSLookupError on %s', request.url)

        elif failure.check(TimeoutError, TCPTimedOutError):
            request = failure.request
            self.logger.error('TimeoutError on %s', request.url)

response类

 def \_\_init\_\_(self, url, status=200, headers=None, body=b'', flags=None, request=None):
        self.headers = Headers(headers or {})

response的参数:
request:yield出来的request,会放在response,让我们知道它是从哪里来的

4. 自行编写随机更换useagent

  1. setting中设置
user_agent_list = [
    'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:51.0) Gecko/20100101 Firefox/51.0',
    'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.104 Safari/537.36',
]

然后在代码中使用。

    from settings import user_agent_list
    import random
    random_index =random.randint(0,len(user_agent_list))
    random_agent = user_agent_list[random_index]

    'User-Agent': random_agent
                import random
                random_index = random.randint(0, len(user_agent_list))
                random_agent = user_agent_list[random_index]
                self.headers["User-Agent"] = random_agent
                yield scrapy.Request(request_url, headers=self.headers, callback=self.parse_question)

但是问题:每个request之前都得这样做。

5. middlewire配置及编写fake UseAgent代理池

取消DOWNLOADER_MIDDLEWARES的注释状态

DOWNLOADER_MIDDLEWARES = {
   'ArticleSpider.middlewares.MyCustomDownloaderMiddleware': 543,
}

articlespider3\Lib\site-packages\scrapy\downloadermiddlewares\useragent.py

class UserAgentMiddleware(object):
    """This middleware allows spiders to override the user\_agent"""

    def \_\_init\_\_(self, user\_agent='Scrapy'):
        self.user_agent = user_agent

    @classmethod
    def from\_crawler(cls, crawler):
        o = cls(crawler.settings['USER\_AGENT'])
        crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
        return o

    def spider\_opened(self, spider):
        self.user_agent = getattr(spider, 'user\_agent', self.user_agent)

    def process\_request(self, request, spider):
        if self.user_agent:
            request.headers.setdefault(b'User-Agent', self.user_agent)

重要方法process_request

**配置默认useagent为none

DOWNLOADER_MIDDLEWARES = {
   'ArticleSpider.middlewares.MyCustomDownloaderMiddleware': 543,
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None
}

使用fakeuseragent
pip install fake-useragent

settinf.py设置随机模式RANDOM_UA_TYPE = "random"

from fake_useragent import UserAgent

class RandomUserAgentMiddlware(object):
    #随机更换user-agent
    def \_\_init\_\_(self, crawler):
        super(RandomUserAgentMiddlware, self).__init__()
        self.ua = UserAgent()
        self.ua_type = crawler.settings.get("RANDOM\_UA\_TYPE", "random")

    @classmethod
    def from\_crawler(cls, crawler):
        return cls(crawler)

    def process\_request(self, request, spider):
        def get\_ua():
            return getattr(self.ua, self.ua_type)

        request.headers.setdefault('User-Agent', get_ua())

6. 使用西刺代理创建ip代理池保存到数据库*

ip动态变化:重启路由器等

ip代理的原理:

不直接发送自己真实ip,而使用中间代理商(代理服务器),那么服务器不知道我们的ip也就不会把我们禁掉
setting.py设置

class RandomProxyMiddleware(object):
    #动态设置ip代理
    def process\_request(self, request, spider):
        request.meta["proxy"] = "http://111.198.219.151:8118"

使用西刺代理创建代理池保存到数据库

# \_\*\_ coding: utf-8 \_\*\_
__author__ = 'mtianyan'
__date__ = '2017/5/24 16:27'
import requests
from scrapy.selector import Selector
import MySQLdb

conn = MySQLdb.connect(host="127.0.0.1", user="root", passwd="password", db="article\_spider", charset="utf8")
cursor = conn.cursor()


def crawl\_ips():
    #爬取西刺的免费ip代理
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; WOW64; rv:52.0) Gecko/20100101 Firefox/52.0"}
    for i in range(1568):
        re = requests.get("http://www.xicidaili.com/nn/{0}".format(i), headers=headers)

        selector = Selector(text=re.text)
        all_trs = selector.css("#ip\_list tr")


        ip_list = []
        for tr in all_trs[1:]:
            speed_str = tr.css(".bar::attr(title)").extract()[0]
            if speed_str:
                speed = float(speed_str.split("秒")[0])
            all_texts = tr.css("td::text").extract()

            ip = all_texts[0]
            port = all_texts[1]
            proxy_type = all_texts[5]

            ip_list.append((ip, port, proxy_type, speed))

        for ip_info in ip_list:
            cursor.execute(
                "insert proxy\_ip(ip, port, speed, proxy\_type) VALUES('{0}', '{1}', {2}, 'HTTP')".format(
                    ip_info[0], ip_info[1], ip_info[3]
                )
            )

            conn.commit()


class GetIP(object):
    def delete\_ip(self, ip):
        #从数据库中删除无效的ip
        delete_sql = """
 delete from proxy\_ip where ip='{0}'
 """.format(ip)
        cursor.execute(delete_sql)
        conn.commit()
        return True

    def judge\_ip(self, ip, port):
        #判断ip是否可用
        http_url = "http://www.baidu.com"
        proxy_url = "http://{0}:{1}".format(ip, port)
        try:
            proxy_dict = {
                "http":proxy_url,
            }
            response = requests.get(http_url, proxies=proxy_dict)
        except Exception as e:
            print ("invalid ip and port")
            self.delete_ip(ip)
            return False
        else:
            code = response.status_code
            if code >= 200 and code < 300:
                print ("effective ip")
                return True
            else:
                print  ("invalid ip and port")
                self.delete_ip(ip)
                return False


    def get\_random\_ip(self):
        #从数据库中随机获取一个可用的ip
        random_sql = """
 SELECT ip, port FROM proxy\_ip
 ORDER BY RAND()
 LIMIT 1
 """
        result = cursor.execute(random_sql)
        for ip_info in cursor.fetchall():
            ip = ip_info[0]
            port = ip_info[1]

            judge_re = self.judge_ip(ip, port)
            if judge_re:
                return "http://{0}:{1}".format(ip, port)
            else:
                return self.get_random_ip()



# print (crawl\_ips())
if __name__ == "\_\_main\_\_":
    get_ip = GetIP()
    get_ip.get_random_ip()

使用scrapy_proxies创建ip代理池

pip install scrapy_proxies

收费,但是简单
github.com/scrapy-plug…

tor隐藏。vpn
www.theonionrouter.com/

7. 通过云打码实现验证码的识别

www.yundama.com/

# \_\*\_ coding: utf-8 \_\*\_
__author__ = 'mtianyan'
__date__ = '2017/6/24 16:48'

import json
import requests

class YDMHttp(object):
    apiurl = 'http://api.yundama.com/api.php'
    username = ''
    password = ''
    appid = ''
    appkey = ''

    def \_\_init\_\_(self, username, password, appid, appkey):
        self.username = username
        self.password = password
        self.appid = str(appid)
        self.appkey = appkey

    def balance(self):
        data = {'method': 'balance', 'username': self.username, 'password': self.password, 'appid': self.appid, 'appkey': self.appkey}
        response_data = requests.post(self.apiurl, data=data)
        ret_data = json.loads(response_data.text)
        if ret_data["ret"] == 0:
            print ("获取剩余积分", ret_data["balance"])
            return ret_data["balance"]
        else:
            return None

    def login(self):
        data = {'method': 'login', 'username': self.username, 'password': self.password, 'appid': self.appid, 'appkey': self.appkey}
        response_data = requests.post(self.apiurl, data=data)
        ret_data = json.loads(response_data.text)
        if ret_data["ret"] == 0:
            print ("登录成功", ret_data["uid"])
            return ret_data["uid"]
        else:
            return None

    def decode(self, filename, codetype, timeout):
        data = {'method': 'upload', 'username': self.username, 'password': self.password, 'appid': self.appid, 'appkey': self.appkey, 'codetype': str(codetype), 'timeout': str(timeout)}
        files = {'file': open(filename, 'rb')}
        response_data = requests.post(self.apiurl, files=files, data=data)
        ret_data = json.loads(response_data.text)
        if ret_data["ret"] == 0:
            print ("识别成功", ret_data["text"])
            return ret_data["text"]
        else:
            return None

def ydm(file\_path):
    username = ''
    # 密码
    password = ''
    # 软件ID,开发者分成必要参数。登录开发者后台【我的软件】获得!
    appid = 
    # 软件密钥,开发者分成必要参数。登录开发者后台【我的软件】获得!
    appkey = ''
    # 图片文件
    filename = 'image/1.jpg'
    # 验证码类型,# 例:1004表示4位字母数字,不同类型收费不同。请准确填写,否则影响识别率。在此查询所有类型 http://www.yundama.com/price.html
    codetype = 5000
    # 超时时间,秒
    timeout = 60
    # 检查

    yundama = YDMHttp(username, password, appid, appkey)
    if (username == 'username'):
        print('请设置好相关参数再测试')
    else:
        # 开始识别,图片路径,验证码类型ID,超时时间(秒),识别结果
        return yundama.decode(file_path, codetype, timeout);

if __name__ == "\_\_main\_\_":
    # 用户名
    username = ''
    # 密码
    password = ''
    # 软件ID,开发者分成必要参数。登录开发者后台【我的软件】获得!
    appid = 
    # 软件密钥,开发者分成必要参数。登录开发者后台【我的软件】获得!
    appkey = ''
    # 图片文件
    filename = 'image/captcha.jpg'
    # 验证码类型,# 例:1004表示4位字母数字,不同类型收费不同。请准确填写,否则影响识别率。在此查询所有类型 http://www.yundama.com/price.html
    codetype = 5000
    # 超时时间,秒
    timeout = 60
    # 检查
    if (username == 'username'):
        print ('请设置好相关参数再测试')
    else:
        # 初始化
        yundama = YDMHttp(username, password, appid, appkey)

        # 登陆云打码
        uid = yundama.login();
        print('uid: %s' % uid)

        # 登陆云打码
        uid = yundama.login();
        print ('uid: %s' % uid)

        # 查询余额
        balance = yundama.balance();
        print ('balance: %s' % balance)

        # 开始识别,图片路径,验证码类型ID,超时时间(秒),识别结果
        text = yundama.decode(filename, codetype, timeout);


8. cookie的禁用。& 设置下载速度

scrapy-chs.readthedocs.io/zh_CN/lates…

setting.py:

# Disable cookies (enabled by default)
COOKIES_ENABLED = False

设置下载速度:

# The initial download delay
#AUTOTHROTTLE\_START\_DELAY = 5

给不同的spider设置自己的setting值

    custom_settings = {
        "COOKIES\_ENABLED": True
    }

六、scrapy进阶开发

1. Selenium动态页面抓取

Selenium (浏览器自动化测试框架)
Selenium是一个用于Web应用程序测试的工具。Selenium测试直接运行在浏览器中,就像真正的用户在操作一样。支持的浏览器包括IE(7, 8, 9, 10, 11),Mozilla Firefox,Safari,Google Chrome,Opera等。这个工具的主要功能包括:测试与浏览器的兼容性——测试你的应用程序看是否能够很好得工作在不同浏览器和操作系统之上。测试系统功能——创建回归测试检验软件功能和用户需求。支持自动录制动作和自动生成 .Net、Java、Perl等不同语言的测试脚本

Selenium架构图
安装
pip install selenium

文档地址:
selenium-python.readthedocs.io/api.html
安装webdriver.exe

天猫价格获取

from selenium import webdriver
from scrapy.selector import Selector

browser = webdriver.Chrome(executable_path="C:/chromedriver.exe")

#天猫价格获取
browser.get("https://detail.tmall.com/item.htm?spm=a230r.1.14.3.yYBVG6&id=538286972599&cm\_id=140105335569ed55e27b&abbucket=15&sku\_properties=10004:709990523;5919063:6536025")
t_selector = Selector(text=browser.page_source)
print (t_selector.css(".tm-price::text").extract())
# print (browser.page\_source)
browser.quit()
知乎模拟登录
from selenium import webdriver
from scrapy.selector import Selector

browser = webdriver.Chrome(executable_path="C:/chromedriver.exe")
#知乎模拟登陆
browser.get("https://www.zhihu.com/#signin")

browser.find_element_by_css_selector(".view-signin input[name='account']").send_keys("phone")
browser.find_element_by_css_selector(".view-signin input[name='password']").send_keys("password")

browser.find_element_by_css_selector(".view-signin button.sign-button").click()
微博模拟登录

微博开放平台api

from selenium import webdriver
from scrapy.selector import Selector

browser = webdriver.Chrome(executable_path="C:/chromedriver.exe")
#selenium 完成微博模拟登录
browser.get("http://weibo.com/")
import time
time.sleep(5)
browser.find_element_by_css_selector("#loginname").send_keys("1147727180@qq.com")
browser.find_element_by_css_selector(".info\_list.password input[node-type='password'] ").send_keys("password")
browser.find_element_by_xpath('//\*[@id="pl\_login\_form"]/div/div[3]/div[6]/a').click()
模拟JavaScript鼠标下滑
from selenium import webdriver
from scrapy.selector import Selector

browser = webdriver.Chrome(executable_path="C:/chromedriver.exe")
#开源中国博客
browser.get("https://www.oschina.net/blog")
import time
time.sleep(5)
for i in range(3):
    browser.execute_script("window.scrollTo(0, document.body.scrollHeight); var lenOfPage=document.body.scrollHeight; return lenOfPage;")
    time.sleep(3)
页面不加载图片
from selenium import webdriver
from scrapy.selector import Selector

# 设置chromedriver不加载图片
chrome_opt = webdriver.ChromeOptions()
prefs = {"profile.managed\_default\_content\_settings.images":2}
chrome_opt.add_experimental_option("prefs", prefs)

browser = webdriver.Chrome(executable_path="C:/chromedriver.exe",chrome_options=chrome_opt)
browser.get("https://www.oschina.net/blog")
phantomjs无界面的浏览器获取天猫价格
#phantomjs, 无界面的浏览器, 多进程情况下phantomjs性能会下降很严重

browser = webdriver.PhantomJS(executable_path="C:/phantomjs-2.1.1-windows/bin/phantomjs.exe")
browser.get("https://detail.tmall.com/item.htm?spm=a230r.1.14.3.yYBVG6&id=538286972599&cm\_id=140105335569ed55e27b&abbucket=15&sku\_properties=10004:709990523;5919063:6536025")
t_selector = Selector(text=browser.page_source)
print (t_selector.css(".tm-price::text").extract())
print (browser.page_source)
# browser.quit()

2.selenium集成进scrapy

如何集成

创建中间件。
from selenium import webdriver
from scrapy.http import HtmlResponse
class JSPageMiddleware(object):

    #通过chrome请求动态网页
    def process\_request(self, request, spider):
        if spider.name == "jobbole":
            browser = webdriver.Chrome(executable_path="C:/chromedriver.exe")
            spider.browser.get(request.url)
            import time
            time.sleep(3)
            print ("访问:{0}".format(request.url))

            return HtmlResponse(url=spider.browser.current_url, body=spider.browser.page_source, encoding="utf-8", request=request)

使用selenium集成到具体spider中

信号量:

dispatcher.connect 信号的映射,当spider结束该做什么

from scrapy.xlib.pydispatch import dispatcher
from scrapy import signals
    #使用selenium

    def \_\_init\_\_(self):
        self.browser = webdriver.Chrome(executable_path="D:/Temp/chromedriver.exe")
        super(JobboleSpider, self).__init__()
        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider\_closed(self, spider):
        #当爬虫退出的时候关闭chrome
        print ("spider closed")
        self.browser.quit()
python下无界面浏览器

pip install pyvirtualdisplay

linux使用:

from pyvirtualdisplay import Display
display = Display(visible=0, size=(800, 600))
display.start()

browser = webdriver.Chrome()
browser.get()

错误:cmd=[‘xvfb’,’help’]
os error

sudo apt-get install xvfb

pip install xvfbwrapper

scrapy-splash:
支持分布式,稳定性不如chorme

github.com/scrapy-plug…

selenium grid
支持分布式

splinter
github.com/cobrateam/s…

scrapy的暂停重启

scrapy crawl lagou -s JOBDIR=job_info/001

pycharm进程直接杀死 kiil -9

一次 ctrl+c可接受信号

Lib\site-packages\scrapy\dupefilters.py

先hash将url变成定长的字符串
然后使用集合set去重

telnet
远程登录

telnet localhost 6023 连接当前spider
est()命令查看spider当前状态

spider.settings["COOKIES_ENABLED"]

Lib\site-packages\scrapy\extensions\telnet.py

数据收集 & 状态收集
Scrapy提供了方便的收集数据的机制。数据以key/value方式存储,值大多是计数值。 该机制叫做数据收集器(Stats Collector),可以通过 Crawler API 的属性 stats 来使用。在下面的章节 常见数据收集器使用方法 将给出例子来说明。

无论数据收集(stats collection)开启或者关闭,数据收集器永远都是可用的。 因此您可以import进自己的模块并使用其API(增加值或者设置新的状态键(stat keys))。 该做法是为了简化数据收集的方法: 您不应该使用超过一行代码来收集您的spider,Scrpay扩展或任何您使用数据收集器代码里头的状态。

scrapy-chs.readthedocs.io/zh_CN/lates…

状态收集,数据收集器

    # 收集伯乐在线所有404的url以及404页面数
    handle_httpstatus_list = [404]

七、scrapy-redis 分布式爬虫

1. 分布式爬虫设计及redis介绍

多个爬虫如何进行调度,一个集中的状态管理器

优点:

  • 利用多机器带宽
  • 利用多ip加速爬取速度

两个问题:

  1. request队列的集中管理
  2. 去重集中管理

分布式。

img img

既有适合小白学习的零基础资料,也有适合3年以上经验的小伙伴深入学习提升的进阶课程,涵盖了95%以上物联网嵌入式知识点,真正体系化!

由于文件比较多,这里只是将部分目录截图出来,全套包含大厂面经、学习笔记、源码讲义、实战项目、大纲路线、电子书籍、讲解视频,并且后续会持续更新

如果你需要这些资料,可以戳这里获取