这是我参与11月更文挑战的第20天，活动详情查看：2021最后一次更文挑战

实验 2

2.1 题目

要求：熟练掌握 scrapy 中 Item、Pipeline 数据的序列化输出方法；使用scrapy框架+Xpath+MySQL数据库存储技术路线爬取外汇网站数据。候选网站：招商银行网：fx.cmbchina.com/hq/

2.2 思路

2.2.1 setting.py

打开请求头
连接数据库信息
ROBOTSTXT_OBEY设置为False
打开pipelines

在这里插入图片描述

2.2.2 item.py

编写item.py

class CmbspiderItem(scrapy.Item):
    currency = scrapy.Field()
    tsp = scrapy.Field()
    csp = scrapy.Field()
    tbp = scrapy.Field()
    cbp = scrapy.Field()
    time = scrapy.Field()

2.2.3 db_Spider.py

数据解析

        lis = response.xpath('//*[@id="realRateInfo"]/table')
        currencys = lis.xpath(".//tr/td[1]/text()").extract()
        tsps = lis.xpath(".//tr/td[4]/text()").extract()
        csps = lis.xpath(".//tr/td[5]/text()").extract()
        tbps = lis.xpath(".//tr/td[6]/text()").extract()
        cbps = lis.xpath(".//tr/td[7]/text()").extract()
        times = lis.xpath(".//tr/td[8]/text()").extract()

注意： 这里有一个坑点，因为这个table后面应该是有一个tbody的！

但是我们如果加了的话，就爬不下来了！所以要删掉这个tbody，然后下面的元素全从\改成\\ 在这里插入图片描述

数据处理

去除数据的前后空格和一些'\r\n'

        for currency, tsp, csp, tbp, cbp, time in zip(currencys, tsps, csps, tbps, cbps, times):
            count+=1
            currency = currency.replace(' ', '')
            tsp = tsp.replace(' ', '')
            csp = csp.replace(' ', '')
            tbp = tbp.replace(' ', '')
            cbp = cbp.replace(' ', '')
            time = time.replace(' ', '')
            currency = currency.replace('\r\n', '')
            tsp = tsp.replace('\r\n', '')
            csp = csp.replace('\r\n', '')
            tbp = tbp.replace('\r\n', '')
            cbp = cbp.replace('\r\n', '')
            time = time.replace('\r\n', '')
            if count ==1 :
                continue
            item = CmbspiderItem(
                currency=currency, tsp=tsp, csp=csp, tbp=tbp, cbp=cbp, time=time
            )
            yield item

2.2.4 pipelines.py

在这里插入图片描述

scrapy 爬取招商网信息并保存mysql

实验 2

2.1 题目

2.2 思路

2.2.1 setting.py

2.2.2 item.py

2.2.3 db_Spider.py

2.2.4 pipelines.py