99、selenium爬取京东商品信息、scrapy介绍、安装、架构、scrapy解析数据、setting配置

217 阅读6分钟

selenium爬取京东商品信息

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys  # 键盘按键操作
import time
import pymysql


conn = pymysql.connect(
        user='root',  # The first four arguments is based on DB-API 2.0 recommendation.
        password="xxxxxxx",
        host='127.0.0.1',
        database="jingdong",
        autocommit=True,
        port=3306)
cursor = conn.cursor()

def get_goods(bro):
    bro.execute_script('scrollTo(0,5000)')  # 往下滑以下屏幕
    li_list = bro.find_elements(By.CLASS_NAME, 'gl-item')
    # print(len(li_list))  # 30
    for watch in li_list:
        try:
            good_url = bro.find_element(By.CSS_SELECTOR,'#J_goodsList > ul > li:nth-child(4) > div > div.p-img > a').get_attribute('href')

            img = bro.find_element(By.CSS_SELECTOR, '#J_goodsList > ul > li:nth-child(1) > div > div.p-img > a > img')
            img_url = img.get_attribute('src')
            # print(img_url)
            price = bro.find_element(By.CSS_SELECTOR,
                                     '#J_goodsList > ul > li:nth-child(1) > div > div.p-price > strong').text
            # print(price)
            desc = bro.find_element(By.CSS_SELECTOR,
                                    '#J_goodsList > ul > li:nth-child(4) > div > div.p-name.p-name-type-2 > a > em').text
            # print(desc)
            common_num = bro.find_element(By.CSS_SELECTOR, '#J_comment_100021614535').text  # J_comment_10079757918250
            if not common_num:
                common_num = bro.find_element(By.CSS_SELECTOR, '#J_comment_10079757918250').text
            # print(common_num)
            shop_name = bro.find_element(By.CSS_SELECTOR,
                                         '#J_goodsList > ul > li:nth-child(4) > div > div.p-shop > span > a').text
            # print(shop_name)
            shop_nature = bro.find_element(By.CSS_SELECTOR, '#J_pro_100058523249 > i').text
            if not shop_nature:
                shop_nature = bro.find_element(By.ID, 'J_pro_10055476654431').text
            # print(shop_nature)
            print(f"""
            商品链接:{good_url}
            手机图片:{img_url}
            手机价格:{price}
            手机简介:{desc}
            评论数量:{common_num}
            手机店铺:{shop_name}
            店铺性质:{shop_nature}
            """)
            cursor.execute('insert into test (good_url,img_url,price,`desc`,common_num,shop_name,shop_nature) values (%s,%s,%s,%s,%s,%s,%s)',
                           args=[good_url,img_url,price,desc,common_num,shop_name,shop_nature])
        except Exception as e:
            print(e)
            continue

    next = bro.find_element(By.PARTIAL_LINK_TEXT,'下一页')
    next.click()
    get_goods(bro)

bro = webdriver.Chrome()
bro.get('https://www.jd.com/')
bro.implicitly_wait(10)
bro.maximize_window()
try:

    search_input = bro.find_element(By.ID,'key')
    search_input.send_keys('华为手表')
    search_input.send_keys(Keys.ENTER)
    get_goods(bro)
except Exception as e:
    print(e)

finally:
    bro.close()

scrapy介绍、安装、架构

Scrapy 是一个爬虫框架(底层代码封装好了,只需要在固定位置写固定代码即可),应用领域比较广泛,相当于是爬虫界的django

scrapy架构

image-20230713110553966.png

1.引擎(EGINE):引擎负责控制系统所有组件之间的数据流,并在某些动作发生时触发事件。大总管,负责整个爬虫数据的流动

2.调度器(SCHEDULER)用来接受引擎发过来的请求, 压入队列中, 并在引擎再次请求的时候返回. 可以想像成一个URL的优先级队列, 由它来决定下一个要抓取的网址是什么, 同时去除重复的网址

3.下载器(DOWLOADER) 用于下载网页内容, 并将网页内容返回给EGINE,下载器是建立在twisted这个高效的异步模型上的(效率很高,同时可以发送特别多请求出出)

4.爬虫(SPIDERS) SPIDERS是开发人员自定义的类,用来解析responses,并且提取items,或者发送新的请求

5.项目管道(ITEM PIPLINES) 在items被提取后负责处理它们,主要包括清理、验证、持久化(比如存到数据库)等操作

6.下载器中间件(Downloader Middlewares)位于Scrapy引擎和下载器之间,主要用来处理从EGINE传到DOWLOADER的请求request,已经从DOWNLOADER传到EGINE的响应response,你可用该中间件做以下几件事

7.爬虫中间件(Spider Middlewares)位于EGINE和SPIDERS之间,主要工作是处理SPIDERS的输入(即responses)和输出(即requests)

安装

1.Windows平台
    1、pip3 install wheel  # 安装后,便支持通过wheel文件安装软件,wheel文件官网:https://www.lfd.uci.edu/~gohlke/pythonlibs
    2、pip3 install lxml
    3、pip3 install pyopenssl
    4、下载并安装pywin32:https://sourceforge.net/projects/pywin32/files/pywin32/
    5、下载twisted的wheel文件:http://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted
    6、执行pip3 install 下载目录\Twisted-17.9.0-cp36-cp36m-win_amd64.whl
    7、pip3 install scrapy

2.Linux,mac平台
    1、pip3 install scrapy

创建scrapy项目:使用命令

1.创建项目
	scrapy startproject 项目名字
  
2.创建爬虫 :scrapy genspider 名字 域名
	 scrapy genspider cnblogs www.cnblogs.com

3.启动爬虫
	1.方式一:使用命令
  scrapy crawl cnblogs
  2.方式二:使用脚本运行爬虫(创建启动文件)
    # run.py
    from scrapy.cmdline import execute
    execute(['scrapy', 'crawl', 'cnblogs','--nolog'])

scrapy项目目录结构

myfirstscrapy  # 项目包
	-myfirstscrapy  # 包
    --spiders  # 包,里面放了自定义的爬虫,类似于app
      ---__init__.py
      ---baidu.py  # 自己建的爬虫
      --cnblogs.py # 自己建的爬虫
    --__init__.py
    --items.py # 类似于django的 models表模型,一个个模型类
    --middlewares.py # 中间件
    --pipelines.py  # 管道--->写持久化
    --run.py  # 自己建的,启动爬虫
    --settings.py # 项目配置文件
  -scrapy.cfg  # 项目上线配置
  
  ps:后期我们只关注spiders中得爬虫和pipelines持久化即可

scrapy解析数据

1.response对象有css方法和xpath方法( scrapy 内置了解析库,不需要使用第三方:支持xpath和css)
	-css中写css选择器
	-xpath中写xpath选择
2.重点11.xpath取文本内容
	'.//a[contains(@class,"link-title")]/text()'
  2.xpath取属性
    './/a[contains(@class,"link-title")]/@href'
  3.css取文本
    'a.link-title::text'
  4.css取属性
    'img.image-scale::attr(src)'
3.重点2:
	.extract_first()  取一个
  .extract()        取所有
   
4.一启动爬虫:把start_urls地址包装成request对象--->
丢给引擎--->调度器--->排队--->引擎--->下载中间件--->下载器--->下载完成--->引擎--->爬虫---->就回到了parse

用css来获取所有文章

import scrapy


class CnblogsSpider(scrapy.Spider):
    name = "cnblogs"
    allowed_domains = ["www.cnblogs.com"]
    start_urls = ["https://www.cnblogs.com"]

    def parse(self, response):
        # print(response.text)  # http响应包装成了response
        # 需求:解析出所有文章(css选择器)
        arctile_list = response.css('article.post-item')  # 列表中放对象
        # print(len(arctile_list))
        for article in arctile_list:
            title = article.css('a.post-item-title::text').extract_first()
            print(title)
            url = article.css('a.post-item-title::attr(href)').extract_first()
            print(url)
            desc = article.css('p.post-item-summary::text').extract()
            # print(desc)
            real_desc = desc[0].replace('\n','').replace(' ','')
            if real_desc:
                desc = real_desc
            else:
                desc = desc[1].replace('\n', '').replace(' ', '')
            print(desc)
            img = article.css('img.avatar::attr(src)').extract_first()
            print(img)
            author = article.css('a.post-item-author>span::text').extract_first()
            print(author)
            create_time = article.css('span.post-meta-item>span::text').extract_first()
            print(create_time)
            print(f"""
            文章标题:{title}
            文章链接:{url}
            文章简介:{desc}
            作者头像:{img}
            作者昵称:{author}
            文章创建时间:{create_time}
            """)

用xpath获取所有文章

import scrapy


class CnblogsSpider(scrapy.Spider):
    name = "cnblogs"
    allowed_domains = ["www.cnblogs.com"]
    start_urls = ["https://www.cnblogs.com"]

    def parse(self, response):
        # print(response.text)  # http响应包装成了response
        # 需求:解析出所有文章(xpath选择器)
        # article.post-item
        arctile_list = response.xpath('//article[contains(@class,"post-item")]')  # 列表中放对象

        print(len(arctile_list))
        for article in arctile_list:
            # a.post - item - title::text
            title = article.xpath('.//a/text()').extract_first()
            print(title)
            # a.post-item-title::attr(href)
            url = article.xpath('.//a[contains(@class,"post-item-title")]/@href').extract_first()
            print(url)
            # p.post-item-summary::text
            desc = article.xpath('.//p[contains(@class,"post-item-summary")]/text()').extract()
            # print(desc)
            real_desc = desc[0].replace('\n','').replace(' ','')
            if real_desc:
                desc = real_desc
            else:
                desc = desc[1].replace('\n', '').replace(' ', '')
            print(desc)
            # img.avatar::attr(src)
            img = article.xpath('.//p//img/@src').extract_first()
            print(img)
            # a.post-item-author>span::text
            author = article.xpath('.//section/footer/a[1]/span/text()').extract_first()
            print(author)
            # span.post-meta-item>span::text
            create_time = article.xpath('.//section/footer/span[1]/span/text()').extract_first()
            print(create_time)
            print(f"""
            文章标题:{title}
            文章链接:{url}
            文章简介:{desc}
            作者头像:{img}
            作者昵称:{author}
            文章创建时间:{create_time}
            """)

setting配置

基础配置

1.了解
	BOT_NAME = "firstscrapy"  #项目名字,整个爬虫名字
  
2.爬虫存放位置    了解
  SPIDER_MODULES = ["firstscrapy.spiders"]
  NEWSPIDER_MODULE = "firstscrapy.spiders"

3.是否遵循爬虫协议,一般都设为False
	ROBOTSTXT_OBEY = False

4.USER_AGENT = "firstscrapy (+http://www.yourdomain.com)"

5.日志级别
	LOG_LEVEL='ERROR'

6.DEFAULT_REQUEST_HEADERS 默认请求头
  DEFAULT_REQUEST_HEADERS = {
     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
     'Accept-Language': 'en',
  }

7.SPIDER_MIDDLEWARES 爬虫中间件
SPIDER_MIDDLEWARES = {
    'cnblogs.middlewares.CnblogsSpiderMiddleware': 543,
}
8.DOWNLOADER_MIDDLEWARES  下载中间件
  DOWNLOADER_MIDDLEWARES = {
      'cnblogs.middlewares.CnblogsDownloaderMiddleware': 543,
  }

9.ITEM_PIPELINES 持久化配置
  ITEM_PIPELINES = {
      'cnblogs.pipelines.CnblogsPipeline': 300,
  }

高级配置(提高爬虫效率--scarpy框架)

1.增加并发:默认16
  默认scrapy开启的并发线程为32个,可以适当进行增加。在settings配置文件中修改
  CONCURRENT_REQUESTS = 100
  值为100,并发设置成了为1002.降低日志级别:
  在运行scrapy时,会有大量日志信息的输出,为了减少CPU的使用率。可以设置log输出信息为INFO或者ERROR即可。在配置文件中编写:
  LOG_LEVEL = 'INFO'


3.禁止cookie:
  如果不是真的需要cookie,则在scrapy爬取数据时可以禁止cookie从而减少CPU的使用率,提升爬取效率。在配置文件中编写:
  COOKIES_ENABLED = False

4.禁止重试:
  对失败的HTTP进行重新请求(重试)会减慢爬取速度,因此可以禁止重试。在配置文件中编写:
  RETRY_ENABLED = False

5.减少下载超时:
  如果对一个非常慢的链接进行爬取,减少下载超时可以能让卡住的链接快速被放弃,从而提升效率。在配置文件中进行编写:
  DOWNLOAD_TIMEOUT = 10 超时时间为10s

项目

1.vue 前端 ,小程序,uniapp,安卓
    -https://gitee.com/hjp1011/uniapp-oa?_from=gitee_search
		-https://gitee.com/microapp/linjiashop-uniapp?_from=gitee_search
    -https://gitee.com/open-source-byte/house?_from=gitee_search
    -https://gitee.com/CZJpython/time-mercenaries
    
    -微信小程序
    -https://gitee.com/kesixin/QuestionWechatApp?_from=gitee_search
    -https://gitee.com/yaozy717/hbhzdtn?_from=gitee_search
    -https://gitee.com/voice-of-xiaozhuang/sportmini?_from=gitee_search
    
    -vue
    -https://gitee.com/vilson/vue-projectManage?_from=gitee_search
    -https://gitee.com/nmgwap/vueproject?_from=gitee_search
    -https://gitee.com/GaryZhouz/exam_system?_from=gitee_search
     
    -python项目
    -https://gitee.com/wushuiyong/walle-web
    -https://gitee.com/fanhuibin1/zhaoxinpms?_from=gitee_search#%E4%BA%A7%E5%93%81%E8%93%9D%E5%9B%BE
    -https://gitee.com/ht-jenny/Parking?_from=gitee_search