scrapy总结(三)---Splash的安装与使用

780 阅读2分钟

1 安装(linux环境)

首先安装docker

curl -sSL https://get.daocloud.io/docker | sh

su root # 先切换到root用户, 再执行以下命令
systemctl enable docker # 开机自动启动docker

systemctl start docker # 启动docker
systemctl restart docker # 重启dokcer

2 拉取镜像

sudo docker pull scrapinghub/splash

3 启动容器:

sudo docker run -p 8050:8050 -p 5023:5023 --restart=always scrapinghub/splash

现在splash在0.0.0.0这个ip上监听并绑定了端口8050(http) 和5023 (telnet)

这样,splash就启动起来了,如果想远程访问的话,要是阿里云的服务器,就去安全组中将进方向和出方向的端口都配上

具体的splash操作详见

splash-cn-doc.readthedocs.io/zh_CN/lates…

docker容器设置开机自启 docker开机自启设置
命令:systemctl enable docker.service
docker容器的开机自启

在使用docker run启动容器时,使用--restart参数来设置

例:# docker run -d --name mysql -p 3306:3306 --restart=always -v /var/lib/mysql:/var/lib/mysql -v /etc/localtime:/etc/localtime 39.106.193.240:9100/joss/mysql:5.7

always - 无论退出状态是如何,都重启容器

如果创建时未指定 --restart=always ,可通过update 命令设置

命令:docker update --restart=always 容器ID
例:docker update --restart=always 9bb3df5a70bf

番外篇

# 查看linux中的docker进程

docker ps

# 杀死docker进程
docker kill 338****0d(就是id)

一旦发现自己的多个进程或循环进程只运行了一次,看scrapy.Request的参数dont_filter是不是没有设置

requests搭配splash--url

# 原生的requests和splash的结合
import requests
from fake_useragent import UserAgent
splash_url = "http://192.168.59.103:8050/render.html?url={}&wait=1"
url = 'https://www.guazi.com/sh/buy/'
headers = {"User-Agent": UserAgent().random}
response = requests.get(splash_url.format(url),headers={"User-Agent": UserAgent().random})
response.encoding='utf-8'
print(response.text)

request搭配splash-lua

# 执行lua代码
import requests
from fake_useragent import UserAgent
from urllib.parse import quote

url = "https://www.guazi.com/sh/buy/"
lua_script = '''
function main(splash,args)
    splash:go('{}')
    splash.wait(2)
    return splash:html()
end
'''.format(url)
splash_url = "http://192.168.59.103:8050/execute?lua_source={}".format(quote(lua_script))

headers = {"User-Agent": UserAgent().random}
print(splash_url)
response = requests.get(splash_url, headers={"User-Agent": UserAgent().random})
response.encoding = 'utf-8'
print(response.text)

scrapy搭配spalsh

#. 然后在对应scrapy项目的settings里面配置Splash服务的地址,例如:

SPLASH_URL = 'http://192.168.59.103:8050'
#. 在settings中的DOWNLOADER_MIDDLEWARES 加上splash的中间件,并设置 HttpCompressionMiddleware 对象的优先级

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
#. 在SPIDER_MIDDLEWARES 中安装splash的 SplashDeduplicateArgsMiddleware 中间件

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
#. 您还可以设置对应的过滤中间件——DUPEFILTER_CLASS

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
#. 您可以设置scrapy.contrib.httpcache.FilesystemCacheStorage 来使用Splash的HTTP缓存

HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'


# 然后再spider中这样写(方式一)
import scrapy
from scrapy_splash import SplashRequest


class BaiduSpider(scrapy.Spider):
    name = 'guazi'
    allowed_domains = ['guazi.comn']
    start_urls = ['https://www.guazi.com/sh/buy/']

    def start_requests(self):
        yield SplashRequest(self.start_urls[0],dont_filter=True,args={'wait':1})

    def parse(self, response):
        print(response.text)

        
# 方式二

import scrapy
from scrapy_splash import SplashRequest


class BaiduSpider(scrapy.Spider):
    name = 'guazi2'
    allowed_domains = ['guazi.comn']
    start_urls = ['https://www.guazi.com/sh/buy/']

    def start_requests(self):
        lua_script = '''
            function main(splash, args)
            assert(splash:go(args.url))
            assert(splash:wait(0.5))
            return {
            html = splash:html()
            }
            end
        '''
        yield SplashRequest(url=self.start_urls[0],endpoint='execute',args={'lua_source':lua_script})

    def parse(self, response):
        print(response.text)

更多精彩,去看这个url内容

splash-cn-doc.readthedocs.io/zh_CN/lates…