场景

我们都知道，spalsh是 scrapy 的一个插件项目。但是有些场景，为了方便、高效，需要脱离scrapy框架使用spalsh。

配置

代理：隧道代理为佳

在宿主机上找个位置，新建文件/root/splash/proxy-files/cip.ini 注意：区别于官方文档，ini应该为小写

[proxy]

; required
host=你的配置
port=你的配置

; optional, default is no auth
username=你的配置
password=你的配置

; optional, default is HTTP. Allowed values are HTTP and SOCKS5
type=HTTP

[rules]
; optional, default ".*"
whitelist=
    .*cip.cc.*

; optional, default is no blacklist
blacklist=
    .*.js.*
    .*.css.*
    .*.png

docker启动spalsh

[root@host proxy-files]# docker run -p 8050:8050 -v /root/splash/proxy-files:/etc/splash/proxy-profiles scrapinghub/splash

# 后台启动
docker run -d -p 8050:8050 -v /root/splash/proxy-files:/etc/splash/proxy-profiles scrapinghub/splash –restart=always

启动后日志：

2022-10-13 04:22:24+0000 [-] Log opened.
2022-10-13 04:22:24.608906 [-] Xvfb is started: ['Xvfb', ':322410884', '-screen', '0', '1024x768x24', '-nolisten', 'tcp']
QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-splash'
2022-10-13 04:22:24.718768 [-] Splash version: 3.5
2022-10-13 04:22:24.774528 [-] Qt 5.14.1, PyQt 5.14.2, WebKit 602.1, Chromium 77.0.3865.129, sip 4.19.22, Twisted 19.7.0, Lua 5.2
2022-10-13 04:22:24.774753 [-] Python 3.6.9 (default, Jul 17 2020, 12:50:27) [GCC 8.4.0]
2022-10-13 04:22:24.774878 [-] Open files limit: 1048576
2022-10-13 04:22:24.774946 [-] Can't bump open files limit
2022-10-13 04:22:24.806401 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles
2022-10-13 04:22:24.806578 [-] memory cache: enabled, private mode: enabled, js cross-domain access: disabled
2022-10-13 04:22:24.969678 [-] verbosity=1, slots=20, argument_cache_max_entries=500, max-timeout=90.0
2022-10-13 04:22:24.970001 [-] Web UI: enabled, Lua: enabled (sandbox: enabled), Webkit: enabled, Chromium: enabled
2022-10-13 04:22:24.970452 [-] Site starting on 8050
2022-10-13 04:22:24.970553 [-] Starting factory <twisted.web.server.Site object at 0x7f729c0f4550>
2022-10-13 04:22:24.970848 [-] Server listening on http://0.0.0.0:8050

其中 2022-10-13 04:22:24.806401 [-] proxy profiles support is enabled, proxy profiles path: /etc/splash/proxy-profiles 标识代理生效

使用：

访问cip.cc，查看此时的ip

curl 'http://10.0.19.90:8050/render.html?url=http://cip.cc&proxy=cip'

返回结果


<html><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">Too Many Request
</pre></body></html>%
➜  ~ curl 'http://10.0.19.90:8050/render.html?url=http://cip.cc&proxy=cip'
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head>
		<title>IP查询 - 查IP(www.cip.cc)</title>
		<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
		<meta name="description" content="查IP(www.cip.cc)网站, 提供免费的IP查询服务,命令行查询IP, 并且支持'PC网站, 手机网站, 命令行(Windows/UNIX/Linux)' 三大平台, 是个多平台的IP查询网站, 更新即使, 数据准确是我们的目标">
		<meta name="keywords" content="IP, 查IP, IP查询">
		<meta name="viewport" content="width=device-width, initial-scale=1.0">
		<meta content="width=device-width,initial-scale=1" name="viewport">
		<link rel="icon" href="data:;base64,=">
		 <link href="//static.cip.cc/static/styles.min.css?v=15" rel="stylesheet">
		 <script src="https://hm.baidu.com/hm.js?6c34da399cbcfbb71d86c72215942759"></script><script type="text/javascript" src="//static.cip.cc/static/js.min.js?v=6"></script>
	</head>
	<body>

<div class="wrapper">
	<div class="page">
		<div class="logo">
			<h1>
				<strong>多平台的命令行IP查询</strong>
				<a href="//www.cip.cc/" title="手机, 命令行IP查询"><img src="//static.cip.cc/static/img/logo.png?v=2" alt="手机, 命令行IP查询"></a>
			</h1>
		</div>
		<div class="search">
			<form action="/" onsubmit="return query();">
				<table>
					<tbody>
					<tr>
						<td style=" width: 75%; ">
					<input id="data-input" placeholder="请输入要查询的 IP 地址" size="26" type="text">
						</td>
						<td>
					<input id="data-submit" type="submit" class="kq-button" value="查询">
						</td>
					</tr>
					</tbody>
				</table>
			</form>
		</div>
		<div class="data kq-well">
				<pre>IP	: 182.87.15.14
地址	: 中国  江西  鹰潭
运营商	: 电信

数据二	: 江西省鹰潭市 | 电信

数据三	: 中国江西省鹰潭市 | 电信

URL	: http://www.cip.cc/182.87.15.14
</pre>

可以看到，ip已经更换

Python requests demo

import requests

target_url = "http://cip.cc"
url = f'http://{你的服务器地址}:8050/render.html?url={target_url}&proxy=cip'

response = requests.get(url)
print(response.text)

一个需求案例：

import time
import requests
from concurrent.futures import ThreadPoolExecutor
from retrying import retry


@retry(stop_max_attempt_number=3)
def req_taobao():
    splash_url = f'http://10.0.19.90:8050/render.html'
    target_url = "https://shop551707528.taobao.com/search.htm?search=y&orderType=hotsell_desc&&pageNo=1"

    response = requests.get(splash_url, timeout=10, params={'url': target_url, 'timeout': 10, 'proxy': 'taobao'})
    print(response.text)



"""开始爬虫"""
s_time = time.time()
executor = ThreadPoolExecutor(5)
for i in range(10):
    executor.submit(req_taobao)
executor.shutdown()
print(time.time() - s_time)

更加推荐的方案

import requests
import json

# Splash 服务的地址
splash_url = 'http://10.0.19.90:8050/execute'

# Lua 脚本
lua_script = """
function main(splash, args)
    splash:on_request(function(request)
        request:set_proxy{
            host = "替换为你的隧道代理host",
            port = 8080,
        }
    end)
    assert(splash:go(args.url))
    assert(splash:wait(1))
    return {
        html = splash:html()
    }
end
"""

# 发送请求到 Splash
response = requests.post(splash_url, json={
    'lua_source': lua_script,
    'url': 'http://cip.cc'  # 目标网站 URL
})
import json
# 输出获取的数据
print(json.loads(response.text))

spalsh，单独使用（不使用scrapy框架），如何设置Proxy IP

场景

配置

docker启动spalsh

使用：

Python requests demo

更加推荐的方案

相关文档