PrSpiders线程池爬虫框架
Github地址: https://github.com/peng0928/PrSpiders
PrSpiders安装
pip install PrSpiders
开始 Go start!
1.Demo
from PrSpider import PrSpiders
class Spider(PrSpiders):
start_urls = 'https://www.runoob.com'
def parse(self, response):
# print(response.text)
print(response, response.code, response.url)
#<Response Code=200 Len=323273> 200 https://www.runoob.com/
if __name__ == '__main__':
Spider()
2.重写入口函数-start_requests
start_requests是框架的启动入口,PrSpiders.Requests是发送请求的发送,参数下面会列举。
from PrSpider import PrSpiders
class Spider(PrSpiders):
def start_requests(self, **kwargs):
start_urls = 'https://www.runoob.com'
PrSpiders.Requests(url=start_urls, callback=self.parse)
def parse(self, response):
# print(response.text)
print(response, response.code, response.url)
if __name__ == '__main__':
Spider()
3.PrSpiders基本配置
底层使用ThreadPoolExecutor
workers: 线程池大小
retry: 是否开启请求失败重试,默认开启
download_delay: 请求周期,默认0s
download_num: 每次线程请求数量,默认1秒5个请求
logger: 日志保存本地,默认False,开启Ture OR str(文件名),如 logger='test'
log_level: 日志等级,默认Info,等级(debug, info, warn, error)
log_stdout: 日志存储是否重定向,默认关闭
使用方法如下
from PrSpider import PrSpiders
class Spider(PrSpiders):
retry = False
download_delay = 3
download_num = 10
def start_requests(self, **kwargs):
start_urls = ['https://www.runoob.com' for i in range(100)]
PrSpiders.Requests(url=start_urls, callback=self.parse)
def parse(self, response):
# print(response.text)
print(response, response.code, response.url)
if __name__ == '__main__':
Spider()
4.PrSpiders.Requests基本配置
基本参数: url:请求网址 callback:回调函数 headers:请求头 retry_time:请求失败重试次数 method:请求方式(默认Get方法), meta:回调参数传递 encoding:编码格式(默认utf-8) retry_interval:重试间隔 timeout:请求超时时间(默认10s) data or params: 请求参数 **kwargs:继承requests的参数如(data, params, proxies)
PrSpiders.Requests(url=start_urls, headers={}, method='post', encoding='gbk', callback=self.parse, retry_time=10, retry_interval=0.5, meta={'hhh': 'ggg'})
Api
GET Status Code
response.code
GET Text
response.text
GET Content
response.content
GET Url
response.url
GET History
response.history
GET Headers
response.headers
GET Text Length
response.len
GET Lxml Xpath
response.xpath
Xpath Api
- text()方法:将xpath结果转成text
- date()方法:将xpath结果转成date
- get()方法:将xpath结果提取
- getall()方法:将xpath结果全部提取,拥有text()方法和date()方法
from PrSpider import PrSpiders
class Spider(PrSpiders):
log_level = 'info'
def start_requests(self, **kwargs):
start_urls = "https://blog.csdn.net/nav/python"
PrSpiders.Requests(url=start_urls, callback=self.parse)
def parse(self, response):
lisqueryall = response.xpath("//div[@class='content']").getall()
for query in lisqueryall:
title = query.xpath(".//span[@class='blog-text']").text(lists=True)
lishref = query.xpath(".//a[@class='blog']/@href").get()
print({
'写法': '第一种',
'列表标题': title,
'列表链接': lishref
})
title = response.xpath("//span[@class='blog-text']").text()
lisquery = response.xpath("//div[@class='content']/a[@class='blog']/@href").get()
print({
'写法': '第二种',
'列表标题': title,
'列表链接': lisquery
})
PrSpiders.Requests(url=lisquery, callback=self.cparse)
def cparse(self, response):
title = response.xpath("//h1[@id='articleContentId']").text()
pudate = response.xpath("//span[@class='time']").date()
content = response.xpath("//div[@id='content_views']").text()
print({
'标题': title,
'时间': str(pudate),
'href': response.url,
})
if __name__ == "__main__":
Spider()
Please contact me if there are any bugs
email ->