笔下之魂,阁里阅读,今天扒一下笔趣阁小说,这个必须个好像挺多人在写爬虫教程的好像笔趣阁网站后台是使用python开发的吧。话说你们都看些什么小说呢。
1.话不多说,直接进入主题,网站地址
在破败中崛起,在寂灭中复苏。沧海成尘,雷电枯竭,那一缕幽雾又一次临近大地,世间的枷锁被打开了。就爬这本圣墟吧,点击数据进入书的列表页
进来之后可见对书的基本介绍,正文,最近更新。既然怕全部的章节最近更新就没有什么用了,
从正文下面的章节进入到详情页
可见列表页和详情页数据都是存在html中的,那就需要xpath解析,可见源码显示为乱码,这里注意需要转码一下。其他的就没有发现有什么问题,这个文章还是纯文字的比之前更的哪一个小说网站好多了。到这里就理清楚了逻辑。
2.先对文章列表页发请求解析到所有章节
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Pragma": "no-cache",
"Referer": "http://www.ibiqu.org/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
url = "http://www.ibiqu.org/52_52542/"
response = requests.get(url, headers=headers, verify=False)
res = etree.HTML(response.text)
data_list = res.xpath("//div[@id='list']/dl/dt[contains(text(),'正文')]/following::dd")
for data in data_list:
title = data.xpath("./a/text()")[0]
url = data.xpath("./a/@href")[0]
print(title,url)
没什么问题,可以发现这里xpath用到了两个高级函数中的方法一个contains,一个following。不熟悉的朋友看一下之前写的xpath高级用法
然后就请求详情页解析正文部分内容
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Language": "zh-CN,zh;q=0.9",
"Cache-Control": "no-cache",
"Connection": "keep-alive",
"Pragma": "no-cache",
"Referer": "http://www.ibiqu.org/52_52542/",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36"
}
response = requests.get(url, headers=headers, verify=False)
response.encoding = response.apparent_encoding
res = etree.HTML(response.text)
conent = ''
conentlist = res.xpath("//div[@id='content']/p//text()")
print(conentlist)
for cc in conentlist:
cc = str(cc).replace('\u3000','')
print(cc)
input()
ok,没什么问题
1,下次更一篇pyhton文件写入