爬最右小说详解

477 阅读2分钟

连更第三天,闲来无事爬本小说慢慢摸鱼。想起第一次读小说是读高中的时候有个同学拿来两天书一个是九转星辰上一个是九转星辰下。五六个人轮着看,等我上册看完了下册被倒霉孩子让教导主任收走了。话不多说,网站地址

1.老规矩还是先来欣赏一下板砖专用图 image.png 2.还是先看一下网站结构,打开网站找一本喜欢的小说,咱就来本修仙的把 image.png 2.1 可见结构清晰,小说的基本介绍,和章节列表都在这个页面了,列表底部有翻页,沟通章节直达文章页面 翻个页看看

image.png 2.2 可见翻页的时候连接page跟着变化,没有json接口,那就是了,咱们直接翻页表里列表页,然后取详情页连接,二次请求获得文章页数据。ok,理清了直接开干 3.首先对列表页发起请求

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    # 'Cookie': 'guid=07b1fb01-f629-470e-483b-18612ae39f15; isMember=false; hmsr=www.baidu.com; webp=1; JSESSIONID=3DC9B0975611644395EECB97B17795FC',
    'Pragma': 'no-cache',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}

params = {
    'page': '8',
}

response = requests.get('https://www.ihuaben.com/book/9422347.html', params=params, headers=headers)
res = etree.HTML(response.text)

3.1这时候需要解析出来左右的章节题目和连接,这里我是用xpath做解析

res = etree.HTML(response.text)

data_list = res.xpath("//div[@class='chapterlist']/p/span[@class='chapterTitle']/a")
for data in data_list:
    title = data.xpath("./@title")[0]
    url = 'https://www.ihuaben.com'+data.xpath("./@href")[0]
    print(title, url)

3.2然后继续请求一下文章的详情页然后xpath提取正文

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Cookie': 'guid=07b1fb01-f629-470e-483b-18612ae39f15; isMember=false; hmsr=www.baidu.com; webp=1; JSESSIONID=A1C04CC582E6F22983C2650238E74AD0',
    'Pragma': 'no-cache',
    'Referer': 'https://www.ihuaben.com/book/9422347.html?page=8',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
    'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
}
response = requests.get(url, headers=headers)
res = etree.HTML(response.text)
conent = ''
conent_list = res.xpath("//div[contains(@id,'contentsource')]//p//text()")
for zww in conent_list:
    conent += zww

3.2最后就是把章节的题目存储一下,这里我们把章节题目作为文件名,然后把文章作为内容做存储

conent_list = res.xpath("//div[contains(@id,'contentsource')]//p//text()")
for zww in conent_list:
    with open(f'{title}.txt','a',encoding='utf-8')as f:
        f.write(zww)
        f.write('\n')

跑一下

image.png