出没有发现先在网上卖的唐诗三百首不同的出版社之间,注释,还有标点符号都是不同的,甚至没有地方能查出那一版本才是正确的,课本上的是正确的吗,真的是真正确的吗,这个正不正确的标准又是由谁来判定呢?今天爬一个古诗文大全网站,既然出版社发的书真确与否都不能得到肯定的解释还不如自己搞点数据看。
- 话不多说,直接开始,还是先分析一下网站结构网站地址
这里就不看推荐了毕竟不是热点新闻嘞数据,直接进入诗文
可见这里就是数据的列表页,通过这里看不出什么,大部分网站首页不翻页不是在一类接口中的,比如说,首页的列表数据在html中但是翻页数据就会在ajax渲染接口中(也就是我常说的json数据),所以还是抓一下翻页的包看一下
嗯...数据还是在html中的,而且翻页也是拼链接得到的。也没有其他参数了,没什么新鲜的再看一下详情页数据
也没有什么特别的就是古诗id拼成链接就ok了,那就明了了,可以写代码了
2. 先请求翻页解析出列表数据
cookies = {
'login': 'flase',
'wxopenid': 'defoaltid',
'ticketStr': '206304604%7cgQHQ7zwAAAAAAAAAAS5odHRwOi8vd2VpeGluLnFxLmNvbS9xLzAydFlCc1FKbGVkN2kxa1lZTXhBMVcAAgQ870hkAwQAjScA',
}
headers = {
'authority': 'www.gushiwen.cn',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'no-cache',
# 'cookie': 'login=flase; wxopenid=defoaltid; ticketStr=206304604%7cgQHQ7zwAAAAAAAAAAS5odHRwOi8vd2VpeGluLnFxLmNvbS9xLzAydFlCc1FKbGVkN2kxa1lZTXhBMVcAAgQ870hkAwQAjScA',
'pragma': 'no-cache',
'referer': 'https://www.gushiwen.cn/',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'same-origin',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}
response = requests.get('https://www.gushiwen.cn/default_2.aspx', cookies=cookies, headers=headers)
res = etree.HTML(response.text)
data_list = res.xpath("//div[@class='left']/div[@class='sons']/div[@class='cont']/p/a[contains(@style,'font-size')]")
for data in data_list:
title = data.xpath("./b/text()")[0]
url = data.xpath("./@href")[0]
print(title,url)
没什么问题然后请求详情页数据,并解析
cookies = {
'ASP.NET_SessionId': 'dalwfvsnxl3ykz3egsjz32h1',
'wxopenid': 'defoaltid',
'ticketStr': '206304604%7cgQHQ7zwAAAAAAAAAAS5odHRwOi8vd2VpeGluLnFxLmNvbS9xLzAydFlCc1FKbGVkN2kxa1lZTXhBMVcAAgQ870hkAwQAjScA',
'login': 'flase',
'codeyzgswso': 'cc905ee008a0c6e8',
}
headers = {
'authority': 'so.gushiwen.cn',
'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'accept-language': 'zh-CN,zh;q=0.9',
'cache-control': 'no-cache',
# 'cookie': 'ASP.NET_SessionId=dalwfvsnxl3ykz3egsjz32h1; wxopenid=defoaltid; ticketStr=206304604%7cgQHQ7zwAAAAAAAAAAS5odHRwOi8vd2VpeGluLnFxLmNvbS9xLzAydFlCc1FKbGVkN2kxa1lZTXhBMVcAAgQ870hkAwQAjScA; login=flase; codeyzgswso=cc905ee008a0c6e8',
'pragma': 'no-cache',
'sec-ch-ua': '" Not A;Brand";v="99", "Chromium";v="100", "Google Chrome";v="100"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"Windows"',
'sec-fetch-dest': 'document',
'sec-fetch-mode': 'navigate',
'sec-fetch-site': 'none',
'sec-fetch-user': '?1',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36',
}
response = requests.get(url, cookies=cookies, headers=headers)
res = etree.HTML(response.text)
conent = res.xpath("//div[@class='contson']//text()")
zhushi = res.xpath("//div[@class='contyishang']//text()")
print(conent)
print(zhushi)
input()
ok,这里就解析两个做个案例,需要其他数据自行处理就好了。