一是爬取以上所有信息的最小父级标签,标签中包含了上面所需的信息,然后针对每一个父级标签,依次提取里面的书名、评分、简介和链接地址;
二是分别爬取所有的书名、所有的评分、所有的简介、所有的链接地址,然后将它们按顺序一一对应起来。本次我们采取第一种思路。
import re
import requests
import time
import random
from openpyxl import workbook,load_workbook
from bs4 import BeautifulSoup
def getBook(page):
if page==0:
url = 'https://book.douban.com/top250'
else:
url='https://book.douban.com/top250'+'?start='+str(page*25)
try:
kv = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.3 Safari/605.1.15',
'Cookie': '__utma=81379588.1410774223.1624331967.1624430146.1624499388.5; __utmb=81379588.2.10.1624499388; __utmc=81379588; __utmz=81379588.1624499388.5.5.utmcsr=baidu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utma=30149280.1458731315.1624331967.1624430146.1624499388.5; __utmb=30149280.2.10.1624499388; __utmc=30149280; __utmz=30149280.1624499388.5.5.utmcsr=baidu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; gr_cs1_84825faf-1548-4089-8031-acd6fdaa3ce1=user_id%3A0; gr_user_id=5f5fb227-eb54-47cd-80bd-fa7cbbaeb2b3; _pk_id.100001.3ac3=449fa3ee36cea64b.1624331967.5.1624499726.1624430146.; _pk_ses.100001.3ac3=*; ap_v=0,6.0; __utmt=1; _vwo_uuid_v2=DD4F4DF42FA305FDB3940B128E6DE508D|87adadc0f8fbc5da7ed45a64ca113bad; __utmt_douban=1; gr_session_id_22c937bbd8ebd703f2d8e9445f7dfd03_84825faf-1548-4089-8031-acd6fdaa3ce1=true; _pk_ref.100001.3ac3=%5B%22%22%2C%22%22%2C1624499389%2C%22https%3A%2F%2Fwww.baidu.com%22%5D; gr_session_id_22c937bbd8ebd703f2d8e9445f7dfd03=84825faf-1548-4089-8031-acd6fdaa3ce1; ct=y; viewed="1052990_1007914_4913064_35378776_4465858_1683129"; __gads=ID=668d257fc5284aeb-22ede7a3a7c9008c:T=1624331967:RT=1624331967:S=ALNI_MaPwpYsc5fdhZ0jN4lIkO-CgZWF0w; ll="108288"; bid=A3beH6OH7gQ',
}
r = requests.get(url,headers=kv,verify=False)
time.sleep(random.randint(3,5))
r.raise_for_status()
r.encoding=r.apparent_encoding
except Exception as e:
print('爬取错误')
html=r.text
bs=BeautifulSoup(html,"html.parser")
return bs
def getMessage(soup):
Book_list=[]
list_books=soup.find_all('tr',class_="item")
for book in list_books:
#print(list_books)
#print(len(list_books))
list_item=[]
tag_list=book.find('div',class_="pl2").find('a')
title=tag_list.get_text().replace('\n','').strip(' ').replace(' ','')
list_item.append(title)
link=tag_list['href']
list_item.append(link)
tag_author=book.find('p',class_="pl")
author=tag_author.string.split('/')[0]
list_item.append(author)
tag_rate=book.find('span',class_='rating_nums')
rating_nums=tag_rate.string
list_item.append(rating_nums)
tag_judge=book.find('span',class_='pl')
judge=tag_judge.string.replace("\n","").replace('(','').replace(')','').strip(' ').split('人')[0]
list_item.append(judge)
try:
tag_quote=book.find('span',class_='inq').string
except:
tag_quote=str(None)
list_item.append(tag_quote)
#print(list_item)
Book_list.append(list_item)
return Book_list
if __name__=='__main__':
wb = workbook.Workbook()
ws=wb.active
ws.append(['书名','网址','作者','评分','评论数','推荐语'])
for n in range(0,10):
print("爬取第%d页的数据"%(n+1))
bs=getBook(n)
for i in range(25):
ws.append([getMessage(bs)[i][0],getMessage(bs)[i][1],getMessage(bs)[i][2],getMessage(bs)[i][3],getMessage(bs)[i][4],getMessage(bs)[i][5]])
wb.save("bookTop250改进版.xlsx")
print("爬取完毕")
爬取结果如下
爬取过程中遇到的问题: 1.爬取推荐语时,有些书没有推荐语,导致经常反馈'NoneType' object has no attribute 'string',试过用if/else排除,效果不好,最后用try/except解决
2.书链接在如图中,采用不同的方式分别获取书名和网址
参考BeautifulSoup 提取某个tag标签里面的内容
跳坑系列-Python爬虫中p标签NavigableString获取问题 3.获得书名等其他内容以后,要进行提纯格式化
具体要去掉\n,空格等
4.运行过程中,出现这个问题,但是不影响结果
参考
requests库提示警告:InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate ver
5.网页信息采集方法 正则表达式,xpath,
具体以后扩充