另一种思路爬豆瓣读书top250

487 阅读3分钟

一是爬取以上所有信息的最小父级标签,标签中包含了上面所需的信息,然后针对每一个父级标签,依次提取里面的书名、评分、简介和链接地址;

二是分别爬取所有的书名、所有的评分、所有的简介、所有的链接地址,然后将它们按顺序一一对应起来。本次我们采取第一种思路。


import re
import requests
import time
import random
from openpyxl import workbook,load_workbook
from bs4 import BeautifulSoup
    
def getBook(page):
    if page==0:  
        url = 'https://book.douban.com/top250'
    else:
        url='https://book.douban.com/top250'+'?start='+str(page*25)
    try:
        kv = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.3 Safari/605.1.15',
      'Cookie': '__utma=81379588.1410774223.1624331967.1624430146.1624499388.5; __utmb=81379588.2.10.1624499388; __utmc=81379588; __utmz=81379588.1624499388.5.5.utmcsr=baidu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; __utma=30149280.1458731315.1624331967.1624430146.1624499388.5; __utmb=30149280.2.10.1624499388; __utmc=30149280; __utmz=30149280.1624499388.5.5.utmcsr=baidu.com|utmccn=(referral)|utmcmd=referral|utmcct=/; gr_cs1_84825faf-1548-4089-8031-acd6fdaa3ce1=user_id%3A0; gr_user_id=5f5fb227-eb54-47cd-80bd-fa7cbbaeb2b3; _pk_id.100001.3ac3=449fa3ee36cea64b.1624331967.5.1624499726.1624430146.; _pk_ses.100001.3ac3=*; ap_v=0,6.0; __utmt=1; _vwo_uuid_v2=DD4F4DF42FA305FDB3940B128E6DE508D|87adadc0f8fbc5da7ed45a64ca113bad; __utmt_douban=1; gr_session_id_22c937bbd8ebd703f2d8e9445f7dfd03_84825faf-1548-4089-8031-acd6fdaa3ce1=true; _pk_ref.100001.3ac3=%5B%22%22%2C%22%22%2C1624499389%2C%22https%3A%2F%2Fwww.baidu.com%22%5D; gr_session_id_22c937bbd8ebd703f2d8e9445f7dfd03=84825faf-1548-4089-8031-acd6fdaa3ce1; ct=y; viewed="1052990_1007914_4913064_35378776_4465858_1683129"; __gads=ID=668d257fc5284aeb-22ede7a3a7c9008c:T=1624331967:RT=1624331967:S=ALNI_MaPwpYsc5fdhZ0jN4lIkO-CgZWF0w; ll="108288"; bid=A3beH6OH7gQ',
      }
        r = requests.get(url,headers=kv,verify=False)
        time.sleep(random.randint(3,5))
        r.raise_for_status()
        r.encoding=r.apparent_encoding
    except Exception as e:
        print('爬取错误')
    html=r.text
    bs=BeautifulSoup(html,"html.parser")
    return bs
def getMessage(soup):
    Book_list=[]
    list_books=soup.find_all('tr',class_="item")
    for book in list_books:
        #print(list_books)
        #print(len(list_books))
        list_item=[]
        tag_list=book.find('div',class_="pl2").find('a')
        title=tag_list.get_text().replace('\n','').strip(' ').replace(' ','')
        list_item.append(title)
        link=tag_list['href']
        list_item.append(link)
        tag_author=book.find('p',class_="pl")
        author=tag_author.string.split('/')[0]
        list_item.append(author)
        tag_rate=book.find('span',class_='rating_nums')
        rating_nums=tag_rate.string
        list_item.append(rating_nums)
        tag_judge=book.find('span',class_='pl')
        judge=tag_judge.string.replace("\n","").replace('(','').replace(')','').strip(' ').split('人')[0]
        list_item.append(judge)
        try:
            tag_quote=book.find('span',class_='inq').string
        except:
            tag_quote=str(None)
        list_item.append(tag_quote)
        
            
        #print(list_item)
        Book_list.append(list_item)
    return Book_list

if __name__=='__main__':
    wb = workbook.Workbook()
    ws=wb.active
    ws.append(['书名','网址','作者','评分','评论数','推荐语'])
    for n in range(0,10):
        print("爬取第%d页的数据"%(n+1))
        bs=getBook(n)
        for i in range(25):
            ws.append([getMessage(bs)[i][0],getMessage(bs)[i][1],getMessage(bs)[i][2],getMessage(bs)[i][3],getMessage(bs)[i][4],getMessage(bs)[i][5]])
    wb.save("bookTop250改进版.xlsx")
    print("爬取完毕")

爬取结果如下

5E37B6E0-03D0-45D1-AA41-52B9A46EAAC7_4_5005_c.jpeg

爬取过程中遇到的问题: 1.爬取推荐语时,有些书没有推荐语,导致经常反馈'NoneType' object has no attribute 'string',试过用if/else排除,效果不好,最后用try/except解决

AA3C990E-4D16-4446-956E-060BCBDFDCBC_4_5005_c.jpeg

2.书链接在如图中,采用不同的方式分别获取书名和网址

9DC8FA40-F28B-46AE-A3FF-776282D40303_4_5005_c.jpeg 参考BeautifulSoup 提取某个tag标签里面的内容

爬虫基本库

跳坑系列-Python爬虫中p标签NavigableString获取问题 3.获得书名等其他内容以后,要进行提纯格式化

53CC5590-CBB8-4F62-8745-C24967555011_4_5005_c.jpeg 具体要去掉\n,空格等 4.运行过程中,出现这个问题,但是不影响结果

010F8DE6-DAB0-4048-B8C1-EBA160BA8640_4_5005_c.jpeg

参考

requests库提示警告:InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate ver

5.网页信息采集方法 正则表达式,xpath,

Python爬虫选择器(一) xpath

具体以后扩充