爬取豆瓣电影Top250

97 阅读3分钟
  • 要求:电影名,上映时间,导演,主演,评分,评价条数,剧情,介绍
  • 爬取数据后保存到Excel中pandas
  • 后续分析对应的数据,通过pyecharts
  • 生成词云图

确定目标网站URL:movie.douban.com/top250

分析页面数据的产生:

通过浏览器工具可以看到页面是属于静态数据生成的,因此我们可以通过xpath活着bs4来提取数据,这里我们就选用xpath

工具:

通过xpath获取数据,浏览器的安装的小插件,我们这里通过火狐浏览器通过xpath插件获取需要的字段信息
我们分析页面HTML元素,发现每个电影都是有ol标签的li标签组合,每一页有25li标签,这样我们可以先通过xpath获取ol标签下所有的li标签(这样做的好处是对获取单个标签的字符串的值处理方便些)

举例:获取每个电影的标题

import requests
from lxml import etree
import pandas as pd

headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Accept-Language': 'zh-CN,zh;q=0.9',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
    'Cookie': 'll="118282"; bid=KAa-jMmP9zQ; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1675995213%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DzqSh_IzoXJzHOZECIieU2OSovfe4sqL0ywMKUL2QN8y36xC5wKArJJ2gssY_e8gb%26wd%3D%26eqid%3Dae10bab90007f79d0000000663e5a845%22%5D; _pk_ses.100001.4cf6=*; ap_v=0,6.0; __utma=30149280.358556425.1675995213.1675995213.1675995213.1; __utmc=30149280; __utmz=30149280.1675995213.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utma=223695111.984307299.1675995213.1675995213.1675995213.1; __utmb=223695111.0.10.1675995213; __utmc=223695111; __utmz=223695111.1675995213.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __yadk_uid=8QehsmrZSRdtjpx1frCHwmPQVfnys9LX; _vwo_uuid_v2=D7C8FF59380ED91F0D9025205640F6A2F|55b65e63e75dc498f0dc12b3c01d3ec7; __utmt=1; __utmb=30149280.3.10.1675995213; __gads=ID=b58e41ab272c9fb9-22d7eecda1d90012:T=1675995366:RT=1675995366:S=ALNI_MZAbBEwz8lXOhsMLZITvvrtUwCiEQ; __gpi=UID=00000bbecffe4bd6:T=1675995366:RT=1675995366:S=ALNI_ManR4BFcBZG-Z_TTmxMYT-a8J2IYg; _pk_id.100001.4cf6=025c2a9b6b803b08.1675995213.1.1675995375.1675995213.',
    'Pragma': 'no-cache',
    'Sec-Fetch-Dest': 'document',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-Site': 'none',
    'Sec-Fetch-User': '?1',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
    'sec-ch-ua': '"Not_A Brand";v="99", "Google Chrome";v="109", "Chromium";v="109"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"macOS"',
}
data = []


def get_data(params):
    response = requests.get('https://movie.douban.com/top250', headers=headers, params=params)
    e = etree.HTML(response.text)
    lis = e.xpath("//div[@id='content']//ol[@class='grid_view']/li")
    for li in lis:
        name = li.xpath("div/div[2]/div[1]/a/span[1]/text()")
        # 这里导演主演相关信息要特殊处理,这里是单个电影的,比较好处理
        director = li.xpath("div/div[2]/div[2]/p[1]/text()")
        d = director[0].strip()
        index1 = d.find('主演')
        director_1 = d[:index1]
        actor = d[index1:]
        d = director[1].strip()
        d_list = d.split('/')
        date = d_list[0].strip()
        address = d_list[1].strip()
        type = d_list[2].strip()
        rating = li.xpath("div/div[2]/div[2]/div/span[2]/text()")
        comment = li.xpath("div/div[2]/div[2]/div/span[4]/text()")
        quotes = li.xpath("div/div[2]/div[2]//span[@class='inq']/text()")
        # 这里短评加了个判断,因为有的电影数据这里为None
        quote = ''
        if quotes:
            quote = quotes[0]
        try:
            data.append((name[0], director_1, actor, date, address, type, rating[0], comment[0], quote))
        except Exception as e:
            print({name[0]})
            print({director_1})
            print({rating[0]})
            print({comment[0]})
            print({quotes[0]})
            print(e)
    return data


if __name__ == '__main__':
    params = {
        'start': '0',
        'filter': '',
    }
    size = 25
    for page in range(0, 10):
        start = page * size
        params['start'] = str(start)
        get_data(params)

    df = pd.DataFrame(data,
                      columns=['name', 'director', 'actor', 'date', 'address', 'type', 'rating', 'comment', 'quotes'])
    df.to_excel('douban.xlsx')

数据存储:

这里就采用pandas库,生产excel、csv相关的比较方便
Excel展示结果

注意点:

  • 处理导演、年份相关信息的时候要特殊处理下

  • quote字段有些为空,需要注意判断