- 要求:电影名,上映时间,导演,主演,评分,评价条数,剧情,介绍
- 爬取数据后保存到Excel中pandas
- 后续分析对应的数据,通过pyecharts
- 生成词云图
确定目标网站URL:movie.douban.com/top250
分析页面数据的产生:
通过浏览器工具可以看到页面是属于静态数据生成的,因此我们可以通过xpath活着bs4来提取数据,这里我们就选用xpath
工具:
通过xpath获取数据,浏览器的安装的小插件,我们这里通过火狐浏览器通过xpath插件获取需要的字段信息
我们分析页面HTML元素,发现每个电影都是有ol标签的li标签组合,每一页有25个li标签,这样我们可以先通过xpath获取ol标签下所有的li标签(这样做的好处是对获取单个标签的字符串的值处理方便些)
举例:获取每个电影的标题
import requests
from lxml import etree
import pandas as pd
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'Cookie': 'll="118282"; bid=KAa-jMmP9zQ; _pk_ref.100001.4cf6=%5B%22%22%2C%22%22%2C1675995213%2C%22https%3A%2F%2Fwww.baidu.com%2Flink%3Furl%3DzqSh_IzoXJzHOZECIieU2OSovfe4sqL0ywMKUL2QN8y36xC5wKArJJ2gssY_e8gb%26wd%3D%26eqid%3Dae10bab90007f79d0000000663e5a845%22%5D; _pk_ses.100001.4cf6=*; ap_v=0,6.0; __utma=30149280.358556425.1675995213.1675995213.1675995213.1; __utmc=30149280; __utmz=30149280.1675995213.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __utma=223695111.984307299.1675995213.1675995213.1675995213.1; __utmb=223695111.0.10.1675995213; __utmc=223695111; __utmz=223695111.1675995213.1.1.utmcsr=baidu|utmccn=(organic)|utmcmd=organic; __yadk_uid=8QehsmrZSRdtjpx1frCHwmPQVfnys9LX; _vwo_uuid_v2=D7C8FF59380ED91F0D9025205640F6A2F|55b65e63e75dc498f0dc12b3c01d3ec7; __utmt=1; __utmb=30149280.3.10.1675995213; __gads=ID=b58e41ab272c9fb9-22d7eecda1d90012:T=1675995366:RT=1675995366:S=ALNI_MZAbBEwz8lXOhsMLZITvvrtUwCiEQ; __gpi=UID=00000bbecffe4bd6:T=1675995366:RT=1675995366:S=ALNI_ManR4BFcBZG-Z_TTmxMYT-a8J2IYg; _pk_id.100001.4cf6=025c2a9b6b803b08.1675995213.1.1675995375.1675995213.',
'Pragma': 'no-cache',
'Sec-Fetch-Dest': 'document',
'Sec-Fetch-Mode': 'navigate',
'Sec-Fetch-Site': 'none',
'Sec-Fetch-User': '?1',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36',
'sec-ch-ua': '"Not_A Brand";v="99", "Google Chrome";v="109", "Chromium";v="109"',
'sec-ch-ua-mobile': '?0',
'sec-ch-ua-platform': '"macOS"',
}
data = []
def get_data(params):
response = requests.get('https://movie.douban.com/top250', headers=headers, params=params)
e = etree.HTML(response.text)
lis = e.xpath("//div[@id='content']//ol[@class='grid_view']/li")
for li in lis:
name = li.xpath("div/div[2]/div[1]/a/span[1]/text()")
# 这里导演主演相关信息要特殊处理,这里是单个电影的,比较好处理
director = li.xpath("div/div[2]/div[2]/p[1]/text()")
d = director[0].strip()
index1 = d.find('主演')
director_1 = d[:index1]
actor = d[index1:]
d = director[1].strip()
d_list = d.split('/')
date = d_list[0].strip()
address = d_list[1].strip()
type = d_list[2].strip()
rating = li.xpath("div/div[2]/div[2]/div/span[2]/text()")
comment = li.xpath("div/div[2]/div[2]/div/span[4]/text()")
quotes = li.xpath("div/div[2]/div[2]//span[@class='inq']/text()")
# 这里短评加了个判断,因为有的电影数据这里为None
quote = ''
if quotes:
quote = quotes[0]
try:
data.append((name[0], director_1, actor, date, address, type, rating[0], comment[0], quote))
except Exception as e:
print({name[0]})
print({director_1})
print({rating[0]})
print({comment[0]})
print({quotes[0]})
print(e)
return data
if __name__ == '__main__':
params = {
'start': '0',
'filter': '',
}
size = 25
for page in range(0, 10):
start = page * size
params['start'] = str(start)
get_data(params)
df = pd.DataFrame(data,
columns=['name', 'director', 'actor', 'date', 'address', 'type', 'rating', 'comment', 'quotes'])
df.to_excel('douban.xlsx')
数据存储:
这里就采用pandas库,生产excel、csv相关的比较方便
Excel展示结果
注意点:
-
处理导演、年份相关信息的时候要特殊处理下
-
quote字段有些为空,需要注意判断