这是我参与2022首次更文挑战的第1天,活动详情查看:2022首次更文挑战
分析
首先,我们来看看猫眼电影的top100;maoyan.com/board/4?off…
通过观察,我们可以知道,翻页规则在于尾数的offset,那么我们就可以通过改变offset来获取不同的连接,具体效果如下:
https://maoyan.com/board/4?offset=1
https://maoyan.com/board/4?offset=2
https://maoyan.com/board/4?offset=3
https://maoyan.com/board/4?offset=4
...
...
https://maoyan.com/board/4?offset=10
生成链接的代码
def mao_yan():
""" 1
:return: 获取猫眼电影排行榜 前 100
"""
for i in range(_START, _END, _STEP):
print("mao yan start:{}".format(i))
_get_mao_yan(i)
time.sleep(random.choice([2, 3, 4, 5, 6, 7]))
pass
pass
其他相关代码
def _get_mao_yan(of):
""" 2
:param of: 页码
:return: 获取页面信息
"""
p = 'https://maoyan.com/board/4?offset={}'.format(of)
sess = requests.session()
sess.headers['User-Agent'] = _init_user_agent()
r = sess.get(p)
\
print("页面响应码:{}".format(r.status_code))
\
# 临时存储,避免被反爬
# _tmp_save_html(r.text, of)
\
# 解析数据
_parse_mao_yan(r.text)
\
pass
\
\
\
def _parse_mao_yan(c):
""" 3
:param c: html 内容
:return: 解析HTML内容
"""
html = etree.HTML(c)
info_list = html.xpath('//dl[@class="board-wrapper"]/dd')
ac = []
for info in info_list:
name = info.xpath('div/div/div[1]/p[1]/a/text()')[0]
info_url = 'http://maoyan.com' + info.xpath('div/div/div[1]/p[1]/a/@href')[0]
star = info.xpath('div/div/div[1]/p[2]/text()')[0].strip()
release_time = info.xpath('div/div/div[1]/p[3]/text()')[0].strip()
score_1 = info.xpath('div/div/div[2]/p/i[1]/text()')[0]
score_2 = info.xpath('div/div/div[2]/p/i[2]/text()')[0]
score = score_1 + score_2
cc = [name, score, star, release_time, info_url]
ac.append(cc)
\
# 存储到csv
_save_to_csv(ac)
pass
\
def _save_to_csv(c):
""" 5
:param c: 解析后的csv内容
:return: 存储内容到csv中
"""
global _csv_first
with open(CSV_FILE, 'a+', newline='', encoding='utf-8') as f:
f_csv = csv.writer(f)
\
if _csv_first == 1:
f_csv.writerow(CSV_HEADERS)
_csv_first = 2
f_csv.writerows(c)
pass
获取到的数据如下
后续
数据可视化 数据清洗 数据分析
...
期待后续