python爬虫一、爬取猫眼电影top100

487 阅读1分钟

这是我参与2022首次更文挑战的第1天,活动详情查看:2022首次更文挑战

分析

首先,我们来看看猫眼电影的top100;maoyan.com/board/4?off…

clipboard.png

通过观察,我们可以知道,翻页规则在于尾数的offset,那么我们就可以通过改变offset来获取不同的连接,具体效果如下:

https://maoyan.com/board/4?offset=1
https://maoyan.com/board/4?offset=2
https://maoyan.com/board/4?offset=3
https://maoyan.com/board/4?offset=4
...
...
https://maoyan.com/board/4?offset=10

生成链接的代码

def mao_yan():

""" 1

:return: 获取猫眼电影排行榜 前 100

"""

for i in range(_START, _END, _STEP):

print("mao yan start:{}".format(i))

_get_mao_yan(i)

time.sleep(random.choice([2, 3, 4, 5, 6, 7]))

pass

pass

其他相关代码

def _get_mao_yan(of):

""" 2

:param of: 页码

:return: 获取页面信息

"""

p = 'https://maoyan.com/board/4?offset={}'.format(of)

sess = requests.session()

sess.headers['User-Agent'] = _init_user_agent()

r = sess.get(p)

\


print("页面响应码:{}".format(r.status_code))

\


# 临时存储,避免被反爬

# _tmp_save_html(r.text, of)

\


# 解析数据

_parse_mao_yan(r.text)

\


pass

\


\


\


def _parse_mao_yan(c):

""" 3

:param c: html 内容

:return: 解析HTML内容

"""

html = etree.HTML(c)

info_list = html.xpath('//dl[@class="board-wrapper"]/dd')

ac = []

for info in info_list:

name = info.xpath('div/div/div[1]/p[1]/a/text()')[0]

info_url = 'http://maoyan.com' + info.xpath('div/div/div[1]/p[1]/a/@href')[0]

star = info.xpath('div/div/div[1]/p[2]/text()')[0].strip()

release_time = info.xpath('div/div/div[1]/p[3]/text()')[0].strip()

score_1 = info.xpath('div/div/div[2]/p/i[1]/text()')[0]

score_2 = info.xpath('div/div/div[2]/p/i[2]/text()')[0]

score = score_1 + score_2

cc = [name, score, star, release_time, info_url]

ac.append(cc)

\


# 存储到csv

_save_to_csv(ac)

pass

\


def _save_to_csv(c):

""" 5

:param c: 解析后的csv内容

:return: 存储内容到csv中

"""

global _csv_first

with open(CSV_FILE, 'a+', newline='', encoding='utf-8') as f:

f_csv = csv.writer(f)

\


if _csv_first == 1:

f_csv.writerow(CSV_HEADERS)

_csv_first = 2

f_csv.writerows(c)

pass

获取到的数据如下

clipboard11.png

后续

数据可视化 数据清洗 数据分析

clipboard333.png ...

期待后续