爬虫抓取烂番茄上前100部电影名称和电影介绍（附带源码）爬虫抓取烂番茄上前100部电影名称和电影介绍（附带源码）爬

携手创作，共同成长！这是我参与「掘金日新计划 · 8 月更文挑战」的第9天，点击查看活动详情

前言

随着大数据时代的到来，对网络信息的需求广泛增加。许多不同的公司出于各种原因从互联网收集外部数据：分析竞争，总结新闻报道，跟踪特定市场的趋势，或收集每日股价以建立预测模型。因此，网络爬虫变得越来越重要。网络爬虫根据指定的规则自动浏览或从互联网上抓取信息。

网络爬虫的分类

根据实现的技术和结构，网络爬虫可以分为通用网络爬虫、重点网络爬虫、增量网络爬虫和深度网络爬虫。

网络爬虫的基本工作流程

一般网络爬虫
的基本工作流程一般网络爬虫的基本工作流程如下：

获取初始 URL。初始URL是网络爬虫的入口点，它链接到需要爬网的网页;
在抓取网页时，我们需要获取页面的HTML内容，然后对其进行解析以获取链接到此页面的所有页面的URL。
将这些 URL 放入队列中;
循环浏览队列，逐个读取队列中的URL，对于每个URL，抓取相应的网页，然后重复上述抓取过程;
检查是否满足停止条件。如果未设置停止条件，爬网程序将继续爬网，直到无法获取新的 URL。

网络爬网的环境准备

确保环境中已安装 Chrome、IE 或其他浏览器。
下载并安装 Python
下载合适的 IDL
本文使用 Visual Studio 代码
安装所需的Python包
Pip是一个Python包管理工具。它提供了用于搜索，下载，安装和卸载Python包的功能。下载和安装Python时将包含此工具。因此，我们可以直接使用“pip install”来安装我们需要的库。


pip install beautifulsoup4
pip install requests
pip install lxml

BeautifulSoup是一个可以轻松解析HTML和XML数据的库。
• lxml 是一个库，用于提高 XML 文件的解析速度。
• 请求是一个模拟 HTTP 请求（如 GET 和 POST）的库。我们将主要使用它来访问任何给定网站的源代码。

以下是使用爬虫抓取烂番茄上前100部电影名称和电影介绍的示例。

史上排名前100的电影–烂番茄

我们需要在此页面上提取电影的名称及其排名，并深入每个电影链接以获取电影的介绍。

1.首先，您需要导入需要使用的库。


import requests
import lxml
from bs4
import BeautifulSoup

2. 创建和访问网址

创建需要爬网的 URL 地址，然后创建标头信息，然后发送网络请求以等待响应。


url = "https://www.rottentomatoes.com/top/bestofrt/"
f = requests.get(url)

请求访问网页内容时，有时您会发现会出现403错误。这是因为服务器拒绝了您的访问。这是网页用于防止恶意收集信息的反爬网程序设置。此时，您可以通过模拟浏览器标头信息来访问它。

url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
f = requests.get(url, headers = headers)

3. 解析网页

创建一个 BeautifulSoup 对象，并将解析器指定为 lxml。
soup = BeautifulSoup(f.content,'lxml')

4. 提取信息

BeautifulSoup库有三种方法来查找元素：
findall（）：find all node
find（）：find a single node
select（）：根据选择器CSS Selector
查找我们需要获取top100电影的名称和链接。我们注意到所需的电影名称位于

.使用BeautifulSoup提取页面内容后，我们可以使用find方法提取相关信息。
movies = soup.find('table',{'class':'table'}).find_all('a')

获取每部电影的介绍

提取相关信息后，还需要提取每部电影的介绍。电影的介绍在每部电影的链接中，因此您需要单击每部电影的链接才能获得介绍。

代码是：

import requests
import lxml
from bs4
import BeautifulSoup
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
f = requests.get(url, headers = headers)
movies_lst = []
soup = BeautifulSoup(f.content, 'lxml')
movies = soup.find('table', {
    'class': 'table'
  })
  .find_all('a')
num = 0
for anchor in movies:
  urls = 'https://www.rottentomatoes.com' + anchor['href']
movies_lst.append(urls)
num += 1
movie_url = urls
movie_f = requests.get(movie_url, headers = headers)
movie_soup = BeautifulSoup(movie_f.content, 'lxml')
movie_content = movie_soup.find('div', {
  'class': 'movie_synopsis clamp clamp-6 js-clamp'
})
print(num, urls, '\n', 'Movie:' + anchor.string.strip())
print('Movie info:' + movie_content.string.strip())

输出为：

将已爬网数据写入 EXCEL

为了便于数据分析，可以将爬网的数据写入Excel。我们使用xlwt将数据写入Excel。

导入 xlwt 库。

from xlwt import *
创建一个空表。

workbook = Workbook(encoding = 'utf-8')
table = workbook.add_sheet('data')
Create the header of each column in the first row.
table.write(0, 0, 'Number')
table.write(0, 1, 'movie_url')
table.write(0, 2, 'movie_name')
table.write(0, 3, 'movie_introduction')
Write the crawled data into Excel separately from the second row.
table.write(line, 0, num)
table.write(line, 1, urls)
table.write(line, 2, anchor.string.strip())
table.write(line, 3, movie_content.string.strip())
line += 1

最后，保存 Excel。
workbook.save('movies_top100.xls')

最终代码是：

import requests
import lxml
from bs4
import BeautifulSoup
from xlwt
import *
workbook = Workbook(encoding = 'utf-8')
table = workbook.add_sheet('data')
table.write(0, 0, 'Number')
table.write(0, 1, 'movie_url')
table.write(0, 2, 'movie_name')
table.write(0, 3, 'movie_introduction')
line = 1
url = "https://www.rottentomatoes.com/top/bestofrt/"
headers = {
  'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36 QIHU 360SE'
}
f = requests.get(url, headers = headers)
movies_lst = []
soup = BeautifulSoup(f.content, 'lxml')
movies = soup.find('table', {
    'class': 'table'
  })
  .find_all('a')
num = 0
for anchor in movies:
  urls = 'https://www.rottentomatoes.com' + anchor['href']
movies_lst.append(urls)
num += 1
movie_url = urls
movie_f = requests.get(movie_url, headers = headers)
movie_soup = BeautifulSoup(movie_f.content, 'lxml')
movie_content = movie_soup.find('div', {
  'class': 'movie_synopsis clamp clamp-6 js-clamp'
})
print(num, urls, '\n', 'Movie:' + anchor.string.strip())
print('Movie info:' + movie_content.string.strip())
table.write(line, 0, num)
table.write(line, 1, urls)
table.write(line, 2, anchor.string.strip())
table.write(line, 3, movie_content.string.strip())
line += 1
workbook.save('movies_top100.xls')

结果是：

爬虫抓取烂番茄上前100部电影名称和电影介绍 （附带源码）