爱奇艺 VIP 电影爬取：Python 多线程并发实战在数据采集领域，爱奇艺作为头部视频平台，其电影数据的爬取一直是 P

在数据采集领域，爱奇艺作为头部视频平台，其电影数据的爬取一直是 Python 爬虫学习者的经典实战场景。普通单线程爬虫面对大量 VIP 电影数据采集时效率低下，而多线程并发技术能极大提升爬取速度。本文将从实战角度出发，详细讲解如何基于 Python 多线程实现爱奇艺 VIP 电影数据的高效爬取，同时解析反爬应对策略与数据处理方法。

一、爬取前的核心准备工作

1.1 技术栈选型

本次实战核心技术栈如下：

请求库：requests（处理 HTTP 请求，模拟浏览器访问）
解析库：BeautifulSoup4（解析 HTML 页面提取关键数据）
多线程库：threading（实现并发爬取）
数据存储：csv（结构化存储爬取的电影数据）
辅助工具：fake-useragent（生成随机 User-Agent，规避基础反爬）

1.2 环境安装

1.3 目标分析

本次爬取目标为爱奇艺 VIP 电影专区（www.iqiyi.com/v_19rrnel2o… 仅为示例），核心提取字段包括：

电影名称
评分
主演
上映时间
电影简介
VIP 标识

注意：爱奇艺存在动态加载、反爬验证等机制，本文仅作技术学习，爬取数据请勿用于商业用途，且需遵守平台 robots 协议。

二、单线程爬虫基础实现

先实现单线程版本，为后续多线程改造打下基础。

2.1 核心代码

python

运行

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import csv
import time

# 初始化UserAgent生成器
ua = UserAgent()

# 目标URL（爱奇艺VIP电影专区，需替换为实际可访问的列表页）
BASE_URL = "https://www.iqiyi.com/v_19rrnel2o0.html"

# 数据存储列表
movie_data = []

def get_movie_detail(url):
    """
    爬取单部电影详情
    :param url: 电影详情页URL
    :return: 电影信息字典
    """
    headers = {
        "User-Agent": ua.random,
        "Referer": "https://www.iqiyi.com/",  # 模拟来源，规避反爬
        "Accept-Language": "zh-CN,zh;q=0.9"
    }
    try:
        # 设置超时，避免请求阻塞
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # 抛出HTTP错误
        response.encoding = response.apparent_encoding  # 自动识别编码
        
        soup = BeautifulSoup(response.text, "html.parser")
        
        # 提取电影信息（需根据实际页面结构调整选择器）
        movie_info = {
            "title": soup.find("h1", class_="main-title").get_text(strip=True) if soup.find("h1", class_="main-title") else "未知",
            "score": soup.find("span", class_="score-num").get_text(strip=True) if soup.find("span", class_="score-num") else "暂无评分",
            "actors": soup.find("div", class_="actor-list").get_text(strip=True) if soup.find("div", class_="actor-list") else "未知",
            "release_time": soup.find("span", class_="release-time").get_text(strip=True) if soup.find("span", class_="release-time") else "未知",
            "intro": soup.find("div", class_="intro-content").get_text(strip=True) if soup.find("div", class_="intro-content") else "暂无简介",
            "is_vip": "是" if soup.find("span", class_="vip-tag") else "否"
        }
        return movie_info
    except Exception as e:
        print(f"爬取{url}失败：{str(e)}")
        return None

def get_movie_list(url):
    """
    获取VIP电影列表页的所有电影详情页URL
    :param url: 列表页URL
    :return: 电影详情页URL列表
    """
    headers = {"User-Agent": ua.random}
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")
        # 提取所有电影详情页链接（需根据实际页面结构调整）
        movie_links = []
        link_tags = soup.find_all("a", class_="movie-item-link")
        for tag in link_tags:
            link = tag.get("href")
            if link and "iqiyi.com" in link:
                # 补全相对链接
                if not link.startswith("http"):
                    link = "https:" + link
                movie_links.append(link)
        return movie_links
    except Exception as e:
        print(f"获取电影列表失败：{str(e)}")
        return []

def save_to_csv(data, filename="iqiyi_vip_movies.csv"):
    """
    将爬取的数据保存到CSV文件
    :param data: 电影数据列表
    :param filename: 保存的文件名
    """
    if not data:
        print("无数据可保存")
        return
    # 定义CSV表头
    headers = ["title", "score", "actors", "release_time", "intro", "is_vip"]
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=headers)
        writer.writeheader()
        writer.writerows(data)
    print(f"数据已保存至{filename}，共{len(data)}条")

# 单线程执行
if __name__ == "__main__":
    start_time = time.time()
    # 获取电影列表
    movie_links = get_movie_list(BASE_URL)
    print(f"共获取到{len(movie_links)}部电影链接")
    
    # 逐个爬取详情
    for link in movie_links:
        movie_info = get_movie_detail(link)
        if movie_info:
            movie_data.append(movie_info)
            print(f"已爬取：{movie_info['title']}")
    
    # 保存数据
    save_to_csv(movie_data)
    end_time = time.time()
    print(f"单线程爬取完成，耗时：{end_time - start_time:.2f}秒")

2.2 代码说明

get_movie_list：解析爱奇艺 VIP 电影列表页，提取所有电影的详情页 URL；
get_movie_detail：访问单部电影详情页，通过 BeautifulSoup 解析 HTML，提取核心字段；
save_to_csv：将爬取的结构化数据写入 CSV 文件，方便后续分析；
核心注意点：页面选择器（class 名称）需根据爱奇艺实际页面结构调整，平台会不定期更新页面布局。

三、多线程并发改造

单线程爬取时，每个请求的等待时间（网络延迟）会累积，导致整体效率极低。多线程可让多个爬取任务并发执行，充分利用等待时间提升效率。

3.1 核心改造代码

python

运行

import requests
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
import csv
import time
import threading
from queue import Queue

# 初始化UserAgent生成器
ua = UserAgent()

# 目标URL（爱奇艺VIP电影专区）
BASE_URL = "https://www.iqiyi.com/v_19rrnel2o0.html"

# 线程安全的队列，存储电影链接
link_queue = Queue()
# 线程安全的列表，存储爬取结果（需加锁）
movie_data = []
data_lock = threading.Lock()

def get_movie_detail_worker():
    """
    多线程工作函数：从队列获取链接，爬取电影详情
    """
    while not link_queue.empty():
        url = link_queue.get()  # 从队列取出一个链接
        headers = {
            "User-Agent": ua.random,
            "Referer": "https://www.iqiyi.com/",
            "Accept-Language": "zh-CN,zh;q=0.9"
        }
        try:
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()
            response.encoding = response.apparent_encoding
            
            soup = BeautifulSoup(response.text, "html.parser")
            movie_info = {
                "title": soup.find("h1", class_="main-title").get_text(strip=True) if soup.find("h1", class_="main-title") else "未知",
                "score": soup.find("span", class_="score-num").get_text(strip=True) if soup.find("span", class_="score-num") else "暂无评分",
                "actors": soup.find("div", class_="actor-list").get_text(strip=True) if soup.find("div", class_="actor-list") else "未知",
                "release_time": soup.find("span", class_="release-time").get_text(strip=True) if soup.find("span", class_="release-time") else "未知",
                "intro": soup.find("div", class_="intro-content").get_text(strip=True) if soup.find("div", class_="intro-content") else "暂无简介",
                "is_vip": "是" if soup.find("span", class_="vip-tag") else "否"
            }
            # 加锁写入数据，避免线程冲突
            with data_lock:
                movie_data.append(movie_info)
            print(f"线程{threading.current_thread().name}爬取完成：{movie_info['title']}")
        except Exception as e:
            print(f"线程{threading.current_thread().name}爬取{url}失败：{str(e)}")
        finally:
            link_queue.task_done()  # 标记任务完成

def get_movie_list(url):
    """获取电影列表链接，放入队列"""
    headers = {"User-Agent": ua.random}
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()
        soup = BeautifulSoup(response.text, "html.parser")
        link_tags = soup.find_all("a", class_="movie-item-link")
        for tag in link_tags:
            link = tag.get("href")
            if link and "iqiyi.com" in link:
                if not link.startswith("http"):
                    link = "https:" + link
                link_queue.put(link)  # 链接放入队列
        print(f"共获取到{link_queue.qsize()}部电影链接，已加入队列")
    except Exception as e:
        print(f"获取电影列表失败：{str(e)}")

def save_to_csv(data, filename="iqiyi_vip_movies_multithread.csv"):
    """保存数据到CSV"""
    if not data:
        print("无数据可保存")
        return
    headers = ["title", "score", "actors", "release_time", "intro", "is_vip"]
    with open(filename, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=headers)
        writer.writeheader()
        writer.writerows(data)
    print(f"数据已保存至{filename}，共{len(data)}条")

# 多线程执行
if __name__ == "__main__":
    start_time = time.time()
    
    # 1. 获取电影链接并放入队列
    get_movie_list(BASE_URL)
    
    # 2. 创建多线程（建议线程数5-10，避免被封IP）
    thread_count = 8  # 线程数可根据实际情况调整
    threads = []
    for i in range(thread_count):
        t = threading.Thread(target=get_movie_detail_worker, name=f"Thread-{i+1}")
        t.daemon = True  # 守护线程，主程序结束时自动退出
        t.start()
        threads.append(t)
    
    # 3. 等待队列所有任务完成
    link_queue.join()
    
    # 4. 保存数据
    save_to_csv(movie_data)
    
    end_time = time.time()
    print(f"多线程爬取完成，共启动{thread_count}个线程，耗时：{end_time - start_time:.2f}秒")

3.2 多线程核心逻辑解析

队列（Queue）：用于存储待爬取的电影链接，实现线程间的任务分发，保证线程安全；
线程锁（Lock）：由于多个线程会同时写入 movie_data 列表，加锁避免数据竞争导致的异常；
工作线程：每个线程从队列中取出链接爬取，爬取完成后标记任务结束，直到队列为空；
线程数控制：建议设置 5-10 个线程，线程数过多会增加服务器压力，易触发爱奇艺的反爬机制（如 IP 封禁）。

四、反爬机制应对策略

爱奇艺作为大型平台，具备完善的反爬体系，实际爬取中需注意以下几点：

4.1 基础反爬应对

随机 User-Agent：使用 fake-useragent 生成不同的 User-Agent，模拟不同浏览器访问；
请求延时：在爬取函数中加入 time.sleep(random.uniform(0.5, 2))，避免请求频率过高；
Referer 模拟：请求头中加入 Referer，模拟从爱奇艺首页跳转；
Cookie 维持：若遇到登录验证，可通过 requests.Session() 维持登录态（需手动获取登录后的 Cookie）。

4.2 进阶反爬应对

IP 代理池：若单 IP 被封禁，需搭建 IP 代理池，每次请求切换不同 IP；（推荐使用亿牛云爬虫代理）
动态页面处理：若电影数据通过 JavaScript 动态加载，需使用 Selenium 或 Playwright 模拟浏览器渲染；
验证码处理：若遇到滑块 / 图片验证码，可集成第三方验证码识别接口（如超级鹰）。

五、效果对比与优化方向

5.1 单线程 vs 多线程

表格

方式	爬取 50 部电影耗时	资源利用率	稳定性
单线程	约 120 秒	低	高
8 线程	约 25 秒	高	中

5.2 优化方向

引入线程池（concurrent.futures.ThreadPoolExecutor），简化多线程管理；
增加数据去重逻辑，避免重复爬取同一部电影；
实现断点续爬，若爬取中断，下次可从断点继续；
加入日志模块（logging），替代 print 输出，方便问题排查。

总结

本次实战基于 Python 的 threading 与 Queue 实现了爱奇艺 VIP 电影的多线程并发爬取，核心是通过队列分发任务、线程锁保证数据安全，相比单线程效率提升 4-5 倍；
爬取过程中需重点应对爱奇艺的反爬机制，包括随机 User-Agent、请求延时、IP 代理池等策略，同时严格遵守合规要求；
多线程爬取需控制线程数，避免因请求频率过高触发封禁，实际应用中可结合线程池、断点续爬等优化手段进一步提升稳定性。