定时爬取！Python 监控百度热搜榜数据变化在信息爆炸的时代，百度热搜榜是全网热点的风向标，无论是舆情监测、市场分析、

在信息爆炸的时代，百度热搜榜是全网热点的风向标，无论是舆情监测、市场分析、内容创作还是商业决策，实时掌握热搜数据的动态变化都具备极高的价值。百度热搜榜每 10 分钟自动更新一次，人工查看不仅效率低下，还无法留存历史数据、追踪排名波动。基于 Python 开发一套自动化监控系统，实现定时爬取、数据存储、变化对比一体化，成为高效获取热搜数据的最优解。

本文将从零搭建一套完整的百度热搜监控系统，解决定时任务稳定执行、数据持久化存储、榜单变化智能分析、反爬 IP 封禁四大核心难题，系统轻量化、易部署、可 7×24 小时稳定运行，适合技术爱好者、运维人员、数据分析人员直接使用。

一、系统整体设计与技术选型

1.1 核心功能

系统采用三层架构设计，实现全流程自动化：

爬取层：定时请求百度热搜页面，解析排名、标题、热度指数等核心数据；
存储层：使用 SQLite 轻量级数据库存储历史数据，无需额外部署服务；
分析层：对比相邻两次爬取数据，自动识别热搜排名升降、新增、消失等变化。

1.2 技术栈选型

表格

功能模块	技术方案	选型优势
HTTP 请求	requests	简洁易用，支持代理、请求头配置
网页解析	BeautifulSoup4	适配 HTML 解析，CSS 选择器定位精准
定时调度	schedule	轻量无依赖，适合中小规模定时任务
数据存储	SQLite	单文件数据库，零配置、易迁移
反爬防护	亿牛云隧道代理	自动切换 IP，解决高频访问封禁问题

二、环境准备与依赖安装

依赖说明：

requests：发送网络请求，获取热搜页面源码；
beautifulsoup4：解析 HTML 页面，提取结构化数据；
schedule：实现定时任务调度，支持自定义爬取间隔。

三、核心代码实现

3.1 模块 1：热搜数据爬取（含反爬代理配置）

目标页面：top.baidu.com/board?tab=r…为避免 IP 被封禁，我们集成隧道代理，每次请求自动切换出口 IP，同时配置标准化请求头模拟浏览器访问。

python

运行

import requests
from bs4 import BeautifulSoup
import random
import time
from datetime import datetime

class BaiduHotSearchSpider:
    """百度热搜爬虫类，负责数据爬取与解析（已内置亿牛云代理）"""
    def __init__(self):
        self.url = "https://top.baidu.com/board?tab=realtime"
        
        # ===================== 亿牛云代理 已直接配置 =====================
        self.proxyHost = "www.16yun.cn"
        self.proxyPort = "5445"
        self.proxyUser = "16QMSOML"
        self.proxyPass = "280651"
        # =================================================================
        
        # 模拟浏览器请求头
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36",
            "Referer": "https://www.baidu.com/",
            "Accept-Language": "zh-CN,zh;q=0.9"
        }

    def get_proxies(self):
        """构建亿牛云代理请求参数"""
        proxy_str = f"http://{self.proxyUser}:{self.proxyPass}@{self.proxyHost}:{self.proxyPort}"
        
        # 随机隧道号，强制切换IP（亿牛云标准用法）
        self.headers["Proxy-Tunnel"] = str(random.randint(1, 10000))
        return {"http": proxy_str, "https": proxy_str}

    def crawl(self):
        """主爬取方法，返回结构化热搜数据"""
        proxies = self.get_proxies()
        try:
            response = requests.get(
                self.url, headers=self.headers, proxies=proxies, timeout=15
            )
            if response.status_code == 200:
                return self.parse_html(response.text)
            elif response.status_code == 429:
                print("请求频繁，触发限流！")
            elif response.status_code == 403:
                print("IP被封禁，切换代理重试！")
            return None
        except Exception as e:
            print(f"爬取失败：{str(e)}")
            return None

    def parse_html(self, html):
        """解析HTML，提取热搜排名、标题、热度"""
        soup = BeautifulSoup(html, "html.parser")
        hot_list = []
        items = soup.select("div.category-wrap_iQLoo.horizontal_1eKyQ")
        for item in items:
            try:
                rank = item.select_one("div.c-single-text-addr").text.strip()
                title = item.select_one("div.c-single-text-ellipsis").text.strip()
                hot_score = item.select_one("div.hot-index_1WnTg")
                score = hot_score.text.strip() if hot_score else "0"
                hot_list.append({
                    "rank": int(rank),
                    "title": title,
                    "hot_score": score
                })
            except:
                continue
        return hot_list


# ===================== 测试运行（直接执行即可） =====================
if __name__ == '__main__':
    spider = BaiduHotSearchSpider()
    data = spider.crawl()
    if data:
        print("【爬取成功】")
        for item in data[:10]:
            print(f"第{item['rank']}名：{item['title']} | 热度：{item['hot_score']}")
    else:
        print("爬取失败")

3.2 模块 2：数据存储与变化对比

使用 SQLite 数据库创建两张表：hot_records存储历史热搜数据，hot_changes存储榜单变化记录，实现数据永久留存与智能对比。

python

运行

import sqlite3

class DataManager:
    """数据管理类，负责存储、查询、数据对比"""
    def __init__(self, db_name="baidu_hot.db"):
        self.db_name = db_name
        self.init_database()

    def init_database(self):
        """初始化数据库和数据表"""
        conn = sqlite3.connect(self.db_name)
        cursor = conn.cursor()
        # 热搜历史记录表
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS hot_records (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                rank INTEGER,
                title TEXT,
                hot_score TEXT,
                crawl_time DATETIME,
                UNIQUE(rank, crawl_time)
            )
        ''')
        # 热搜变化记录表
        cursor.execute('''
            CREATE TABLE IF NOT EXISTS hot_changes (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                title TEXT,
                old_rank INTEGER,
                new_rank INTEGER,
                change_type TEXT,
                crawl_time DATETIME
            )
        ''')
        conn.commit()
        conn.close()

    def save_data(self, hot_list):
        """保存当前爬取的热搜数据"""
        conn = sqlite3.connect(self.db_name)
        cursor = conn.cursor()
        now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        for item in hot_list:
            cursor.execute('''
                INSERT OR IGNORE INTO hot_records (rank, title, hot_score, crawl_time)
                VALUES (?, ?, ?, ?)
            ''', (item["rank"], item["title"], item["hot_score"], now))
        conn.commit()
        conn.close()

    def get_last_data(self):
        """获取上一次爬取的热搜数据"""
        conn = sqlite3.connect(self.db_name)
        cursor = conn.cursor()
        cursor.execute('''
            SELECT rank, title FROM hot_records
            WHERE crawl_time = (SELECT MAX(crawl_time) FROM hot_records)
        ''')
        data = {row[1]: row[0] for row in cursor.fetchall()}
        conn.close()
        return data

    def compare_and_save(self, old_data, new_data):
        """对比新旧数据，保存变化记录"""
        conn = sqlite3.connect(self.db_name)
        cursor = conn.cursor()
        now = datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        new_dict = {item["title"]: item["rank"] for item in new_data}

        # 分析排名变化、新增热搜
        for title, new_rank in new_dict.items():
            old_rank = old_data.get(title)
            if old_rank is None:
                # 新增热搜
                cursor.execute(
                    "INSERT INTO hot_changes VALUES (?,?,?,?,?,?)",
                    (None, title, None, new_rank, "新增", now)
                )
            elif old_rank != new_rank:
                # 排名变化
                change_type = "上升" if new_rank < old_rank else "下降"
                cursor.execute(
                    "INSERT INTO hot_changes VALUES (?,?,?,?,?,?)",
                    (None, title, old_rank, new_rank, change_type, now)
                )

        # 分析消失的热搜
        for title in old_data.keys() - new_dict.keys():
            cursor.execute(
                "INSERT INTO hot_changes VALUES (?,?,?,?,?,?)",
                (None, title, old_data[title], None, "消失", now)
            )

        conn.commit()
        conn.close()

3.3 模块 3：定时调度与系统启动

整合爬虫与数据管理模块，使用schedule实现定时任务，支持自定义爬取间隔（默认 10 分钟）。

python

运行

import schedule

class HotSearchMonitor:
    """热搜监控总类，整合所有功能"""
    def __init__(self, proxy_config=None, interval=10):
        self.spider = BaiduHotSearchSpider(proxy_config)
        self.manager = DataManager()
        self.interval = interval

    def job(self):
        """定时执行的核心任务"""
        print(f"\n===== {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} 开始爬取 =====")
        # 1. 获取上一次数据
        last_data = self.manager.get_last_data()
        # 2. 爬取最新数据
        new_data = self.spider.crawl()
        if not new_data:
            print("爬取失败，等待下一次执行")
            return
        # 3. 保存最新数据
        self.manager.save_data(new_data)
        # 4. 对比数据变化
        if last_data:
            self.manager.compare_and_save(last_data, new_data)
            print("数据对比完成，变化已记录")
        print(f"成功爬取 {len(new_data)} 条热搜数据")

    def show_recent_changes(self, hours=1):
        """展示最近N小时的变化"""
        conn = sqlite3.connect(self.manager.db_name)
        cursor = conn.cursor()
        cursor.execute('''
            SELECT title, old_rank, new_rank, change_type, crawl_time
            FROM hot_changes
            WHERE crawl_time > datetime('now', '-{} hours')
            ORDER BY crawl_time DESC
        '''.format(hours))
        changes = cursor.fetchall()
        conn.close()

        print(f"\n===== 近{hours}小时热搜变化 =====")
        for change in changes[:10]:
            title, old_r, new_r, type_, time_ = change
            if type_ == "上升":
                print(f"【上升】{title}：{old_r} → {new_r}")
            elif type_ == "下降":
                print(f"【下降】{title}：{old_r} → {new_r}")
            elif type_ == "新增":
                print(f"【新增】{title}，排名：{new_r}")
            elif type_ == "消失":
                print(f"【消失】{title}，原排名：{old_r}")

    def start(self):
        """启动监控系统"""
        schedule.every(self.interval).minutes.do(self.job)
        print(f"监控已启动，每{self.interval}分钟爬取一次，按Ctrl+C停止")
        # 立即执行一次
        self.job()
        while True:
            schedule.run_pending()
            time.sleep(1)

3.4 完整启动代码

配置代理参数，一键启动监控系统：

python

运行

if __name__ == '__main__':
    # 亿牛云隧道代理配置（替换为自己的账号密码）
    proxy_config = {
        "host": "t.16yun.cn",
        "port": "31111",
        "user": "your_username",
        "pass": "your_password"
    }

    # 创建监控实例，10分钟爬取一次
    monitor = HotSearchMonitor(proxy_config=proxy_config, interval=10)
    # 启动监控
    monitor.start()
    # 查看最近1小时变化（单独调用）
    # monitor.show_recent_changes(hours=1)

四、系统运行与效果展示

首次运行：程序会立即爬取一次热搜数据，保存到baidu_hot.db数据库中；
定时执行：每 10 分钟自动爬取一次，对比上一次数据，记录所有变化；
变化展示：程序会实时输出新增、上升、下降、消失的热搜条目；
数据留存：所有历史数据和变化记录永久存储在数据库中，支持后续分析。

示例输出：

plaintext

===== 2025-12-29 10:00:00 开始爬取 =====
成功爬取 50 条热搜数据

===== 近1小时热搜变化 =====
【上升】2025年度总结：15 → 5
【新增】Python技术峰会，排名：8
【下降】春节购票攻略：3 → 12
【消失】网红景点打卡，原排名：20

五、常见问题与优化方案

5.1 反爬与 IP 封禁问题

问题：频繁爬取触发 403/429 错误；
解决方案：使用隧道代理自动切换 IP，延长爬取间隔，优化请求头。

5.2 页面结构变化

问题：百度热搜页面更新导致解析失败；
解决方案：更新 CSS 选择器，适配最新页面结构。

5.3 系统优化方向

数据可视化：集成 Matplotlib/Flask，搭建热搜数据可视化面板；
消息推送：新增微信 / 邮件推送，热搜变化实时通知；
分布式部署：多节点爬取，提升系统稳定性；
数据导出：支持将数据导出为 Excel/CSV 文件。

六、总结

基于 Python 实现的百度热搜定时监控系统，完美解决了人工查看效率低、数据无留存、反爬难突破的痛点。系统采用模块化设计，代码简洁易维护，从数据爬取、存储到智能分析，形成完整的技术闭环。

无论是个人用于热点追踪，还是企业用于舆情监测、商业分析，这套系统都具备极高的实用价值。通过集成代理服务，系统可以长期稳定运行，为你实时捕捉全网热点变化，让数据驱动决策更高效、更精准。