python的进阶学习网络爬虫 - 简易天气数据爬取与保存网络爬虫 - 简易天气数据爬取与保存核心功能：爬取指定城市

网络爬虫 - 简易天气数据爬取与保存

核心功能：爬取指定城市的实时天气、未来 3 天天气预报（温度、湿度、天气状况、风力），将数据保存到 Excel 文件或本地 TXT 文件，支持多城市批量爬取。
技术栈：
- 网络请求：requests（发送 GET 请求，获取网页源码 / 接口数据）；
- 数据解析：BeautifulSoup4（解析 HTML 网页）或直接解析 JSON 接口（更简单）；
- 数据存储：openpyxl（保存到 Excel）或基础文件操作（保存到 TXT）。
实现思路：
- 第一步：寻找公开的天气接口（或简易天气网页，如中国天气网城市子页面）；
- 第二步：使用 requests.get() 发送请求，添加请求头（User-Agent）模拟浏览器访问，避免被反爬；
- 第三步：若为 HTML 页面，用 BeautifulSoup 定位标签（如 div、span）提取天气数据；若为 JSON 接口，直接解析 response.json() 获取数据；
- 第四步：将爬取的城市天气数据整理为字典 / 列表格式；
- 第五步：使用 openpyxl 创建 Excel 文件，将数据写入单元格并保存。

一、项目准备

1. 安装所需第三方库

该项目依赖两个核心库，打开终端 / 命令提示符，执行以下安装命令：

bash

运行

# 发送网络请求
pip install requests
# 解析HTML页面（若使用JSON接口可省略，此处提供两种实现方式）
pip install beautifulsoup4
# 操作Excel保存数据（核心存储库）
pip install openpyxl

2. 技术选型说明

网络请求：requests（轻量、易用，支持自定义请求头，规避基础反爬）
数据解析：提供两种方案（HTML 解析：BeautifulSoup4；JSON 接口解析：直接处理，更简洁高效）
数据存储：openpyxl（支持 Excel 文件读写，兼容.xlsx 格式，无需额外安装 Excel 软件）
辅助：Python 基础数据结构（列表 / 字典），用于临时存储爬取的天气数据

二、完整实现代码（两种方案）

方案 1：JSON 接口解析（推荐，更简洁、不易出错）

天气接口为公开测试接口，无需申请密钥，直接调用即可获取结构化 JSON 数据。

python

运行

import requests
from openpyxl import Workbook
import time

def get_weather_by_city(city_name):
    """
    根据城市名称爬取天气数据
    :param city_name: 城市名称（如：北京、上海、广州）
    :return: 包含实时天气+未来3天天气的字典，爬取失败返回None
    """
    # 公开天气API接口（无需密钥，测试用）
    url = "http://wthrcdn.etouch.cn/weather_mini"
    # 请求参数（指定城市名称）
    params = {
        "city": city_name
    }
    # 请求头（模拟浏览器访问，避免被反爬拦截）
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    }

    try:
        # 发送GET请求，超时时间10秒
        response = requests.get(url, params=params, headers=headers, timeout=10)
        # 验证请求是否成功（状态码200表示成功）
        response.raise_for_status()
        # 解析JSON数据为Python字典
        weather_data = response.json()

        # 判断接口是否返回有效数据
        if weather_data.get("status") != 1000:
            print(f"获取{city_name}天气数据失败：{weather_data.get('desc', '未知错误')}")
            return None

        # 提取核心数据
        data = weather_data["data"]
        real_time_weather = data["wendu"]  # 实时温度
        real_time_type = data["forecast"][0]["type"]  # 实时天气类型（晴/雨等）
        real_time_wind = data["forecast"][0]["fengli"]  # 实时风力
        real_time_humidity = data["shidu"]  # 实时湿度

        # 提取未来3天天气预报（排除今日，取后3天）
        future_3d_weather = []
        for forecast in data["forecast"][1:4]:  # 索引1-3对应未来3天
            day_weather = {
                "date": forecast["date"],  # 日期
                "type": forecast["type"],  # 天气类型
                "high_temp": forecast["high"].replace("高温 ", ""),  # 最高温
                "low_temp": forecast["low"].replace("低温 ", ""),  # 最低温
                "wind": forecast["fengxiang"] + " " + forecast["fengli"]  # 风向+风力
            }
            future_3d_weather.append(day_weather)

        # 整理最终返回数据
        result = {
            "city": city_name,
            "real_time_wendu": real_time_weather,
            "real_time_type": real_time_type,
            "real_time_wind": real_time_wind,
            "real_time_humidity": real_time_humidity,
            "future_3d": future_3d_weather
        }

        print(f"成功获取{city_name}天气数据")
        return result

    except requests.exceptions.RequestException as e:
        print(f"获取{city_name}天气数据异常：{str(e)}")
        return None

def save_weather_to_excel(weather_list, save_path="天气数据汇总.xlsx"):
    """
    将爬取的多城市天气数据保存到Excel文件
    :param weather_list: 天气数据列表（包含多个城市的天气字典）
    :param save_path: Excel文件保存路径（默认当前目录）
    """
    # 创建一个新的Excel工作簿
    wb = Workbook()
    # 获取默认的工作表（第一个工作表）
    ws = wb.active
    # 设置工作表名称
    ws.title = "城市天气数据"

    # 写入表头（第一行）
    header = [
        "城市名称", "实时温度(℃)", "实时天气类型", "实时风力", "实时湿度",
        "未来1天日期", "未来1天天气", "未来1天最高温", "未来1天最低温", "未来1天风向风力",
        "未来2天日期", "未来2天天气", "未来2天最高温", "未来2天最低温", "未来2天风向风力",
        "未来3天日期", "未来3天天气", "未来3天最高温", "未来3天最低温", "未来3天风向风力"
    ]
    ws.append(header)

    # 遍历天气数据列表，写入每行数据
    for weather in weather_list:
        if not weather:  # 跳过无效数据
            continue
        # 提取单行数据
        row_data = [
            weather["city"],
            weather["real_time_wendu"],
            weather["real_time_type"],
            weather["real_time_wind"],
            weather["real_time_humidity"],
            # 未来1天数据
            weather["future_3d"][0]["date"],
            weather["future_3d"][0]["type"],
            weather["future_3d"][0]["high_temp"],
            weather["future_3d"][0]["low_temp"],
            weather["future_3d"][0]["wind"],
            # 未来2天数据
            weather["future_3d"][1]["date"],
            weather["future_3d"][1]["type"],
            weather["future_3d"][1]["high_temp"],
            weather["future_3d"][1]["low_temp"],
            weather["future_3d"][1]["wind"],
            # 未来3天数据
            weather["future_3d"][2]["date"],
            weather["future_3d"][2]["type"],
            weather["future_3d"][2]["high_temp"],
            weather["future_3d"][2]["low_temp"],
            weather["future_3d"][2]["wind"]
        ]
        ws.append(row_data)

    # 保存Excel文件
    wb.save(save_path)
    print(f"所有天气数据已成功保存到：{save_path}")

if __name__ == "__main__":
    # 待爬取的城市列表（可自行添加/修改）
    city_list = ["北京", "上海", "广州", "深圳", "杭州", "成都", "重庆"]
    # 存储所有城市的天气数据
    all_weather_data = []

    # 遍历城市列表，逐个爬取天气数据
    for city in city_list:
        weather = get_weather_by_city(city)
        if weather:
            all_weather_data.append(weather)
        # 延迟1秒，避免频繁请求被接口限制
        time.sleep(1)

    # 将所有数据保存到Excel
    if all_weather_data:
        save_weather_to_excel(all_weather_data)
    else:
        print("未获取到任何有效天气数据，无需保存")

方案 2：HTML 页面解析（基于 BeautifulSoup4）

若你想学习 HTML 解析技巧，该方案爬取中国天气网简易城市页面，提取天气数据。

python

运行

import requests
from bs4 import BeautifulSoup
from openpyxl import Workbook
import time

def get_weather_by_html(city_url):
    """
    从天气网页HTML中解析天气数据
    :param city_url: 城市天气页面URL（如：北京天气页）
    :return: 天气数据字典，失败返回None
    """
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36"
    }

    try:
        response = requests.get(city_url, headers=headers, timeout=10)
        response.raise_for_status()
        # 指定编码格式（避免中文乱码）
        response.encoding = "utf-8"
        # 创建BeautifulSoup对象，解析HTML
        soup = BeautifulSoup(response.text, "html.parser")

        # 提取城市名称
        city_name = soup.find("h1").text.strip().replace("天气", "")
        # 提取实时温度
        real_temp = soup.find("div", class_="wea_weather clearfix").find("em").text.strip()
        # 提取实时天气类型
        real_type = soup.find("div", class_="wea_weather clearfix").find("b").text.strip()
        # 提取实时风力+湿度
        real_info = soup.find("div", class_="wea_about clearfix").text.strip().split()
        real_wind = real_info[0] if len(real_info) > 0 else "未知"
        real_humidity = real_info[1] if len(real_info) > 1 else "未知"

        # 提取未来3天天气预报
        future_3d = []
        forecast_list = soup.find("ul", class_="wea_forecast clearfix").find_all("li")[1:4]  # 未来3天
        for li in forecast_list:
            date = li.find("h4").text.strip()
            weather_type = li.find("p", class_="wea").text.strip()
            temp_range = li.find("p", class_="tem").text.strip().split("/")
            high_temp = temp_range[0].strip() if len(temp_range) > 0 else "未知"
            low_temp = temp_range[1].strip() if len(temp_range) > 1 else "未知"
            wind = li.find("p", class_="win").text.strip()

            future_3d.append({
                "date": date,
                "type": weather_type,
                "high_temp": high_temp,
                "low_temp": low_temp,
                "wind": wind
            })

        # 整理数据
        result = {
            "city": city_name,
            "real_time_wendu": real_temp,
            "real_time_type": real_type,
            "real_time_wind": real_wind,
            "real_time_humidity": real_humidity,
            "future_3d": future_3d
        }

        print(f"成功获取{city_name}天气数据（HTML解析）")
        return result

    except Exception as e:
        print(f"HTML解析天气数据异常：{str(e)}")
        return None

def save_weather_to_excel(weather_list, save_path="天气数据_HTML解析.xlsx"):
    """与方案1的save_weather_to_excel函数完全一致，此处省略（直接复用即可）"""
    wb = Workbook()
    ws = wb.active
    ws.title = "城市天气数据"
    header = [
        "城市名称", "实时温度(℃)", "实时天气类型", "实时风力", "实时湿度",
        "未来1天日期", "未来1天天气", "未来1天最高温", "未来1天最低温", "未来1天风向风力",
        "未来2天日期", "未来2天天气", "未来2天最高温", "未来2天最低温", "未来2天风向风力",
        "未来3天日期", "未来3天天气", "未来3天最高温", "未来3天最低温", "未来3天风向风力"
    ]
    ws.append(header)
    for weather in weather_list:
        if not weather:
            continue
        row_data = [
            weather["city"],
            weather["real_time_wendu"],
            weather["real_time_type"],
            weather["real_time_wind"],
            weather["real_time_humidity"],
            weather["future_3d"][0]["date"],
            weather["future_3d"][0]["type"],
            weather["future_3d"][0]["high_temp"],
            weather["future_3d"][0]["low_temp"],
            weather["future_3d"][0]["wind"],
            weather["future_3d"][1]["date"],
            weather["future_3d"][1]["type"],
            weather["future_3d"][1]["high_temp"],
            weather["future_3d"][1]["low_temp"],
            weather["future_3d"][1]["wind"],
            weather["future_3d"][2]["date"],
            weather["future_3d"][2]["type"],
            weather["future_3d"][2]["high_temp"],
            weather["future_3d"][2]["low_temp"],
            weather["future_3d"][2]["wind"]
        ]
        ws.append(row_data)
    wb.save(save_path)
    print(f"天气数据已保存到：{save_path}")

if __name__ == "__main__":
    # 城市天气页面URL（示例，可自行替换其他城市）
    city_url_list = [
        "http://www.weather.com.cn/weather/101010100.shtml",  # 北京
        "http://www.weather.com.cn/weather/101020100.shtml",  # 上海
        "http://www.weather.com.cn/weather/101280101.shtml"   # 广州
    ]
    all_weather = []

    for url in city_url_list:
        weather = get_weather_by_html(url)
        if weather:
            all_weather.append(weather)
        time.sleep(1)

    if all_weather:
        save_weather_to_excel(all_weather)
    else:
        print("未获取到有效天气数据")

三、关键功能解析

1. 规避反爬策略

设置User-Agent请求头：模拟浏览器访问，避免被服务器识别为爬虫而拦截。
添加time.sleep(1)：每次爬取后延迟 1 秒，避免频繁请求给接口造成压力，防止被限流。
异常处理：使用try-except捕获网络超时、请求失败等异常，提高程序健壮性。
验证响应状态：response.raise_for_status()自动判断状态码，非 200 时抛出异常。

2. 数据解析核心

JSON 解析：直接通过response.json()将返回数据转为 Python 字典，通过键名快速提取数据，无需解析复杂 HTML 结构，效率更高。
HTML 解析：使用BeautifulSoup的find()/find_all()方法，通过标签名 + class 属性定位目标元素，提取文本内容，需处理中文编码问题（response.encoding = "utf-8"）。

3. Excel 保存核心

Workbook()：创建新的 Excel 工作簿。
wb.active：获取默认工作表，可通过ws.title修改工作表名称。
ws.append()：批量写入一行数据（表头 / 内容），无需手动指定单元格位置。
wb.save()：保存文件到指定路径，支持自定义文件名和存储目录。

四、运行说明与效果

1. 运行步骤

将上述代码复制到 VS Code/PyCharm 等编辑器中，保存为.py文件（如weather_spider.py）。
确保已安装所需第三方库（参考项目准备部分）。
直接运行该 Python 文件，等待程序执行完成。
在程序所在目录下，可找到生成的 Excel 文件（如天气数据汇总.xlsx）。

2. 运行效果

控制台：打印每个城市的爬取状态（成功 / 失败），最终打印保存路径。
Excel 文件：包含表头和所有城市的实时天气 + 未来 3 天天气数据，格式规整，可直接用 Excel 打开编辑。

五、优化与扩展方向

支持更多城市：修改city_list/city_url_list，添加所需城市名称或 URL。
数据持久化升级：除了 Excel，可将数据保存到 TXT、CSV 或 SQLite 数据库。
增加图形界面：使用tkinter制作简单 GUI，支持用户输入城市名称，可视化展示天气数据。
定时爬取：使用schedule库实现定时任务（如每天早上 8 点自动爬取当日天气）。
增加异常重试：对爬取失败的城市，添加重试机制（如最多重试 3 次），提升数据获取成功率。

总结

该项目核心技术栈为requests（网络请求）、BeautifulSoup4（可选，HTML 解析）、openpyxl（Excel 存储），兼顾易用性和实用性。
提供两种实现方案：JSON 接口解析（推荐，简洁高效）和 HTML 解析（适合学习网页解析技巧），均可直接运行。
关键要点：设置请求头、添加延迟、异常处理、Excel 批量写入，这些技巧可复用至其他爬虫项目中。

python的进阶学习 网络爬虫 - 简易天气数据爬取与保存

网络爬虫 - 简易天气数据爬取与保存

一、项目准备

1. 安装所需第三方库

2. 技术选型说明

二、完整实现代码（两种方案）

方案 1：JSON 接口解析（推荐，更简洁、不易出错）

方案 2：HTML 页面解析（基于 BeautifulSoup4）

三、关键功能解析

1. 规避反爬策略

2. 数据解析核心

3. Excel 保存核心

四、运行说明与效果

1. 运行步骤

2. 运行效果

五、优化与扩展方向

总结

python的进阶学习网络爬虫 - 简易天气数据爬取与保存