优化 Python 爬虫性能：异步爬取新浪财经大数据一、同步爬虫的瓶颈传统的同步爬虫（如requests+Beauti

一、同步爬虫的瓶颈

传统的同步爬虫（如requests+BeautifulSoup）在请求网页时，必须等待服务器返回响应后才能继续下一个请求。这种阻塞式I/O操作在面对大量数据时存在以下问题：

速度慢：每个请求必须串行执行，无法充分利用网络带宽。
易被封禁：高频请求可能触发IP限制或验证码。
资源浪费：CPU在等待I/O时处于空闲状态。

解决方案：异步爬虫（Asynchronous Crawling）
Python的asyncio+aiohttp库可以实现非阻塞I/O，允许同时发起多个请求，大幅提升爬取效率。

二、异步爬虫技术选型

技术方案	适用场景	优势
`aiohttp`	HTTP请求	异步HTTP客户端，支持高并发
`asyncio`	事件循环	Python原生异步I/O框架
`aiofiles`	异步文件存储	避免文件写入阻塞主线程
`uvloop`	加速事件循环	替换`asyncio` 默认循环，性能提升2-4倍

三、实战：异步爬取新浪财经股票数据

目标

爬取新浪财经A股股票实时行情（代码、名称、价格、涨跌幅等）。
使用aiohttp实现高并发请求。
存储至CSV文件，避免数据丢失。

步骤1：分析数据接口

新浪财经的股票数据通常通过API返回，我们可以通过浏览器开发者工具（F12）抓包分析：

示例接口：https://finance.sina.com.cn/realstock/company/sh600000/nc.shtml
数据格式：部分数据直接渲染在HTML中，部分通过Ajax加载（如分时数据）。

步骤2：安装依赖库

步骤3：编写异步爬虫代码

import asyncio
import aiohttp
import aiofiles
from bs4 import BeautifulSoup
import csv
import time

# 替换为新浪财经股票列表API（示例）
STOCK_LIST_API = "https://finance.sina.com.cn/stock/sl/stock_list.html"
HEADERS = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
}

async def fetch(session, url):
    """异步获取网页内容"""
    async with session.get(url, headers=HEADERS) as response:
        return await response.text()

async def parse_stock_data(html):
    """解析股票数据（示例：仅提取名称和价格）"""
    soup = BeautifulSoup(html, "html.parser")
    stock_name = soup.select_one(".stock-name").text.strip() if soup.select_one(".stock-name") else "N/A"
    stock_price = soup.select_one(".price").text.strip() if soup.select_one(".price") else "N/A"
    return {"name": stock_name, "price": stock_price}

async def save_to_csv(data, filename="stocks.csv"):
    """异步写入CSV"""
    async with aiofiles.open(filename, mode="a", encoding="utf-8", newline="") as f:
        writer = csv.writer(f)
        await writer.writerow([data["name"], data["price"]])

async def crawl_stock(stock_code, session):
    """爬取单只股票数据"""
    url = f"https://finance.sina.com.cn/realstock/company/{stock_code}/nc.shtml"
    try:
        html = await fetch(session, url)
        data = await parse_stock_data(html)
        await save_to_csv(data)
        print(f"爬取成功：{stock_code} - {data['name']}")
    except Exception as e:
        print(f"爬取失败：{stock_code} - {str(e)}")

async def main():
    """主协程：并发爬取多个股票"""
    stock_codes = ["sh600000", "sh601318", "sz000001"]  # 示例股票代码（可扩展）
    
    # 使用uvloop加速（仅限Unix系统）
    try:
        import uvloop
        asyncio.set_event_loop_policy(uvloop.EventLoopPolicy())
    except ImportError:
        pass
    
    # 创建aiohttp会话
    async with aiohttp.ClientSession() as session:
        tasks = [crawl_stock(code, session) for code in stock_codes]
        await asyncio.gather(*tasks)

if __name__ == "__main__":
    start_time = time.time()
    asyncio.run(main())
    print(f"爬取完成，耗时：{time.time() - start_time:.2f}秒")

四、性能优化策略

1. 控制并发量

新浪财经可能限制高频请求

2. 使用代理IP

避免IP被封：

3. 随机User-Agent

减少被识别为爬虫的概率：

4. 数据存储优化

异步数据库写入：如aiomysql、asyncpg。
批量写入：减少I/O次数。

import asyncio
import aiohttp
from bs4 import BeautifulSoup
import pandas as pd
from fake_useragent import UserAgent
import aiomysql

# 使用 Semaphore 限制并发数
semaphore = asyncio.Semaphore(10)  # 最大并发 10

# 代理信息
proxyHost = "www.16yun.cn"
proxyPort = "5445"
proxyUser = "16QMSOML"
proxyPass = "280651"

# 构造代理 URL
PROXY = f"http://{proxyUser}:{proxyPass}@{proxyHost}:{proxyPort}"

# 随机 User-Agent
ua = UserAgent()

# 数据库配置
DB_CONFIG = {
    'host': 'localhost',
    'port': 3306,
 '   user': 'your_username',
    'password': 'your_password',
    'db': 'your_database',
    'charset': 'utf8mb4'
}

# 数据存储优化：异步数据库写入
async def save_to_db(data):
    conn = await aiomysql.connect(**DB_CONFIG)
    async with conn.cursor() as cur:
        await cur.executemany("INSERT INTO finance_data (column1, column2, column3) VALUES (%s, %s, %s)", data)
        await conn.commit()
    conn.close()

# 爬取单个股票数据
async def crawl_stock(stock_code, session):
    async with semaphore:
        url = f"https://finance.sina.com.cn/stock/{stock_code}.html"
        HEADERS = {"User-Agent": ua.random}
        async with session.get(url, headers=HEADERS, proxy=PROXY) as response:
            html = await response.text()
            data = parse(html)
            return data

# 解析网页内容
def parse(html):
    soup = BeautifulSoup(html, 'html.parser')
    # 假设数据在特定的表格中
    table = soup.find('table', {'class': 'example'})
    data = []
    for row in table.find_all('tr'):
        cols = row.find_all('td')
        cols = [ele.text.strip() for ele in cols]
        data.append([ele for ele in cols if ele])
    return data

# 主函数
async def main(stock_codes):
    async with aiohttp.ClientSession() as session:
        tasks = [crawl_stock(stock_code, session) for stock_code in stock_codes]
        all_data = await asyncio.gather(*tasks)
        # 扁平化数据
        flat_data = [item for sublist in all_data for item in sublist]
        # 异步批量写入数据库
        await save_to_db(flat_data)

# 示例股票代码列表
stock_codes = [
    '000001',
    '000002',
    # 更多股票代码
]

# 运行爬虫
asyncio.run(main(stock_codes))

五、对比同步与异步爬虫性能

指标	同步爬虫（requests）	异步爬虫（aiohttp）
100次请求耗时	~20秒	~3秒
CPU占用	低（大量时间在等待）	高（并发处理）
反爬风险	高（易触发封禁）	较低（可控并发）