并发拉满后，抓取反而变慢？我踩了这个坑本文描述了作者在爬取新闻站数据时遇到的问题，包括低数据量、高延迟和代理IP被封禁。

1. 事情是这样开始的

上周临时接了个需求，要从某新闻站（www.toutiao.com）抓一波热点数据想着时间紧，就直接上了高并发 + 代理池，心里还美滋滋地觉得这速度肯定飞起。

结果——上线一小时，发现数据量只有预期的 40%，延迟还奇高。

2. 排查过程（时间线）

Day 1 晚上\ 先怀疑代理 IP 不行，跑去看了下日志，嗯…成功率其实还可以，80% 左右。那问题不在代理质量。

Day 2 早上\ 继续翻日志，发现一些代理节点被用得特别狠，同一个 IP 同时跑了十几二十个请求，直接被目标站点限速，甚至封禁。\ 而有些代理就几乎没用上——这就很尴尬了。

Day 3 中午\ 确认了，罪魁祸首就是：全局并发没有限速 + 单 IP 并发没控制。\ 加上失败后无脑重试，把带宽和 CPU 都消耗掉了。

3. 问题拆解

并发太猛，全局频率高到触发目标站限速。
单个代理 IP 压力过大，封了一个就拖一大片任务。
没有实时监控，出问题了也只能靠翻日志找线索。

4. 解决思路

全局限速：用信号量限制任务总并发数。
单节点限速：每个代理 IP 自己的并发阈值，超了就排队。
失败退避：不要一秒一个重试，用指数退避，慢慢来。
加点监控：用 tqdm做个进度条，任务数一眼能看，抓完顺手输出节点健康状态。
自动分析热点：爬到的标题直接做关键词统计，省得人工数。

5. 优化后代码（精简版）

import asyncio, random, re
from collections import Counter, defaultdict
from playwright.async_api import async_playwright
from tqdm.asyncio import tqdm_asyncio
#爬虫代理设置（亿牛云示例 www.16yun.cn）
PROXIES = [{"host":"proxy.16yun.cn","port":10000,"user":"16YUN","pass":"16IP"}]
KEYWORDS = ["人工智能", "芯片", "特斯拉", "东京奥运"]

MAX_TOTAL, MAX_PER_PROXY = 6, 3
total_sem = asyncio.Semaphore(MAX_TOTAL)
proxy_sem = {i: asyncio.Semaphore(MAX_PER_PROXY) for i in range(len(PROXIES))}
proxy_stats = defaultdict(lambda: {"success": 0, "fail": 0})

def extract_words(titles):
    return re.findall(r"[\u4e00-\u9fa5]{2,}", " ".join(titles))

async def fetch_page(page, keyword):
    await page.goto(f"https://m.toutiao.com/search/?keyword={keyword}", timeout=15000)
    return [await el.inner_text() for el in await page.query_selector_all("div.result-item .title")]

async def worker(browser, proxy_idx, kw):
    proxy = PROXIES[proxy_idx]
    async with proxy_sem[proxy_idx], total_sem:
        ctx = await browser.new_context(proxy={
            "server": f"http://{proxy['host']}:{proxy['port']}",
            "username": proxy['user'],
            "password": proxy['pass']
        })
        page = await ctx.new_page()
        await asyncio.sleep(random.uniform(0.3, 1.0))
        for attempt in range(4):
            try:
                data = await fetch_page(page, kw)
                proxy_stats[proxy_idx]["success"] += 1
                await ctx.close()
                return data
            except:
                proxy_stats[proxy_idx]["fail"] += 1
                await asyncio.sleep((2 ** attempt) + random.random())
        await ctx.close()
        return []

async def main():
    async with async_playwright() as pw:
        browser = await pw.chromium.launch(headless=True)
        tasks = [worker(browser, i % len(PROXIES), kw) for i, kw in enumerate(KEYWORDS)]
        results = await tqdm_asyncio.gather(*tasks, desc="抓取进度", total=len(tasks))
        await browser.close()

        print("\n节点健康报告：")
        for idx, stat in proxy_stats.items():
            total_req = stat["success"] + stat["fail"]
            rate = (stat["success"] / total_req * 100) if total_req else 0
            print(f"节点 {idx}: 成功 {stat['success']} 次, 失败 {stat['fail']} 次, 成功率 {rate:.2f}%")

        print("\n热点关键词：")
        for word, count in Counter(extract_words([t for res in results for t in res])).most_common(10):
            print(f"{word} - {count} 次")

if __name__ == "__main__":
    asyncio.run(main())

6. 优化效果

成功率从 60% 提到了 85%+
单节点没再被打爆
热点数据直接生成，拿去分析就行

7. 最后感想

这事儿让我记住一句话： 并发不是越高越好，控制好节奏才是王道。特别是有限制的站点，猛冲只会给自己挖坑。