Scrapling — 自适应爬虫框架，一库到底零妥协｜Github DailyGitHub Daily Scrapl

GitHub Daily

Scrapling — 自适应爬虫框架，一库到底零妥协

今日 +1,184 Stars | GitHub Trending 热榜

📅 2026-05-07

写过爬虫的开发者都知道这个痛点：**昨天还跑得好好的爬虫，今天网站改版就全挂了。**选择器失效、反爬升级、JS 渲染……每一个都够你折腾半天。传统方案中，requests + BeautifulSoup 速度虽快但扛不住动态网站，Selenium/Playwright 功能强但太重，而 Scrapy 框架又不够灵活。今天的开源主角 Scrapling，用一个框架同时解决了这三个问题——它来自安全研究者 D4Vinci（Karim Shoair），以 46,200+ Stars 登上 GitHub Trending，今日暴涨 +1,184 Stars。

📋 项目速览

项目名称Scrapling

作者D4Vinci (Karim Shoair)

GitHub Stars⭐ 46,200+ (今日 +1,184)

语言Python 3.10+

许可证BSD-3-Clause

测试覆盖率92%

GitHub 地址github.com/D4Vinci/Scrapling

官方文档scrapling.readthedocs.io

💔 它能解决什么问题？

creepy 1：网站一改版，爬虫就报废

传统爬虫依赖硬编码的 CSS/XPath 选择器。网站一旦改版（class 名变了、DOM 结构调整了），选择器全部失效。维护成本极高，尤其是要爬大量网站时。

creepy 2：反爬系统越来越狠

Cloudflare Turnstile、PerimeterX、DataDome……反机器人技术日新月异。简单的 requests 直接被 403，Selenium 也能被指纹识别检测出来。

creepy 3：工具碎片化，方案难统一

静态网站用 requests，动态网站用 Playwright，反爬网站还得加 stealth 插件
不同工具返回不同的响应对象，API 完全不同
大规模爬取还要接入 Scrapy，又是一套体系

🛠 Scrapling 的解法

Scrapling 用**「渐进增强」（Progressive Enhancement）哲学统一一切：同一个 Response 接口，从轻量 HTTP 到浏览器自动化再到隐秘反爬，三级 Fetcher 无缝切换。更关键的是内置自适应解析器**——网站结构变了，它能用相似度算法自动重新定位元素，让你的爬虫「改版免疫」。

✨ 核心亮点深度解析

1. 三级渐进式 Fetcher 体系

Scrapling 的核心设计理念是「按需升级」。不是每个请求都需要开浏览器，也不是每次都要反爬伪装。三级 Fetcher 体系让你根据目标网站灵活选择：

Tier 1

Fetcher / AsyncFetcher轻量 HTTP 请求，模拟浏览器 TLS 指纹 + HTTP/3 支持，速度最快

Tier 2

DynamicFetcher / AsyncDynamicFetcher基于 Playwright 的浏览器自动化，支持 Chromium 和 Chrome，处理 JS 渲染

Tier 3

StealthyFetcher / AsyncStealthyFetcher高级隐秘模式，fingerprint 伪装，自动绕过 Cloudflare Turnstile/Interstitial

关键是：无论你用哪一级 Fetcher，返回的都是同一个 Response 对象（继承自 Selector）。你的解析代码 `.css()` / `.xpath()` 在不同 Fetcher 之间完全通用，无需重写。

2. 自适应解析器 — 网站改版也不怕

这是 Scrapling 最独特的能力。它内置了智能相似度算法，能够「记住」你抓取的元素特征。当网站 DOM 结构改变后，自动重新定位到正确元素：

Python

from
 scrapling.fetchers

import
 StealthyFetcher
StealthyFetcher
.adaptive = 
True

# 第一次抓取：保存元素签名

p = 
StealthyFetcher
.fetch(
'https://example.com/products'
, headless=
True
) products = p.css(
'.product'
, auto_save=
True
)

# 网站改版后：自动重新定位元素

products = p.css(
'.product'
, adaptive=
True
)

# 即使 class 名变了也能找到

相比传统方案中 AutoScraper 的相似度匹配（12.45ms），Scrapling 只需 2.39ms——快了 5.2 倍，同时准确率更高。

3. 开箱即用的反机器人绕过

StealthyFetcher 内置了 CDP（Chrome DevTools Protocol）泄露修补、fingerprint 伪装、Cloudflare Turnstile 自动解决等能力。不需要额外安装 undetected-chromedriver 或 playwright-stealth：

Python

from
 scrapling.fetchers

import
 StealthySession

with
StealthySession
(headless=
True
, solve_cloudflare=
True
) 
as
 session:

page = session.fetch(
'https://nopecha.com/demo/cloudflare'
)

data = page.css(
'#padded_content a'
).getall()

4. 完整的 Spider 爬虫框架

不只是请求工具，Scrapling 还内置了类 Scrapy 的完整爬虫框架，支持并发爬取、暂停恢复、多 Session 路由、自动代理轮换等企业级特性：

Python

from
 scrapling.spiders

import
 Spider, Request, Response

from
 scrapling.fetchers

import
 FetcherSession, AsyncStealthySession

class
MultiSessionSpider
(
Spider
):     name = 
"multi"
     start_urls = [
"https://example.com/"
]     concurrent_requests = 
10

def
configure_sessions
(self, manager):

# 快速 HTTP Session 处理普通页面
         manager.add(
"fast"
, 
FetcherSession
(impersonate=
"chrome"
))

# 隐秘 Session 处理有反爬的页面（懒加载）

manager.add(
"stealth"
, 
AsyncStealthySession
(headless=
True
), lazy=
True
)

async def
parse
(self, response: 
Response
):        
for
 link 
in
 response.css(
'a::attr(href)'
).getall():            
if
"protected"
in
 link:

yield
Request
(link, sid=
"stealth"
)

# 路由到隐秘 Session

else
:

yield
Request
(link, sid=
"fast"
, callback=self.parse)  result = 
MultiSessionSpider
(crawldir=
"./crawl_data"
).start() result.items.to_json(
"output.json"
)

按 Ctrl+C 优雅暂停，下次启动自动从上次中断处继续。这在爬取数万页数据时非常实用。

5. AI 原生 — 内置 MCP Server

Scrapling 是少数内置 MCP（Model Context Protocol）服务器的爬虫框架。安装 AI 扩展后，Claude、Cursor 等大模型可以直接调用 Scrapling 进行结构化数据提取：

Bash

pip install 
"scrapling[ai]"

MCP 服务器会在提取内容后再传递给 AI，减少 token 消耗，提高速度和准确性。

6. 性能碾压级表现

📊 文本提取速度基准测试（5000 个嵌套元素）

排名	库	耗时 (ms)	相对速度
1	Scrapling
最快
2.02
1.0x


2

Parsel/Scrapy

2.04

1.01x


3

Raw Lxml

2.54

1.26x


4

PyQuery

24.17

~12x


5

Selectolax

82.63

~41x


6

BS4 + Lxml

1,584
~784x

7

BS4 + html5lib

3,392
~1,679x

* 所有基准测试代表 100+ 次运行的平均值

🗺 功能特性全景

🔄智能元素追踪

相似度算法自动重新定位 DOM 变更后的元素，网站改版也不断更你的选择器。

🛡️Cloudflare 绕过

自动解决 Cloudflare Turnstile/Interstitial，无需第三方反检测工具。

🔌统一 Response 接口

三级 Fetcher 返回相同的 Response 对象（继承 Selector），.css() 和 .xpath() 完全通用。

🕷️Spider 爬虫框架

类 Scrapy API，支持并发、暂停恢复、多 Session 路由、自动代理轮换。

🔐DNS 泄漏防护

内置 DNS-over-HTTPS（Cloudflare DoH），使用代理时不会泄露真实 DNS 请求。

🤖MCP AI 集成

内置 MCP Server，Claude/Cursor 等大模型可直接调用 Scrapling 进行结构化数据提取。

🎯广告/域名屏蔽

内置 3,500+ 广告和追踪域名屏蔽，减少无关请求，提高爬取效率。

⚡HTTP/3 + TLS 伪装

Fetcher 模拟 Chrome/Firefox 的 TLS 指纹和请求头，支持 HTTP/3 协议。

🎮 实战场景展示

🏢场景一：电商数据采集

采集多个电商平台的商品价格、评论和库存信息。部分平台有严格的反爬措施（Cloudflare、PerimeterX），部分是纯静态页面。

方案：使用 Spider 框架 + 多 Session 路由。普通页面走 FetcherSession（速度快），有反爬的走 StealthySession。自适应解析器让不同电商的页面模板都能自动适配。

📰场景二：新闻聚合平台

需要从 100+ 新闻网站持续抓取文章标题、摘要和发布时间。网站频繁改版，维护噩梦。

方案：首次运行时使用 auto_save=True 保存元素签名。网站改版后，传递 adaptive=True 即可自动重新定位元素，无需手动修复选择器。

🔬场景三：学术数据采集

从 PubMed、arXiv 等学术平台采集论文元数据。部分页面需要 JavaScript 渲染才能获取完整内容。

方案：使用 DynamicFetcher 处理 JS 渲染页面，Fetcher 处理静态 API 端点。Spider 的 Streaming 模式实时输出采集结果，方便对接下游管道。

🤖场景四：AI Agent 数据管道

为 Claude、GPT 等 AI Agent 提供实时网页数据。需要结构化提取、低延迟、减少 token 消耗。

方案：启用 Scrapling MCP Server，AI Agent 直接调用 Scrapling 进行定向内容提取，只传递需要的结构化数据给模型，大幅降低 token 使用。

🚀 上手指南

安装

Bash

# 基础安装（仅解析器引擎）

pip install scrapling

# 完整安装（含 Fetcher + 浏览器依赖）

pip install 
"scrapling[fetchers]"
scrapling install

# 安装所有功能（含 AI + Shell）

pip install 
"scrapling[all]"

# 或使用 Docker（开箱即用，浏览器已预装）

docker pull pyd4vinci/scrapling

快速上手

30 秒快速抓取：用 Fetcher 发送一个 HTTP 请求，用熟悉的 CSS 选择器提取数据。遇到 JS 渲染页面？：把 Fetcher 换成 DynamicFetcher，其他代码不变。遇到 Cloudflare 反爬？：换成 StealthyFetcher，一行代码自动绕过。网站改版了？：加上 adaptive=True，自适应解析器自动修复选择器。需要大规模爬取？：用 Spider 框架，支持并发、暂停恢复、代理轮换。

Python — 三行代码从入门到实战

from
 scrapling.fetchers

import
 Fetcher

# 静态页面：直接 GET + CSS 选择器

page = 
Fetcher
.get(
'https://quotes.toscrape.com/'
) quotes = page.css(
'.quote .text::text'
).getall()

# ['"The world as we have created it..."', '"It is our choices..."', ...]
# 动态页面：换成 DynamicFetcher

from
 scrapling.fetchers

import
 DynamicFetcher

page = 
DynamicFetcher
.fetch(
'https://example.com/dynamic'
, headless=
True
)

data = page.xpath(
'//div[@class="content"]'
).getall()

# 反爬页面：换成 StealthyFetcher

from
 scrapling.fetchers

import
 StealthyFetcher

page = 
StealthyFetcher
.fetch(
'https://protected-site.com'
) data = page.css(
'.data-row'
).getall()

无代码抓取（CLI 模式）

Scrapling 提供了终端命令行工具，无需写 Python 代码即可完成抓取：

Bash

# 提取页面内容并保存为 Markdown

scrapling extract get 
'https://example.com'
 content.md

# 使用 CSS 选择器定向提取

scrapling extract get 
'https://example.com'
 products.txt \   --css-selector 
'.product-item'
 --impersonate chrome

# 反爬页面提取

scrapling extract stealthy-fetch 
'https://protected.com'
 data.html \   --css-selector 
'#main-content'
 --solve-cloudflare

Fetchers 与 Sessions 对照

用途	一次性请求	Session 持久化

HTTP 请求
`Fetcher`	`FetcherSession`	`FetcherSession`
上下文感知


反爬绕过
`StealthyFetcher`	`StealthySession`	`AsyncStealthySession`

浏览器自动化
`DynamicFetcher`	`DynamicSession`	`AsyncDynamicSession`

⚖️ 与主流方案对比

能力	Scrapling	Scrapy	BeautifulSoup

自适应元素定位
内置	无	无	无

反爬绕过
内置	需插件	无	部分

JS 渲染
内置	需集成	无	核心能力

统一 Response 接口
是	仅 HTTP	仅解析	仅浏览器

暂停/恢复
内置	需扩展	无	无

AI/MCP 集成
内置	无	无	无

文本提取性能
2.02ms	2.04ms	1,584ms
N/A