Python动态网页爬取技术文档
目录
- 技术背景
- 核心解决方案
- 进阶实战技巧
- 反反爬策略
- 性能优化指南
- 注意事项
- 附录
技术背景
动态加载类型
| 类型 | 特征描述 | 常见场景 |
|---|
| AJAX | 局部数据更新 | 分页加载/搜索建议 |
| WebSocket | 双向实时通信 | 即时聊天/股票行情 |
| 前端渲染 | JS构建完整DOM | React/Vue单页应用 |
| 懒加载 | 滚动触发加载 | 图片瀑布流/无限滚动 |
检测方法
import requests
from bs4 import BeautifulSoup
res = requests.get(url)
soup = BeautifulSoup(res.text, 'html.parser')
if not soup.find('div', class_='content') and len(res.text) < 5000:
print("疑似动态加载页面")
核心解决方案
方案1:API直连分析
实施流程
- 打开Chrome开发者工具(F12)
- 切换到Network面板,筛选XHR/Fetch请求
- 查找关键数据请求,复制为cURL
- 使用curlconverter转换为Python代码
示例代码
import requests
API_ENDPOINT = "https://api.ecommerce.com/products"
SIGNATURE_KEY = "X-Signature"
def generate_signature(params):
# 实现逆向工程得到的签名算法
return hashlib.sha256(params.encode()).hexdigest()
params = {
"category": "electronics",
"page": 2,
"timestamp": int(time.time())
}
params[SIGNATURE_KEY] = generate_signature(str(params))
response = requests.get(
API_ENDPOINT,
headers={"Referer": "https://www.ecommerce.com"},
params=params
)
方案2:浏览器自动化(Selenium)
架构图
[Python代码] -> [WebDriver] -> [浏览器实例] -> [目标网站]
增强版配置
from selenium import webdriver
def create_stealth_driver():
options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option("useAutomationExtension", False)
driver = webdriver.Chrome(options=options)
# 覆盖navigator.webdriver属性
driver.execute_script(
"Object.defineProperty(navigator, 'webdriver', {get: () => undefined})"
)
return driver
进阶实战技巧
智能等待策略
from selenium.webdriver.support import expected_conditions as EC
def wait_complex(driver):
WebDriverWait(driver, 15).until(
lambda d: d.execute_script("return document.readyState") == "complete"
and d.find_element(By.CSS_SELECTOR, ".loaded-indicator").is_displayed()
and len(d.find_elements(By.CLASS_NAME, "product-item")) >= 50
)
数据包监听(Playwright)
def capture_api_requests():
with sync_playwright() as p:
browser = p.chromium.launch()
context = browser.new_context()
page = context.new_page()
api_responses = []
def log_response(response):
if "/graphql" in response.url:
api_responses.append({
"url": response.url,
"status": response.status,
"body": response.json()
})
page.on("response", log_response)
page.goto("https://social-media-site.com")
context.close()
return api_responses
反反爬策略
指纹伪装矩阵
| 伪装维度 | 实现方法 | 工具推荐 |
|---|
| Canvas指纹 | 随机化渲染模式 | fingerprintjs |
| WebGL指纹 | 修改显卡驱动特征 | headless-gl |
| 字体列表 | 随机排序系统字体 | font-list-manipulator |
| 音频上下文 | 生成随机音频波纹 | web-audio-fingerprint |
代理轮换系统
from itertools import cycle
import requests
PROXY_POOL = [
"http://user:pass@proxy1:port",
"socks5://user:pass@proxy2:port",
"http://user:pass@proxy3:port"
]
proxy_cycle = cycle(PROXY_POOL)
def get_with_rotation(url):
proxy = next(proxy_cycle)
try:
return requests.get(url, proxies={"http": proxy}, timeout=10)
except:
return get_with_rotation(url)
性能优化指南
Selenium优化配置
chrome_options = webdriver.ChromeOptions()
# 性能调优参数
OPTIMIZATION_PARAMS = [ "--disable-gpu", "--disable-software-rasterizer", "--disable-dev-shm-usage", "--no-sandbox", "--disable-features=IsolateOrigins,site-per-process", "--blink-settings=imagesEnabled=false"]
for param in OPTIMIZATION_PARAMS:
chrome_options.add_argument(param)
# 实验性参数
chrome_options.add_experimental_option(
"prefs", {
"profile.managed_default_content_settings.images": 2,
"permissions.default.stylesheet": 2
}
)
Playwright复用策略
from playwright.sync_api import sync_playwright
def create_persistent_context(user_id):
with sync_playwright() as p:
context = p.chromium.launch_persistent_context(
user_data_dir=f"./profiles/{user_id}",
headless=True,
args=["--disable-blink-features=AutomationControlled"]
)
context.add_cookies(load_cookies(user_id))
return context
注意事项
法律合规检查表
- 检查网站robots.txt文件
- 确认服务条款中关于数据采集的条款
- 控制请求频率(建议≥5秒/次)
- 设置合理的User-Agent标识
- 对敏感数据进行匿名化处理
异常处理模板
try:
response = requests.get(url, timeout=10)
except (ConnectionError, Timeout) as e:
log_error(f"网络异常: {str(e)}")
if retry_count < 3:
return retry_fetch(url, retry_count+1)
except Exception as e:
log_error(f"未知错误: {traceback.format_exc()}")
raise CrawlerException("Fatal Error")
else:
if response.status_code == 429:
handle_rate_limit()
elif 500 <= response.status_code < 600:
schedule_retry()
附录
工具库推荐
调试技巧
- 使用mitmproxy中间人代理分析加密流量
mitmproxy -s decrypt_script.py
- 通过Chrome DevTools Protocol直接调用浏览器API
- 使用浏览器重放请求(Copy as Node.js fetch)
文档版本:1.2.0 | 最后更新:2023-12-15