面向电商的多语言页面抓取策略摘要：本文分析了爬取全球电商网站时遇到的字符集、页面布局和本地化问题，并提供了一个改进的爬

前言

如果你抓取过像 Amazon 这样的全球电商网站，你一定有过这种崩溃体验：
同一个商品链接，打开美国站是英文版，切到日本站变成全角文字，再到德国站，居然还出现了 € 字符乱码。
最离谱的是——你的爬虫居然还能“自信地”打印出一堆看似乱码但没报错的内容。

这类问题往往不是代码写错，而是忽略了字符集、页面布局差异以及本地化策略。
今天我们不讲“完美写法”，反而要反着来：看一个错误案例，拆解为什么它会踩坑，然后再修成一个“更稳的版本”。

一、错误示例

这段代码在 StackOverflow 上你可能都见过类似的。它能跑，但坑也跟着来了。

import requests
from bs4 import BeautifulSoup

url = "https://www.amazon.com/dp/B08N5WRWNW"
headers = {"User-Agent": "my-bot/1.0"}
resp = requests.get(url, headers=headers, timeout=10)

# 問題点：直接用 resp.text，不管编码是什么
soup = BeautifulSoup(resp.text, "html.parser")
title = soup.select_one("#productTitle").get_text(strip=True)
print("Title:", title)

二、出了啥问题？

这段代码的表面逻辑没错：请求 → 解析 → 提取标题。
但在实际运行中，可能出现以下“灾情”：

乱码地狱
Amazon 不同区域站点返回的编码格式不一定一样。你以为是 UTF-8，结果可能是 ISO-8859-1 或其他编码混合。于是 resp.text 自动解码后，文字看着像“火星语”。
选择器失效
在不同语言版本或移动端布局下，#productTitle 根本找不到。于是 NoneType 报错，一运行直接挂。
反爬秒封
没有代理，没有延时，没有 cookie，Amazon 会非常快地识别出这不是正常用户请求。轻则返回空白页，重则跳验证码或限制 IP。

三、修复过程

我们现在来一步步修好它

目标：

自动检测页面编码；
支持代理（用爬虫代理作示例）；
针对多语言布局容错；
合理控制请求频率；

环境准备

pip install requests bs4 chardet

改进后的代码示例

# robust_amazon_scraper.py
import time
import random
import requests
import chardet
from bs4 import BeautifulSoup

# === 代理配置（以亿牛云为例，www.16yun.cn） ===
proxy_host = "proxy.16yun.cn"
proxy_port = 3100
proxy_user = "16YUN"
proxy_pass = "16IP"

proxy_url = f"http://{proxy_user}:{proxy_pass}@{proxy_host}:{proxy_port}"
proxies = {"http": proxy_url, "https": proxy_url}

# === 请求头 ===
HEADERS = {
    "User-Agent": ("Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                   "AppleWebKit/537.36 (KHTML, like Gecko) "
                   "Chrome/120.0.0.0 Safari/537.36"),
    "Accept-Language": "en-US,en;q=0.9",
}

# === 小工具函数 ===
def check_robots():
    """简单检查 robots.txt"""
    try:
        r = requests.get("https://www.amazon.com/robots.txt", timeout=8)
        print("robots.txt status:", r.status_code)
        print("\n".join(r.text.splitlines()[:10]))
    except Exception as e:
        print("robots.txt 读取失败:", e)

def detect_encoding(resp):
    """更聪明的编码检测"""
    if resp.encoding and resp.encoding.lower() != "iso-8859-1":
        return resp.encoding
    detected = chardet.detect(resp.content)
    return detected.get("encoding") or "utf-8"

def safe_get(url, max_retries=3):
    """带重试与随机等待的请求函数"""
    for i in range(max_retries):
        try:
            r = requests.get(url, headers=HEADERS, proxies=proxies, timeout=15)
            if r.status_code == 200 and len(r.content) > 200:
                return r
        except Exception as e:
            print("请求失败:", e)
        time.sleep(random.uniform(1, 3))
    raise RuntimeError("多次请求失败")

def parse_amazon_page(html):
    """针对不同语言/布局容错"""
    soup = BeautifulSoup(html, "html.parser")
    title = None
    for sel in ["#productTitle", "h1 span#productTitle", ".a-size-large.a-color-base.a-text-normal"]:
        node = soup.select_one(sel)
        if node:
            title = node.get_text(strip=True)
            break

    price = None
    for sel in ["#priceblock_ourprice", ".a-price .a-offscreen", "#corePriceDisplay_desktop_feature_div .a-offscreen"]:
        node = soup.select_one(sel)
        if node:
            price = node.get_text(strip=True)
            break

    return title, price

def scrape_product(url):
    check_robots()
    resp = safe_get(url)
    encoding = detect_encoding(resp)
    html = resp.content.decode(encoding, errors="replace")
    title, price = parse_amazon_page(html)
    return {"title": title, "price": price, "encoding": encoding}

if __name__ == "__main__":
    result = scrape_product("https://www.amazon.com/dp/B08N5WRWNW")
    print(result)

这段代码做了几件事：

看编码
用 chardet 判断真实编码，而不是盲目信任 requests 的自动检测。
随机等待
每次请求间随机等待 1～3 秒，模拟真实用户访问节奏。
选择器
Amazon 各区域页面结构差别大，多给几个备用选择器比写死一个靠谱得多。
代理
使用爬虫代理这样的代理池服务，可以避免单一 IP 被限制的问题。
（当然，代理不是为了“躲反爬”，而是做分布式任务时的基础设施。）

四、实战建议

多语言 ≠ 多页面
不同地区的 Amazon 页面可能不仅语言不同，连布局和字段命名都不同。
最好的策略是按区域维护独立解析模板。
编码检测要习惯性检查
当你发现中文、欧元符号、表情等内容出问题，八成是编码。
养成先看 chardet.detect() 的习惯。