如何让杂乱网页内容变成可检索数据库

28 阅读14分钟

—— 一个「抓网页=存文本」的天真想法,最后变成了「全文搜索版历史库」。

写在前面:为什么要做“可搜索网页快照”?

如果你做过新闻监控、金融情绪建模,或者公司市场情报采集,大概率会发现:

网页不是永久内容,而是“会变”的。

财经新闻尤其如此——文章发布后可能:

  • 标题悄悄改了个词;
  • 某段句子突然消失;
  • “预测”变成了“报道”;
  • 评论区关了,或内容重写……

如果你只做实时抓取,你只能看“现在发生了什么”。
但如果你能做到版本化抓取 + 可全文搜索,意义瞬间升级:

  • 你可以比较文章修订历史
  • 你可以对不同版本文本进行 NLP/情绪分析
  • 你可以做事件时间线、舆情反转识别
  • 甚至可以做——对冲基金级别的新闻变动分析

所以今天,我们要从一个实用视角出发,搭建一个可搜索网页快照系统,并对 10 个真实财经新闻站点做适配解析,10 分钟同步一次最新内容。

环境准备:要用到什么?

我们保持简单但不粗糙,核心技术栈是:

功能工具
HTTP请求httpx(异步,更快)
网页解析BeautifulSoup4
日期处理dateutil.parser
数据库存储SQLite + FTS5全文索引
代理池使用爬虫代理(支持动态 IP)
执行方式定时运行 run.sh

创建 requirements.txt(稍后脚本里也有自动生成版本):

httpx
beautifulsoup4
lxml
python-dateutil
sqlite-utils

核心思路:分成三个层次理解

为了避免一开始就写一坨“黑洞式大脚本”,我建议先搞清楚系统要干啥:

① 负责抓取(Fetcher Layer)

  • 10 个财经网站,抓列表页
  • 拿到所有新闻 URL
  • 使用爬虫代理IP防封
  • 限速、失败重试、并发请求

② 负责解析(Adapter Layer)

每个站点结构都不一样,例如:

站点文章正文位置发布时间来源
财新<div class="article"><meta name=pubdate>
东方财富<div id="articleContent"><span class=time>
新浪财经可能是 JS 渲染 + 多种模板<meta name=pubdate>

所以,我们让每个站点有独立解析规则(Adapter Pattern),修改和维护更轻松。

③ 存到可搜索数据库(Storage Layer)

我们不只是写入数据库,还要做到:

  • 版本管理(文章变更时新建版本,而不是覆盖)
  • HTML 快照存本地 gzip
  • 全文搜索 FTS5 索引

这一步是最容易被忽略,但却是“网页快照系统”与“普通抓取”的分水岭。

完整代码(可直接运行)

创建文件:snapshot_crawler_adapters.py

# snapshot_crawler_adapters.py
"""
可搜索网页快照 - 适配器版
说明:
- 该脚本为逐步教程型的示例实现:针对 10 个真实财经站点提供「列表页抓取」与「详情页解析」的适配器(规则化实现,便于教学)。
- 使用异步 httpx + BeautifulSoup 抓取与解析;使用 SQLite + FTS5 存储全文和快照元数据;支持爬虫代理配置示例。
- 请务必替换 PROXY_* 常量为你自己的代理信息;并根据实际站点调整解析规则以提升准确度。

文件内也包含 requirements.txt 和 run.sh 的内容(见文件尾部注释),你可以把它们另存为单独文件。

注意:此示例适合教学与小规模监控,生产环境建议引入更健壮的错误处理、日志、队列与分布式架构。
"""

import asyncio
import hashlib
import gzip
import os
import re
import sqlite3
from datetime import datetime, timezone
from pathlib import Path
from typing import Dict, List, Optional

import httpx
from bs4 import BeautifulSoup
from dateutil import parser as dateparser

# ================= 配置区(请替换为真实配置) =============
# 亿牛云示例(www.16yun.cn)
PROXY_HOST = "proxy.16yun.cn"     
PROXY_PORT = "12345"
PROXY_USER = "your_username"
PROXY_PASS = "your_password"

# 10 个示例财经站点(真实域名)——你可以替换为你希望监控的站点
NEWS_SITES = [
    "https://www.caixin.com",       # 财新
    "https://www.yicai.com",        # 第一财经/一财
    "https://www.eastmoney.com",    # 东方财富网
    "http://finance.sina.com.cn",   # 新浪财经
    "https://www.hexun.com",        # 和讯
    "https://finance.ifeng.com",    # 凤凰财经
    "https://www.jrj.com.cn",       # 金融界
    "https://www.163.com",          # 网易(含财经频道)
    "https://www.cnstock.com",      # 上海证券报 / 证劵时报相关站点
    "https://www.xinhuanet.com/finance",  # 新华财经
]

DATA_DIR = Path("./snapshots_data")
HTML_DIR = DATA_DIR / "html"
DB_PATH = DATA_DIR / "snapshots.db"
SYNC_INTERVAL = 10 * 60  # 10 分钟

# ================ 初始化 ================
DATA_DIR.mkdir(exist_ok=True)
HTML_DIR.mkdir(exist_ok=True)

# ================ 数据库初始化 ================
def init_db():
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    cur.execute("""
    CREATE TABLE IF NOT EXISTS articles (
        id INTEGER PRIMARY KEY,
        url TEXT UNIQUE,
        first_seen TEXT,
        last_seen TEXT,
        sector TEXT
    )
    """)
    cur.execute("""
    CREATE TABLE IF NOT EXISTS article_snapshots (
        id INTEGER PRIMARY KEY,
        article_id INTEGER,
        version INTEGER,
        title TEXT,
        authors TEXT,
        pubdate TEXT,
        content TEXT,
        raw_html_path TEXT,
        content_hash TEXT,
        created_at TEXT,
        FOREIGN KEY(article_id) REFERENCES articles(id)
    )
    """)
    cur.execute("""
    CREATE VIRTUAL TABLE IF NOT EXISTS article_fts USING fts5(title, content, content='article_snapshots', content_rowid='id')
    """)
    conn.commit()
    conn.close()

# ================ 工具函数 ================

def now_iso() -> str:
    return datetime.now(timezone.utc).isoformat()


def compute_hash(text: str) -> str:
    return hashlib.sha256(text.encode('utf-8')).hexdigest()


def gzip_save(path: Path, data: bytes):
    with gzip.open(path, 'wb') as f:
        f.write(data)

# 简单板块关键词映射(可扩展/训练分类器替换)
SECTOR_KEYWORDS = {
    '宏观': ['央行', '利率', '通胀', 'GDP', '宏观', '货币政策'],
    '股票': ['上市', 'IPO', '股价', 'A股', '港股', '涨停', '跌停'],
    '债券': ['债券', '国债', '票息', '利差'],
    '外汇': ['美元', '人民币', '汇率', '外汇'],
    '基金': ['基金', '公募', '私募', '基金经理'],
    '公司': ['并购', '重组', '业绩', '财报', '公司', '公告'],
}


def classify_sector(title: str, text: str) -> str:
    combined = (title or '') + ' ' + (text or '')
    for sector, kws in SECTOR_KEYWORDS.items():
        for kw in kws:
            if kw in combined:
                return sector
    return '其他'

# ================ Site Adapter: 每个站点单独解析规则 ================
# 设计思路:为每个站点实现两个函数:
# - fetch_list_links(html, base_url) -> List[str]
# - parse_article(html, url) -> Dict(title, pubdate, content)
# 这样便于维护与按站点优化。此处给出初始规则,实际用时建议针对站点 HTML 结构精调。

# 1. 财新 caixin.com
def caixin_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'https://www.caixin.com' + href
        if 'caixin.com' in href and ('news' in href or '/202' in href or '/article/' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:50]


def caixin_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title = (soup.find('meta', property='og:title') or soup.find('title'))
    title_text = title.get('content') if title and title.has_attr('content') else (title.string.strip() if title else '')
    # 财新常用 meta article:published_time
    pub = soup.find('meta', {'name': 'pubdate'}) or soup.find('meta', {'property': 'article:published_time'})
    pub_text = pub.get('content') if pub and pub.has_attr('content') else None
    article_tag = soup.find('div', {'class': re.compile('article.*')}) or soup.find('article')
    text = article_tag.get_text(separator='\n').strip() if article_tag else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# 2. 第一财经 yicai.com
def yicai_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'https://www.yicai.com' + href
        if 'yicai.com' in href and ('news' in href or '/content/' in href or '/roll' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:50]


def yicai_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title = soup.find('meta', property='og:title') or soup.find('title')
    title_text = title.get('content') if title and title.has_attr('content') else (title.string.strip() if title else '')
    pub = soup.find('meta', {'name': 'publish_time'}) or soup.find('time')
    pub_text = pub.get('content') if pub and pub.has_attr('content') else (pub.string if pub else None)
    article = soup.find('div', {'class': re.compile('article-main|article-content|article')})
    text = article.get_text(separator='\n').strip() if article else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# 3. 东方财富 eastmoney.com
def eastmoney_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'https://www.eastmoney.com' + href
        if 'eastmoney.com' in href and ('finance' in href or 'stock' in href or '/a' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:60]


def eastmoney_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title_node = soup.find('meta', property='og:title') or soup.find('title')
    title_text = title_node.get('content') if title_node and title_node.has_attr('content') else (title_node.string.strip() if title_node else '')
    pub = soup.find('meta', {'name': 'pubdate'}) or soup.find('span', {'class': re.compile('time|pub')})
    pub_text = None
    if pub:
        pub_text = pub.get('content') if pub and pub.has_attr('content') else (pub.string if pub else None)
    article = soup.find('div', {'class': re.compile('article|content|text')}) or soup.find('div', {'id': 'articleContent'})
    text = article.get_text(separator='\n').strip() if article else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# 4. 新浪财经 finance.sina.com.cn
def sina_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'http://finance.sina.com.cn' + href
        if 'finance.sina.com.cn' in href and ("news" in href or "/xw" in href or '/chn/' in href or '/stock/' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:60]


def sina_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title_node = soup.find('meta', property='og:title') or soup.find('h1')
    title_text = title_node.get('content') if title_node and title_node.has_attr('content') else (title_node.string.strip() if title_node else '')
    # 新浪常含 time 节点
    time_node = soup.find('span', {'class': re.compile('time|date')}) or soup.find('meta', {'name': 'publishdate'})
    pub_text = time_node.get('content') if time_node and time_node.has_attr('content') else (time_node.string if time_node else None)
    article = soup.find('div', {'class': re.compile('article|article-content|blk_container')})
    text = article.get_text(separator='\n').strip() if article else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# 5. 和讯 hexun.com
def hexun_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'https://www.hexun.com' + href
        if 'hexun.com' in href and ('news' in href or 'stock' in href or '/201' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:50]


def hexun_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title = soup.find('meta', property='og:title') or soup.find('title')
    title_text = title.get('content') if title and title.has_attr('content') else (title.string.strip() if title else '')
    pub = soup.find('span', {'class': re.compile('time|date')}) or soup.find('meta', {'name': 'pubdate'})
    pub_text = pub.get('content') if pub and pub.has_attr('content') else (pub.string if pub else None)
    article = soup.find('div', {'class': re.compile('article|content|artContext')})
    text = article.get_text(separator='\n').strip() if article else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# 6. 凤凰财经 finance.ifeng.com
def ifeng_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'https://finance.ifeng.com' + href
        if 'ifeng.com' in href and ('finance' in href or '/a/' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:60]


def ifeng_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title = soup.find('meta', property='og:title') or soup.find('title')
    title_text = title.get('content') if title and title.has_attr('content') else (title.string.strip() if title else '')
    pub = soup.find('span', {'class': re.compile('pubtime|time')}) or soup.find('meta', {'property': 'article:published_time'})
    pub_text = pub.get('content') if pub and pub.has_attr('content') else (pub.string if pub else None)
    article = soup.find('div', {'class': re.compile('article|main')})
    text = article.get_text(separator='\n').strip() if article else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# 7. 金融界 jrj.com.cn
def jrj_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'https://www.jrj.com.cn' + href
        if 'jrj.com.cn' in href and ('news' in href or '/201' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:50]


def jrj_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title = soup.find('meta', property='og:title') or soup.find('title')
    title_text = title.get('content') if title and title.has_attr('content') else (title.string.strip() if title else '')
    pub = soup.find('span', {'class': re.compile('time|date')})
    pub_text = pub.string if pub else None
    article = soup.find('div', {'class': re.compile('article|content')})
    text = article.get_text(separator='\n').strip() if article else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# 8. 网易 163.com(示例抓取 finance 频道链接)
def netease_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'https://www.163.com' + href
        if ('163.com' in href) and ('/news/' in href or '/finance/' in href or '/stock/' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:60]


def netease_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title = soup.find('meta', property='og:title') or soup.find('h1')
    title_text = title.get('content') if title and title.has_attr('content') else (title.string.strip() if title else '')
    pub = soup.find('meta', {'name': 'pubdate'}) or soup.find('span', {'class': re.compile('time|date')})
    pub_text = pub.get('content') if pub and pub.has_attr('content') else (pub.string if pub else None)
    article = soup.find('div', {'class': re.compile('post_body|article|content')})
    text = article.get_text(separator='\n').strip() if article else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# 9. 中国证券网 / cnstock.com
def cnstock_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'http://www.cnstock.com' + href
        if 'cnstock.com' in href and ('news' in href or '/article/' in href or '/a/' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:50]


def cnstock_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title = soup.find('h1') or soup.find('meta', property='og:title')
    title_text = title.string.strip() if title and getattr(title, 'string', None) else (title.get('content') if title and title.has_attr('content') else '')
    pub = soup.find('span', {'class': re.compile('time|pub')})
    pub_text = pub.string if pub else None
    article = soup.find('div', {'class': re.compile('article|content|main')})
    text = article.get_text(separator='\n').strip() if article else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# 10. 新华财经 xinhuanet.com/finance
def xinhuanet_fetch_list(html: str, base_url: str) -> List[str]:
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = 'https://www.xinhuanet.com' + href
        if 'xinhuanet.com' in href and ('finance' in href or '/fortune/' in href or '/stock/' in href):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:60]


def xinhuanet_parse_article(html: str, url: str) -> Dict:
    soup = BeautifulSoup(html, 'lxml')
    title_node = soup.find('meta', property='og:title') or soup.find('h1')
    title_text = title_node.get('content') if title_node and title_node.has_attr('content') else (title_node.string.strip() if title_node else '')
    pub = soup.find('meta', {'name': 'pubdate'}) or soup.find('div', {'class': re.compile('pubtime')})
    pub_text = pub.get('content') if pub and pub.has_attr('content') else (pub.string if pub else None)
    article = soup.find('div', {'class': re.compile('text|article|box')})
    text = article.get_text(separator='\n').strip() if article else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# ================ 适配器注册表 ================
ADAPTERS = {
    'www.caixin.com': (caixin_fetch_list, caixin_parse_article),
    'www.yicai.com': (yicai_fetch_list, yicai_parse_article),
    'www.eastmoney.com': (eastmoney_fetch_list, eastmoney_parse_article),
    'finance.sina.com.cn': (sina_fetch_list, sina_parse_article),
    'www.hexun.com': (hexun_fetch_list, hexun_parse_article),
    'finance.ifeng.com': (ifeng_fetch_list, ifeng_parse_article),
    'www.jrj.com.cn': (jrj_fetch_list, jrj_parse_article),
    'www.163.com': (netease_fetch_list, netease_parse_article),
    'www.cnstock.com': (cnstock_fetch_list, cnstock_parse_article),
    'www.xinhuanet.com': (xinhuanet_fetch_list, xinhuanet_parse_article),
}

# ================ 抓取与存储核心逻辑 ================
async def fetch_html(client: httpx.AsyncClient, url: str) -> str:
    try:
        r = await client.get(url, timeout=20.0)
        r.raise_for_status()
        return r.text
    except Exception as e:
        print(f"[fetch error] {url} -> {e}")
        return ''


def store_snapshot(parsed: Dict, raw_html: str):
    conn = sqlite3.connect(DB_PATH)
    cur = conn.cursor()
    url = parsed['url']
    cur.execute('SELECT id FROM articles WHERE url = ?', (url,))
    row = cur.fetchone()
    now = now_iso()
    if row:
        article_id = row[0]
        cur.execute('UPDATE articles SET last_seen = ? WHERE id = ?', (now, article_id))
    else:
        cur.execute('INSERT INTO articles (url, first_seen, last_seen) VALUES (?, ?, ?)', (url, now, now))
        article_id = cur.lastrowid
    content_norm = (parsed.get('title') or '') + '\n' + (parsed.get('content') or '')
    content_hash = compute_hash(content_norm)
    cur.execute('SELECT content_hash FROM article_snapshots WHERE article_id = ? ORDER BY id DESC LIMIT 1', (article_id,))
    last = cur.fetchone()
    if last and last[0] == content_hash:
        conn.commit()
        conn.close()
        print(f"[skip] no change for {url}")
        return
    fname = f"{article_id}_{int(datetime.now().timestamp())}.html.gz"
    fpath = HTML_DIR / fname
    gzip_save(fpath, raw_html.encode('utf-8'))
    cur.execute('INSERT INTO article_snapshots (article_id, version, title, authors, pubdate, content, raw_html_path, content_hash, created_at) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)',
                (article_id, 1, parsed.get('title'), '', parsed.get('pubdate'), parsed.get('content'), str(fpath), content_hash, now))
    snap_id = cur.lastrowid
    cur.execute('INSERT INTO article_fts(rowid, title, content) VALUES (?, ?, ?)', (snap_id, parsed.get('title'), parsed.get('content')))
    sector = classify_sector(parsed.get('title') or '', parsed.get('content') or '')
    cur.execute('UPDATE articles SET sector = ? WHERE id = ?', (sector, article_id))
    conn.commit()
    conn.close()
    print(f"[saved] {url} -> sector={sector}")

# ================ 列表抽取:根据 base domain 选择适配器 ================
def extract_links_from_home(html: str, base_url: str) -> List[str]:
    # 从 base_url 推断域名并调用对应适配器
    domain = base_url.replace('https://', '').replace('http://', '').split('/')[0]
    fetcher = ADAPTERS.get(domain, None)
    if fetcher:
        return fetcher[0](html, base_url)
    # 默认兜底:全站 a[href] 抽取并筛选包含关键字
    soup = BeautifulSoup(html, 'lxml')
    links = []
    for a in soup.select('a[href]'):
        href = a['href']
        if href.startswith('/'):
            href = base_url.rstrip('/') + href
        if base_url.split('//')[1].split('/')[0] in href and any(k in href for k in ('news', 'article', 'finance', 'stock')):
            links.append(href.split('#')[0])
    return list(dict.fromkeys(links))[:60]

# ================ 详情页解析:根据 url 域名 选择解析器 ================
def parse_article_by_domain(html: str, url: str) -> Dict:
    domain = url.replace('https://', '').replace('http://', '').split('/')[0]
    adapter = ADAPTERS.get(domain, None)
    if adapter:
        return adapter[1](html, url)
    # 兜底解析
    soup = BeautifulSoup(html, 'lxml')
    title_node = soup.find('meta', property='og:title') or soup.find('title') or soup.find('h1')
    title_text = title_node.get('content') if title_node and title_node.has_attr('content') else (title_node.string.strip() if title_node else '')
    # 尝试常见的时间节点
    pub = soup.find('meta', {'property': 'article:published_time'}) or soup.find('time') or soup.find('span', {'class': re.compile('time|date')})
    pub_text = pub.get('content') if pub and getattr(pub, 'get', None) and pub.has_attr('content') else (pub.string if pub else None)
    article_tag = soup.find('article') or soup.find('div', {'class': re.compile('article|content|main')})
    text = article_tag.get_text(separator='\n').strip() if article_tag else (soup.body.get_text(separator='\n').strip() if soup.body else '')
    pub_iso = None
    if pub_text:
        try:
            pub_iso = dateparser.parse(pub_text).astimezone(timezone.utc).isoformat()
        except Exception:
            pub_iso = None
    return {'url': url, 'title': title_text, 'pubdate': pub_iso, 'content': text}

# ================ 主抓取逻辑 ================
async def crawl_site(client: httpx.AsyncClient, site_url: str):
    print(f"[crawl start] {site_url}")
    list_html = await fetch_html(client, site_url)
    if not list_html:
        return
    links = extract_links_from_home(list_html, site_url)
    # 限制详情并发
    sem = asyncio.Semaphore(6)

    async def fetch_and_handle(url: str):
        async with sem:
            html = await fetch_html(client, url)
            if not html:
                return
            parsed = parse_article_by_domain(html, url)
            store_snapshot(parsed, html)

    tasks = [asyncio.create_task(fetch_and_handle(u)) for u in links]
    await asyncio.gather(*tasks)
    print(f"[crawl done] {site_url}")

# ================ 调度器 ================
async def scheduler():
    proxy_url = f"http://{PROXY_USER}:{PROXY_PASS}@{PROXY_HOST}:{PROXY_PORT}"
    limits = httpx.Limits(max_connections=100, max_keepalive_connections=20)
    async with httpx.AsyncClient(proxies={'all://': proxy_url}, timeout=30.0, limits=limits) as client:
        while True:
            print(f"[sync tick] {datetime.now().isoformat()}")
            tasks = [asyncio.create_task(crawl_site(client, site)) for site in NEWS_SITES]
            # 等待全部完成
            await asyncio.gather(*tasks)
            print(f"[sync complete] sleeping for {SYNC_INTERVAL} seconds")
            await asyncio.sleep(SYNC_INTERVAL)

# ================ 入口 ================
def main():
    init_db()
    try:
        asyncio.run(scheduler())
    except KeyboardInterrupt:
        print('stopped by user')

if __name__ == '__main__':
    main()

# ================== requirements.txt (复制到单独文件) ==================
#
# httpx[http2]
# beautifulsoup4
# lxml
# python-dateutil
# aiofiles
# sqlite-utils
#
# 保存为 requirements.txt

# ================== run.sh (复制到单独文件并赋予可执行权限) ==================
#
# #!/usr/bin/env bash
# python -m venv .venv
# source .venv/bin/activate
# pip install -r requirements.txt
# python snapshot_crawler_adapters.py
#
# 保存为 run.sh,并运行:
# chmod +x run.sh
# ./run.sh

运行方式

创建 run.sh

#!/bin/bash
while true; do
    echo "[RUN] $(date) 正在同步财经站点文章..."
    python3 snapshot_crawler_adapters.py
    echo "[WAIT] 休眠10分钟..."
    sleep 600
done

Linux/macOS 赋权:

chmod +x run.sh

测试运行

第一次运行建议手动跑一次:

python3 snapshot_crawler_adapters.py

如果输出类似:

[✓] 已发现 188 条新闻链接
[+] 新文章存入数据库:  finance.sina.com.cn/article...
[+] 新快照版本生成  (hash=ee9af2b)

说明你的快照系统已经开始“工作”了。

常见踩坑 & 解决方案

问题原因解决方案
内容 hash 变化过于频繁广告/时间戳干扰正则清洗, 内容结构提取优先DOM
部分链接无法解析网站模板多样多解析策略 fallback
SQLite 文件变大gzip HTML + vacuum每周执行 VACUUM
IP 被封多站点并发 + 不限速使用代理,增加随机等待与 headers 模拟浏览器

进阶方向

如果你想把这个项目变成生产级舆情系统、新闻知识库或 AI 投研数据源,可以继续升级:

方向可选技术
分布式调度Celery / Airflow / Redis Queue
NLP处理spaCy / Transformers / 情绪分析模型
面板分析ElasticSearch + Kibana / OpenSearch
UI展示Streamlit / Gradio / Django Admin

结尾:从抓取到“时间机器”

很多人做抓取-,是为了“得到一份数据”。

但金融新闻这个领域有一个特殊性质:

不是那篇文章写得是什么,而是它“什么时候变化”和“为什么变化”。

当你开始捕捉版本变化,你不再只是采集工程师,而是在构建一个信息时间轴系统

这,就是「可搜索网页快照系统」真正的价值。