我用 Python 标准库造了个每日技术日报机器人，零依赖、零维护成本我用 Python 标准库造了个每日技术日报机器人

我用 Python 标准库造了个每日技术日报机器人，零依赖、零维护成本

每天早上 7 点，它自动跑完，然后我打开页面，直接看到昨夜全球技术圈发生的事。没有 RSS 聚合器，没有付费订阅，没有 AI token 消耗——只有标准库 + 两个公开 API。

这篇文章分享它的架构、核心代码，以及我在路上踩过的坑。

背景：为什么要造这个

我每天早上的"信息摄入仪式"：打开 Hacker News → 翻 10 分钟 → 打开 GitHub Trending → 再翻 5 分钟 → 打开几个感兴趣的链接 → 半小时没了。

这 30 分钟不是阅读时间，是过滤时间。真正有价值的内容就那几条。

解决方案很简单：让机器替我过滤，我只看结果。

目标：

每天自动汇总 HN Top 15 + GitHub Trending 12 个仓库
生成 Markdown 日报 + 趋势分析
推送到我的 GitHub Pages 网站
完全不需要我盯着

架构一眼看懂

07:00 cron 触发
    │
    ▼
generator.py          ← HN Firebase API + GitHub Search API
    │   输出：posts/2026-03-25.md
    │         data/hn-stories-*.json
    │         data/github-trending-*.json
    ▼
analyzer.py           ← 读今天的 data/*.json
    │   输出：data/analysis-2026-03-25.json
    │         （热点分类、关键词频率、趋势信号）
    ▼
publish_to_github_pages.py
    │   git add + commit + push → GitHub Pages 更新
    ▼
完成，去睡觉

没有数据库，没有 Redis，没有消息队列。就是三个 Python 脚本串起来。

核心代码：数据获取

HN 数据（官方 Firebase API，免费无限制）

import urllib.request
import json

def fetch_top_stories(limit=15):
    """获取 HN Top Stories"""
    # 第一步：拿 ID 列表
    url = "https://hacker-news.firebaseio.com/v0/topstories.json"
    with urllib.request.urlopen(url, timeout=10) as r:
        story_ids = json.loads(r.read())[:limit * 2]  # 多拿一些，防止有删除的
    
    stories = []
    for sid in story_ids:
        try:
            item_url = f"https://hacker-news.firebaseio.com/v0/item/{sid}.json"
            with urllib.request.urlopen(item_url, timeout=5) as r:
                item = json.loads(r.read())
            # 只保留真实 story（有 url 的）
            if item and item.get('type') == 'story' and item.get('url'):
                stories.append({
                    'title': item['title'],
                    'url': item['url'],
                    'score': item.get('score', 0),
                    'comments': item.get('descendants', 0),
                    'by': item.get('by', ''),
                })
                if len(stories) >= limit:
                    break
        except Exception:
            continue
    
    return stories

关键细节：Firebase API 没有速率限制（只要不暴力并发），但每个 item 要单独请求。15 个 story = 16 次 HTTP 请求，加上并发保护大概 10-15 秒跑完。

GitHub Trending（用 Search API 模拟）

from datetime import datetime, timedelta

def fetch_github_trending(limit=12):
    """通过 GitHub Search API 模拟 Trending"""
    # GitHub 官方没有 Trending API
    # 用 created:>7天前 + sort=stars 近似代替
    since = (datetime.now() - timedelta(days=7)).strftime('%Y-%m-%d')
    query = f"created:>{since}"
    
    url = (
        f"https://api.github.com/search/repositories"
        f"?q={urllib.parse.quote(query)}"
        f"&sort=stars&order=desc&per_page={limit}"
    )
    
    req = urllib.request.Request(url, headers={
        'Accept': 'application/vnd.github.v3+json',
        'User-Agent': 'content-producer/1.0'
    })
    
    with urllib.request.urlopen(req, timeout=15) as r:
        data = json.loads(r.read())
    
    repos = []
    for item in data.get('items', []):
        repos.append({
            'name': item['full_name'],
            'description': item.get('description', ''),
            'stars': item['stargazers_count'],
            'language': item.get('language', 'Unknown'),
            'url': item['html_url'],
        })
    
    return repos

注意：不带 Token 调用 GitHub API，每小时限额 60 次。日报只需调用一次，完全够用。带 Token 可以到 5000 次/小时。

趋势分析：不用 LLM 也能做分类

这是我比较得意的设计。很多人遇到"分类"问题就想到调 GPT，但其实规则 + 关键词就够了：

CATEGORY_KEYWORDS = {
    'AI/LLM': ['llm', 'gpt', 'claude', 'gemini', 'openai', 'anthropic', 
               'neural', 'model', 'inference', 'embedding', 'rag', 'agent'],
    'Security': ['vulnerability', 'cve', 'exploit', 'breach', 'hack', 
                 'malware', 'zero-day', 'patch', 'backdoor'],
    'Infrastructure': ['kubernetes', 'docker', 'terraform', 'aws', 'gcp',
                       'deployment', 'microservice', 'container', 'k8s'],
    'Web/Frontend': ['react', 'vue', 'svelte', 'typescript', 'css', 
                     'browser', 'wasm', 'webgl'],
    'Open Source': ['open source', 'github', 'release', 'fork', 'contribute'],
}

def categorize_stories(stories):
    """给每条 story 分配类别（可多分类）"""
    results = {}
    for category, keywords in CATEGORY_KEYWORDS.items():
        count = 0
        for story in stories:
            text = (story['title'] + ' ' + story.get('url', '')).lower()
            if any(kw in text for kw in keywords):
                count += 1
        if count > 0:
            results[category] = count
    
    # 按命中数排序
    return sorted(results.items(), key=lambda x: x[1], reverse=True)

输出样例：

{
  "hot_categories": [["AI/LLM", 6], ["Security", 4], ["Infrastructure", 3]],
  "top_keywords": [["rust", 5], ["llm", 4], ["docker", 3]],
  "signal": "AI/LLM dominates this cycle — 40% of top stories"
}

零 token 消耗，延迟 0ms，准确率对日报场景来说完全够用。

踩过的坑

坑 1：GitHub Actions 时区

schedule:
  - cron: '0 23 * * *'   # UTC 23:00 = 北京次日 07:00

这个我想了 10 分钟才搞清楚。GitHub Actions 的 cron 用 UTC，我想让它在北京时间每天早 7 点跑，就要减 8 小时 = UTC 前一天 23:00。

坑 2：GitHub Pages 仓库的 remote 判断

我本地用 SSH 方式克隆（git@github.com:...），但脚本里最初用了 https://github.com 来判断 remote：

# 这样不行
if 'github.com/citriac/citriac.github.io' in remote_url:
    ...

# 改成这样才对
if 'citriac.github.io' in remote_url:
    ...

SSH 格式是 git@github.com:citriac/citriac.github.io.git，字符串匹配要包容两种格式。

坑 3：HN API 偶尔返回 null

Firebase API 偶尔会对某个 item 返回 null（已删除或被标记）：

with urllib.request.urlopen(item_url, timeout=5) as r:
    item = json.loads(r.read())

# 一定要判空
if item and item.get('type') == 'story':
    ...

效果

跑了一个多月，每天准时出报告：

HN Top 15：真实的高分 story，基本没噪音
GitHub Trending 12：新兴项目，经常发现没关注过的好库
趋势分类：一眼看出今天是 AI 多、还是 Security 多、还是开源工具多

👉 Live 效果：citriac.github.io/daily.html（… UTC 23:00 自动更新）

完整源码

开源在 GitHub：github.com/citriac/con…

不想自己搭的：我打包了一个开箱即用的版本，包含完整配置、GitHub Actions workflow、setup 文档，30 分钟内可以跑起来自己的日报站：

🚀 Daily Tech Digest Automation Kit — $15

我用 Python 标准库造了个每日技术日报机器人，零依赖、零维护成本