系统设计实战 175:175. 设计电影评分聚合系统

3 阅读13分钟

🚀 系统设计实战 175:175. 设计电影评分聚合系统

摘要:本文深入剖析系统的核心架构关键算法工程实践,提供完整的设计方案和面试要点。

你是否想过,设计电影评分聚合系统背后的技术挑战有多复杂?

系统概述

设计一个电影评分聚合平台,从多个数据源抓取评分信息,使用聚合算法计算综合评分,提供实时数据更新、缓存策略和API服务,为用户提供准确的电影评价参考。

核心功能需求

基础功能

  • 多源数据抓取
  • 评分聚合算法
  • 实时数据更新
  • 缓存策略优化
  • RESTful API 服务

高级功能

  • 智能评分权重
  • 异常数据检测
  • 趋势分析预测
  • 个性化推荐
  • 数据可视化

系统架构

整体架构

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   客户端层       │    │   API 网关      │    │   数据抓取层     │
│                │    │                │    │                │
│ • Web 应用      │◄──►│ • 路由转发      │◄──►│ • IMDb 爬虫     │
│ • 移动 App      │    │ • 认证鉴权      │    │ • 豆瓣爬虫      │
│ • 第三方集成    │    │ • 限流熔断      │    │ • Metacritic    │
└─────────────────┘    └─────────────────┘    │ • Rotten Tomatoes│
                                             │ • 其他数据源     │
                                             └─────────────────┘
                              │
                              ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   业务服务层     │    │   数据处理层     │    │   调度系统      │
│                │    │                │    │                │
│ • 电影服务      │◄──►│ • 数据清洗      │◄──►│ • 任务调度      │
│ • 评分服务      │    │ • 聚合计算      │    │ • 定时任务      │
│ • 搜索服务      │    │ • 异常检测      │    │ • 监控告警      │
│ • 推荐服务      │    │ • 数据验证      │    │                │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                              │
                              ▼
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   数据存储层     │    │   缓存层        │    │   消息队列      │
│                │    │                │    │                │
│ • PostgreSQL    │    │ • Redis 集群    │    │ • Kafka        │
│ • MongoDB       │    │ • Memcached     │    │ • RabbitMQ      │
│ • Elasticsearch │    │ • CDN 缓存      │    │                │
└─────────────────┘    └─────────────────┘    └─────────────────┘

数据库设计

电影表 (movies)

CREATE TABLE movies (
    movie_id BIGINT PRIMARY KEY,
    imdb_id VARCHAR(20) UNIQUE,
    tmdb_id INT,
    title VARCHAR(500) NOT NULL,
    original_title VARCHAR(500),
    release_date DATE,
    runtime_minutes INT,
    genres JSON,
    director VARCHAR(200),
    cast_members JSON,
    plot_summary TEXT,
    poster_url VARCHAR(1000),
    budget BIGINT,
    box_office BIGINT,
    production_companies JSON,
    countries JSON,
    languages JSON,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP
);

数据源表 (rating_sources)

CREATE TABLE rating_sources (
    source_id INT PRIMARY KEY,
    source_name VARCHAR(50) NOT NULL,
    source_url VARCHAR(200),
    rating_scale VARCHAR(20), -- 如 "1-10", "1-5", "0-100%"
    weight DECIMAL(3,2) DEFAULT 1.00,
    is_active BOOLEAN DEFAULT TRUE,
    reliability_score DECIMAL(3,2) DEFAULT 1.00,
    update_frequency_hours INT DEFAULT 24,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

原始评分表 (raw_ratings)

CREATE TABLE raw_ratings (
    rating_id BIGINT PRIMARY KEY,
    movie_id BIGINT NOT NULL,
    source_id INT NOT NULL,
    rating_value DECIMAL(4,2) NOT NULL,
    rating_count INT,
    rating_distribution JSON, -- 各星级的评分分布
    critic_score DECIMAL(4,2),
    audience_score DECIMAL(4,2),
    scraped_at DATETIME NOT NULL,
    data_quality_score DECIMAL(3,2) DEFAULT 1.00,
    is_valid BOOLEAN DEFAULT TRUE,
    FOREIGN KEY (movie_id) REFERENCES movies(movie_id),
    FOREIGN KEY (source_id) REFERENCES rating_sources(source_id),
    INDEX idx_movie_source (movie_id, source_id),
    INDEX idx_scraped_at (scraped_at)
);

聚合评分表 (aggregated_ratings)

CREATE TABLE aggregated_ratings (
    aggregation_id BIGINT PRIMARY KEY,
    movie_id BIGINT NOT NULL,
    overall_score DECIMAL(4,2) NOT NULL,
    weighted_score DECIMAL(4,2) NOT NULL,
    critic_consensus_score DECIMAL(4,2),
    audience_consensus_score DECIMAL(4,2),
    total_rating_count INT,
    source_count INT,
    confidence_level DECIMAL(3,2),
    last_updated DATETIME NOT NULL,
    calculation_method VARCHAR(50),
    FOREIGN KEY (movie_id) REFERENCES movies(movie_id),
    UNIQUE KEY unique_movie (movie_id)
);

评分历史表 (rating_history)

CREATE TABLE rating_history (
    history_id BIGINT PRIMARY KEY,
    movie_id BIGINT NOT NULL,
    source_id INT NOT NULL,
    rating_value DECIMAL(4,2) NOT NULL,
    rating_count INT,
    recorded_date DATE NOT NULL,
    change_from_previous DECIMAL(4,2),
    FOREIGN KEY (movie_id) REFERENCES movies(movie_id),
    FOREIGN KEY (source_id) REFERENCES rating_sources(source_id),
    INDEX idx_movie_date (movie_id, recorded_date)
);

核心服务设计

1. 数据抓取服务 (Data Scraping Service)

// 时间复杂度:O(N),空间复杂度:O(1)

import asyncio
import aiohttp
from bs4 import BeautifulSoup
from datetime import datetime, timedelta

class DataScrapingService:
    def __init__(self):
        self.db = DatabaseConnection()
        self.cache = RedisCache()
        self.scrapers = {
            'imdb': IMDbScraper(),
            'douban': DoubanScraper(),
            'metacritic': MetacriticScraper(),
            'rotten_tomatoes': RottenTomatoesScraper()
        }
    
    async def scrape_all_sources(self, movie_id):
        """并发抓取所有数据源"""
        movie = self.db.get_movie(movie_id)
        if not movie:
            return None
        
        tasks = []
        for source_name, scraper in self.scrapers.items():
            if scraper.is_active():
                task = asyncio.create_task(
                    self.scrape_source_with_retry(scraper, movie, source_name)
                )
                tasks.append(task)
        
        results = await asyncio.gather(*tasks, return_exceptions=True)
        
        # 处理抓取结果
        scraped_data = []
        for i, result in enumerate(results):
            if not isinstance(result, Exception) and result:
                scraped_data.append(result)
        
        return scraped_data
    
    async def scrape_source_with_retry(self, scraper, movie, source_name, max_retries=3):
        """带重试的数据抓取"""
        for attempt in range(max_retries):
            try:
                # 检查缓存
                cache_key = f"scrape:{source_name}:{movie['movie_id']}"
                cached_result = self.cache.get(cache_key)
                
                if cached_result:
                    return json.loads(cached_result)
                
                # 执行抓取
                rating_data = await scraper.scrape_rating(movie)
                
                if rating_data:
                    # 数据质量检查
                    quality_score = self.validate_data_quality(rating_data, source_name)
                    rating_data['data_quality_score'] = quality_score
                    
                    # 缓存结果
                    self.cache.set(cache_key, json.dumps(rating_data), ex=3600)
                    
                    return rating_data
                
            except Exception as e:
                if attempt == max_retries - 1:
                    self.log_scraping_error(source_name, movie['movie_id'], str(e))
                else:
                    await asyncio.sleep(2 ** attempt)  # 指数退避
        
        return None
    
    def validate_data_quality(self, rating_data, source_name):
        """验证数据质量"""
        quality_score = 1.0
        
        # 检查评分范围
        source_info = self.db.get_rating_source(source_name)
        if source_info:
            min_rating, max_rating = self.parse_rating_scale(source_info['rating_scale'])
            
            if not (min_rating <= rating_data['rating_value'] <= max_rating):
                quality_score -= 0.3
        
        # 检查评分数量合理性
        if rating_data.get('rating_count', 0) < 10:
            quality_score -= 0.2
        
        # 检查数据完整性
        required_fields = ['rating_value', 'scraped_at']
        missing_fields = [field for field in required_fields if not rating_data.get(field)]
        
        if missing_fields:
            quality_score -= 0.2 * len(missing_fields)
        
        return max(0.0, quality_score)

class IMDbScraper:
    def __init__(self):
        self.base_url = "https://www.imdb.com"
        self.session = aiohttp.ClientSession()
    
    async def scrape_rating(self, movie):
        """抓取IMDb评分"""
        if not movie.get('imdb_id'):
            return None
        
        url = f"{self.base_url}/title/{movie['imdb_id']}/"
        
        async with self.session.get(url) as response:
            if response.status != 200:
                return None
            
            html = await response.text()
            soup = BeautifulSoup(html, 'html.parser')
            
            # 提取评分信息
            rating_element = soup.find('span', {'class': 'sc-bde20123-1'})
            if not rating_element:
                return None
            
            rating_value = float(rating_element.text.strip())
            
            # 提取评分数量
            rating_count_element = soup.find('div', {'class': 'sc-bde20123-3'})
            rating_count = 0
            if rating_count_element:
                count_text = rating_count_element.text.strip()
                rating_count = self.parse_rating_count(count_text)
            
            # 提取评分分布
            distribution = self.extract_rating_distribution(soup)
            
            return {
                'movie_id': movie['movie_id'],
                'source_id': self.get_source_id('imdb'),
                'rating_value': rating_value,
                'rating_count': rating_count,
                'rating_distribution': distribution,
                'scraped_at': datetime.now()
            }
    
    def extract_rating_distribution(self, soup):
        """提取评分分布"""
        distribution = {}
        
        # 查找评分分布表格
        rating_table = soup.find('table', {'class': 'ratings-table'})
        if rating_table:
            rows = rating_table.find_all('tr')
            for row in rows:
                cells = row.find_all('td')
                if len(cells) >= 3:
                    star_rating = cells[0].text.strip()
                    percentage = cells[2].text.strip().replace('%', '')
                    distribution[star_rating] = float(percentage)
        
        return distribution

2. 评分聚合服务 (Rating Aggregation Service)

class RatingAggregationService:
    def __init__(self):
        self.db = DatabaseConnection()
        self.cache = RedisCache()
        self.anomaly_detector = AnomalyDetector()
    
    def calculate_aggregated_rating(self, movie_id):
        """计算聚合评分"""
        # 获取所有有效的原始评分
        raw_ratings = self.db.query("""
            SELECT rr.*, rs.weight, rs.reliability_score, rs.rating_scale
            FROM raw_ratings rr
            JOIN rating_sources rs ON rr.source_id = rs.source_id
            WHERE rr.movie_id = %s 
            AND rr.is_valid = TRUE
            AND rs.is_active = TRUE
            AND rr.scraped_at >= DATE_SUB(NOW(), INTERVAL 7 DAY)
        """, [movie_id])
        
        if not raw_ratings:
            return None
        
        # 标准化评分到0-10范围
        normalized_ratings = []
        for rating in raw_ratings:
            normalized_value = self.normalize_rating(
                rating['rating_value'], 
                rating['rating_scale']
            )
            
            # 异常检测
            is_anomaly = self.anomaly_detector.detect_anomaly(
                movie_id, rating['source_id'], normalized_value
            )
            
            if not is_anomaly:
                normalized_ratings.append({
                    'value': normalized_value,
                    'weight': rating['weight'],
                    'reliability': rating['reliability_score'],
                    'count': rating['rating_count'] or 1,
                    'source_id': rating['source_id']
                })
        
        if not normalized_ratings:
            return None
        
        # 计算加权平均分
        weighted_score = self.calculate_weighted_average(normalized_ratings)
        
        # 计算置信度
        confidence_level = self.calculate_confidence_level(normalized_ratings)
        
        # 计算专业评分和观众评分
        critic_score = self.calculate_critic_consensus(normalized_ratings)
        audience_score = self.calculate_audience_consensus(normalized_ratings)
        
        # 计算总体评分(结合多种算法)
        overall_score = self.calculate_overall_score(
            weighted_score, critic_score, audience_score, confidence_level
        )
        
        aggregated_data = {
            'movie_id': movie_id,
            'overall_score': overall_score,
            'weighted_score': weighted_score,
            'critic_consensus_score': critic_score,
            'audience_consensus_score': audience_score,
            'total_rating_count': sum(r['count'] for r in normalized_ratings),
            'source_count': len(normalized_ratings),
            'confidence_level': confidence_level,
            'last_updated': datetime.now(),
            'calculation_method': 'weighted_bayesian_average'
        }
        
        # 保存聚合结果
        self.save_aggregated_rating(aggregated_data)
        
        return aggregated_data
    
    def normalize_rating(self, rating_value, rating_scale):
        """标准化评分到0-10范围"""
        if rating_scale == "1-10":
            return rating_value
        elif rating_scale == "1-5":
            return rating_value * 2
        elif rating_scale == "0-100%":
            return rating_value / 10
        elif rating_scale == "1-4":
            return (rating_value - 1) * 10 / 3
        else:
            # 默认假设为0-10范围
            return rating_value
    
    def calculate_weighted_average(self, ratings):
        """计算加权平均分"""
        total_weighted_sum = 0
        total_weight = 0
        
        for rating in ratings:
            # 综合权重 = 数据源权重 × 可靠性 × log(评分数量)
            combined_weight = (
                rating['weight'] * 
                rating['reliability'] * 
                math.log(rating['count'] + 1)
            )
            
            total_weighted_sum += rating['value'] * combined_weight
            total_weight += combined_weight
        
        return total_weighted_sum / total_weight if total_weight > 0 else 0
    
    def calculate_confidence_level(self, ratings):
        """计算置信度"""
        if len(ratings) < 2:
            return 0.5
        
        # 基于数据源数量和评分一致性
        source_diversity = len(ratings) / 10  # 最多10个数据源
        
        # 计算评分方差
        values = [r['value'] for r in ratings]
        variance = statistics.variance(values) if len(values) > 1 else 0
        consistency = max(0, 1 - variance / 10)  # 方差越小一致性越高
        
        # 基于总评分数量
        total_count = sum(r['count'] for r in ratings)
        sample_size_factor = min(1.0, math.log(total_count + 1) / 10)
        
        confidence = (source_diversity + consistency + sample_size_factor) / 3
        return min(1.0, confidence)
    
    def calculate_bayesian_average(self, ratings, global_average=6.5, min_votes=100):
        """贝叶斯平均算法"""
        total_weighted_sum = 0
        total_weight = 0
        
        for rating in ratings:
            # 贝叶斯调整
            adjusted_count = rating['count'] + min_votes
            adjusted_sum = rating['value'] * rating['count'] + global_average * min_votes
            adjusted_average = adjusted_sum / adjusted_count
            
            weight = rating['weight'] * rating['reliability']
            total_weighted_sum += adjusted_average * weight
            total_weight += weight
        
        return total_weighted_sum / total_weight if total_weight > 0 else global_average

3. 异常检测服务 (Anomaly Detection Service)

class AnomalyDetector:
    def __init__(self):
        self.db = DatabaseConnection()
        self.ml_model = AnomalyDetectionModel()
    
    def detect_anomaly(self, movie_id, source_id, rating_value):
        """检测异常评分"""
        # 获取历史数据
        historical_ratings = self.db.query("""
            SELECT rating_value, scraped_at
            FROM raw_ratings
            WHERE movie_id = %s AND source_id = %s
            AND scraped_at >= DATE_SUB(NOW(), INTERVAL 30 DAY)
            ORDER BY scraped_at DESC
            LIMIT 50
        """, [movie_id, source_id])
        
        if len(historical_ratings) < 5:
            return False  # 数据不足,不判定为异常
        
        # 统计方法检测
        values = [r['rating_value'] for r in historical_ratings]
        mean_rating = statistics.mean(values)
        std_rating = statistics.stdev(values) if len(values) > 1 else 0
        
        # Z-score异常检测
        if std_rating > 0:
            z_score = abs(rating_value - mean_rating) / std_rating
            if z_score > 3:  # 3-sigma规则
                return True
        
        # 机器学习模型检测
        features = self.extract_anomaly_features(movie_id, source_id, rating_value)
        ml_anomaly_score = self.ml_model.predict_anomaly(features)
        
        return ml_anomaly_score > 0.8  # 异常阈值
    
    def extract_anomaly_features(self, movie_id, source_id, rating_value):
        """提取异常检测特征"""
        # 电影特征
        movie = self.db.get_movie(movie_id)
        
        # 时间特征
        release_days = (datetime.now().date() - movie['release_date']).days
        
        # 评分趋势特征
        recent_ratings = self.db.query("""
            SELECT rating_value, scraped_at
            FROM raw_ratings
            WHERE movie_id = %s
            AND scraped_at >= DATE_SUB(NOW(), INTERVAL 7 DAY)
            ORDER BY scraped_at DESC
        """, [movie_id])
        
        trend_slope = self.calculate_trend_slope(recent_ratings)
        
        return {
            'rating_value': rating_value,
            'release_days': release_days,
            'genre_count': len(movie.get('genres', [])),
            'trend_slope': trend_slope,
            'source_reliability': self.get_source_reliability(source_id),
            'rating_volatility': self.calculate_rating_volatility(movie_id)
        }

4. API 服务 (API Service)

from flask import Flask, jsonify, request
from flask_limiter import Limiter
from flask_limiter.util import get_remote_address

class MovieRatingAPI:
    def __init__(self):
        self.app = Flask(__name__)
        self.limiter = Limiter(
            app=self.app,
            key_func=get_remote_address,
            default_limits=["1000 per hour"]
        )
        self.db = DatabaseConnection()
        self.cache = RedisCache()
        self.setup_routes()
    
    def setup_routes(self):
        @self.app.route('/api/movies/<int:movie_id>/rating')
        @self.limiter.limit("100 per minute")
        def get_movie_rating(movie_id):
            """获取电影聚合评分"""
            # 检查缓存
            cache_key = f"movie_rating:{movie_id}"
            cached_rating = self.cache.get(cache_key)
            
            if cached_rating:
                return jsonify(json.loads(cached_rating))
            
            # 从数据库获取
            rating_data = self.db.query_one("""
                SELECT ar.*, m.title, m.release_date, m.poster_url
                FROM aggregated_ratings ar
                JOIN movies m ON ar.movie_id = m.movie_id
                WHERE ar.movie_id = %s
            """, [movie_id])
            
            if not rating_data:
                return jsonify({'error': 'Movie not found'}), 404
            
            # 获取各数据源评分
            source_ratings = self.db.query("""
                SELECT rr.rating_value, rr.rating_count, rs.source_name, rs.rating_scale
                FROM raw_ratings rr
                JOIN rating_sources rs ON rr.source_id = rs.source_id
                WHERE rr.movie_id = %s
                AND rr.is_valid = TRUE
                ORDER BY rr.scraped_at DESC
            """, [movie_id])
            
            response_data = {
                'movie_id': movie_id,
                'title': rating_data['title'],
                'release_date': rating_data['release_date'].isoformat() if rating_data['release_date'] else None,
                'poster_url': rating_data['poster_url'],
                'overall_score': float(rating_data['overall_score']),
                'weighted_score': float(rating_data['weighted_score']),
                'critic_consensus_score': float(rating_data['critic_consensus_score']) if rating_data['critic_consensus_score'] else None,
                'audience_consensus_score': float(rating_data['audience_consensus_score']) if rating_data['audience_consensus_score'] else None,
                'confidence_level': float(rating_data['confidence_level']),
                'total_rating_count': rating_data['total_rating_count'],
                'source_count': rating_data['source_count'],
                'last_updated': rating_data['last_updated'].isoformat(),
                'source_ratings': [
                    {
                        'source': rating['source_name'],
                        'rating': float(rating['rating_value']),
                        'count': rating['rating_count'],
                        'scale': rating['rating_scale']
                    }
                    for rating in source_ratings
                ]
            }
            
            # 缓存结果
            self.cache.set(cache_key, json.dumps(response_data), ex=1800)
            
            return jsonify(response_data)
        
        @self.app.route('/api/movies/search')
        @self.limiter.limit("200 per minute")
        def search_movies():
            """搜索电影"""
            query = request.args.get('q', '').strip()
            limit = min(int(request.args.get('limit', 20)), 100)
            
            if not query:
                return jsonify({'error': 'Query parameter required'}), 400
            
            # 使用Elasticsearch进行搜索
            search_results = self.search_service.search_movies(query, limit)
            
            return jsonify({
                'query': query,
                'results': search_results,
                'total': len(search_results)
            })
        
        @self.app.route('/api/movies/trending')
        @self.limiter.limit("50 per minute")
        def get_trending_movies():
            """获取热门电影"""
            cache_key = "trending_movies"
            cached_data = self.cache.get(cache_key)
            
            if cached_data:
                return jsonify(json.loads(cached_data))
            
            trending_movies = self.db.query("""
                SELECT m.movie_id, m.title, m.release_date, m.poster_url,
                       ar.overall_score, ar.total_rating_count
                FROM movies m
                JOIN aggregated_ratings ar ON m.movie_id = ar.movie_id
                WHERE m.release_date >= DATE_SUB(NOW(), INTERVAL 1 YEAR)
                AND ar.total_rating_count >= 1000
                ORDER BY ar.overall_score DESC, ar.total_rating_count DESC
                LIMIT 50
            """)
            
            response_data = {
                'trending_movies': [
                    {
                        'movie_id': movie['movie_id'],
                        'title': movie['title'],
                        'release_date': movie['release_date'].isoformat() if movie['release_date'] else None,
                        'poster_url': movie['poster_url'],
                        'overall_score': float(movie['overall_score']),
                        'rating_count': movie['total_rating_count']
                    }
                    for movie in trending_movies
                ]
            }
            
            # 缓存1小时
            self.cache.set(cache_key, json.dumps(response_data), ex=3600)
            
            return jsonify(response_data)

前端实现

电影评分展示组件 (React)

import React, { useState, useEffect } from 'react';
import { LineChart, Line, XAxis, YAxis, CartesianGrid, Tooltip, ResponsiveContainer } from 'recharts';

const MovieRatingDisplay = ({ movieId }) => {
    const [ratingData, setRatingData] = useState(null);
    const [ratingHistory, setRatingHistory] = useState([]);
    const [loading, setLoading] = useState(true);
    
    useEffect(() => {
        loadMovieRating();
        loadRatingHistory();
    }, [movieId]);
    
    const loadMovieRating = async () => {
        try {
            const response = await fetch(`/api/movies/${movieId}/rating`);
            const data = await response.json();
            setRatingData(data);
        } catch (error) {
            console.error('加载评分数据失败:', error);
        } finally {
            setLoading(false);
        }
    };
    
    const loadRatingHistory = async () => {
        try {
            const response = await fetch(`/api/movies/${movieId}/rating-history`);
            const data = await response.json();
            setRatingHistory(data);
        } catch (error) {
            console.error('加载评分历史失败:', error);
        }
    };
    
    const getScoreColor = (score) => {
        if (score >= 8) return '#4CAF50';
        if (score >= 6) return '#FF9800';
        return '#F44336';
    };
    
    if (loading) {
        return <div className="loading">加载中...</div>;
    }
    
    if (!ratingData) {
        return <div className="error">暂无评分数据</div>;
    }
    
    return (
        <div className="movie-rating-display">
            <div className="rating-header">
                <img src={ratingData.poster_url} alt={ratingData.title} className="movie-poster" />
                <div className="rating-info">
                    <h2>{ratingData.title}</h2>
                    <div className="overall-score">
                        <span 
                            className="score-value"
                            style={{ color: getScoreColor(ratingData.overall_score) }}
                        >
                            {ratingData.overall_score.toFixed(1)}
                        </span>
                        <span className="score-label">/10</span>
                    </div>
                    <div className="confidence-info">
                        <span>置信度: {(ratingData.confidence_level * 100).toFixed(0)}%</span>
                        <span>基于 {ratingData.source_count} 个数据源</span>
                        <span>{ratingData.total_rating_count.toLocaleString()} 个评分</span>
                    </div>
                </div>
            </div>
            
            <div className="source-ratings">
                <h3>各平台评分</h3>
                <div className="source-grid">
                    {ratingData.source_ratings.map(source => (
                        <div key={source.source} className="source-item">
                            <div className="source-name">{source.source}</div>
                            <div className="source-score">
                                {source.rating.toFixed(1)}
                                <span className="source-scale">/{source.scale}</span>
                            </div>
                            <div className="source-count">
                                {source.count?.toLocaleString()} 评分
                            </div>
                        </div>
                    ))}
                </div>
            </div>
            
            {ratingHistory.length > 0 && (
                <div className="rating-trend">
                    <h3>评分趋势</h3>
                    <ResponsiveContainer width="100%" height={300}>
                        <LineChart data={ratingHistory}>
                            <CartesianGrid strokeDasharray="3 3" />
                            <XAxis dataKey="date" />
                            <YAxis domain={[0, 10]} />
                            <Tooltip />
                            <Line 
                                type="monotone" 
                                dataKey="overall_score" 
                                stroke="#2196F3" 
                                strokeWidth={2}
                                name="综合评分"
                            />
                            <Line 
                                type="monotone" 
                                dataKey="critic_score" 
                                stroke="#4CAF50" 
                                strokeWidth={2}
                                name="专业评分"
                            />
                            <Line 
                                type="monotone" 
                                dataKey="audience_score" 
                                stroke="#FF9800" 
                                strokeWidth={2}
                                name="观众评分"
                            />
                        </LineChart>
                    </ResponsiveContainer>
                </div>
            )}
        </div>
    );
};

这个电影评分聚合系统设计提供了完整的多源数据抓取、智能聚合算法、异常检测和API服务功能,通过先进的算法确保评分的准确性和可靠性。


🎯 场景引入

你打开App,

你打开手机准备使用设计电影评分聚合系统服务。看似简单的操作背后,系统面临三大核心挑战:

  • 挑战一:高并发——如何在百万级 QPS 下保持低延迟?
  • 挑战二:高可用——如何在节点故障时保证服务不中断?
  • 挑战三:数据一致性——如何在分布式环境下保证数据正确?

📈 容量估算

假设 DAU 1000 万,人均日请求 50 次

指标数值
数据总量10 TB+
日写入量~100 GB
写入 TPS~5 万/秒
读取 QPS~20 万/秒
P99 读延迟< 10ms
节点数10-50
副本因子3

❓ 高频面试问题

Q1:电影评分聚合系统的核心设计原则是什么?

参考正文中的架构设计部分,核心原则包括:高可用(故障自动恢复)、高性能(低延迟高吞吐)、可扩展(水平扩展能力)、一致性(数据正确性保证)。面试时需结合具体场景展开。

Q2:电影评分聚合系统在大规模场景下的主要挑战是什么?

  1. 性能瓶颈:随着数据量和请求量增长,单节点无法承载;2) 一致性:分布式环境下的数据一致性保证;3) 故障恢复:节点故障时的自动切换和数据恢复;4) 运维复杂度:集群管理、监控、升级。

Q3:如何保证电影评分聚合系统的高可用?

  1. 多副本冗余(至少 3 副本);2) 自动故障检测和切换(心跳 + 选主);3) 数据持久化和备份;4) 限流降级(防止雪崩);5) 多机房/多活部署。

Q4:电影评分聚合系统的性能优化有哪些关键手段?

  1. 缓存(减少重复计算和 IO);2) 异步处理(非关键路径异步化);3) 批量操作(减少网络往返);4) 数据分片(并行处理);5) 连接池复用。

Q5:电影评分聚合系统与同类方案相比有什么优劣势?

参考方案对比表格。选型时需考虑:团队技术栈、数据规模、延迟要求、一致性需求、运维成本。没有银弹,需根据业务场景权衡取舍。


| 方案一 | 简单实现 | 低 | 适合小规模 | | 方案二 | 中等复杂度 | 中 | 适合中等规模 | | 方案三 | 高复杂度 ⭐推荐 | 高 | 适合大规模生产环境 |

✅ 架构设计检查清单

检查项状态
缓存策略
分布式架构
数据一致性
监控告警
安全设计
性能优化

🚀 架构演进路径

阶段一:单机版 MVP(用户量 < 10 万)

  • 单体应用 + 单机数据库,快速验证核心功能
  • 适用场景:产品早期,快速迭代

阶段二:基础版分布式(用户量 10 万 → 100 万)

  • 应用层水平扩展 + 数据库主从分离 + Redis 缓存
  • 引入消息队列解耦异步任务

阶段三:生产级高可用(用户量 > 100 万)

  • 微服务拆分 + 数据库分库分表 + 多机房部署
  • 全链路监控 + 自动化运维 + 异地容灾

⚖️ 关键 Trade-off 分析

🔴 Trade-off 1:一致性 vs 可用性

  • 强一致(CP):适用于金融交易等不能出错的场景
  • 高可用(AP):适用于社交动态等允许短暂不一致的场景
  • 本系统选择:核心路径强一致,非核心路径最终一致

🔴 Trade-off 2:同步 vs 异步

  • 同步处理:延迟低但吞吐受限,适用于核心交互路径
  • 异步处理:吞吐高但增加延迟,适用于后台计算
  • 本系统选择:核心路径同步,非核心路径异步