系统设计实战 168:168. 系统设计 - 内容审核系统 (Content Moderation System)

1 阅读16分钟

🚀 系统设计实战 168:168. 系统设计 - 内容审核系统 (Content Moderation System)

摘要:本文深入剖析系统的核心架构关键算法工程实践,提供完整的设计方案和面试要点。

你是否想过,设计内容审核系统背后的技术挑战有多复杂?

📋 需求分析

功能需求

  • 文本分类: 垃圾信息、仇恨言论、暴力内容检测
  • 图像识别: 色情、暴力、违法内容识别
  • 视频审核: 关键帧提取、音频分析、场景识别
  • 人工复审: 争议内容人工审核、审核员工作流
  • 审核规则引擎: 可配置规则、动态策略、A/B测试

非功能需求

  • 性能: 支持千万级内容,秒级响应
  • 准确率: 自动审核>95%,误报率<2%
  • 可扩展性: 多语言、多平台、多媒体类型
  • 合规性: 符合各国法律法规、数据隐私保护

🏗️ 系统架构

整体架构

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│  Content APIs   │    │  Admin Portal   │    │ Reviewer Portal │
└─────────────────┘    └─────────────────┘    └─────────────────┘
         │                       │                       │
         └───────────────────────┼───────────────────────┘
                                 │
                    ┌─────────────────┐
                    │   API Gateway   │
                    │  (Rate Limiting)│
                    └─────────────────┘
                                 │
    ┌────────────────────────────┼────────────────────────────┐
    │                            │                            │
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Content Ingress │    │ ML Classifier   │    │ Rule Engine     │
└─────────────────┘    └─────────────────┘    └─────────────────┘
    │                            │                            │
┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│ Human Review    │    │ Decision Engine │    │ Action Executor │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                 │
                    ┌─────────────────┐
                    │  Data Storage   │
                    │ (Content + ML)  │
                    └─────────────────┘

核心组件设计

1. Content Ingress Service (内容接入服务)
class ContentIngressService:
    def __init__(self, queue_manager, storage_service):
        self.queue = queue_manager
        self.storage = storage_service
        self.supported_types = ['text', 'image', 'video', 'audio']
        self.max_file_size = 100 * 1024 * 1024  # 100MB
    
    def submit_content(self, content_data, metadata):
        """提交内容进行审核"""
        try:
            # 验证内容
            self.validate_content(content_data, metadata)
            
            # 创建内容记录
            content = ModerationContent(
                content_id=str(uuid.uuid4()),
                content_type=metadata['type'],
                source_platform=metadata.get('platform'),
                user_id=metadata.get('user_id'),
                priority=metadata.get('priority', 'normal'),
                language=metadata.get('language', 'auto'),
                status='pending',
                submitted_at=datetime.utcnow()
            )
            
            # 存储内容
            if content.content_type in ['image', 'video', 'audio']:
                # 存储媒体文件
                file_path = self.storage.store_media_file(
                    content_data, content.content_id, content.content_type
                )
                content.file_path = file_path
                
                # 提取元数据
                media_metadata = self.extract_media_metadata(
                    content_data, content.content_type
                )
                content.metadata = media_metadata
            else:
                # 文本内容直接存储
                content.text_content = content_data
            
            # 保存到数据库
            content_id = self.db.save_content(content)
            
            # 发送到审核队列
            self.queue.send_to_moderation_queue(
                content_id, content.priority
            )
            
            return SubmissionResult(
                content_id=content_id,
                status='submitted',
                estimated_processing_time=self.estimate_processing_time(
                    content.content_type, content.priority
                )
            )
            
        except Exception as e:
            return SubmissionResult(
                status='failed',
                error=str(e)
            )
    
    def validate_content(self, content_data, metadata):
        """验证内容"""
        content_type = metadata.get('type')
        
        if content_type not in self.supported_types:
            raise UnsupportedContentTypeError(content_type)
        
        # 检查文件大小
        if content_type != 'text':
            if len(content_data) > self.max_file_size:
                raise ContentTooLargeError()
        
        # 检查文本长度
        if content_type == 'text':
            if len(content_data) > 10000:  # 10k字符限制
                raise TextTooLongError()
        
        # 检查必需字段
        required_fields = ['type', 'platform']
        for field in required_fields:
            if field not in metadata:
                raise MissingMetadataError(field)
    
    def extract_media_metadata(self, media_data, content_type):
        """提取媒体元数据"""
        metadata = {}
        
        if content_type == 'image':
            # 使用PIL提取图像信息
            from PIL import Image
            import io
            
            image = Image.open(io.BytesIO(media_data))
            metadata.update({
                'width': image.width,
                'height': image.height,
                'format': image.format,
                'mode': image.mode,
                'size_bytes': len(media_data)
            })
            
            # 提取EXIF数据
            if hasattr(image, '_getexif'):
                exif = image._getexif()
                if exif:
                    metadata['exif'] = {
                        k: v for k, v in exif.items() 
                        if isinstance(v, (str, int, float))
                    }
        
        elif content_type == 'video':
            # 使用FFmpeg提取视频信息
            metadata.update(
                self.extract_video_metadata(media_data)
            )
        
        elif content_type == 'audio':
            # 提取音频信息
            metadata.update(
                self.extract_audio_metadata(media_data)
            )
        
        return metadata
    
    def estimate_processing_time(self, content_type, priority):
        """估算处理时间"""
        base_times = {
            'text': 1,      # 1秒
            'image': 5,     # 5秒
            'audio': 30,    # 30秒
            'video': 120    # 2分钟
        }
        
        priority_multipliers = {
            'high': 0.5,
            'normal': 1.0,
            'low': 2.0
        }
        
        base_time = base_times.get(content_type, 60)
        multiplier = priority_multipliers.get(priority, 1.0)
        
        return int(base_time * multiplier)
2. ML Classifier Service (机器学习分类服务)
class MLClassifierService:
    def __init__(self, model_manager, feature_extractor):
        self.model_manager = model_manager
        self.feature_extractor = feature_extractor
        self.classification_categories = [
            'spam', 'hate_speech', 'violence', 'adult_content',
            'harassment', 'misinformation', 'copyright_violation'
        ]
    
    def classify_content(self, content):
        """分类内容"""
        try:
            if content.content_type == 'text':
                return self.classify_text(content)
            elif content.content_type == 'image':
                return self.classify_image(content)
            elif content.content_type == 'video':
                return self.classify_video(content)
            elif content.content_type == 'audio':
                return self.classify_audio(content)
            else:
                raise UnsupportedContentTypeError(content.content_type)
                
        except Exception as e:
            return ClassificationResult(
                content_id=content.content_id,
                success=False,
                error=str(e)
            )
    
    def classify_text(self, content):
        """文本分类"""
        text = content.text_content
        language = content.language or self.detect_language(text)
        
        # 获取文本分类模型
        text_model = self.model_manager.get_text_model(language)
        
        # 预处理文本
        processed_text = self.preprocess_text(text, language)
        
        # 特征提取
        features = self.feature_extractor.extract_text_features(
            processed_text, language
        )
        
        # 模型预测
        predictions = text_model.predict(features)
        
        # 解析结果
        classification_results = []
        for category, score in predictions.items():
            if score > 0.1:  # 只保留有意义的分数
                classification_results.append(
                    CategoryResult(
                        category=category,
                        confidence=score,
                        threshold=self.get_category_threshold(category),
                        action=self.determine_action(category, score)
                    )
                )
        
        # 检测特定模式
        pattern_results = self.detect_text_patterns(text, language)
        classification_results.extend(pattern_results)
        
        return ClassificationResult(
            content_id=content.content_id,
            content_type='text',
            language=language,
            results=classification_results,
            overall_risk_score=self.calculate_overall_risk(classification_results),
            processing_time_ms=self.get_processing_time(),
            success=True
        )
    
    def classify_image(self, content):
        """图像分类"""
        # 加载图像
        image_data = self.storage.load_media_file(content.file_path)
        image = self.load_image(image_data)
        
        # 获取图像分类模型
        image_model = self.model_manager.get_image_model()
        
        # 预处理图像
        processed_image = self.preprocess_image(image)
        
        # 特征提取
        features = self.feature_extractor.extract_image_features(processed_image)
        
        # 多模型预测
        classification_results = []
        
        # 1. 通用内容分类
        general_predictions = image_model.predict_general(features)
        for category, score in general_predictions.items():
            if score > 0.1:
                classification_results.append(
                    CategoryResult(
                        category=category,
                        confidence=score,
                        threshold=self.get_category_threshold(category),
                        action=self.determine_action(category, score)
                    )
                )
        
        # 2. 成人内容检测
        nsfw_score = image_model.predict_nsfw(features)
        if nsfw_score > 0.1:
            classification_results.append(
                CategoryResult(
                    category='adult_content',
                    confidence=nsfw_score,
                    threshold=0.7,
                    action=self.determine_action('adult_content', nsfw_score)
                )
            )
        
        # 3. 暴力内容检测
        violence_score = image_model.predict_violence(features)
        if violence_score > 0.1:
            classification_results.append(
                CategoryResult(
                    category='violence',
                    confidence=violence_score,
                    threshold=0.8,
                    action=self.determine_action('violence', violence_score)
                )
            )
        
        # 4. 人脸检测和分析
        face_results = self.analyze_faces(image)
        classification_results.extend(face_results)
        
        # 5. OCR文本提取和分析
        ocr_text = self.extract_text_from_image(image)
        if ocr_text:
            text_results = self.classify_extracted_text(ocr_text)
            classification_results.extend(text_results)
        
        return ClassificationResult(
            content_id=content.content_id,
            content_type='image',
            results=classification_results,
            overall_risk_score=self.calculate_overall_risk(classification_results),
            processing_time_ms=self.get_processing_time(),
            success=True
        )
    
    def classify_video(self, content):
        """视频分类"""
        # 提取关键帧
        keyframes = self.extract_keyframes(content.file_path)
        
        # 提取音频
        audio_track = self.extract_audio_from_video(content.file_path)
        
        classification_results = []
        
        # 1. 关键帧分析
        for i, frame in enumerate(keyframes):
            frame_result = self.classify_video_frame(frame, i)
            classification_results.extend(frame_result.results)
        
        # 2. 音频分析
        if audio_track:
            audio_result = self.classify_audio_content(audio_track)
            classification_results.extend(audio_result.results)
        
        # 3. 场景连续性分析
        scene_results = self.analyze_video_scenes(keyframes)
        classification_results.extend(scene_results)
        
        # 4. 运动分析
        motion_results = self.analyze_video_motion(content.file_path)
        classification_results.extend(motion_results)
        
        return ClassificationResult(
            content_id=content.content_id,
            content_type='video',
            results=classification_results,
            overall_risk_score=self.calculate_overall_risk(classification_results),
            processing_time_ms=self.get_processing_time(),
            success=True
        )
    
    def detect_text_patterns(self, text, language):
        """检测文本模式"""
        pattern_results = []
        
        # 1. 个人信息检测
        pii_patterns = self.get_pii_patterns(language)
        for pattern_name, pattern in pii_patterns.items():
            matches = re.findall(pattern, text, re.IGNORECASE)
            if matches:
                pattern_results.append(
                    CategoryResult(
                        category='personal_info',
                        confidence=0.9,
                        threshold=0.5,
                        action='flag',
                        details={'pattern': pattern_name, 'matches': len(matches)}
                    )
                )
        
        # 2. 垃圾信息模式
        spam_indicators = self.detect_spam_indicators(text)
        if spam_indicators['score'] > 0.5:
            pattern_results.append(
                CategoryResult(
                    category='spam',
                    confidence=spam_indicators['score'],
                    threshold=0.7,
                    action=self.determine_action('spam', spam_indicators['score']),
                    details=spam_indicators['indicators']
                )
            )
        
        # 3. 仇恨言论关键词
        hate_keywords = self.detect_hate_keywords(text, language)
        if hate_keywords:
            pattern_results.append(
                CategoryResult(
                    category='hate_speech',
                    confidence=0.8,
                    threshold=0.6,
                    action='block',
                    details={'keywords': hate_keywords}
                )
            )
        
        return pattern_results
    
    def determine_action(self, category, confidence):
        """确定处理动作"""
        thresholds = self.get_category_thresholds()
        category_threshold = thresholds.get(category, {})
        
        if confidence >= category_threshold.get('block', 0.9):
            return 'block'
        elif confidence >= category_threshold.get('flag', 0.7):
            return 'flag'
        elif confidence >= category_threshold.get('review', 0.5):
            return 'human_review'
        else:
            return 'allow'
3. Rule Engine (规则引擎)
class ModerationRuleEngine:
    def __init__(self, rule_storage):
        self.rule_storage = rule_storage
        self.rule_cache = {}
        self.rule_evaluators = {
            'keyword_match': self.evaluate_keyword_rule,
            'regex_match': self.evaluate_regex_rule,
            'score_threshold': self.evaluate_score_rule,
            'user_reputation': self.evaluate_reputation_rule,
            'content_frequency': self.evaluate_frequency_rule,
            'time_based': self.evaluate_time_rule
        }
    
    def evaluate_rules(self, content, classification_result):
        """评估规则"""
        try:
            # 获取适用规则
            applicable_rules = self.get_applicable_rules(
                content.content_type, content.source_platform
            )
            
            rule_results = []
            
            for rule in applicable_rules:
                try:
                    # 评估单个规则
                    result = self.evaluate_single_rule(
                        rule, content, classification_result
                    )
                    
                    if result.matched:
                        rule_results.append(result)
                        
                except Exception as e:
                    logger.error(f"规则评估失败 {rule.id}: {e}")
                    continue
            
            # 合并规则结果
            final_action = self.merge_rule_results(rule_results)
            
            return RuleEvaluationResult(
                content_id=content.content_id,
                matched_rules=rule_results,
                final_action=final_action,
                confidence=self.calculate_rule_confidence(rule_results)
            )
            
        except Exception as e:
            return RuleEvaluationResult(
                content_id=content.content_id,
                error=str(e),
                final_action='allow'  # 默认允许
            )
    
    def evaluate_single_rule(self, rule, content, classification_result):
        """评估单个规则"""
        evaluator = self.rule_evaluators.get(rule.type)
        if not evaluator:
            raise UnsupportedRuleTypeError(rule.type)
        
        # 检查规则条件
        conditions_met = self.check_rule_conditions(
            rule.conditions, content, classification_result
        )
        
        if not conditions_met:
            return RuleResult(matched=False, rule_id=rule.id)
        
        # 执行规则评估
        result = evaluator(rule, content, classification_result)
        
        return RuleResult(
            matched=result.matched,
            rule_id=rule.id,
            rule_name=rule.name,
            action=rule.action,
            confidence=result.confidence,
            details=result.details
        )
    
    def evaluate_keyword_rule(self, rule, content, classification_result):
        """评估关键词规则"""
        if content.content_type != 'text':
            return RuleEvaluationResult(matched=False)
        
        text = content.text_content.lower()
        keywords = rule.parameters.get('keywords', [])
        match_type = rule.parameters.get('match_type', 'any')
        
        matched_keywords = []
        for keyword in keywords:
            if keyword.lower() in text:
                matched_keywords.append(keyword)
        
        if match_type == 'any' and matched_keywords:
            matched = True
        elif match_type == 'all' and len(matched_keywords) == len(keywords):
            matched = True
        else:
            matched = False
        
        confidence = len(matched_keywords) / len(keywords) if keywords else 0
        
        return RuleEvaluationResult(
            matched=matched,
            confidence=confidence,
            details={'matched_keywords': matched_keywords}
        )
    
    def evaluate_score_rule(self, rule, content, classification_result):
        """评估分数阈值规则"""
        category = rule.parameters.get('category')
        threshold = rule.parameters.get('threshold', 0.5)
        operator = rule.parameters.get('operator', 'greater_than')
        
        # 查找对应分类结果
        category_result = None
        for result in classification_result.results:
            if result.category == category:
                category_result = result
                break
        
        if not category_result:
            return RuleEvaluationResult(matched=False)
        
        score = category_result.confidence
        
        if operator == 'greater_than':
            matched = score > threshold
        elif operator == 'greater_equal':
            matched = score >= threshold
        elif operator == 'less_than':
            matched = score < threshold
        elif operator == 'less_equal':
            matched = score <= threshold
        else:
            matched = False
        
        return RuleEvaluationResult(
            matched=matched,
            confidence=abs(score - threshold),
            details={'score': score, 'threshold': threshold}
        )
    
    def merge_rule_results(self, rule_results):
        """合并规则结果"""
        if not rule_results:
            return 'allow'
        
        # 按优先级排序
        sorted_results = sorted(
            rule_results, 
            key=lambda x: self.get_action_priority(x.action),
            reverse=True
        )
        
        # 返回最高优先级的动作
        return sorted_results[0].action
    
    def get_action_priority(self, action):
        """获取动作优先级"""
        priorities = {
            'block': 100,
            'quarantine': 80,
            'human_review': 60,
            'flag': 40,
            'warn': 20,
            'allow': 0
        }
        return priorities.get(action, 0)
4. Human Review Service (人工审核服务)
class HumanReviewService:
    def __init__(self, reviewer_manager, workflow_engine):
        self.reviewer_manager = reviewer_manager
        self.workflow = workflow_engine
        self.review_queues = {}
        self.sla_targets = {
            'high': 3600,    # 1小时
            'normal': 14400, # 4小时
            'low': 86400     # 24小时
        }
    
    def assign_for_review(self, content_id, priority='normal', category=None):
        """分配内容进行人工审核"""
        try:
            # 获取内容信息
            content = self.get_content(content_id)
            
            # 创建审核任务
            review_task = ReviewTask(
                content_id=content_id,
                priority=priority,
                category=category or 'general',
                status='pending',
                created_at=datetime.utcnow(),
                sla_deadline=datetime.utcnow() + timedelta(
                    seconds=self.sla_targets[priority]
                )
            )
            
            # 选择合适的审核员
            reviewer = self.select_reviewer(review_task)
            
            if reviewer:
                review_task.assigned_reviewer_id = reviewer.id
                review_task.assigned_at = datetime.utcnow()
                review_task.status = 'assigned'
            
            # 保存任务
            task_id = self.db.save_review_task(review_task)
            
            # 添加到审核队列
            self.add_to_review_queue(review_task)
            
            # 发送通知
            if reviewer:
                self.notify_reviewer(reviewer, review_task)
            
            return ReviewAssignmentResult(
                task_id=task_id,
                assigned_reviewer=reviewer.id if reviewer else None,
                estimated_completion=review_task.sla_deadline
            )
            
        except Exception as e:
            return ReviewAssignmentResult(
                error=str(e),
                success=False
            )
    
    def select_reviewer(self, review_task):
        """选择审核员"""
        # 获取可用审核员
        available_reviewers = self.reviewer_manager.get_available_reviewers(
            category=review_task.category,
            language=review_task.language
        )
        
        if not available_reviewers:
            return None
        
        # 计算审核员评分
        reviewer_scores = []
        for reviewer in available_reviewers:
            score = self.calculate_reviewer_score(reviewer, review_task)
            reviewer_scores.append((reviewer, score))
        
        # 选择最高分审核员
        reviewer_scores.sort(key=lambda x: x[1], reverse=True)
        
        return reviewer_scores[0][0]
    
    def calculate_reviewer_score(self, reviewer, task):
        """计算审核员评分"""
        score = 0
        
        # 1. 专业领域匹配度
        if task.category in reviewer.specialties:
            score += 30
        
        # 2. 语言能力
        if task.language in reviewer.languages:
            score += 20
        
        # 3. 当前工作负载
        current_load = self.get_reviewer_workload(reviewer.id)
        max_load = reviewer.max_concurrent_tasks
        load_ratio = current_load / max_load if max_load > 0 else 1
        score += (1 - load_ratio) * 20
        
        # 4. 历史准确率
        accuracy = reviewer.accuracy_rate
        score += accuracy * 15
        
        # 5. 平均处理时间
        avg_time = reviewer.avg_processing_time
        if avg_time < self.sla_targets[task.priority]:
            score += 10
        
        # 6. 在线状态
        if reviewer.is_online:
            score += 5
        
        return score
    
    def submit_review_decision(self, task_id, reviewer_id, decision_data):
        """提交审核决定"""
        try:
            # 验证审核员权限
            task = self.get_review_task(task_id)
            if task.assigned_reviewer_id != reviewer_id:
                raise UnauthorizedReviewerError()
            
            # 创建审核决定
            decision = ReviewDecision(
                task_id=task_id,
                reviewer_id=reviewer_id,
                action=decision_data['action'],
                reason=decision_data.get('reason'),
                confidence=decision_data.get('confidence', 1.0),
                notes=decision_data.get('notes'),
                tags=decision_data.get('tags', []),
                processing_time=datetime.utcnow() - task.assigned_at,
                created_at=datetime.utcnow()
            )
            
            # 保存决定
            decision_id = self.db.save_review_decision(decision)
            
            # 更新任务状态
            task.status = 'completed'
            task.completed_at = datetime.utcnow()
            task.final_decision = decision.action
            self.db.update_review_task(task)
            
            # 执行审核动作
            self.execute_review_action(task.content_id, decision)
            
            # 更新审核员统计
            self.update_reviewer_stats(reviewer_id, decision)
            
            # 质量检查
            if self.should_perform_quality_check(decision):
                self.schedule_quality_check(decision_id)
            
            return ReviewSubmissionResult(
                decision_id=decision_id,
                success=True
            )
            
        except Exception as e:
            return ReviewSubmissionResult(
                error=str(e),
                success=False
            )
    
    def execute_review_action(self, content_id, decision):
        """执行审核动作"""
        content = self.get_content(content_id)
        
        if decision.action == 'approve':
            self.approve_content(content)
        elif decision.action == 'reject':
            self.reject_content(content, decision.reason)
        elif decision.action == 'require_edit':
            self.require_content_edit(content, decision.notes)
        elif decision.action == 'escalate':
            self.escalate_to_senior_reviewer(content, decision.reason)
        
        # 记录审核历史
        self.record_moderation_history(content_id, decision)
    
    def monitor_review_sla(self):
        """监控审核SLA"""
        overdue_tasks = self.db.get_overdue_review_tasks()
        
        for task in overdue_tasks:
            # 发送SLA警告
            self.send_sla_warning(task)
            
            # 自动重新分配
            if task.overdue_hours > 2:
                self.reassign_review_task(task.id)
    
    def generate_reviewer_performance_report(self, reviewer_id, period_days=30):
        """生成审核员绩效报告"""
        end_date = datetime.utcnow()
        start_date = end_date - timedelta(days=period_days)
        
        # 获取审核数据
        reviews = self.db.get_reviewer_decisions(
            reviewer_id, start_date, end_date
        )
        
        if not reviews:
            return None
        
        # 计算指标
        total_reviews = len(reviews)
        avg_processing_time = sum(r.processing_time.total_seconds() 
                                for r in reviews) / total_reviews
        
        # 准确率计算(基于质量检查结果)
        quality_checks = self.db.get_quality_check_results(
            reviewer_id, start_date, end_date
        )
        
        correct_decisions = sum(1 for qc in quality_checks if qc.is_correct)
        accuracy_rate = correct_decisions / len(quality_checks) if quality_checks else 0
        
        # 分类统计
        action_distribution = {}
        for review in reviews:
            action = review.action
            action_distribution[action] = action_distribution.get(action, 0) + 1
        
        return ReviewerPerformanceReport(
            reviewer_id=reviewer_id,
            period_start=start_date,
            period_end=end_date,
            total_reviews=total_reviews,
            avg_processing_time_seconds=avg_processing_time,
            accuracy_rate=accuracy_rate,
            action_distribution=action_distribution,
            sla_compliance_rate=self.calculate_sla_compliance(reviews)
        )

💾 数据存储设计

数据库设计

-- 内容表
CREATE TABLE moderation_content (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    content_id VARCHAR(36) NOT NULL UNIQUE,
    content_type ENUM('text', 'image', 'video', 'audio') NOT NULL,
    source_platform VARCHAR(50),
    user_id BIGINT,
    text_content TEXT,
    file_path VARCHAR(500),
    metadata JSON,
    language VARCHAR(10),
    priority ENUM('high', 'normal', 'low') DEFAULT 'normal',
    status ENUM('pending', 'processing', 'completed', 'failed') DEFAULT 'pending',
    submitted_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    processed_at TIMESTAMP,
    INDEX idx_status_priority (status, priority),
    INDEX idx_user_platform (user_id, source_platform),
    INDEX idx_submitted_at (submitted_at)
);

-- 分类结果表
CREATE TABLE classification_results (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    content_id VARCHAR(36) NOT NULL,
    category VARCHAR(50) NOT NULL,
    confidence DECIMAL(5,4) NOT NULL,
    threshold_value DECIMAL(5,4),
    action VARCHAR(20),
    model_version VARCHAR(50),
    processing_time_ms INT,
    details JSON,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (content_id) REFERENCES moderation_content(content_id) ON DELETE CASCADE,
    INDEX idx_content_category (content_id, category),
    INDEX idx_confidence (confidence),
    INDEX idx_action (action)
);

-- 审核规则表
CREATE TABLE moderation_rules (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    name VARCHAR(100) NOT NULL,
    description TEXT,
    rule_type VARCHAR(50) NOT NULL,
    content_types JSON NOT NULL,
    platforms JSON,
    conditions JSON NOT NULL,
    parameters JSON NOT NULL,
    action VARCHAR(20) NOT NULL,
    priority INT DEFAULT 0,
    is_active BOOLEAN DEFAULT TRUE,
    created_by BIGINT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    INDEX idx_type_active (rule_type, is_active),
    INDEX idx_priority (priority)
);

-- 人工审核任务表
CREATE TABLE review_tasks (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    content_id VARCHAR(36) NOT NULL,
    category VARCHAR(50),
    priority ENUM('high', 'normal', 'low') DEFAULT 'normal',
    language VARCHAR(10),
    assigned_reviewer_id BIGINT,
    status ENUM('pending', 'assigned', 'in_progress', 'completed', 'escalated') DEFAULT 'pending',
    sla_deadline TIMESTAMP NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    assigned_at TIMESTAMP,
    completed_at TIMESTAMP,
    final_decision VARCHAR(20),
    FOREIGN KEY (content_id) REFERENCES moderation_content(content_id) ON DELETE CASCADE,
    INDEX idx_status_priority (status, priority),
    INDEX idx_reviewer_status (assigned_reviewer_id, status),
    INDEX idx_sla_deadline (sla_deadline)
);

-- 审核决定表
CREATE TABLE review_decisions (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    task_id BIGINT NOT NULL,
    reviewer_id BIGINT NOT NULL,
    action VARCHAR(20) NOT NULL,
    reason TEXT,
    confidence DECIMAL(3,2),
    notes TEXT,
    tags JSON,
    processing_time_seconds INT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (task_id) REFERENCES review_tasks(id) ON DELETE CASCADE,
    INDEX idx_reviewer_action (reviewer_id, action),
    INDEX idx_created_at (created_at)
);

-- 审核员表
CREATE TABLE reviewers (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    user_id BIGINT NOT NULL UNIQUE,
    name VARCHAR(100) NOT NULL,
    email VARCHAR(255) NOT NULL,
    specialties JSON NOT NULL,
    languages JSON NOT NULL,
    max_concurrent_tasks INT DEFAULT 10,
    accuracy_rate DECIMAL(5,4) DEFAULT 0,
    avg_processing_time_seconds INT DEFAULT 0,
    total_reviews INT DEFAULT 0,
    is_active BOOLEAN DEFAULT TRUE,
    is_online BOOLEAN DEFAULT FALSE,
    last_active_at TIMESTAMP,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_active_online (is_active, is_online),
    INDEX idx_specialties ((CAST(specialties AS CHAR(255) ARRAY)))
);

-- 审核历史表
CREATE TABLE moderation_history (
    id BIGINT PRIMARY KEY AUTO_INCREMENT,
    content_id VARCHAR(36) NOT NULL,
    action VARCHAR(20) NOT NULL,
    reason TEXT,
    performed_by_type ENUM('system', 'reviewer', 'admin') NOT NULL,
    performed_by_id BIGINT,
    details JSON,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    FOREIGN KEY (content_id) REFERENCES moderation_content(content_id) ON DELETE CASCADE,
    INDEX idx_content_action (content_id, action),
    INDEX idx_performed_by (performed_by_type, performed_by_id),
    INDEX idx_created_at (created_at)
);

缓存策略

class ModerationCacheManager:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.cache_ttl = {
            'classification_result': 3600,    # 1小时
            'rule_evaluation': 1800,          # 30分钟
            'reviewer_workload': 300,         # 5分钟
            'model_metadata': 7200            # 2小时
        }
    
    def cache_classification_result(self, content_id, result):
        """缓存分类结果"""
        key = f"classification:{content_id}"
        self.redis.setex(key, self.cache_ttl['classification_result'],
                        json.dumps(result.to_dict()))
    
    def get_cached_classification(self, content_id):
        """获取缓存的分类结果"""
        key = f"classification:{content_id}"
        cached_data = self.redis.get(key)
        return json.loads(cached_data) if cached_data else None
    
    def cache_reviewer_workload(self, reviewer_id, workload):
        """缓存审核员工作负载"""
        key = f"reviewer_workload:{reviewer_id}"
        self.redis.setex(key, self.cache_ttl['reviewer_workload'], workload)
    
    def update_rule_cache(self, rule_id, rule_data):
        """更新规则缓存"""
        key = f"rule:{rule_id}"
        self.redis.setex(key, self.cache_ttl['rule_evaluation'],
                        json.dumps(rule_data))
    
    def invalidate_content_cache(self, content_id):
        """清除内容相关缓存"""
        patterns = [
            f"classification:{content_id}",
            f"rule_eval:{content_id}:*"
        ]
        
        for pattern in patterns:
            if '*' in pattern:
                keys = self.redis.keys(pattern)
                if keys:
                    self.redis.delete(*keys)
            else:
                self.redis.delete(pattern)

🚀 性能优化

批量处理优化

class BatchProcessor:
    def __init__(self, ml_classifier, rule_engine):
        self.ml_classifier = ml_classifier
        self.rule_engine = rule_engine
        self.batch_size = 100
        self.max_wait_time = 30  # 30秒
    
    def process_content_batch(self, content_batch):
        """批量处理内容"""
        try:
            # 按类型分组
            content_groups = self.group_by_type(content_batch)
            
            results = []
            
            # 并行处理不同类型
            with ThreadPoolExecutor(max_workers=4) as executor:
                futures = []
                
                for content_type, contents in content_groups.items():
                    future = executor.submit(
                        self.process_type_batch, content_type, contents
                    )
                    futures.append(future)
                
                # 收集结果
                for future in futures:
                    batch_results = future.result()
                    results.extend(batch_results)
            
            return results
            
        except Exception as e:
            logger.error(f"批量处理失败: {e}")
            return []
    
    def process_type_batch(self, content_type, contents):
        """处理同类型内容批次"""
        if content_type == 'text':
            return self.process_text_batch(contents)
        elif content_type == 'image':
            return self.process_image_batch(contents)
        else:
            # 单个处理其他类型
            return [self.process_single_content(c) for c in contents]
    
    def process_text_batch(self, text_contents):
        """批量处理文本"""
        # 提取所有文本
        texts = [c.text_content for c in text_contents]
        
        # 批量分类
        batch_classifications = self.ml_classifier.classify_text_batch(texts)
        
        results = []
        for i, content in enumerate(text_contents):
            classification = batch_classifications[i]
            
            # 应用规则
            rule_result = self.rule_engine.evaluate_rules(
                content, classification
            )
            
            results.append(
                ProcessingResult(
                    content_id=content.content_id,
                    classification=classification,
                    rule_result=rule_result,
                    final_action=rule_result.final_action
                )
            )
        
        return results

模型推理优化

class ModelInferenceOptimizer:
    def __init__(self):
        self.model_pool = {}
        self.inference_queue = Queue()
        self.batch_timeout = 0.1  # 100ms
    
    def optimize_inference(self, model_name, inputs):
        """优化模型推理"""
        # 动态批处理
        batch = self.collect_batch(inputs, self.batch_timeout)
        
        # 获取优化后的模型
        model = self.get_optimized_model(model_name)
        
        # 批量推理
        with torch.no_grad():
            batch_results = model(batch)
        
        return batch_results
    
    def get_optimized_model(self, model_name):
        """获取优化后的模型"""
        if model_name not in self.model_pool:
            # 加载并优化模型
            base_model = self.load_base_model(model_name)
            
            # 应用优化
            optimized_model = self.apply_optimizations(base_model)
            
            self.model_pool[model_name] = optimized_model
        
        return self.model_pool[model_name]
    
    def apply_optimizations(self, model):
        """应用模型优化"""
        # 1. 量化
        quantized_model = torch.quantization.quantize_dynamic(
            model, {torch.nn.Linear}, dtype=torch.qint8
        )
        
        # 2. JIT编译
        jit_model = torch.jit.script(quantized_model)
        
        # 3. 设置为评估模式
        jit_model.eval()
        
        return jit_model

📊 监控与分析

性能监控

class ModerationMetrics:
    def __init__(self, metrics_client):
        self.metrics = metrics_client
    
    def track_content_processing(self, content_type, processing_time, action):
        """跟踪内容处理指标"""
        self.metrics.histogram('content.processing_time',
                             processing_time,
                             tags={'type': content_type, 'action': action})
        
        self.metrics.increment('content.processed',
                             tags={'type': content_type, 'action': action})
    
    def track_model_performance(self, model_name, accuracy, latency):
        """跟踪模型性能"""
        self.metrics.gauge('model.accuracy',
                          accuracy,
                          tags={'model': model_name})
        
        self.metrics.histogram('model.inference_latency',
                             latency,
                             tags={'model': model_name})
    
    def track_reviewer_metrics(self, reviewer_id, decision_time, accuracy):
        """跟踪审核员指标"""
        self.metrics.histogram('reviewer.decision_time',
                             decision_time,
                             tags={'reviewer': reviewer_id})
        
        self.metrics.gauge('reviewer.accuracy',
                          accuracy,
                          tags={'reviewer': reviewer_id})
    
    def generate_daily_report(self):
        """生成日报"""
        today = datetime.utcnow().date()
        
        # 内容处理统计
        content_stats = self.get_content_processing_stats(today)
        
        # 模型性能统计
        model_stats = self.get_model_performance_stats(today)
        
        # 审核员绩效统计
        reviewer_stats = self.get_reviewer_performance_stats(today)
        
        return DailyModerationReport(
            date=today,
            content_stats=content_stats,
            model_stats=model_stats,
            reviewer_stats=reviewer_stats,
            sla_compliance=self.calculate_sla_compliance(today)
        )

内容审核系统设计完成,涵盖了多媒体内容分类、规则引擎、人工审核、性能优化等核心功能,确保内容安全和合规性。


🎯 场景引入

你打开App,

你打开手机准备使用设计内容审核系统服务。看似简单的操作背后,系统面临三大核心挑战:

  • 挑战一:高并发——如何在百万级 QPS 下保持低延迟?
  • 挑战二:高可用——如何在节点故障时保证服务不中断?
  • 挑战三:数据一致性——如何在分布式环境下保证数据正确?

📈 容量估算

假设 DAU 1000 万,人均日请求 50 次

指标数值
日活用户500 万
峰值 QPS~5 万/秒
数据存储~5 TB
P99 延迟< 100ms
可用性99.99%
日增数据~50 GB
服务节点数20-50

❓ 高频面试问题

Q1:内容审核系统的核心设计原则是什么?

参考正文中的架构设计部分,核心原则包括:高可用(故障自动恢复)、高性能(低延迟高吞吐)、可扩展(水平扩展能力)、一致性(数据正确性保证)。面试时需结合具体场景展开。

Q2:内容审核系统在大规模场景下的主要挑战是什么?

  1. 性能瓶颈:随着数据量和请求量增长,单节点无法承载;2) 一致性:分布式环境下的数据一致性保证;3) 故障恢复:节点故障时的自动切换和数据恢复;4) 运维复杂度:集群管理、监控、升级。

Q3:如何保证内容审核系统的高可用?

  1. 多副本冗余(至少 3 副本);2) 自动故障检测和切换(心跳 + 选主);3) 数据持久化和备份;4) 限流降级(防止雪崩);5) 多机房/多活部署。

Q4:内容审核系统的性能优化有哪些关键手段?

  1. 缓存(减少重复计算和 IO);2) 异步处理(非关键路径异步化);3) 批量操作(减少网络往返);4) 数据分片(并行处理);5) 连接池复用。

Q5:内容审核系统与同类方案相比有什么优劣势?

参考方案对比表格。选型时需考虑:团队技术栈、数据规模、延迟要求、一致性需求、运维成本。没有银弹,需根据业务场景权衡取舍。


| 方案一 | 简单实现 | 低 | 适合小规模 | | 方案二 | 中等复杂度 | 中 | 适合中等规模 | | 方案三 | 高复杂度 ⭐推荐 | 高 | 适合大规模生产环境 |

✅ 架构设计检查清单

检查项状态
缓存策略
监控告警
安全设计
性能优化
水平扩展

🚀 架构演进路径

阶段一:单机版 MVP(用户量 < 10 万)

  • 单体应用 + 单机数据库,快速验证核心功能
  • 适用场景:产品早期,快速迭代

阶段二:基础版分布式(用户量 10 万 → 100 万)

  • 应用层水平扩展 + 数据库主从分离 + Redis 缓存
  • 引入消息队列解耦异步任务

阶段三:生产级高可用(用户量 > 100 万)

  • 微服务拆分 + 数据库分库分表 + 多机房部署
  • 全链路监控 + 自动化运维 + 异地容灾

⚖️ 关键 Trade-off 分析

🔴 Trade-off 1:一致性 vs 可用性

  • 强一致(CP):适用于金融交易等不能出错的场景
  • 高可用(AP):适用于社交动态等允许短暂不一致的场景
  • 本系统选择:核心路径强一致,非核心路径最终一致

🔴 Trade-off 2:同步 vs 异步

  • 同步处理:延迟低但吞吐受限,适用于核心交互路径
  • 异步处理:吞吐高但增加延迟,适用于后台计算
  • 本系统选择:核心路径同步,非核心路径异步