🚀 系统设计实战 170:170. 设计个性化搜索系统
摘要:本文深入剖析系统的核心架构、关键算法和工程实践,提供完整的设计方案和面试要点。
你是否想过,设计个性化搜索系统背后的技术挑战有多复杂?
1. 需求分析
功能需求
- 查询理解: 智能解析用户搜索意图和语义
- 个性化排序: 基于用户画像的个性化结果排序
- 实时推荐: 搜索过程中的实时查询建议
- 多模态搜索: 支持文本、图片、语音等多种搜索方式
- 搜索历史: 用户搜索历史记录和分析
- A/B测试: 搜索算法的A/B测试框架
非功能需求
- 性能: 搜索响应时间<200ms,支持10万QPS
- 准确性: 搜索结果相关性>90%,个性化提升>15%
- 可用性: 99.9%服务可用性
- 扩展性: 支持亿级用户和千亿级文档
- 实时性: 用户行为实时反馈到搜索结果
2. 系统架构
整体架构
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Search Apps │ │ Web Portal │ │ Admin Panel │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ API Gateway │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Search Service │ │Query Understanding│ │Personalization │
│ │ │ Service │ │ Service │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
└───────────────────────┼───────────────────────┘
│
┌─────────────────────────────────────────────────────────────────┐
│ Search Engine Core │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Indexing │ │ Retrieval │ │ Ranking │ │
│ │ Service │ │ Service │ │ Service │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘
│
┌───────────────────────┼───────────────────────┐
│ │ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Search Index │ │ User Profile │ │ Analytics │
│ (Elasticsearch)│ │ Service │ │ Service │
└─────────────────┘ └─────────────────┘ └─────────────────┘
3. 核心组件设计
3.1 查询理解服务
// 时间复杂度:O(N),空间复杂度:O(1)
class QueryUnderstandingService:
def __init__(self):
self.query_parser = QueryParser()
self.intent_classifier = IntentClassifier()
self.entity_extractor = EntityExtractor()
self.query_expander = QueryExpander()
self.spell_checker = SpellChecker()
async def understand_query(self, query: str, user_context: UserContext) -> QueryUnderstanding:
# 拼写检查和纠正
corrected_query = await self.spell_checker.correct(query)
# 查询解析
parsed_query = self.query_parser.parse(corrected_query)
# 意图识别
intent = await self.intent_classifier.classify(corrected_query, user_context)
# 实体提取
entities = await self.entity_extractor.extract(corrected_query)
# 查询扩展
expanded_terms = await self.query_expander.expand(
corrected_query, intent, entities, user_context
)
return QueryUnderstanding(
original_query=query,
corrected_query=corrected_query,
parsed_query=parsed_query,
intent=intent,
entities=entities,
expanded_terms=expanded_terms,
confidence=self._calculate_confidence(intent, entities)
)
class IntentClassifier:
def __init__(self):
self.model = self._load_intent_model()
self.intent_categories = [
'product_search', 'information_seeking', 'navigation',
'comparison', 'local_search', 'image_search'
]
async def classify(self, query: str, user_context: UserContext) -> Intent:
# 特征提取
features = self._extract_features(query, user_context)
# 模型预测
intent_probs = self.model.predict_proba([features])[0]
# 获取最可能的意图
max_prob_idx = np.argmax(intent_probs)
predicted_intent = self.intent_categories[max_prob_idx]
confidence = intent_probs[max_prob_idx]
return Intent(
category=predicted_intent,
confidence=confidence,
subcategory=self._get_subcategory(predicted_intent, query)
)
def _extract_features(self, query: str, user_context: UserContext) -> np.ndarray:
features = []
# 查询特征
features.extend([
len(query.split()), # 查询长度
len(query), # 字符数
query.count('?'), # 问号数量
query.count('vs'), # 比较词
query.count('near'), # 位置词
])
# 用户上下文特征
features.extend([
user_context.search_history_count,
user_context.avg_session_length,
user_context.preferred_categories_count
])
# TF-IDF特征
tfidf_features = self._get_tfidf_features(query)
features.extend(tfidf_features)
return np.array(features)
class EntityExtractor:
def __init__(self):
self.ner_model = self._load_ner_model()
self.entity_types = [
'PRODUCT', 'BRAND', 'CATEGORY', 'PRICE', 'LOCATION',
'DATE', 'PERSON', 'ORGANIZATION'
]
async def extract(self, query: str) -> List[Entity]:
# NER模型提取
ner_results = self.ner_model(query)
entities = []
for entity_info in ner_results:
entity = Entity(
text=entity_info['word'],
type=entity_info['entity'],
confidence=entity_info['confidence'],
start_pos=entity_info['start'],
end_pos=entity_info['end']
)
entities.append(entity)
# 规则增强
rule_entities = await self._extract_rule_based_entities(query)
entities.extend(rule_entities)
# 去重和合并
entities = self._merge_overlapping_entities(entities)
return entities
3.2 个性化服务
class PersonalizationService:
def __init__(self):
self.user_profile_service = UserProfileService()
self.behavior_analyzer = BehaviorAnalyzer()
self.preference_learner = PreferenceLearner()
self.context_analyzer = ContextAnalyzer()
async def personalize_results(self, query_understanding: QueryUnderstanding,
initial_results: List[SearchResult],
user_id: str) -> List[SearchResult]:
# 获取用户画像
user_profile = await self.user_profile_service.get_profile(user_id)
# 分析当前上下文
current_context = await self.context_analyzer.analyze_context(
user_id, query_understanding
)
# 计算个性化分数
personalized_results = []
for result in initial_results:
personalization_score = await self._calculate_personalization_score(
result, user_profile, current_context, query_understanding
)
# 更新结果分数
result.personalization_score = personalization_score
result.final_score = self._combine_scores(
result.relevance_score, personalization_score
)
personalized_results.append(result)
# 重新排序
personalized_results.sort(key=lambda x: x.final_score, reverse=True)
return personalized_results
async def _calculate_personalization_score(self, result: SearchResult,
user_profile: UserProfile,
context: SearchContext,
query_understanding: QueryUnderstanding) -> float:
score_components = {}
# 用户兴趣匹配
interest_score = self._calculate_interest_match(result, user_profile.interests)
score_components['interest'] = interest_score
# 历史行为相似度
behavior_score = await self._calculate_behavior_similarity(
result, user_profile.behavior_history
)
score_components['behavior'] = behavior_score
# 协同过滤分数
collaborative_score = await self._calculate_collaborative_score(
result, user_profile.user_id
)
score_components['collaborative'] = collaborative_score
# 上下文相关性
context_score = self._calculate_context_relevance(result, context)
score_components['context'] = context_score
# 时间衰减因子
time_decay = self._calculate_time_decay(result.publish_time)
score_components['time_decay'] = time_decay
# 加权综合
weights = self._get_personalization_weights(query_understanding.intent)
final_score = sum(
score_components[component] * weights[component]
for component in score_components
)
return final_score
class UserProfileService:
def __init__(self):
self.profile_cache = UserProfileCache()
self.behavior_tracker = BehaviorTracker()
self.interest_extractor = InterestExtractor()
async def get_profile(self, user_id: str) -> UserProfile:
# 检查缓存
cached_profile = await self.profile_cache.get(user_id)
if cached_profile and not self._is_profile_stale(cached_profile):
return cached_profile
# 构建用户画像
profile = await self._build_user_profile(user_id)
# 更新缓存
await self.profile_cache.set(user_id, profile)
return profile
async def _build_user_profile(self, user_id: str) -> UserProfile:
# 获取用户基本信息
basic_info = await self._get_user_basic_info(user_id)
# 分析搜索历史
search_history = await self._get_search_history(user_id, days=30)
search_patterns = self._analyze_search_patterns(search_history)
# 分析点击行为
click_history = await self._get_click_history(user_id, days=30)
click_preferences = self._analyze_click_preferences(click_history)
# 提取兴趣标签
interests = await self.interest_extractor.extract_interests(
search_history, click_history
)
# 计算用户特征向量
feature_vector = self._compute_user_embedding(
search_patterns, click_preferences, interests
)
return UserProfile(
user_id=user_id,
basic_info=basic_info,
interests=interests,
search_patterns=search_patterns,
click_preferences=click_preferences,
feature_vector=feature_vector,
last_updated=datetime.utcnow()
)
3.3 搜索引擎核心
class SearchEngineCore:
def __init__(self):
self.elasticsearch_client = ElasticsearchClient()
self.retrieval_service = RetrievalService()
self.ranking_service = RankingService()
self.index_manager = IndexManager()
async def search(self, query_understanding: QueryUnderstanding,
user_context: UserContext,
pagination: Pagination) -> SearchResults:
# 检索阶段
candidate_results = await self.retrieval_service.retrieve(
query_understanding, user_context, limit=1000
)
# 排序阶段
ranked_results = await self.ranking_service.rank(
candidate_results, query_understanding, user_context
)
# 分页
paginated_results = self._paginate_results(ranked_results, pagination)
# 构建搜索结果
return SearchResults(
results=paginated_results,
total_count=len(candidate_results),
query_understanding=query_understanding,
search_time_ms=self._get_search_time(),
personalization_applied=user_context.user_id is not None
)
class RetrievalService:
def __init__(self):
self.elasticsearch_client = ElasticsearchClient()
self.query_builder = ElasticsearchQueryBuilder()
self.multi_stage_retrieval = MultiStageRetrieval()
async def retrieve(self, query_understanding: QueryUnderstanding,
user_context: UserContext,
limit: int = 1000) -> List[SearchResult]:
# 构建Elasticsearch查询
es_query = self.query_builder.build_query(query_understanding, user_context)
# 执行搜索
es_response = await self.elasticsearch_client.search(
index='search_documents',
body=es_query,
size=limit
)
# 解析结果
results = []
for hit in es_response['hits']['hits']:
result = SearchResult(
document_id=hit['_id'],
title=hit['_source']['title'],
content=hit['_source']['content'],
url=hit['_source']['url'],
category=hit['_source']['category'],
relevance_score=hit['_score'],
publish_time=hit['_source']['publish_time'],
metadata=hit['_source'].get('metadata', {})
)
results.append(result)
return results
class ElasticsearchQueryBuilder:
def build_query(self, query_understanding: QueryUnderstanding,
user_context: UserContext) -> Dict:
query = {
"query": {
"bool": {
"must": [],
"should": [],
"filter": [],
"must_not": []
}
},
"highlight": {
"fields": {
"title": {},
"content": {}
}
},
"_source": ["title", "content", "url", "category", "publish_time", "metadata"]
}
# 主查询
main_query = self._build_main_query(query_understanding)
query["query"]["bool"]["must"].append(main_query)
# 意图相关的查询增强
intent_boost = self._build_intent_boost(query_understanding.intent)
if intent_boost:
query["query"]["bool"]["should"].extend(intent_boost)
# 实体过滤
entity_filters = self._build_entity_filters(query_understanding.entities)
query["query"]["bool"]["filter"].extend(entity_filters)
# 用户上下文增强
if user_context.user_id:
context_boost = self._build_context_boost(user_context)
query["query"]["bool"]["should"].extend(context_boost)
# 时间衰减
time_decay = self._build_time_decay()
query["query"]["bool"]["should"].append(time_decay)
return query
def _build_main_query(self, query_understanding: QueryUnderstanding) -> Dict:
"""构建主查询"""
return {
"multi_match": {
"query": query_understanding.corrected_query,
"fields": [
"title^3", # 标题权重最高
"content^1", # 内容基础权重
"tags^2", # 标签权重较高
"category^1.5" # 分类权重中等
],
"type": "best_fields",
"fuzziness": "AUTO",
"operator": "and"
}
}
def _build_intent_boost(self, intent: Intent) -> List[Dict]:
"""基于意图的查询增强"""
boosts = []
if intent.category == 'product_search':
boosts.append({
"term": {
"document_type": {
"value": "product",
"boost": 2.0
}
}
})
elif intent.category == 'information_seeking':
boosts.append({
"term": {
"document_type": {
"value": "article",
"boost": 2.0
}
}
})
return boosts
class RankingService:
def __init__(self):
self.learning_to_rank = LearningToRankModel()
self.feature_extractor = RankingFeatureExtractor()
self.diversity_optimizer = DiversityOptimizer()
async def rank(self, results: List[SearchResult],
query_understanding: QueryUnderstanding,
user_context: UserContext) -> List[SearchResult]:
if not results:
return results
# 提取排序特征
ranking_features = await self.feature_extractor.extract_features(
results, query_understanding, user_context
)
# Learning-to-Rank模型预测
ltr_scores = self.learning_to_rank.predict(ranking_features)
# 更新结果分数
for i, result in enumerate(results):
result.ltr_score = ltr_scores[i]
result.final_score = self._combine_scores(
result.relevance_score, result.ltr_score
)
# 排序
results.sort(key=lambda x: x.final_score, reverse=True)
# 结果多样性优化
diversified_results = await self.diversity_optimizer.optimize(
results, query_understanding
)
return diversified_results
class RankingFeatureExtractor:
def __init__(self):
self.text_analyzer = TextAnalyzer()
self.popularity_calculator = PopularityCalculator()
self.freshness_calculator = FreshnessCalculator()
async def extract_features(self, results: List[SearchResult],
query_understanding: QueryUnderstanding,
user_context: UserContext) -> np.ndarray:
features_matrix = []
for result in results:
features = []
# 文本相关性特征
text_features = self._extract_text_features(result, query_understanding)
features.extend(text_features)
# 流行度特征
popularity_features = await self._extract_popularity_features(result)
features.extend(popularity_features)
# 新鲜度特征
freshness_features = self._extract_freshness_features(result)
features.extend(freshness_features)
# 用户相关特征
if user_context.user_id:
user_features = await self._extract_user_features(result, user_context)
features.extend(user_features)
else:
features.extend([0.0] * 10) # 填充默认值
# 查询-文档匹配特征
match_features = self._extract_match_features(result, query_understanding)
features.extend(match_features)
features_matrix.append(features)
return np.array(features_matrix)
def _extract_text_features(self, result: SearchResult,
query_understanding: QueryUnderstanding) -> List[float]:
"""提取文本相关性特征"""
features = []
query_terms = query_understanding.corrected_query.lower().split()
title_terms = result.title.lower().split()
content_terms = result.content.lower().split()
# TF-IDF相似度
title_tfidf = self.text_analyzer.calculate_tfidf_similarity(
query_terms, title_terms
)
content_tfidf = self.text_analyzer.calculate_tfidf_similarity(
query_terms, content_terms
)
features.extend([title_tfidf, content_tfidf])
# 词汇重叠度
title_overlap = len(set(query_terms) & set(title_terms)) / len(query_terms)
content_overlap = len(set(query_terms) & set(content_terms)) / len(query_terms)
features.extend([title_overlap, content_overlap])
# 精确匹配
exact_match_title = 1.0 if query_understanding.corrected_query.lower() in result.title.lower() else 0.0
exact_match_content = 1.0 if query_understanding.corrected_query.lower() in result.content.lower() else 0.0
features.extend([exact_match_title, exact_match_content])
return features
3.4 实时学习系统
class RealTimeLearningSystem:
def __init__(self):
self.click_tracker = ClickTracker()
self.feedback_processor = FeedbackProcessor()
self.model_updater = OnlineModelUpdater()
self.feature_updater = FeatureUpdater()
async def process_user_feedback(self, feedback: UserFeedback):
"""处理用户反馈"""
# 记录用户行为
await self.click_tracker.track_interaction(feedback)
# 更新用户画像
await self._update_user_profile(feedback)
# 更新搜索模型
await self._update_search_models(feedback)
# 更新文档特征
await self._update_document_features(feedback)
async def _update_user_profile(self, feedback: UserFeedback):
"""更新用户画像"""
user_id = feedback.user_id
# 更新兴趣标签
if feedback.action == 'click':
await self._update_user_interests(user_id, feedback.document_id)
# 更新搜索模式
await self._update_search_patterns(user_id, feedback.query, feedback.action)
# 更新偏好权重
await self._update_preference_weights(user_id, feedback)
async def _update_search_models(self, feedback: UserFeedback):
"""更新搜索模型"""
# 准备训练样本
training_sample = await self._prepare_training_sample(feedback)
# 在线学习更新
await self.model_updater.update_ranking_model(training_sample)
# 更新查询理解模型
if feedback.action in ['click', 'dwell_time_long']:
await self.model_updater.update_query_understanding_model(
feedback.query, feedback.document_category
)
class OnlineModelUpdater:
def __init__(self):
self.ranking_model = OnlineLearningRanker()
self.query_model = OnlineQueryModel()
self.update_buffer = UpdateBuffer(max_size=1000)
async def update_ranking_model(self, training_sample: TrainingSample):
"""更新排序模型"""
# 添加到更新缓冲区
self.update_buffer.add(training_sample)
# 批量更新
if self.update_buffer.is_full():
batch_samples = self.update_buffer.get_all()
# 增量学习
self.ranking_model.partial_fit(
[sample.features for sample in batch_samples],
[sample.label for sample in batch_samples]
)
# 清空缓冲区
self.update_buffer.clear()
async def update_query_understanding_model(self, query: str, category: str):
"""更新查询理解模型"""
# 创建训练样本
features = self._extract_query_features(query)
# 在线更新
self.query_model.partial_fit([features], [category])
class ClickTracker:
def __init__(self):
self.kafka_producer = KafkaProducer(topic='user_interactions')
self.redis_client = RedisClient()
async def track_interaction(self, feedback: UserFeedback):
"""跟踪用户交互"""
# 发送到Kafka进行实时处理
interaction_event = {
'user_id': feedback.user_id,
'query': feedback.query,
'document_id': feedback.document_id,
'action': feedback.action,
'timestamp': feedback.timestamp.isoformat(),
'position': feedback.position,
'session_id': feedback.session_id
}
await self.kafka_producer.send('user_interactions', interaction_event)
# 更新实时统计
await self._update_realtime_stats(feedback)
async def _update_realtime_stats(self, feedback: UserFeedback):
"""更新实时统计"""
# 更新文档点击率
doc_key = f"doc_ctr:{feedback.document_id}"
await self.redis_client.hincrby(doc_key, 'clicks', 1)
await self.redis_client.hincrby(doc_key, 'impressions', 1)
# 更新查询统计
query_key = f"query_stats:{hash(feedback.query)}"
await self.redis_client.hincrby(query_key, 'total_queries', 1)
if feedback.action == 'click':
await self.redis_client.hincrby(query_key, 'clicked_queries', 1)
4. 数据存储设计
4.1 搜索索引设计
// Elasticsearch索引映射
{
"mappings": {
"properties": {
"title": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart",
"fields": {
"keyword": {
"type": "keyword"
}
}
},
"content": {
"type": "text",
"analyzer": "ik_max_word",
"search_analyzer": "ik_smart"
},
"category": {
"type": "keyword"
},
"tags": {
"type": "keyword"
},
"url": {
"type": "keyword"
},
"publish_time": {
"type": "date"
},
"popularity_score": {
"type": "float"
},
"quality_score": {
"type": "float"
},
"click_count": {
"type": "integer"
},
"view_count": {
"type": "integer"
},
"embedding_vector": {
"type": "dense_vector",
"dims": 768
},
"metadata": {
"type": "object",
"properties": {
"author": {"type": "keyword"},
"source": {"type": "keyword"},
"language": {"type": "keyword"}
}
}
}
},
"settings": {
"number_of_shards": 10,
"number_of_replicas": 1,
"analysis": {
"analyzer": {
"ik_max_word": {
"type": "ik_max_word"
},
"ik_smart": {
"type": "ik_smart"
}
}
}
}
}
4.2 用户数据存储
-- 用户画像表
CREATE TABLE user_profiles (
user_id UUID PRIMARY KEY,
interests JSONB DEFAULT '[]',
search_patterns JSONB DEFAULT '{}',
click_preferences JSONB DEFAULT '{}',
demographic_info JSONB DEFAULT '{}',
feature_vector FLOAT[] DEFAULT '{}',
last_updated TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_last_updated (last_updated)
);
-- 搜索历史表
CREATE TABLE search_history (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
query TEXT NOT NULL,
query_understanding JSONB,
results_count INTEGER,
clicked_results JSONB DEFAULT '[]',
session_id VARCHAR(64),
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_user_timestamp (user_id, timestamp),
INDEX idx_query (query),
INDEX idx_session (session_id)
);
-- 点击行为表
CREATE TABLE click_behavior (
id UUID PRIMARY KEY,
user_id UUID NOT NULL,
query TEXT NOT NULL,
document_id VARCHAR(128) NOT NULL,
position INTEGER NOT NULL,
action VARCHAR(50) NOT NULL, -- click, dwell, skip
dwell_time INTEGER, -- 停留时间(秒)
session_id VARCHAR(64),
timestamp TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
INDEX idx_user_timestamp (user_id, timestamp),
INDEX idx_document (document_id),
INDEX idx_query_action (query, action)
);
5. 冷启动处理
5.1 新用户冷启动
class ColdStartHandler:
def __init__(self):
self.demographic_predictor = DemographicPredictor()
self.popular_content_service = PopularContentService()
self.similarity_calculator = UserSimilarityCalculator()
self.onboarding_service = OnboardingService()
async def handle_new_user_search(self, query: str, user_info: Dict) -> List[SearchResult]:
"""处理新用户搜索"""
# 基于人口统计学信息预测兴趣
predicted_interests = await self.demographic_predictor.predict_interests(user_info)
# 查找相似用户
similar_users = await self.similarity_calculator.find_similar_users(
user_info, predicted_interests
)
# 基于相似用户的搜索行为推荐
collaborative_results = await self._get_collaborative_results(
query, similar_users
)
# 热门内容推荐
popular_results = await self.popular_content_service.get_popular_for_query(
query, predicted_interests
)
# 合并和排序结果
combined_results = self._combine_cold_start_results(
collaborative_results, popular_results, query
)
return combined_results
async def _get_collaborative_results(self, query: str,
similar_users: List[str]) -> List[SearchResult]:
"""基于相似用户的协同过滤结果"""
# 获取相似用户的搜索和点击历史
user_behaviors = await self._get_users_behavior(similar_users, days=30)
# 找到与当前查询相关的历史查询
related_queries = self._find_related_queries(query, user_behaviors)
# 获取这些查询的高点击率结果
collaborative_results = []
for related_query in related_queries:
high_ctr_results = await self._get_high_ctr_results(related_query)
collaborative_results.extend(high_ctr_results)
# 去重和排序
return self._deduplicate_and_rank(collaborative_results)
class OnboardingService:
def __init__(self):
self.interest_detector = InterestDetector()
self.preference_learner = PreferenceLearner()
async def collect_initial_preferences(self, user_id: str,
onboarding_data: Dict) -> UserProfile:
"""收集用户初始偏好"""
initial_profile = UserProfile(user_id=user_id)
# 从问卷调查中提取兴趣
if 'survey_responses' in onboarding_data:
survey_interests = self.interest_detector.extract_from_survey(
onboarding_data['survey_responses']
)
initial_profile.interests.extend(survey_interests)
# 从社交媒体数据中提取兴趣(如果用户授权)
if 'social_media_data' in onboarding_data:
social_interests = await self.interest_detector.extract_from_social_media(
onboarding_data['social_media_data']
)
initial_profile.interests.extend(social_interests)
# 从浏览历史中提取兴趣(如果用户导入)
if 'browsing_history' in onboarding_data:
browsing_interests = await self.interest_detector.extract_from_browsing(
onboarding_data['browsing_history']
)
initial_profile.interests.extend(browsing_interests)
return initial_profile
class PopularContentService:
def __init__(self):
self.trending_calculator = TrendingCalculator()
self.category_analyzer = CategoryAnalyzer()
async def get_popular_for_query(self, query: str,
interests: List[str]) -> List[SearchResult]:
"""获取查询相关的热门内容"""
# 分析查询的主要类别
query_categories = await self.category_analyzer.analyze_query_categories(query)
# 获取相关类别的热门内容
popular_results = []
for category in query_categories:
category_popular = await self._get_category_popular_content(
category, interests, limit=20
)
popular_results.extend(category_popular)
# 基于查询相关性过滤
filtered_results = await self._filter_by_query_relevance(
popular_results, query, threshold=0.3
)
return filtered_results
async def _get_category_popular_content(self, category: str,
interests: List[str],
limit: int) -> List[SearchResult]:
"""获取特定类别的热门内容"""
# 计算热门度分数(点击率 + 时间衰减 + 质量分数)
popular_docs = await self.trending_calculator.get_trending_documents(
category=category,
time_window='7d',
limit=limit * 2 # 获取更多候选
)
# 基于用户兴趣进行过滤和排序
interest_filtered = []
for doc in popular_docs:
interest_score = self._calculate_interest_match(doc, interests)
if interest_score > 0.1: # 最低兴趣匹配阈值
doc.interest_score = interest_score
interest_filtered.append(doc)
# 按兴趣匹配度排序
interest_filtered.sort(key=lambda x: x.interest_score, reverse=True)
return interest_filtered[:limit]
5.2 新内容冷启动
class ContentColdStartHandler:
def __init__(self):
self.content_analyzer = ContentAnalyzer()
self.similarity_calculator = ContentSimilarityCalculator()
self.exploration_strategy = ExplorationStrategy()
async def handle_new_content(self, new_document: Document) -> ContentProfile:
"""处理新内容的冷启动"""
# 内容特征提取
content_features = await self.content_analyzer.extract_features(new_document)
# 找到相似的已有内容
similar_documents = await self.similarity_calculator.find_similar_documents(
new_document, limit=50
)
# 基于相似内容预测性能
predicted_performance = await self._predict_content_performance(
new_document, similar_documents
)
# 确定探索策略
exploration_config = self.exploration_strategy.get_exploration_config(
new_document, predicted_performance
)
return ContentProfile(
document_id=new_document.id,
features=content_features,
similar_documents=similar_documents,
predicted_performance=predicted_performance,
exploration_config=exploration_config
)
async def _predict_content_performance(self, new_document: Document,
similar_documents: List[Document]) -> Dict:
"""预测新内容的性能"""
if not similar_documents:
return {'predicted_ctr': 0.05, 'confidence': 0.1}
# 基于相似内容的历史性能预测
similar_ctrs = [doc.click_through_rate for doc in similar_documents]
similar_qualities = [doc.quality_score for doc in similar_documents]
# 加权平均(相似度作为权重)
weights = [doc.similarity_score for doc in similar_documents]
predicted_ctr = np.average(similar_ctrs, weights=weights)
predicted_quality = np.average(similar_qualities, weights=weights)
# 置信度基于相似文档的数量和相似度
confidence = min(0.9, len(similar_documents) / 50 * np.mean(weights))
return {
'predicted_ctr': predicted_ctr,
'predicted_quality': predicted_quality,
'confidence': confidence
}
class ExplorationStrategy:
def __init__(self):
self.epsilon_greedy = EpsilonGreedyStrategy()
self.thompson_sampling = ThompsonSamplingStrategy()
self.ucb = UCBStrategy()
def get_exploration_config(self, document: Document,
predicted_performance: Dict) -> ExplorationConfig:
"""获取探索配置"""
confidence = predicted_performance['confidence']
if confidence < 0.3:
# 低置信度:使用Thompson Sampling进行更多探索
strategy = 'thompson_sampling'
exploration_rate = 0.3
elif confidence < 0.7:
# 中等置信度:使用UCB平衡探索和利用
strategy = 'ucb'
exploration_rate = 0.15
else:
# 高置信度:使用Epsilon-Greedy进行少量探索
strategy = 'epsilon_greedy'
exploration_rate = 0.05
return ExplorationConfig(
strategy=strategy,
exploration_rate=exploration_rate,
min_impressions=100, # 最少展示次数
evaluation_period=24 # 评估周期(小时)
)
6. A/B测试框架
6.1 实验管理系统
class ABTestingFramework:
def __init__(self):
self.experiment_manager = ExperimentManager()
self.traffic_splitter = TrafficSplitter()
self.metrics_collector = MetricsCollector()
self.statistical_analyzer = StatisticalAnalyzer()
async def create_search_experiment(self, experiment_config: ExperimentConfig) -> str:
"""创建搜索实验"""
experiment = SearchExperiment(
id=str(uuid.uuid4()),
name=experiment_config.name,
description=experiment_config.description,
variants=experiment_config.variants,
traffic_allocation=experiment_config.traffic_allocation,
target_metrics=experiment_config.target_metrics,
start_time=experiment_config.start_time,
end_time=experiment_config.end_time,
status='active'
)
# 保存实验配置
await self.experiment_manager.save_experiment(experiment)
# 启动流量分配
await self.traffic_splitter.setup_traffic_split(experiment)
return experiment.id
async def assign_user_to_variant(self, user_id: str,
experiment_id: str) -> str:
"""为用户分配实验变体"""
experiment = await self.experiment_manager.get_experiment(experiment_id)
if not experiment or experiment.status != 'active':
return 'control'
# 基于用户ID的一致性哈希分配
user_hash = hashlib.md5(f"{user_id}:{experiment_id}".encode()).hexdigest()
hash_value = int(user_hash[:8], 16) / (2**32)
# 根据流量分配确定变体
cumulative_allocation = 0
for variant, allocation in experiment.traffic_allocation.items():
cumulative_allocation += allocation
if hash_value <= cumulative_allocation:
return variant
return 'control' # 默认返回控制组
class SearchExperimentVariants:
"""搜索实验变体定义"""
@staticmethod
def ranking_algorithm_test():
"""排序算法测试"""
return {
'control': {
'ranking_model': 'current_model',
'personalization_weight': 0.3
},
'treatment_a': {
'ranking_model': 'new_ltr_model',
'personalization_weight': 0.3
},
'treatment_b': {
'ranking_model': 'current_model',
'personalization_weight': 0.5
}
}
@staticmethod
def query_understanding_test():
"""查询理解测试"""
return {
'control': {
'query_expansion': False,
'intent_classification': 'rule_based'
},
'treatment': {
'query_expansion': True,
'intent_classification': 'ml_based'
}
}
@staticmethod
def personalization_test():
"""个性化测试"""
return {
'control': {
'personalization_enabled': False
},
'treatment_light': {
'personalization_enabled': True,
'personalization_strength': 0.3
},
'treatment_strong': {
'personalization_enabled': True,
'personalization_strength': 0.7
}
}
class ExperimentMetricsCollector:
def __init__(self):
self.kafka_producer = KafkaProducer(topic='experiment_events')
self.metrics_aggregator = MetricsAggregator()
async def track_search_event(self, event: SearchEvent,
experiment_id: str, variant: str):
"""跟踪搜索事件"""
experiment_event = {
'experiment_id': experiment_id,
'variant': variant,
'user_id': event.user_id,
'query': event.query,
'results_count': event.results_count,
'click_positions': event.click_positions,
'dwell_times': event.dwell_times,
'timestamp': event.timestamp.isoformat()
}
# 发送到Kafka进行实时处理
await self.kafka_producer.send('experiment_events', experiment_event)
# 更新实时指标
await self.metrics_aggregator.update_experiment_metrics(
experiment_id, variant, event
)
async def calculate_experiment_metrics(self, experiment_id: str,
time_range: TimeRange) -> Dict:
"""计算实验指标"""
metrics = {}
experiment = await self.experiment_manager.get_experiment(experiment_id)
for variant in experiment.variants:
variant_events = await self._get_variant_events(
experiment_id, variant, time_range
)
variant_metrics = {
'total_searches': len(variant_events),
'avg_results_per_search': np.mean([e.results_count for e in variant_events]),
'click_through_rate': self._calculate_ctr(variant_events),
'avg_clicks_per_search': self._calculate_avg_clicks(variant_events),
'avg_dwell_time': self._calculate_avg_dwell_time(variant_events),
'zero_result_rate': self._calculate_zero_result_rate(variant_events)
}
metrics[variant] = variant_metrics
return metrics
7. 性能优化
7.1 缓存策略
class SearchCacheManager:
def __init__(self):
# 多级缓存架构
self.query_cache = QueryResultCache(ttl=3600) # 查询结果缓存
self.user_cache = UserProfileCache(ttl=1800) # 用户画像缓存
self.model_cache = ModelCache(ttl=7200) # 模型缓存
self.feature_cache = FeatureCache(ttl=900) # 特征缓存
async def get_cached_search_results(self, cache_key: str) -> Optional[SearchResults]:
"""获取缓存的搜索结果"""
return await self.query_cache.get(cache_key)
async def cache_search_results(self, cache_key: str, results: SearchResults):
"""缓存搜索结果"""
# 只缓存高质量的搜索结果
if self._should_cache_results(results):
await self.query_cache.set(cache_key, results)
def _should_cache_results(self, results: SearchResults) -> bool:
"""判断是否应该缓存结果"""
return (
len(results.results) > 0 and
results.query_understanding.confidence > 0.8 and
not results.personalization_applied # 个性化结果不缓存
)
def generate_cache_key(self, query: str, filters: Dict,
user_context: Optional[UserContext] = None) -> str:
"""生成缓存键"""
key_components = [query, str(sorted(filters.items()))]
# 非个性化查询可以共享缓存
if not user_context or not user_context.user_id:
cache_key = hashlib.md5('|'.join(key_components).encode()).hexdigest()
else:
# 个性化查询包含用户信息
key_components.append(user_context.user_id)
cache_key = hashlib.md5('|'.join(key_components).encode()).hexdigest()
return f"search:{cache_key}"
class QueryOptimizer:
def __init__(self):
self.query_analyzer = QueryAnalyzer()
self.index_optimizer = IndexOptimizer()
async def optimize_elasticsearch_query(self, es_query: Dict,
query_stats: QueryStats) -> Dict:
"""优化Elasticsearch查询"""
optimized_query = es_query.copy()
# 基于查询统计优化
if query_stats.avg_result_count < 10:
# 结果太少,放宽匹配条件
optimized_query = self._relax_matching_conditions(optimized_query)
elif query_stats.avg_result_count > 10000:
# 结果太多,增加过滤条件
optimized_query = self._add_filtering_conditions(optimized_query)
# 动态调整字段权重
optimized_query = self._adjust_field_weights(optimized_query, query_stats)
# 优化聚合查询
if 'aggs' in optimized_query:
optimized_query['aggs'] = self._optimize_aggregations(
optimized_query['aggs']
)
return optimized_query
def _relax_matching_conditions(self, query: Dict) -> Dict:
"""放宽匹配条件"""
if 'query' in query and 'bool' in query['query']:
bool_query = query['query']['bool']
# 将must条件改为should
if 'must' in bool_query and len(bool_query['must']) > 1:
should_conditions = bool_query.get('should', [])
should_conditions.extend(bool_query['must'][1:])
bool_query['should'] = should_conditions
bool_query['must'] = bool_query['must'][:1]
bool_query['minimum_should_match'] = 1
return query
class PerformanceMonitor:
def __init__(self):
self.metrics_collector = MetricsCollector()
self.alert_manager = AlertManager()
async def monitor_search_performance(self):
"""监控搜索性能"""
while True:
# 收集性能指标
metrics = await self.metrics_collector.collect_search_metrics()
# 检查性能阈值
await self._check_performance_thresholds(metrics)
# 自动优化建议
optimization_suggestions = await self._generate_optimization_suggestions(metrics)
if optimization_suggestions:
await self._apply_auto_optimizations(optimization_suggestions)
await asyncio.sleep(300) # 每5分钟检查一次
async def _check_performance_thresholds(self, metrics: Dict):
"""检查性能阈值"""
if metrics['avg_search_latency'] > 500: # 500ms
await self.alert_manager.send_alert(
'HIGH_SEARCH_LATENCY',
f"Average search latency: {metrics['avg_search_latency']}ms"
)
if metrics['cache_hit_rate'] < 0.6: # 60%
await self.alert_manager.send_alert(
'LOW_CACHE_HIT_RATE',
f"Cache hit rate: {metrics['cache_hit_rate']:.2%}"
)
if metrics['error_rate'] > 0.01: # 1%
await self.alert_manager.send_alert(
'HIGH_ERROR_RATE',
f"Search error rate: {metrics['error_rate']:.2%}"
)
7.2 分布式架构优化
class DistributedSearchCoordinator:
def __init__(self):
self.shard_manager = ShardManager()
self.load_balancer = SearchLoadBalancer()
self.result_merger = ResultMerger()
async def distributed_search(self, query: SearchQuery) -> SearchResults:
"""分布式搜索协调"""
# 确定需要搜索的分片
target_shards = await self.shard_manager.get_target_shards(query)
# 并行搜索各个分片
shard_tasks = []
for shard in target_shards:
task = self._search_shard(shard, query)
shard_tasks.append(task)
shard_results = await asyncio.gather(*shard_tasks, return_exceptions=True)
# 处理异常结果
valid_results = []
for i, result in enumerate(shard_results):
if isinstance(result, Exception):
logger.error(f"Shard {target_shards[i]} search failed: {result}")
else:
valid_results.append(result)
# 合并结果
merged_results = await self.result_merger.merge_shard_results(
valid_results, query
)
return merged_results
async def _search_shard(self, shard: ShardInfo, query: SearchQuery) -> ShardSearchResult:
"""搜索单个分片"""
# 选择最优的分片副本
replica = await self.load_balancer.select_replica(shard)
# 执行搜索
try:
result = await replica.search(query)
return ShardSearchResult(
shard_id=shard.id,
results=result.results,
total_hits=result.total_hits,
search_time=result.search_time
)
except Exception as e:
# 故障转移到其他副本
backup_replica = await self.load_balancer.get_backup_replica(shard, replica)
if backup_replica:
result = await backup_replica.search(query)
return ShardSearchResult(
shard_id=shard.id,
results=result.results,
total_hits=result.total_hits,
search_time=result.search_time,
used_backup=True
)
else:
raise e
class ResultMerger:
def __init__(self):
self.score_normalizer = ScoreNormalizer()
self.diversity_optimizer = DiversityOptimizer()
async def merge_shard_results(self, shard_results: List[ShardSearchResult],
query: SearchQuery) -> SearchResults:
"""合并分片结果"""
all_results = []
total_hits = 0
# 收集所有结果
for shard_result in shard_results:
# 分数归一化
normalized_results = self.score_normalizer.normalize_scores(
shard_result.results, shard_result.shard_id
)
all_results.extend(normalized_results)
total_hits += shard_result.total_hits
# 全局排序
all_results.sort(key=lambda x: x.final_score, reverse=True)
# 结果去重
deduplicated_results = self._deduplicate_results(all_results)
# 多样性优化
if query.enable_diversity:
final_results = await self.diversity_optimizer.optimize(
deduplicated_results, query
)
else:
final_results = deduplicated_results
# 分页
paginated_results = final_results[
query.offset:query.offset + query.limit
]
return SearchResults(
results=paginated_results,
total_count=total_hits,
search_time_ms=max(sr.search_time for sr in shard_results)
)
def _deduplicate_results(self, results: List[SearchResult]) -> List[SearchResult]:
"""结果去重"""
seen_urls = set()
deduplicated = []
for result in results:
if result.url not in seen_urls:
seen_urls.add(result.url)
deduplicated.append(result)
return deduplicated
8. 监控与分析
8.1 搜索质量监控
class SearchQualityMonitor:
def __init__(self):
self.metrics_collector = SearchMetricsCollector()
self.quality_analyzer = QualityAnalyzer()
self.alert_manager = AlertManager()
async def monitor_search_quality(self):
"""监控搜索质量"""
while True:
# 收集搜索质量指标
quality_metrics = await self._collect_quality_metrics()
# 分析质量趋势
quality_trends = await self.quality_analyzer.analyze_trends(quality_metrics)
# 检查质量阈值
await self._check_quality_thresholds(quality_metrics, quality_trends)
# 生成质量报告
await self._generate_quality_report(quality_metrics, quality_trends)
await asyncio.sleep(3600) # 每小时检查一次
async def _collect_quality_metrics(self) -> Dict:
"""收集搜索质量指标"""
metrics = {}
# 点击率指标
metrics['overall_ctr'] = await self.metrics_collector.get_overall_ctr()
metrics['ctr_by_position'] = await self.metrics_collector.get_ctr_by_position()
# 零结果率
metrics['zero_result_rate'] = await self.metrics_collector.get_zero_result_rate()
# 用户满意度指标
metrics['avg_dwell_time'] = await self.metrics_collector.get_avg_dwell_time()
metrics['bounce_rate'] = await self.metrics_collector.get_bounce_rate()
# 查询理解准确性
metrics['query_understanding_accuracy'] = await self._measure_query_understanding_accuracy()
# 个性化效果
metrics['personalization_lift'] = await self._measure_personalization_lift()
return metrics
async def _measure_personalization_lift(self) -> float:
"""测量个性化效果提升"""
# 比较个性化和非个性化结果的CTR
personalized_ctr = await self.metrics_collector.get_ctr_for_personalized_users()
non_personalized_ctr = await self.metrics_collector.get_ctr_for_non_personalized_users()
if non_personalized_ctr > 0:
lift = (personalized_ctr - non_personalized_ctr) / non_personalized_ctr
return lift
else:
return 0.0
class SearchAnalytics:
def __init__(self):
self.query_analyzer = QueryAnalyzer()
self.user_behavior_analyzer = UserBehaviorAnalyzer()
self.content_analyzer = ContentAnalyzer()
async def generate_search_insights(self, time_range: TimeRange) -> SearchInsights:
"""生成搜索洞察报告"""
insights = SearchInsights()
# 热门查询分析
insights.top_queries = await self.query_analyzer.get_top_queries(time_range)
insights.trending_queries = await self.query_analyzer.get_trending_queries(time_range)
insights.failed_queries = await self.query_analyzer.get_failed_queries(time_range)
# 用户行为分析
insights.user_behavior = await self.user_behavior_analyzer.analyze_behavior(time_range)
insights.search_patterns = await self.user_behavior_analyzer.identify_patterns(time_range)
# 内容分析
insights.popular_content = await self.content_analyzer.get_popular_content(time_range)
insights.content_gaps = await self.content_analyzer.identify_content_gaps(time_range)
# 性能分析
insights.performance_metrics = await self._analyze_performance(time_range)
return insights
async def _analyze_performance(self, time_range: TimeRange) -> Dict:
"""分析搜索性能"""
return {
'avg_response_time': await self._get_avg_response_time(time_range),
'p95_response_time': await self._get_p95_response_time(time_range),
'error_rate': await self._get_error_rate(time_range),
'cache_hit_rate': await self._get_cache_hit_rate(time_range),
'throughput': await self._get_throughput(time_range)
}
9. 总结
个性化搜索系统的设计需要考虑以下关键要素:
- 查询理解: 准确理解用户搜索意图和语义,支持多种查询类型
- 个性化算法: 基于用户画像和行为历史的个性化排序
- 实时学习: 根据用户反馈实时更新模型和用户画像
- 冷启动处理: 有效处理新用户和新内容的冷启动问题
- 性能优化: 通过缓存、分布式架构等手段保证搜索性能
- 质量监控: 建立完善的搜索质量监控和分析体系
该系统能够为用户提供高度个性化、准确且快速的搜索体验,持续学习和优化搜索效果。
🎯 场景引入
你打开App,
你打开手机准备使用设计个性化搜索系统服务。看似简单的操作背后,系统面临三大核心挑战:
- 挑战一:高并发——如何在百万级 QPS 下保持低延迟?
- 挑战二:高可用——如何在节点故障时保证服务不中断?
- 挑战三:数据一致性——如何在分布式环境下保证数据正确?
📈 容量估算
假设 DAU 1000 万,人均日请求 50 次
| 指标 | 数值 |
|---|---|
| 模型大小 | ~10 GB |
| 推理延迟 | < 50ms |
| 推理 QPS | ~5000/秒 |
| 训练数据量 | ~1 TB |
| GPU 集群 | 8-64 卡 |
| 特征维度 | 1000+ |
| 模型更新频率 | 每天/每小时 |
❓ 高频面试问题
Q1:个性化搜索系统的核心设计原则是什么?
参考正文中的架构设计部分,核心原则包括:高可用(故障自动恢复)、高性能(低延迟高吞吐)、可扩展(水平扩展能力)、一致性(数据正确性保证)。面试时需结合具体场景展开。
Q2:个性化搜索系统在大规模场景下的主要挑战是什么?
- 性能瓶颈:随着数据量和请求量增长,单节点无法承载;2) 一致性:分布式环境下的数据一致性保证;3) 故障恢复:节点故障时的自动切换和数据恢复;4) 运维复杂度:集群管理、监控、升级。
Q3:如何保证个性化搜索系统的高可用?
- 多副本冗余(至少 3 副本);2) 自动故障检测和切换(心跳 + 选主);3) 数据持久化和备份;4) 限流降级(防止雪崩);5) 多机房/多活部署。
Q4:个性化搜索系统的性能优化有哪些关键手段?
- 缓存(减少重复计算和 IO);2) 异步处理(非关键路径异步化);3) 批量操作(减少网络往返);4) 数据分片(并行处理);5) 连接池复用。
Q5:个性化搜索系统与同类方案相比有什么优劣势?
参考方案对比表格。选型时需考虑:团队技术栈、数据规模、延迟要求、一致性需求、运维成本。没有银弹,需根据业务场景权衡取舍。
| 方案一 | 简单实现 | 低 | 适合小规模 | | 方案二 | 中等复杂度 | 中 | 适合中等规模 | | 方案三 | 高复杂度 ⭐推荐 | 高 | 适合大规模生产环境 |
✅ 架构设计检查清单
| 检查项 | 状态 |
|---|---|
| 缓存策略 | ✅ |
| 分布式架构 | ✅ |
| 数据一致性 | ✅ |
| 监控告警 | ✅ |
| 高可用设计 | ✅ |
| 性能优化 | ✅ |
| 水平扩展 | ✅ |
🚀 架构演进路径
阶段一:单机版 MVP(用户量 < 10 万)
- 单体应用 + 单机数据库,快速验证核心功能
- 适用场景:产品早期,快速迭代
阶段二:基础版分布式(用户量 10 万 → 100 万)
- 应用层水平扩展 + 数据库主从分离 + Redis 缓存
- 引入消息队列解耦异步任务
阶段三:生产级高可用(用户量 > 100 万)
- 微服务拆分 + 数据库分库分表 + 多机房部署
- 全链路监控 + 自动化运维 + 异地容灾