这是我参与「第三届青训营 -后端场」笔记创作活动的的第8篇笔记
todo:相关度debug、相关搜索推荐、web和前端的说明
项目说明_20220611
存储
支持multi-shard存储数据,以shard=2为例
flowchart
subgraph Shard2
workerchan2 --doc--> id2[addDocument]
id2 --> db5[(DocStorage)] & db4[(PositiveIndex)] & db6[(InvertedIndex)]
end
subgraph Shard1
workerchan1 --doc--> id1[addDocument]
id1 --> db1[(DocStorage)] & db2[(PositiveIndex)] & db3[(InvertedIndex)]
end
engine -.- workerchan1 & workerchan2
索引
DocStorage存储原本的文件数据,key=id,val={text,url,···}
PositiveIndex存储正向索引,key=id,val={word1,word2,···}
InvertedIndex存储反向索引,key=word,val=map<id,frequency>
graph TD
doc -.- text & id & url & etc.
text --Tokenizer--> map["map<word,frequency>"]
doc --> db1[(DocStorage)]
id & map--> db2[(PositiveIndex)]
map & id --> db3[(InvertedIndex)]
搜索
多线程搜索
通过多线程对每个word进行搜索,汇总到sort模块,去重后得到搜索结果
flowchart
serachRequest --> parser
parser --> context1["query"] & context2["block"]
context1["block"] & context2["query"] --> start
subgraph multiSearch
start-->Tokenizer--> word1 & word2["···"] & wordn --> db3[(InvertedIndex)] --> map1["map<id,frequency>"] & map2["map<id,frequency>"] & mapn["map<id,frequency>"]
map1 --> score1["score(id1,word1)"] & score2["score(id2,word1)"]
map2 --> score3["score(id1,···)"] & score4["score(id2,···)"]
mapn --> score5["score(id1,wordn)"] & score6["score(id3,wordn)"]
score1 & score2 & score3 & score4 & score5 & score6 --> sort --merge--> endsearch[end]
end
endsearch--> blockResult & queryResult --difference--> result