ArangoSearch提供了信息检索功能,原生集成到了ArangoDB 的查询语言中,并支持所有数据模型,它是一个全文搜索引擎。
ArangoSearch引入了可以理解为虚拟集合的视图(Views) 概念。每个视图代表一个倒排索引,以提供对一个或多个相互链接集合的快速全文搜索,并保存搜索功能的配置,例如要索引的属性。
它可以覆盖相互链接集合中文档的多个甚至所有属性。搜索结果可以通过它们的相似度排名进行排序,使用流行的评分算法首先返回最佳匹配。
可配置的分析器(Analyzers) 可用于文本处理,例如tokenization分词、特定语言的词干化 stemming、大小写转换、从字符中删除变音符号(重音)等等。分析器可以单独使用,也可以与视图结合使用以进行复杂的搜索。
数据准备
在具体介绍ArangoSearch之前,我们先建立数据库,准备好所用的数据
IMDB 数据集
本次实验,将使用IMDB数据集,它包含了演员,导演,电影及流派相关信息,我们想将其建立成图
# 第一步,git clone官方教程库, --single-branch和-b指定只克隆 imdb_complete分支到当前目录下的 imdb_complete目录中
git clone -b imdb_complete --single-branch https://github.com/arangodb/interactive_tutorials.git imdb_complete
# 将克隆下来的dump数据从 imdb_complete/data/imdb_dump 同步到 imdb_dump目录下
rsync -av imdb_complete/data/imdb_dump/ ./imdb_dump/
最终,在imdb_dump目录下,有8个文件,是我们在本文中即将使用的dump数据,如下图, 两两分别是collection结构和具体的documents:
arangorestore 简介
为了从之前使用 arangodump 创建的dump文件中恢复数据,ArangoDB 提供了 arangorestore 工具。
官网对arangorestore参数有详细介绍,如图:
自2.6版本以来,arangorestore 提供了 --create-database
参数,如果将其设为true,表示若不存在目标数据库则会建立相应库。
本例中,我们仅使用其中几个参数,arangorestore默认在本机地址端口上(tcp://127.0.0.1:8529)连接到_system数据库,我们可使用 --server.database <string>
指定连接到其他数据库
arangorestore -c none --create-database true --server.database imdb_newdb --input-directory "imdb_dump"
执行命令后,如下图,可看到创建了我们指定的 imdb_newdb
数据库,随后在库中建立了四个collection,并在其中恢复数据
在Web界面中也可以看到,根据图标,还可以知道 imdb_edges 和 Ratings
是两个边集合
创建视图 View
ArangoSearch视图包含存储在不同集合中的文档的引用。 这使得执行复杂的联合搜索成为可能,甚至在包括顶点和边集合的图上也是如此。
之前文章我们使用的都是pyArango
这个python驱动, 本文中我们使用另一个python驱动,python-arango
, 首先安装该驱动:
pip3 install "python-arango>=5.0"
# 简单熟悉python-arango的用法
from arango import ArangoClient
# Initialize the ArangoDB client.
client = ArangoClient()
# Connect to "_system" database as root user.
# This returns an API wrapper for "_system" database.
sys_db = client.db('_system', username='root', password='')
# List all databases.
print(sys_db.databases())
# 输出,可看到刚建的 imdb_newdb
['_system', 'imdb_newdb', 'school']
# Connect to the 'imdb_newdb' database as user "root".
db = client.db('imdb_newdb', username='root', password='')
print(db)
# 创建ArangoSearch View
db.create_arangosearch_view(name='v_imdb')
# 检查刚创建的Viwe
print(db['v_imdb'])
# 输出:
<StandardDatabase imdb_newdb>
<StandardCollection v_imdb>
至此,该View还是空的,我们需要将其与集合建立连接:
link = {
"includeAllFields": True,
"fields": {"description": {"analyzers": ["text_en"]}}
}
# 将View与 imdb_vertices集合建立连接
db.update_arangosearch_view(
name='v_imdb',
properties={'links': {'imdb_vertices': link}}
)
执行语句后,可在Web界面中看到新建的View:
我们熟悉一下代码背后做了什么,为了使用指定的analyzer(本例中是"analyzers" : [ "text_en" ])填充空的View,analyzer会将输入值解析转换成一系列更小的子值(sub-values),填充好View就可供将来用户查询时使用。如下图:
用一个具体例子说明analyzer如何对一段文本进行分词和词干化(熟悉NLP自然语言的同学应该对这些步骤不陌生)
cursor = db.aql.execute(
'RETURN TOKENS("I like ArangoDB because it rocks!", "text_en")'
)
# 对结果cursor迭代
for doc in cursor:
print(doc)
# 输出,会看到analyzer对所有词都转小写,分词,词干化,还去掉了标点符号
['i', 'like', 'arangodb', 'becaus', 'it', 'rock']
已经建立好了View,我们来执行第一个查询来寻找较短的剧情电影:
# 由于imdb_vertices集合中,除了存储type为Movie的文档,还有type为Person的文档
# 这里我们只搜索流派为Drama, 总时长10到50之间的电影
cursor = db.aql.execute(
"""
FOR d IN v_imdb
SEARCH d.type == "Movie" AND d.genres == "['Drama']"
AND d.runtime IN 10..50
RETURN d.title
"""
)
for doc in cursor:
print(doc)
# 输出:
Sunday in August
Antoine et Colette
Carne
Alias
À San Remo
Glastage
Zwischen Flieder wandern und singen
Breaking Glass
Pulsar
Dr. Jekyll and Mr. Hyde
Dr. Jekyll and Mr. Hyde
Edison's Frankenstein
Lücken im Gedankenstrom
Frühlings Erwachen - Eine Kindertragödie
Primavera
Wiatr
Rosemarie Nitribitt - Tod einer Edelhure
Wellcome
Land gewinnen
Room 10
Dreamcatcher
Rounds
Hotel Chevalier
Melissa
Space Riders
True
Kurz:Ivan
The Kolaborator
Silvester Home Run
Bis zur Unendlichkeit
The Wiggles: Wiggle Bay
Crin blanc: Le cheval sauvage
VeggieTales: An Easter Carol
Good Night
Another Lady Innocent
此刻你可能会问,我难道不能用AQL FILTER查询出同样的结果吗?
cursor = database.aql.execute(
"""FOR d IN v_imdb
FILTER d.type == "Movie" AND d.genres == "['Drama']" AND d.runtime IN 10..50
RETURN d.title"""
)
# Iterate through the result cursor
for doc in cursor:
print(doc)
的确,这样也能得到相同结果,但使用SEARCH关键字是在view上查询,而FILTER查询必须对整个结果集执行后处理(post-processing),性能差别很大,在我本机上,SEARCH查询平均0.7ms,而FILTER查询需200ms,差距高达百倍. 此外,SEARCH查询还能做更酷的事情。下面这个例子,我们要检索出description字段中提到星战"Star wars"字样的电影:
cursor = db.aql.execute(
"""
FOR d IN v_imdb
SEARCH PHRASE(d.description, "Star wars", "text_en")
RETURN {title: d.title, description: d.description}
"""
)
for doc in cursor:
print(doc)
#输出:
{'title': 'Star Wars: The Clone Wars', 'description': "Set between Episode II and III the Clone Wars is the first computer animated Star Wars film. Anakin and Obi Wan must find out who kidnapped Jabba the Hutts son and return him safely. The Seperatists will try anything to stop them and ruin any chance of a diplomatic agreement between the Hutt's and the Republic."}
{'title': 'Fanboys', 'description': '"Star Wars" fans travel to Skywalker Ranch to steal an early copy of "Episode I: The Phantom Menace".'}
{'title': 'Family Guy: Blue Harvest', 'description': 'With the Griffins stuck at home during a blackout, Peter begins to tell a story, which leads to a Star Wars flashback. Acting out scenes from Star Wars Episode IV: A New Hope.'}
{'title': 'Battle Beyond the Stars', 'description': "Roger Corman's post Star Wars take on The Seven Samurai."}
{'title': 'Gymkata', 'description': 'Johnathan Cabot is a champion gymnast. In the tiny, yet savage, country of Parmistan, there is a perfect spot for a "star wars" site. For the US to get this site, they must compete in the brutal "Game". The government calls on Cabot, the son of a former operative, to win the game. Cabot must combine his gymnastics skills of the west with fighting secrets of the east and form GYMKATA!'}
邻近搜索 Proximity Search
邻近搜索是指搜索2个或更多个词,而他们之间可以间隔插入其他的词,以下例子中,我们希望搜索in galaxy这两个词,它俩中间可间隔1个其他词:
# Execute the query
cursor = db.aql.execute(
"""FOR d IN v_imdb
SEARCH PHRASE(d.description, "in", 1, "galaxy", "text_en")
RETURN {title:d.title, description: d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
print(doc)
# 输出的3个文档,description中都有 in galaxy字样,中间隔了一个其他词
{'title': 'The Ice Pirates', 'description': 'The time is the distant future, where by far the most precious commodity in the galaxy is water. The last surviving water planet was somehow removed to the unreachable centre of the galaxy at the end of the galactic trade wars. The galaxy is ruled by an evil emperor (John Carradine) presiding over a trade oligarchy that controls all mining and sale of ice from asteroids and comets.'}
{'title': 'Alien Tracker', 'description': "In a galaxy far away, alien criminals organize a spectacular prison break. Pursued by their home planet's authorities, their leader Zin decides there's only one place to go: through a wormhole that allows instantaneous travel to Earth. Cole is the Alien Tracker who's in hot pursuit of the escaped convicts."}
{'title': 'Lost in Space', 'description': 'The prospects for continuing life on Earth in the year 2058 are grim. So the Robinsons are launched into space to colonize Alpha Prime, the only other inhabitable planet in the galaxy. But when a stowaway sabotages the mission, the Robinsons find themselves hurtling through uncharted space.'}
# 如果把间隔距离改为2,则输出以下文档,因为该文档包含 in a distant galaxy短语
{'title': 'Stargate: The Ark of Truth', 'description': "SG-1 searches for an ancient weapon which could help them defeat the Ori, and discover it may be in the Ori's own home galaxy. As the Ori prepare to send ships through to the Milky Way to attack Earth, SG-1 travels to the Ori galaxy aboard the Odyssey. The International Oversight committee have their own plans and SG-1 finds themselves in a distant galaxy fighting two powerful enemies."}
# 如果把间隔改为3,则输出该文档,因为它包含 In travels that span galaxies
{'title': 'Barbarella', 'description': '"Barbarella" tells the story of a female mercenary who roams across the universe in a distant future, undertaking missions that require her physical fearlessness, ingenuity and sensuality. In travels that span galaxies known and unknown, "Barbarella" will challenge tradition, startle the senses and take audiences on an epic adventure of discovery and wonder. '}
# 换成5之后,搜索不到满足的文档
排名和文档相关性
不错,现在我们已经可以识别到包含特定短语的文档,但对于大型文档库,我们经常希望能根据文档的相关性进行排名,ArangoSearch支持以下两种方案:
来看一个例子,搜索含有以下关键字的电影: "amazing, action, world, alien, sci-fi, science, documental, galaxy"
cursor = db.aql.execute(
"""FOR d IN v_imdb
SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental galaxy', 'text_en'), 'text_en')
SORT BM25(d) DESC
LIMIT 10
RETURN {title:d.title, description: d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
print(doc)
# 输出:
{'title': 'Moon 44', 'description': 'In 2038, at a remote outpost on Moon 44, Galactic Mining Corp. agent Felix Stone (Michael Paré) embarks on a dangerous mission to stop the hijacking of precious natural resources needed on Earth. To do so, he must battle a familiar foe and an alien enemy. Malcolm McDowell, Lisa Eichhorn and Dean Devlin star in this sci-fi thriller from action director Roland Emmerich (Independence Day).'}
{'title': 'AVPR: Aliens vs. Predator - Requiem', 'description': "Prepare for more mayhem as warring aliens and predators return for Round 2 of their no-holds-barred slugfest. This time, the intergalactic creatures do battle in a small American town, throwing local residents into harm's way. To save the planet, the humans must oust both types of unwelcome guests. This sci-fi sequel features tons of spectacular action sequences full of nifty new gadgets and gooey monster gore."}
{'title': 'Push', 'description': 'The action packed sci-fi thriller involves a group of young American ex-pats with telekinetic and clairvoyant abilities, hiding from a clandestine U.S. government agency. They must utilize their different talents and band together for a final job enabling them to escape the agency forever.'}
{'title': 'Casshern', 'description': 'Live-action sci-fi movie based on a 1973 Japanese animé of the same name.'}
{'title': 'Starship Troopers 2: Hero of the Federation', 'description': "In the sequel to Paul Verhoeven's loved/reviled sci-fi film, a group of troopers taking refuge in an abandoned outpost after fighting alien bugs, failing to realize that more danger lays in wait."}
{'title': 'Dark Star', 'description': 'A low-budget, sci-fi satire that focuses on a group of scientists whose mission is to destroy unstable planets. 20 years into their mission, they have battle their alien mascot, that resembles a beach ball, as well as a "sensitive" and intelligent bombing device that starts to question the meaning of its existence.'}
{'title': 'Cesta do pravěku', 'description': 'Most classical sci-fi from K. Zeman. Four young boys visit a dinosaur exhibit at the New York city Museum of Natural History. They then row out onto Central Park Lake where they find a secret cave and paddle into, and go back-in-time into a wondrous prehistoric world filled with the very dinosaurs they had just seen.'}
{'title': 'Puzzlehead', 'description': 'In a post apocalyptic world where technology is outlawed, Walter, a reclusive scientist, secretly creates a self-aware android, "Puzzlehead". Jealously erupts when Puzzlehead wins the affection of Julia, the beautiful shopgirl that Walter has longed for. The resulting Sci-Fi love triangle is a Frankensteinian fable that traps all three in a web of deception and the ultimate betrayal.'}
{'title': "Logan's Run", 'description': 'An idyllic sci-fi future has one major drawback: life must end at 30.'}
{'title': 'Interstella 5555: The 5tory of the 5ecret 5tar 5ystem', 'description': 'A sci-fi japanimation House-musical movie collaboration between Daft Punk--and their music--and designer Leiji Matsumoto. During the recording of their DISCOVERY album and using the themes of sci-fi celebrity, decadence and space travel, Daft Punk--with help from Cedric Hervet--wrote the story and inspired seasoned Japanese animators to symbiotically create this stunning space musical.'}
ArangoSearch 的另一个亮点是能在查询时微调相关模型评估出的文档分数。该功能通过BOOST函数实现。下面例子中,我们使“galaxy”在其他关键字中更优先:
cursor = db.aql.execute(
"""FOR d IN v_imdb
SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') ||
BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en')
SORT BM25(d) DESC
LIMIT 3
RETURN {title:d.title, description: d.description}"""
)
# Iterate through the result cursor
for doc in cursor:
print(doc)
# 输出:
{'title': "The Hitchhiker's Guide to the Galaxy", 'description': 'Mere seconds before the Earth is to be demolished by an alien construction crew, Arthur Dent is swept off the planet by his friend Ford Prefect, a researcher penning a new edition of "The Hitchhiker\'s Guide to the Galaxy."'}
{'title': 'Alien Tracker', 'description': "In a galaxy far away, alien criminals organize a spectacular prison break. Pursued by their home planet's authorities, their leader Zin decides there's only one place to go: through a wormhole that allows instantaneous travel to Earth. Cole is the Alien Tracker who's in hot pursuit of the escaped convicts."}
{'title': 'The Ice Pirates', 'description': 'The time is the distant future, where by far the most precious commodity in the galaxy is water. The last surviving water planet was somehow removed to the unreachable centre of the galaxy at the end of the galactic trade wars. The galaxy is ruled by an evil emperor (John Carradine) presiding over a trade oligarchy that controls all mining and sale of ice from asteroids and comets.'}
当 ArangoSearch遇上图
作为一个多模型数据库,ArangoDB一大特色就是可以结合多模型与查询的能力,比如我们可以结合ArangoSearch和图遍历。我们恢复的imdb数据集其实是个图,下面例子,我们来看看刚才搜索出的科幻电影他们各自的导演是谁:
# 用于搜索电影的语句后接一个图遍历语句
cursor = db.aql.execute(
"""FOR d IN v_imdb
SEARCH ANALYZER(d.description IN TOKENS('amazing action world alien sci-fi science documental', 'text_en') ||
BOOST(d.description IN TOKENS('galaxy', 'text_en'), 5), 'text_en')
SORT BM25(d) DESC
LIMIT 10
FOR vertex, edge, path IN 1..1 INBOUND d imdb_edges
FILTER edge.$label == "DIRECTED"
RETURN DISTINCT {director: vertex.name, movie: d.title}"""
)
# Iterate through the result cursor
for doc in cursor:
print(doc)
# 输出,虽然指定了LIMIT10,但我们的数据中只有7部电影能找到导演, 另外3部只能找到演员
{'director': 'Garth Jennings', 'movie': "The Hitchhiker's Guide to the Galaxy"}
{'director': 'Stewart Raffill', 'movie': 'The Ice Pirates'}
{'director': 'Robert C. Cooper', 'movie': 'Stargate: The Ark of Truth'}
{'director': 'George Lucas', 'movie': 'Star Wars: Episode III: Revenge of the Sith'}
{'director': 'Jesse V. Johnson', 'movie': 'Alien Agent'}
{'director': 'Stephen Hopkins', 'movie': 'Lost in Space'}
{'director': 'Peter Yates', 'movie': 'Krull'}