Arango索引笔记

Arango支持的索引类型还是比较多的，这里根据实际使用经验，简单做一个汇总笔记，以便将来查阅。

索引类型

primary index

_key存的就是primary index, 这个可以指定也可以自己生成， unique索引。是In-memory的persisteng index. 因为是基于persistent实现的，所以可以用来sort， range query。

persistent index

persistent index基于rocksDB实现，是存储在硬盘的，不是in-memory index，它存储的是一个文档的primary key.（类似mysql的非聚簇索引）在启动的时候，会加载如memory,如果加载失败，会重新build索引。

As the persistent index is not an in-memory index, it does not store pointers into the primary index as all the in-memory indexes do, but instead it stores a document’s primary key. To retrieve a document via a persistent index via an index value lookup, there will therefore be an additional O(1) lookup into the primary index to fetch the actual document.

官网这个比较让人疑惑， primary index是persistent index（后文可推断是平衡树）, 并且是排序的，但是其仍然可以O(1)来查找命中，这里可能的解释是：

同Mysql的id，如果是自生成的，自然就是排好序的。但是并不是平衡树的结构

As the persistent index is sorted, it can be used for point lookups, range queries and sorting operations, but only if either all index attributes are provided in a query, or if a leftmost prefix of the index attributes is specified.

~~推断rocksdb的persistent索引是基于平衡树实现的原因，就是这段话~~

但是看了rocksDB官网后，发现rocksDB采用了一种特殊hash命中方式，有点类似sorted hashmap, R树

Geo Index

The geo index stores two-dimensional coordinates. It can be created on either two separate document attributes (latitude and longitude) or a single array attribute that contains both latitude and longitude. Latitude and longitude must be numeric values.

这个主要用来在二维坐标做搜索。

现在回想起来，这个非常适合做通过HSB来查找color这种功能。

Fulltext Index

类似mysql全文索引，是sparse的，只能实现从左到右的搜索，所以说如果想实现分词搜索，应该采用search view

sparse index

A sparse index does not contain documents for which at least one of the index attribute is not set or contains a value of null。

这意味着如果对一个collection的attribute1做index, 但是这个collection下很多documents的attribute1是null, 则这些文档不会被加入索引。这样可以提高索引速度，特别存在大量attribtes1:null的documents的情况下，会提高命中率，并且降低索引占用的内存。

但是要注意的是，如果使用了sparse index，则在query的时候，如果query条件是 attribute1==null, 是不会走索引的，而是全表扫描。

在arangoDB中， sparse index默认是false。

索引应用的一些情况

sub-attributes

这个比较重用，因为arango是multiple model数据库，可以存储json,会有各种嵌套的情况，所以

数组上创建索引

{
    "_id": 1/123,
    "_key":123,
    "attr": ["apple", "orange"]
}

对于这种数据来说，如果我们想对attr的内容创建索引，不能直接

db.posts.ensureIndex({ type: "persistent", fields: [ "attr"]});

否则，会直接针对整个attr数组建立索引，而非attr内的内容建立索引。

如果相对 apple, orange建立索引，则需要

db.posts.ensureIndex({ type: "persistent", fields: [ "attr[*]"]});

只有这样，才能在如下的query中使用索引

for doc in posts
 filter 'apple' in doc.attr
return doc

数组索引 + sub-attributes

这种情况也很常见，例如如下数据结构

{
    "_id": 1/123,
    "_key":123,
    "attr": [ {“name”: "apple", "weight": "1lbs"}, {"name":"orange", "color":"orange-yellow"}]
}

如果我们仅仅想对attr中的name建立索引，也就是通过attr[*].name来优化query，则可以

db.posts.ensureIndex({ type: "persistent", fields: [ "attr[*].name"]});

创建数组索引的时候，默认会开启deduplicate,会帮助去重。不过这个可以通过业务层来实现。

另外，对数组索引声明sparse是无效的，因为null会被索引到。

对于数组索引， aql中只有 IN filter有作用。这点也非常符合常理。

arangoDB限制了，数组索引只能支持到一级，所以如果attr[*].name还是个数组，就不能进一步创建索引了所以在具体设计数据的时候，如果需要基于某个attribute进行query, 要提前考虑这一点

ArangoDB索引汇总