es

188 阅读6分钟

基础使用

You add data to Elasticsearch as JSON objects called documents. Elasticsearch stores these documents in searchable indices.

Add a single document

POST books/_doc
{"name": "Snow Crash", "author": "Neal Stephenson", "release_date": "1992-06-01", "page_count": 470}

response:

{
  "_index": "books",
  "_id": "O0lG2IsBaSa7VYx_rEia",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 0,
  "_primary_term": 1
}

Search data

Search all documents

Run the following command to search the books index for all documents:

GET books/_search

match query

Run the following command to search the books index for documents containing brave in the name field:

GET books/_search
{
  "query": {
    "match": {
      "name": "brave"
    }
  }
}

核心设计

  • fst: 用于全文检索的倒排索引
  • BKD tree: 用于存储数值和地理位置数据
  • 列存储: 用于分析

Inverted Indexes and Index Terms

www.elastic.co/blog/found-…

image.png The inverted index maps terms to documents (and possibly positions in the documents) containing the term. Since the terms in the dictionary are sorted, we can quickly find a term, and subsequently its occurrences in the postings-structure.

looking up terms by their prefix is O(log(n)), while finding terms by an arbitrary substring is O(n).

小技巧

  • To find everything ending with "tastic", we can index the reverse (e.g. "fantastic" → "citsatnaf") and search for everything starting with "citsat".

[Building Indexes]

在构建倒排索引时,我们需要优先考虑以下几点:搜索速度、索引紧凑性、索引速度以及新更改变得可见所需的时间。

搜索速度和索引紧凑性息息相关:当利用较小的索引进行搜索时,需要处理的数据较少,而更多的数据将载入内存。正如我们将要看到的,这两者(特别是紧凑性)都是以索引速度为代价的。

为了尽量减少索引大小,我们使用了各种压缩技术。例如,当存储 postings(可能会变得很大)时,Lucene 会使用可变数量的字节(因此可以用单个字节保存较小的数字)等来实现诸如增量编码(例如,将 [42, 100, 666] 存储为 [42, 58, 566])之类的技巧。

保持数据结构小而紧凑意味着,牺牲了高效更新数据的可能性。事实上,Lucene 根本不更新数据:Lucene 写入的索引文件是不可变的,即这类文件永远不会更新。

但删除是个例外。当您从索引中删除一个文档时,该文档会在一个特殊的删除文件中会被标记为删除,这个删除文件实际上只是一个位图,更新起来很便宜。索引结构本身不会更新。

因此,更新先前已编制索引的文档就是执行一个删除操作,然后重新插入该文档。请注意,这意味着更新文档甚至比一开始就添加文档更昂贵。因此,将快速变化的计数器之类的数据存储在 Lucene 索引中通常不是一个好主意,毕竟无法对值进行就地更新

When new documents are added (perhaps via an update), the index changes are first buffered in memory. Eventually, the index files are flushed to disk. Note that this is the Lucene-meaning of "flush". Elasticsearch's flush operation involves a Lucene commit and more, covered in the transaction log-section.

When to flush can depend on various factors: how quickly changes must be visible, the memory available for buffering, I/O saturation, etc. Generally, for indexing speed, larger buffers are better, as long as they are small enough that your I/O can keep up.

The written files make up an index segment.

[Index Segments]

When you do a search, Lucene does the search on every segment, filters out any deletions, and merges the results from all the segments.

To keep the number of segments manageable, Lucene occasionally merges segments according to some merge policy as new segments are added.

When segments are merged, documents marked as deleted are finally discarded. This is why adding more documents can actually result in a smaller index size: it can trigger a merge.

Before segments are flushed to disk, changes are buffered in memory.

The most common cause for flushes with Elasticsearch is probably the continuous index refreshing, which by default happens once every second. As new segments are flushed, they become available for searching, enabling (near) real-time search. While a flush is not as expensive as a commit (as it does not need to wait for a confirmed write), it does cause a new segment to be created, invalidating some caches, and possibly triggering a merge.

Elasticsearch Indexes

An Elasticsearch index is made up of one or more shards, which can have zero or more replicas. These are all individual Lucene indexes. That is, an Elasticsearch index is made up of many Lucene indexes, which in turn is made up of index segments. When you search an Elasticsearch index, the search is executed on all the shards - and in turn, all the segments - and merged.

A "shard" is the basic scaling unit for Elasticsearch. As documents are added to the index, it is routed to a shard. By default, this is done in a round-robin fashion, based on the hash of the document's id.

the number of shards is specified at index creation time, and cannot be changed later on.

数据结构

  • fst: 用于全文检索的倒排索引
  • BKD tree: 用于存储数值和地理位置数据
  • 列存储: 用于分析

finite state transducer (FST)

www.shenyanchao.cn/blog/2018/1…

FST是Lucene现在使用的索引结构

based on Direct Construction of Minimal Acyclic Subsequential Transducers

image.png tuesday:3和thursday:5 的 fst

FST,不但能共享前缀还能共享后缀。不但能判断查找的key是否存在,还能给出响应的输入output。 它在时间复杂度和空间复杂度上都做了最大程度的优化,使得Lucene能够将Term Dictionary完全加载到内存,快速的定位Term找到相应的 posting list

skip list for posting list

格式

The posting list is the sorted skip list of DocIds that contains the related term. It’s used to return the documents for the searched term.

Let’s have a deep look at a complete Posting List for the term game in the field title :

0 : 1 : [2] : [6-10],
1 : 2 : [1, 4] : [0-4, 18-22],
2 : 1 : [1] : [0-4] Each element of this posting list is :
Document Ordinal: Term Frequency: [array of Term Positions]: [array of Term Offset].

Doc0 1st occurrence of the term game in the field title occupies the 2nd position in the field content.
( “title”:”video(1) game(2)  history(3)” )

skip list

issues.apache.org/jira/browse… To accelerate posting list skips (TermDocs.skipTo(int)) Lucene uses skip lists.
The default skip interval is set to 16. If we want to skip e. g. 100 documents,
then it is not necessary to read 100 entries from the posting list, but only
100/16 = 6 skip list entries plus 100%16 = 4 entries from the posting list. This
speeds up conjunction (AND) and phrase queries significantly.

multi-level skip lists to guarantee a
logarithmic amount of skips to any target in the posting list

b-k-d tree

cloud.tencent.com/developer/a…

k-d tree

www.baeldung.com/cs/k-d-tree… A K-D Tree is a binary tree in which each node represents a k-dimensional point .

There are different strategies for choosing an axis when dividing, but the most common one would be to cycle through each of the K dimensions repeatedly and select a midpoint along it to divide the space.  For instance, in the case of 2-dimensional points with x and y axes, we first split along the x-axis, then the !y-axis, and then the x-axis again, continuing in this manner until all points are accounted for:

image.png

k-d-b tree

k-dimensional B-tree,k 维 B 树。k-d-b 树的目的是提供平衡的 k-d 树的搜索效率,同时提供 B 树的面向块的存储,以优化外部存储器访问。

b-k-d tree

bkd 树和 kdb 树主要的不同主要存在于两个方面:批量构建动态更新

Block tree

Block trees are used to represent the posting list - a set of doc ids and offsets associated with each term in an inverted index.

score

Lucene combines [Boolean model (BM) of Information Retrieval] with [Vector Space Model (VSM) of Information Retrieval] to determine how relevant a given Document is to a User's query.

In VSM, documents and queries are represented as weighted vectors in a multi-dimensional space, where each distinct index term is a dimension, and weights are [Tf-idf] values.

It uses the boolean model to first narrow down the documents that need to be scored based on the use of boolean logic in the query specification.

Tf-idf

  • Term frequency :
    • number of instances of a term in a single document
  • Inverse Document Frequency:
    • number of documents containing the term

fuzzy match (Approximate String Matching)

The similarity measurement is based on the Damerau-Levenshtein (optimal string alignment) algorithm.

Terms will be collected and scored according to their edit distance. Only the top terms are used for building the [BooleanQuery]).

At most, this query will match terms up to 2 edits.