Elasticsearch:Search-as-you-type 字段类型

1,019 阅读7分钟

search_as_you_type 字段类型是一个类似 text 的字段,经过优化,可以为提供按需输入完成情况的查询提供开箱即用的支持。 它创建了一系列子字段,这些子字段被分析以索引可被部分与整个索引文本值匹配的查询有效匹配的术语。 支持前缀完成(即,匹配项从输入的开头开始)和中缀完成(即,匹配项在输入中的任意位置)。

将这种类型的字段添加到 mapping 时

PUT my-index-000001
{
  "mappings": {
    "properties": {
      "my_field": {
        "type": "search_as_you_type"
      }
    }
  }
}

这将创建以下字段:

my_field按照 mapping 中的配置进行分析。 如果未配置分析器,则使用索引的默认分词器
my_field._2gram用大小为 2 的 shingle token filter  分词器对 ny_field 进行分词
my_field._3gram用大小为 3 的 shingle token filter  分词器对 ny_field 进行分词
my_field._index_prefix用 edge ngram token filter 包装 my_field._3gram 的分词器

如果你对上面的 edge ngram 以及 shingle 不是很明白的话,你可以参考我之前的文章 “Elasticsearch: Ngrams, edge ngrams, and shingles”。

可以使用 max_shingle_size  mapping 参数配置子字段中的 shingles 的大小。 默认值为3,此参数的有效值为整数值2 - 4(含2和4)。 将为从 2 到 max_shingle_size(包括 max_shingle_size)的每个 shingle size 创建 shingle 子字段。 在构造自己的分析器时,my_field._index_prefix 子字段将始终使用带有 max_shingle_size 的 shingle 子字段中的分析器。

增加 max_shingle_size 将改善具有更多连续项的查询的匹配度,但代价是较大的索引大小。 默认的 max_shingle_size 通常应该足够了。

当被索引的文档具有根字段 my_field 的值时,相同的输入文本将使用不同的分析链自动索引到这些字段中的每个字段中。

上面的描述确实有些拗口,不太容易理解。我们还是拿一个简单的例子来展示:

PUT jobs
{
  "mappings": {
    "properties": {
      "title": {
        "type": "search_as_you_type"
      }
    }
  }
}

在上面,我们创建了一个叫做 jobs 的索引。在这个索引中,我们定义了一个叫做 search_as_you_type 的字段 title。

我们来使用如下的 _analyze API 来进行展示:

POST jobs/_analyze
{
  "field": "title",
  "text": [
    "Senior Software Developer"
  ]
}

上面的结果显示:

{
  "tokens" : [
    {
      "token" : "senior",
      "start_offset" : 0,
      "end_offset" : 6,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "software",
      "start_offset" : 7,
      "end_offset" : 15,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "developer",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

这个和我们之前看的没有什么两样。但是在上面我们也讲到,它将生成其它的字段:

  • title._2gram
  • title._3gram
  • title._index_prefix

我们可以使用如下的方法来进行测试:

POST jobs/_analyze
{
  "field": "title._2gram",
  "text": [
    "Senior Software Developer"
  ]
}

上面的结果显示:

{
  "tokens" : [
    {
      "token" : "senior software",
      "start_offset" : 0,
      "end_offset" : 15,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "software developer",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    }
  ]
}

也就是说,当我们查询  senior software 或者 software developer 时,这个文档将会被搜索到。

同样地,我们针对 title._3gram 来做分析:

POST jobs/_analyze
{
  "field": "title._3gram",
  "text": [
    "Senior Software Developer"
  ]
}

上面的分析结果为:

{
  "tokens" : [
    {
      "token" : "senior software developer",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    }
  ]
}

也就是说当我们完整地输入 senior software developer,那么这个文档将会被搜索到。

接下来,我们对 title._index_prefix 来进行分析:

POST jobs/_analyze
{
  "field": "title._index_prefix",
  "text": [
    "Senior Software Developer"
  ]
}

这个返回结果是非常之长的:

{
  "tokens" : [
    {
      "token" : "s",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "se",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "sen",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "seni",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senio",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior ",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior s",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior so",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior sof",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior soft",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior softw",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior softwa",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior softwar",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior software",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior software ",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior software d",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior software de",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior software dev",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior software deve",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "senior software developer",
      "start_offset" : 0,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 0
    },
    {
      "token" : "s",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "so",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "sof",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "soft",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "softw",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "softwa",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "softwar",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software ",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software d",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software de",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software dev",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software deve",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software devel",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software develo",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software develop",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software develope",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software developer",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "software developer ",
      "start_offset" : 7,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 1
    },
    {
      "token" : "d",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "de",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "dev",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "deve",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "devel",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "develo",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "develop",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "develope",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "developer",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "developer ",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    },
    {
      "token" : "developer  ",
      "start_offset" : 16,
      "end_offset" : 25,
      "type" : "shingle",
      "position" : 2
    }
  ]
}

首先,我们必须知道的是,它是针对 title._3gram (senior software developer)来进行 dge ngram token filter 操作的。从上面的结果,我们可以看出来,当我们输入如下的任何一个:

s, se, sen, seni, senio, senior, senior , senior s, senior so, senior sof, senior soft, senior softw, senior softwa, senior softwar, senior software, senior software , senior software d, senior software de, senior software dev, senior software deve, senior software devel, senior software develo, senior software develop, senior software develope, senior software developer

s, so, sof, soft, softw, softwa, softwar, software, software , software d, software de, software dev, software deve, software devel, software develo, software develop, software develope, software developer 

d, de, dev, deve, devel, develo, develop, develope, developer

我们的文档都将被匹配。这样的好处是我们可以最大限度地匹配我们想要的文档,但是问题是它将大大增加我们的存储空间。

下面,我们将使用一个具体的例子来进行展示:

PUT jobs/_bulk?refresh
{ "index": {} }
{ "title": "Software Developer" }
{ "index": {} }
{ "title": "Senior Software Developer" }
{ "index": {} }
{ "title": "Principal Software Developer" }
{ "index": {} }
{ "title": "Developer Advocate" }
{ "index": {} }
{ "title": "Developer 🇨🇳" }

在上面,我们创建了5个文档。在下面我们将使用 match_phrase_prefix 来进行查询。

GET jobs/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": "developer"
    }
  }
}

我们将看到5个文档都将被搜索到。这是因为它们的分词里都含有 developer,当然,我们也甚至可以进行如下的搜索:

GET jobs/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": "dev"
    }
  }
}

它也能搜索所有的5个文档。

我们进行如下的搜索:

GET jobs/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": "software dev"
    }
  }
}

我们将只搜索到3个文档:

    "hits" : [
      {
        "_index" : "jobs",
        "_type" : "_doc",
        "_id" : "MXYGyXgBI6xucLpoKQ9y",
        "_score" : 0.67181337,
        "_source" : {
          "title" : "Software Developer"
        }
      },
      {
        "_index" : "jobs",
        "_type" : "_doc",
        "_id" : "MnYGyXgBI6xucLpoKQ9y",
        "_score" : 0.5679247,
        "_source" : {
          "title" : "Software Software Developer"
        }
      },
      {
        "_index" : "jobs",
        "_type" : "_doc",
        "_id" : "M3YGyXgBI6xucLpoKQ9y",
        "_score" : 0.5679247,
        "_source" : {
          "title" : "Principal Software Developer"
        }
      }
    ]

如果我们进行如下的搜索:

GET jobs/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": "senior dev"
    }
  }
}

我们将看不到任何的结果。这是因为在默认的情况下,针对 match_phrase_prefix 的 slop 设置为 0,也就是说 senior 后面应该马上接一个以 dev 为开头的任何单词,这样才可以进行匹配。当然,我们可以进行如下的修改:

GET jobs/_search
{
  "query": {
    "match_phrase_prefix": {
      "title": {
        "query": "senior dev",
        "slop": 1
      }
    }
  }
}

在上面,我们定义 slop 为1,表示 senior 和 dev 中间可以插入单词,这样也可以进行匹配。上面的搜索结果为:

    "hits" : [
      {
        "_index" : "jobs",
        "_type" : "_doc",
        "_id" : "wHYQyXgBI6xucLpoWA-v",
        "_score" : 0.8418889,
        "_source" : {
          "title" : "Senior Software Developer"
        }
      }
    ]

如果我们想搜索含有 senior 或者以 dev 为开头的所有文档,我们可以这么搜索:

GET jobs/_search
{
  "query": {
    "multi_match": {
      "query": "senior dev",
      "type": "bool_prefix",
      "operator": "or", 
      "fields": [
        "title",
        "title._2gram",
        "title._3gram"
      ]
    }
  }
}

上面的搜索结果将显示所有的5个文档。

我们也可以针对 emoji 符号 🇨🇳 来进行搜索:

GET jobs/_search
{
  "query": {
    "multi_match": {
      "query": "🇨🇳",
      "type": "bool_prefix",
      "operator": "and", 
      "fields": [
        "title",
        "title._2gram",
        "title._3gram"
      ]
    }
  }
}

上面显示:

    "hits" : [
      {
        "_index" : "jobs",
        "_type" : "_doc",
        "_id" : "w3YQyXgBI6xucLpoWA-v",
        "_score" : 1.0,
        "_source" : {
          "title" : """Developer 🇨🇳"""
        }
      }
    ]

针对这种 search_as_you_type,还有一个类型的数据类型 completion。completion suggester 针对速度进行了优化。 建议程序使用的数据结构可实现快速查找,但构建成本很高,并且存储在内存中,同时它不可以做 infix (中缀)查询。