详细说说ES的source字段（含源码分析）Elasticsearch（以下简称ES）里的source字段用来存储文档的

Elasticsearch（以下简称ES）里的source字段用来存储文档的原始信息，默认是开启的。因为大部分场景下我们都需要这个字段，有时候反而容易被忽略。

这篇文章尽量详细说说关于source的相关知识点。

比如我们写入两篇文档，

PUT student/_doc/1
{
  "name":"Jack",
  "age": 15,
  "like": "hiking,basketball"
}

PUT student/_doc/2
{
  "name":"Tom",
  "age": 16,
  "like": "football,swiming,basketball"
}

ES首先会建立倒排索引，这个是为了方便我们搜索。

在这里插入图片描述

同时，默认情况下，source字段是会存储数据的原始信息的。而存储的结构一般称为正排索引。如下图所示：

在这里插入图片描述

我们查询的时候，符合条件的文档会通过_source字段返回原始的信息：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.6931471,
    "hits" : [
      {
        "_index" : "student",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.6931471,
        "_source" : {
          "name" : "Jack",
          "age" : 15,
          "like" : "hiking,basketball"
        }
      }
    ]
  }
}

在实际的项目中，有时候我们并不需要这个source字段，只是希望拿到查询的结果，在一些业务场景下我们往往会用数据库的主键作为文档id。然后用这个id再进一步去其它的存储介质拿数据。因为source字段默认是打开的，如果你不了解这个机制，就白白浪费了ES的存储空间。

ES还有一种机制，可以在查询阶段控制返回source里面的部分内容，比如我只返回文档里的name字段，可以这样：

GET student/_search?_source=name
{
  "query": {
    "match": {
      "like": "hiking"
    }
  }
}

我们来看下source字段不存储的例子。

设置mapping，source字段关闭。

PUT student_new
{
  "mappings": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "age": {
        "type": "long"
      },
      "like": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      },
      "name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

写入跟上面一样的数据，用新的索引查询结果示例如下：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.7549127,
    "hits" : [
      {
        "_index" : "student_new",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.7549127
      }
    ]
  }
}

可以看到source字段没有了。但是我们依然能够拿到文档的id。如果一条文档节省10KB，几亿文档的话节省的存储空间还是相当可观的。

source字段不存储确实可以节省不少存储空间，但是有几种情况是必须保留这个字段的。我这里列举下：

文档需要使用update或者update_by_query更新
会用到reindex
会用到文档高亮

下面从源码层面看看对于source字段的处理，源码的版本是7.10

ES写入文档的入口是TransportShardBulkAction类的executeBulkItemRequest的方法，最终执行操作的是方法内的applyIndexOperationOnPrimary方法。

TransportShardBulkAction类在这里插入图片描述

applyIndexOperationOnPrimary是IndexShard类的方法，其中会调用到内部的prepareIndex，接着是DocumentParser类的parseDocument方法，这里方法负责解析文档，在解析的过程中，如果发现source的enbble为false，就不会把source这个字段放入结果的document（这个是最终写入lucene的文档）中。

DocumentParser类在这里插入图片描述

SourceFieldMapper类在这里插入图片描述

在这里插入图片描述