elasticsearch 数据迁移工具简介elasticsearch 不同版本间进行数据迁移的开源工具 ela；支持批

在企业中，因为旧版本的 elasticsearch 停止了维护，所以有需要迁移到新版本的 elasticsearch 的需求。

现有的解决方案

目前存在这些免费的方案去做 elasticsearch 数据迁移：

使用 logstash 可以用于做全量的数据迁移；笔者需要将 es5 的数据迁移到 es8，由于 es 之间存在一些兼容性，必须选定指定版本的 logstash 才能做数据迁移，经过大量测试, 可以使用 7.6.2 版本满足以上需求。但是有一个问题，就是索引的配置不能迁移，由于目标es中索引是通过数据的类型推断生成，导致两边索引 mapping 的字段类型不一致，导致出现一些问题。
使用 elasticsearch 的 dump工具进行数据迁移。可以在源 ES 上挂载云盘，将源 es 的数据导出到云盘上；然后将云盘挂载到目标 es 上，导入到目标 es；这中间必须得保证两个 es 版本是兼容的，这样对跨版本升级提供了挑战。
使用开源的 esm 工具，其提供了数据的全量迁移、增量迁移以及比对；但是只能针对单个索引，不支持批量；还有就是数据比对有问题，它会对源和目标es的索引的数据进行全量的排序，排序字段为 doc id; 由于 doc id 在 es 中是字符串，不是索引字段，排序会导致 es 的内存暴增，触发 es 的过载保护而拒绝服务。

ela 的出现

由于现有的开源免费的工具存在一些问题，结合前辈们的一些经验，本人开发了一个 elasticsearch 的数据迁移工具 ela。它支持如下功能：

支持批量全量迁移索引数据。
支持批量增量迁移索引数据。
支持批量比对索引数据。
支持 es 的升级以及降级做数据迁移
未来会支持双写，实现不停服做数据迁移。

ela 的使用方式

./ela ./config.yml

配置文件如下：

elastics:             # elasticsearch clusters
  es5:                # cluster name which is unique
    addresses:        # elasticsearch addresses which has many masters
      - "http://127.0.0.1:15200"
    user: ""          # basic auth username
    password: ""      # basic auth password

  es6:
    addresses:
      - "http://127.0.0.1:16200"
    user: ""
    password: ""

  es7:
    addresses:
      - "http://127.0.0.1:17200"
    user: ""
    password: ""

  es8:
    addresses:
      - "http://127.0.0.1:18200"
    user: ""
    password: ""

tasks:  # tasks which is executed orderly.
  - name: task1 # task name
    source_es: es5  # source elasticsearch cluster name which is defined in elastics config
    target_es: es8  # target elasticsearch cluster name which is defined in elastics config 
    index_pairs:    # index multiple pairs which is used to sync data from source index to target_index
      -
        source_index: "sample_hello"
        target_index: "sample_hello"
    index_pattern: "test_.*" # index pattern which is used to filter index to sync, source index is same with target index.
    action: sync # index actions which can be assigned to 'sync', 'compare', 'sync_diff'. sync to insert data, compare to compare source index with target index, sync_diff to sync data between source index and target index.
    force: true       # force to cover the target index data with source index data and settings.
    scroll_size: 1000 # scroll size which is used to scroll data from source index.
    parallelism: 12   # parallelism which is used to sync data in parallel index pairs.

  - name: task2
    source_es: es5
    target_es: es8
    index_pairs:
      - source_index: "sample_hello"
        target_index: "sample_hello"
    action: compare

  - name: task3
    source_es: es5
    target_es: es6
    index_pairs:
      -
        source_index: "sample_hello"
        target_index: "sample_hello"
    action: sync
    force: true

  - name: task3
    source_es: es5
    target_es: es6
    index_pairs:
      -
        source_index: "sample_hello"
        target_index: "sample_hello"
    action: sync_diff
    force: true

用户可以配置多个 es 集群实例，实例的名字必须是唯一的任意有效的字符串，方便后面的 task 配置去引用。其中集群的 addresses 可以配置多个 master，方便滚动遍历数据的时候做负载均衡。

elastics:             # elasticsearch clusters
  es5:                # cluster name which is unique
    addresses:        # elasticsearch addresses which has many masters
      - "http://127.0.0.1:15200"
    user: ""          # basic auth username
    password: ""      # basic auth password

  es6:
    addresses:
      - "http://127.0.0.1:16200"
    user: ""
    password: ""

用户可以配置多种任务，任务会按照顺序去执行。用户需要重点关注如下参数：

参数	描述
index_pairs	设置多个不相同名称的源、目标 es 索引对
index_pattern	设置具有此正则规则的多个相同的源、目标 ES 相同名称索引对
parallelism	由于涉及到多个索引对的操作，可以并行的去操作这些索引对，设置并行的数量
force	强制删除目标索引，用源 ES 索引的配置以及数据去覆盖目标 ES 索引
scroll	每次滚动的数据条数，按照工具所在主机或者容器的内存来设置
action	有三种类型：sync, sync_diff, compare。sync 是全量同步；sync_diff 是增量同步；compare 是比较