Elasticsearch 集群滚动升级实战 - 保障业务零停机

101 阅读37分钟

Elasticsearch 版本更新频繁,而线上业务不能中断。滚动升级让我们能在保证服务持续可用的同时完成集群版本更新,这篇文章将带你实际操作这一过程,包括代码实现和应急方案。

滚动升级基础

滚动升级是指在不停止整个集群的情况下,逐个更新 Elasticsearch 节点的过程。通过这种方式,我们可以将集群从当前版本升级到目标版本,同时保持数据和服务的连续性。

滚动升级基础.png

版本兼容性

Elasticsearch 遵循以下版本兼容性规则:

  1. 主要版本兼容性:只能从上一个主要版本升级,如从 6.x 只能直接升级到 7.x,不能直接到 8.x
  2. 次要版本兼容性:可以跨越次要版本升级,如从 7.10 可直接升级到 7.16

以下是常见版本的兼容性:

常见版本的兼容性.png

版本兼容性检查实现

import org.apache.http.HttpHost;
import org.elasticsearch.client.Request;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.core.MainResponse;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

/**
 * 版本兼容性检查器 - 验证升级路径可行性
 */
public class VersionCompatibilityChecker {
    private static final Logger logger = LoggerFactory.getLogger(VersionCompatibilityChecker.class);
    private final RestHighLevelClient client;

    public VersionCompatibilityChecker(RestHighLevelClient client) {
        this.client = client;
    }

    /**
     * 检查当前集群版本并验证目标版本的兼容性
     * @param targetVersion 目标版本
     * @return 是否兼容
     */
    public boolean isCompatibleUpgrade(String targetVersion) {
        try {
            MainResponse info = client.info(RequestOptions.DEFAULT);
            String currentVersion = info.getVersion().getNumber();
            logger.info("当前Elasticsearch版本: {}", currentVersion);

            // 解析版本号
            String[] currentParts = currentVersion.split("\\.");
            String[] targetParts = targetVersion.split("\\.");

            int currentMajor = Integer.parseInt(currentParts[0]);
            int targetMajor = Integer.parseInt(targetParts[0]);

            // 主版本号检查
            if (targetMajor > currentMajor + 1) {
                logger.error("不支持跨多个主版本升级: {} -> {}", currentVersion, targetVersion);
                return false;
            }

            // 特定版本路径检查
            if (currentMajor == 7 && targetMajor == 8) {
                // 7.x到8.x需要至少7.17版本
                int currentMinor = Integer.parseInt(currentParts[1]);
                if (currentMinor < 17) {
                    logger.warn("升级到8.x前需要先升级到至少7.17版本");
                    return false;
                }
            }

            // 检查已安装插件的兼容性
            if (!checkPluginCompatibility(targetVersion)) {
                return false;
            }

            logger.info("版本兼容性检查通过: {} -> {}", currentVersion, targetVersion);
            return true;
        } catch (Exception e) {
            logger.error("版本兼容性检查失败", e);
            return false;
        }
    }

    /**
     * 检查已安装插件在目标版本中的兼容性
     * @param targetVersion 目标版本
     * @return 插件是否兼容
     */
    private boolean checkPluginCompatibility(String targetVersion) {
        try {
            // 获取已安装插件列表
            Request pluginsRequest = new Request("GET", "/_cat/plugins?format=json");
            Response pluginsResponse = client.getLowLevelClient().performRequest(pluginsRequest);

            // 实际实现中应解析响应并检查每个插件的兼容性
            // 这里简化处理,仅记录日志
            logger.info("需检查插件在版本 {} 中的兼容性", targetVersion);
            return true;
        } catch (Exception e) {
            logger.error("检查插件兼容性失败", e);
            return false;
        }
    }
}

注意:本文代码示例主要基于 Elasticsearch 7.x 版本的 RestHighLevelClient API。从 Elasticsearch 8.0 开始,官方推荐使用新的 Java API Client。如果你使用的是 8.x 版本,请参考下面的新 API 示例:

import co.elastic.clients.elasticsearch.ElasticsearchClient;
import co.elastic.clients.json.jackson.JacksonJsonpMapper;
import co.elastic.clients.transport.ElasticsearchTransport;
import co.elastic.clients.transport.rest_client.RestClientTransport;
import org.apache.http.HttpHost;
import org.elasticsearch.client.RestClient;

// 创建新版API客户端
RestClient restClient = RestClient.builder(new HttpHost("localhost", 9200)).build();
ElasticsearchTransport transport = new RestClientTransport(restClient, new JacksonJsonpMapper());
ElasticsearchClient client = new ElasticsearchClient(transport);

// 使用新API获取集群信息
var info = client.info();
String version = info.version().number();

升级前准备

在开始滚动升级前,需要做好以下准备工作:

  1. 备份数据:任何升级操作前都应该备份关键数据
  2. 验证兼容性:确认目标版本与当前版本的兼容性
  3. 检查集群状态:确保集群状态为绿色
  4. 计划升级路径:某些版本可能需要先升级到中间版本
  5. 测试环境验证:在与生产环境相似的测试环境中进行完整升级演练
  6. 确认进行中操作:检查是否有大型操作(如 reindex、restore)正在进行

集群健康检查实现

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.elasticsearch.client.Request;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.cluster.health.ClusterHealthRequest;
import org.elasticsearch.client.cluster.health.ClusterHealthResponse;
import org.elasticsearch.cluster.health.ClusterHealthStatus;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

import java.io.IOException;
import java.io.InputStream;
import java.util.UUID;

/**
 * 集群健康检查器 - 验证集群是否适合升级
 */
public class ClusterHealthChecker {

    private static final Logger logger = LoggerFactory.getLogger(ClusterHealthChecker.class);
    private final RestHighLevelClient client;

    public ClusterHealthChecker(RestHighLevelClient client) {
        this.client = client;
    }

    /**
     * 检查集群健康状态
     * @return 如果集群状态为绿色且没有未分配分片,返回true
     */
    public boolean isClusterHealthy() {
        try {
            String checkId = UUID.randomUUID().toString().substring(0, 8);
            MDC.put("checkId", checkId);
            logger.info("开始集群健康检查");

            ClusterHealthRequest request = new ClusterHealthRequest();
            ClusterHealthResponse response = client.cluster().health(request, RequestOptions.DEFAULT);

            ClusterHealthStatus status = response.getStatus();
            int numberOfNodes = response.getNumberOfNodes();
            int unassignedShards = response.getUnassignedShards();

            logger.info("集群状态: {}", status);
            logger.info("节点数量: {}", numberOfNodes);
            logger.info("未分配分片: {}", unassignedShards);

            // 仅当状态为绿色且没有未分配分片时返回true
            boolean isHealthy = status == ClusterHealthStatus.GREEN && unassignedShards == 0;

            if (isHealthy) {
                logger.info("集群健康检查通过");
            } else {
                logger.warn("集群健康检查未通过:状态={}, 未分配分片={}", status, unassignedShards);
            }

            MDC.remove("checkId");
            return isHealthy;
        } catch (IOException e) {
            logger.error("检查集群健康状态失败", e);
            MDC.remove("checkId");
            return false;
        }
    }

    /**
     * 检查是否有大型操作正在进行
     * @return 如果没有大型操作正在进行,返回true
     */
    public boolean noOngoingOperations() {
        try {
            // 检查是否有reindex任务
            Request tasksRequest = new Request("GET", "/_tasks");
            tasksRequest.addParameter("detailed", "true");
            tasksRequest.addParameter("actions", "*reindex*,*restore*");

            Response response = client.getLowLevelClient().performRequest(tasksRequest);

            // 解析响应判断是否有正在进行的任务
            ObjectMapper mapper = new ObjectMapper();
            try (InputStream is = response.getEntity().getContent()) {
                JsonNode root = mapper.readTree(is);
                JsonNode nodes = root.path("nodes");

                // 检查是否有任务节点
                if (nodes.size() > 0) {
                    // 遍历所有节点
                    for (JsonNode node : nodes) {
                        // 检查任务数
                        JsonNode tasks = node.path("tasks");
                        if (tasks.size() > 0) {
                            logger.warn("发现正在进行的大型操作,共 {} 个任务", tasks.size());
                            return false;
                        }
                    }
                }
            }

            logger.info("检查进行中操作:未发现大型操作");
            return true;
        } catch (IOException e) {
            logger.error("检查进行中操作失败", e);
            return false;
        }
    }

    /**
     * 获取详细的集群健康报告
     * @return 集群健康报告字符串
     */
    public String getDetailedHealthReport() {
        try {
            ClusterHealthRequest request = new ClusterHealthRequest();
            request.level("indices"); // 获取索引级别的信息
            ClusterHealthResponse response = client.cluster().health(request, RequestOptions.DEFAULT);

            StringBuilder report = new StringBuilder();
            report.append("==== 集群健康详细报告 ====\n");
            report.append("状态: ").append(response.getStatus()).append("\n");
            report.append("节点数: ").append(response.getNumberOfNodes()).append("\n");
            report.append("数据节点数: ").append(response.getNumberOfDataNodes()).append("\n");
            report.append("活动分片数: ").append(response.getActiveShards()).append("\n");
            report.append("活动主分片数: ").append(response.getActivePrimaryShards()).append("\n");
            report.append("重定位分片数: ").append(response.getRelocatingShards()).append("\n");
            report.append("初始化分片数: ").append(response.getInitializingShards()).append("\n");
            report.append("未分配分片数: ").append(response.getUnassignedShards()).append("\n");
            report.append("延迟的未分配分片数: ").append(response.getDelayedUnassignedShards()).append("\n");
            report.append("待处理任务数: ").append(response.getNumberOfPendingTasks()).append("\n");
            report.append("在飞行获取数: ").append(response.getNumberOfInFlightFetch()).append("\n");

            // 如果是DEBUG级别,才记录详细报告
            if (logger.isDebugEnabled()) {
                logger.debug(report.toString());
            }

            return report.toString();
        } catch (IOException e) {
            logger.error("获取集群健康详细报告失败", e);
            return "无法获取集群健康报告: " + e.getMessage();
        }
    }
}

快照备份管理

import org.elasticsearch.action.admin.cluster.repositories.put.PutRepositoryRequest;
import org.elasticsearch.action.admin.cluster.snapshots.create.CreateSnapshotRequest;
import org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsRequest;
import org.elasticsearch.action.admin.cluster.snapshots.get.GetSnapshotsResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.settings.Settings;
import org.elasticsearch.repositories.fs.FsRepository;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;

/**
 * 快照管理器 - 处理升级前的数据备份
 */
public class SnapshotManager {
    private static final Logger logger = LoggerFactory.getLogger(SnapshotManager.class);
    private final RestHighLevelClient client;
    private final RetryHelper retryHelper;

    public SnapshotManager(RestHighLevelClient client, RetryHelper retryHelper) {
        this.client = client;
        this.retryHelper = retryHelper;
    }

    /**
     * 确保快照仓库存在
     * @param repositoryName 仓库名称
     * @param path 文件系统路径
     * @return 是否成功
     */
    public boolean ensureRepositoryExists(String repositoryName, String path) {
        try {
            logger.info("确保快照仓库 {} 存在", repositoryName);

            // 检查仓库是否已存在
            try {
                GetSnapshotsRequest getSnapshotsRequest = new GetSnapshotsRequest(repositoryName);
                getSnapshotsRequest.snapshots(new String[]{"_all"});
                client.snapshot().get(getSnapshotsRequest, RequestOptions.DEFAULT);

                logger.info("快照仓库 {} 已存在", repositoryName);
                return true;
            } catch (Exception e) {
                // 仓库不存在,创建它
                logger.info("快照仓库 {} 不存在,将创建", repositoryName);

                PutRepositoryRequest request = new PutRepositoryRequest(repositoryName);
                request.type(FsRepository.TYPE);
                request.settings(Settings.builder()
                        .put(FsRepository.LOCATION_SETTING.getKey(), path)
                        .put(FsRepository.COMPRESS_SETTING.getKey(), true)
                        .build());

                client.snapshot().createRepository(request, RequestOptions.DEFAULT);
                logger.info("快照仓库 {} 创建成功", repositoryName);
                return true;
            }
        } catch (IOException e) {
            logger.error("创建快照仓库失败", e);
            return false;
        }
    }

    /**
     * 在升级前创建集群快照
     * @param repositoryName 仓库名称
     * @return 是否成功
     */
    public boolean createPreUpgradeSnapshot(String repositoryName) {
        try {
            String timestamp = LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMdd_HHmmss"));
            String snapshotName = "pre_upgrade_" + timestamp;

            logger.info("创建升级前快照: {}/{}", repositoryName, snapshotName);

            CreateSnapshotRequest request = new CreateSnapshotRequest(repositoryName, snapshotName);
            request.waitForCompletion(true);

            client.snapshot().create(request, RequestOptions.DEFAULT);
            logger.info("升级前快照 {} 创建成功", snapshotName);
            return true;
        } catch (Exception e) {
            logger.error("创建升级前快照失败", e);
            return false;
        }
    }

    /**
     * 验证快照状态
     * @param repositoryName 仓库名称
     * @param snapshotName 快照名称
     * @return 快照是否成功
     */
    public boolean validateSnapshot(String repositoryName, String snapshotName) {
        try {
            logger.info("验证快照状态: {}/{}", repositoryName, snapshotName);

            GetSnapshotsRequest request = new GetSnapshotsRequest(repositoryName);
            request.snapshots(new String[]{snapshotName});

            GetSnapshotsResponse response = client.snapshot().get(request, RequestOptions.DEFAULT);

            boolean success = response.getSnapshots().stream()
                    .anyMatch(snapshot -> "SUCCESS".equals(snapshot.state().name()));

            if (success) {
                logger.info("快照 {} 状态正常", snapshotName);
            } else {
                logger.warn("快照 {} 状态异常", snapshotName);
            }

            return success;
        } catch (IOException e) {
            logger.error("验证快照状态失败", e);
            return false;
        }
    }
}

滚动升级实施步骤

1. 禁用分片分配

首先,我们需要禁用分片分配,防止在节点离线时 Elasticsearch 尝试重新平衡分片:

import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.cluster.ClusterUpdateSettingsRequest;
import org.elasticsearch.common.settings.Settings;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.function.Supplier;

/**
 * 分片分配管理器 - 控制集群分片分配策略
 */
public class ShardAllocationManager {

    private static final Logger logger = LoggerFactory.getLogger(ShardAllocationManager.class);
    private final RestHighLevelClient client;
    private final RetryHelper retryHelper;

    public ShardAllocationManager(RestHighLevelClient client, RetryHelper retryHelper) {
        this.client = client;
        this.retryHelper = retryHelper;
    }

    /**
     * 禁用分片分配,仅允许主分片分配
     * 在节点关闭前调用,防止分片重平衡
     */
    public void disableShardAllocation() {
        try {
            logger.info("禁用分片分配,仅允许主分片分配");

            Supplier<Boolean> operation = () -> {
                try {
                    ClusterUpdateSettingsRequest request = new ClusterUpdateSettingsRequest();
                    request.transientSettings(Settings.builder()
                            .put("cluster.routing.allocation.enable", "primaries")
                            .build());

                    client.cluster().putSettings(request, RequestOptions.DEFAULT);
                    return true;
                } catch (IOException e) {
                    logger.warn("禁用分片分配失败,将重试", e);
                    return false;
                }
            };

            boolean success = retryHelper.withRetry(operation, 3, 5000);
            if (success) {
                logger.info("成功禁用分片分配,仅允许主分片分配");
            } else {
                throw new UpgradeException("禁用分片分配失败,达到最大重试次数", "cluster", "disable_allocation");
            }
        } catch (Exception e) {
            logger.error("禁用分片分配过程出错", e);
            throw new UpgradeException("禁用分片分配时发生错误", "cluster", "disable_allocation", e);
        }
    }

    /**
     * 完全禁用分片分配
     * 适用于短暂维护操作
     */
    public void disableAllShardAllocation() {
        try {
            logger.info("完全禁用所有分片分配");

            Supplier<Boolean> operation = () -> {
                try {
                    ClusterUpdateSettingsRequest request = new ClusterUpdateSettingsRequest();
                    request.transientSettings(Settings.builder()
                            .put("cluster.routing.allocation.enable", "none")
                            .build());

                    client.cluster().putSettings(request, RequestOptions.DEFAULT);
                    return true;
                } catch (IOException e) {
                    logger.warn("禁用所有分片分配失败,将重试", e);
                    return false;
                }
            };

            boolean success = retryHelper.withRetry(operation, 3, 5000);
            if (success) {
                logger.info("成功禁用所有分片分配");
            } else {
                throw new UpgradeException("禁用所有分片分配失败,达到最大重试次数", "cluster", "disable_all_allocation");
            }
        } catch (Exception e) {
            logger.error("禁用所有分片分配过程出错", e);
            throw new UpgradeException("禁用所有分片分配时发生错误", "cluster", "disable_all_allocation", e);
        }
    }

    /**
     * 启用所有分片分配
     * 在所有节点升级完成后调用
     */
    public void enableShardAllocation() {
        try {
            logger.info("启用所有分片分配");

            Supplier<Boolean> operation = () -> {
                try {
                    ClusterUpdateSettingsRequest request = new ClusterUpdateSettingsRequest();
                    request.transientSettings(Settings.builder()
                            .put("cluster.routing.allocation.enable", "all")
                            .build());

                    client.cluster().putSettings(request, RequestOptions.DEFAULT);
                    return true;
                } catch (IOException e) {
                    logger.warn("启用分片分配失败,将重试", e);
                    return false;
                }
            };

            boolean success = retryHelper.withRetry(operation, 3, 5000);
            if (success) {
                logger.info("成功启用所有分片分配");
            } else {
                throw new UpgradeException("启用分片分配失败,达到最大重试次数", "cluster", "enable_allocation");
            }
        } catch (Exception e) {
            logger.error("启用分片分配过程出错", e);
            throw new UpgradeException("启用分片分配时发生错误", "cluster", "enable_allocation", e);
        }
    }
}

/**
 * 重试帮助类 - 提供通用重试机制
 */
class RetryHelper {
    private static final Logger logger = LoggerFactory.getLogger(RetryHelper.class);

    /**
     * 带重试机制执行操作
     * @param operation 要执行的操作
     * @param maxRetries 最大重试次数
     * @param delayMs 重试间隔毫秒数
     * @return 操作结果
     * @param <T> 返回类型
     */
    public <T> T withRetry(Supplier<T> operation, int maxRetries, long delayMs) {
        int attempts = 0;
        while (attempts < maxRetries) {
            try {
                return operation.get();
            } catch (Exception e) {
                attempts++;
                if (attempts >= maxRetries) {
                    throw e;
                }
                logger.warn("操作失败,{}/{}次尝试,将在{}ms后重试", attempts, maxRetries, delayMs);
                try {
                    Thread.sleep(delayMs);
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new RuntimeException("重试被中断", ie);
                }
            }
        }
        throw new RuntimeException("达到最大重试次数");
    }
}

/**
 * 升级异常类 - 提供升级过程中的异常信息
 */
class UpgradeException extends RuntimeException {
    private final String resource;
    private final String operation;

    public UpgradeException(String message, String resource, String operation) {
        super(message);
        this.resource = resource;
        this.operation = operation;
    }

    public UpgradeException(String message, String resource, String operation, Throwable cause) {
        super(message, cause);
        this.resource = resource;
        this.operation = operation;
    }

    public String getResource() {
        return resource;
    }

    public String getOperation() {
        return operation;
    }
}

2. 索引别名和 ILM 策略管理

在升级过程中,正确管理索引别名和 ILM 策略对于保障业务连续性至关重要:

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.elasticsearch.client.Request;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.indices.GetAliasesRequest;
import org.elasticsearch.client.indices.GetAliasesResponse;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

/**
 * 索引管理器 - 处理索引别名和ILM策略
 */
public class IndexManager {

    private static final Logger logger = LoggerFactory.getLogger(IndexManager.class);
    private final RestHighLevelClient client;
    private final RetryHelper retryHelper;

    public IndexManager(RestHighLevelClient client, RetryHelper retryHelper) {
        this.client = client;
        this.retryHelper = retryHelper;
    }

    /**
     * 备份当前索引别名配置
     * @return 索引别名映射
     */
    public Map<String, List<String>> backupAliases() {
        try {
            logger.info("开始备份索引别名配置");

            GetAliasesRequest request = new GetAliasesRequest();
            GetAliasesResponse response = client.indices().getAlias(request, RequestOptions.DEFAULT);

            Map<String, List<String>> aliasMap = new HashMap<>();

            // 处理别名映射
            response.getAliases().forEach((indexName, aliasMetaData) -> {
                List<String> aliases = new ArrayList<>();
                aliasMetaData.forEach(alias -> aliases.add(alias.alias()));
                if (!aliases.isEmpty()) {
                    aliasMap.put(indexName, aliases);
                }
            });

            logger.info("成功备份索引别名配置,共 {} 个索引的别名", aliasMap.size());
            return aliasMap;
        } catch (IOException e) {
            logger.error("备份索引别名失败", e);
            throw new UpgradeException("备份索引别名时发生错误", "index", "backup_aliases", e);
        }
    }

    /**
     * 暂停ILM策略
     */
    public void pauseIlmPolicies() {
        try {
            logger.info("暂停ILM策略");

            Request request = new Request("POST", "/_ilm/stop");
            Response response = client.getLowLevelClient().performRequest(request);

            // 验证响应
            ObjectMapper mapper = new ObjectMapper();
            try (InputStream is = response.getEntity().getContent()) {
                JsonNode root = mapper.readTree(is);
                boolean acknowledged = root.path("acknowledged").asBoolean();

                if (acknowledged) {
                    logger.info("成功暂停ILM策略");
                } else {
                    logger.warn("暂停ILM策略请求未被确认");
                    throw new UpgradeException("暂停ILM策略未被确认", "ilm", "pause_policies");
                }
            }
        } catch (IOException e) {
            logger.error("暂停ILM策略失败", e);
            throw new UpgradeException("暂停ILM策略时发生错误", "ilm", "pause_policies", e);
        }
    }

    /**
     * 恢复ILM策略
     */
    public void resumeIlmPolicies() {
        try {
            logger.info("恢复ILM策略");

            Request request = new Request("POST", "/_ilm/start");
            Response response = client.getLowLevelClient().performRequest(request);

            // 验证响应
            ObjectMapper mapper = new ObjectMapper();
            try (InputStream is = response.getEntity().getContent()) {
                JsonNode root = mapper.readTree(is);
                boolean acknowledged = root.path("acknowledged").asBoolean();

                if (acknowledged) {
                    logger.info("成功恢复ILM策略");
                } else {
                    logger.warn("恢复ILM策略请求未被确认");
                    throw new UpgradeException("恢复ILM策略未被确认", "ilm", "resume_policies");
                }
            }
        } catch (IOException e) {
            logger.error("恢复ILM策略失败", e);
            throw new UpgradeException("恢复ILM策略时发生错误", "ilm", "resume_policies", e);
        }
    }

    /**
     * 处理热点索引特殊需求
     * @param hotIndices 热点索引列表
     */
    public void handleHotIndices(List<String> hotIndices) {
        try {
            logger.info("开始处理热点索引,共 {} 个", hotIndices.size());

            for (String index : hotIndices) {
                // 对热点索引进行特殊处理
                logger.info("处理热点索引: {}", index);

                // 为热点索引添加特殊的路由规则,避免在升级期间过度重分配
                Request request = new Request("PUT", "/" + index + "/_settings");
                request.setJsonEntity(
                    "{\n" +
                    "  \"settings\": {\n" +
                    "    \"index.routing.allocation.require._name\": null,\n" +
                    "    \"index.routing.allocation.include._name\": \"upgraded_nodes\"\n" +
                    "  }\n" +
                    "}"
                );

                Response response = client.getLowLevelClient().performRequest(request);
                int statusCode = response.getStatusLine().getStatusCode();

                if (statusCode >= 200 && statusCode < 300) {
                    logger.info("成功更新热点索引 {} 的设置", index);
                } else {
                    logger.warn("更新热点索引 {} 的设置失败,状态码: {}", index, statusCode);
                }
            }

            logger.info("热点索引处理完成");
        } catch (IOException e) {
            logger.error("处理热点索引失败", e);
            throw new UpgradeException("处理热点索引时发生错误", "index", "handle_hot_indices", e);
        }
    }
}

3. 跨集群复制(CCR)管理

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.elasticsearch.client.Request;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestHighLevelClient;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.io.InputStream;
import java.util.ArrayList;
import java.util.List;

/**
 * 跨集群复制管理器 - 处理CCR相关配置
 */
public class CrossClusterReplicationManager {

    private static final Logger logger = LoggerFactory.getLogger(CrossClusterReplicationManager.class);
    private final RestHighLevelClient client;
    private final RetryHelper retryHelper;

    public CrossClusterReplicationManager(RestHighLevelClient client, RetryHelper retryHelper) {
        this.client = client;
        this.retryHelper = retryHelper;
    }

    /**
     * 检查集群是否启用CCR
     * @return 是否启用CCR
     */
    public boolean isCcrEnabled() {
        try {
            logger.info("检查集群是否启用CCR");

            // 尝试获取CCR状态信息
            Request request = new Request("GET", "/_ccr/stats");
            Response response = client.getLowLevelClient().performRequest(request);

            int statusCode = response.getStatusLine().getStatusCode();
            boolean enabled = statusCode >= 200 && statusCode < 300;

            logger.info("CCR状态检查结果: {}", enabled ? "已启用" : "未启用");
            return enabled;
        } catch (IOException e) {
            logger.info("集群未启用CCR或获取CCR状态失败");
            return false;
        }
    }

    /**
     * 获取所有follower索引
     * @return follower索引列表
     */
    public List<String> getFollowerIndices() {
        try {
            logger.info("获取所有follower索引");

            List<String> followerIndices = new ArrayList<>();
            Request request = new Request("GET", "/_ccr/stats");
            Response response = client.getLowLevelClient().performRequest(request);

            // 解析响应
            ObjectMapper mapper = new ObjectMapper();
            try (InputStream is = response.getEntity().getContent()) {
                JsonNode root = mapper.readTree(is);
                JsonNode followerStats = root.path("follow_stats");

                if (followerStats.isArray()) {
                    for (JsonNode stat : followerStats) {
                        String indexName = stat.path("index").asText();
                        followerIndices.add(indexName);
                    }
                }
            }

            logger.info("发现 {} 个follower索引", followerIndices.size());
            return followerIndices;
        } catch (IOException e) {
            logger.error("获取follower索引失败", e);
            throw new UpgradeException("获取follower索引时发生错误", "ccr", "get_followers", e);
        }
    }

    /**
     * 暂停跨集群复制
     */
    public void pauseCcrFollowing() {
        try {
            logger.info("暂停跨集群复制");

            // 获取所有follower索引
            List<String> followers = getFollowerIndices();
            if (followers.isEmpty()) {
                logger.info("没有发现follower索引,跳过暂停CCR步骤");
                return;
            }

            // 暂停每个follower索引
            for (String index : followers) {
                logger.info("暂停索引 {} 的复制", index);

                Request request = new Request("POST", "/" + index + "/_ccr/pause_follow");
                Response response = client.getLowLevelClient().performRequest(request);

                int statusCode = response.getStatusLine().getStatusCode();
                if (statusCode >= 200 && statusCode < 300) {
                    logger.info("成功暂停索引 {} 的复制", index);
                } else {
                    logger.warn("暂停索引 {} 的复制失败,状态码: {}", index, statusCode);
                }
            }

            logger.info("已暂停所有CCR复制");
        } catch (Exception e) {
            logger.error("暂停CCR复制失败", e);
            throw new UpgradeException("暂停CCR复制时发生错误", "ccr", "pause", e);
        }
    }

    /**
     * 恢复跨集群复制
     * @param followerIndices follower索引列表
     */
    public void resumeCcrFollowing(List<String> followerIndices) {
        try {
            logger.info("恢复跨集群复制");

            if (followerIndices.isEmpty()) {
                logger.info("没有需要恢复的follower索引");
                return;
            }

            // 恢复每个follower索引
            for (String index : followerIndices) {
                logger.info("恢复索引 {} 的复制", index);

                Request request = new Request("POST", "/" + index + "/_ccr/resume_follow");
                Response response = client.getLowLevelClient().performRequest(request);

                int statusCode = response.getStatusLine().getStatusCode();
                if (statusCode >= 200 && statusCode < 300) {
                    logger.info("成功恢复索引 {} 的复制", index);
                } else {
                    logger.warn("恢复索引 {} 的复制失败,状态码: {}", index, statusCode);
                }
            }

            logger.info("已恢复所有CCR复制");
        } catch (Exception e) {
            logger.error("恢复CCR复制失败", e);
            throw new UpgradeException("恢复CCR复制时发生错误", "ccr", "resume", e);
        }
    }
}

4. 节点升级接口与实现

使用接口抽象和依赖注入模式提高可测试性:

import java.util.UUID;
import java.util.concurrent.TimeoutException;

/**
 * 节点升级器接口 - 定义节点升级的基本操作
 */
public interface NodeUpgrader {
    /**
     * 升级单个节点
     * @param nodeHost 节点主机名或IP
     * @param targetVersion 目标版本
     * @throws UpgradeException 升级失败时抛出
     * @return 是否升级成功
     */
    boolean upgradeNode(String nodeHost, String targetVersion) throws UpgradeException;

    /**
     * 等待节点加入集群
     * @param nodeHost 节点主机名或IP
     * @throws UpgradeException 等待失败时抛出
     * @return 是否成功加入集群
     */
    boolean waitForNodeToJoin(String nodeHost) throws UpgradeException;

    /**
     * 停止节点服务
     * @param nodeHost 节点主机名或IP
     * @throws UpgradeException 停止失败时抛出
     * @return 是否成功停止
     */
    boolean stopNode(String nodeHost) throws UpgradeException;

    /**
     * 安装新版本
     * @param nodeHost 节点主机名或IP
     * @param targetVersion 目标版本
     * @throws UpgradeException 安装失败时抛出
     * @return 是否安装成功
     */
    boolean installNewVersion(String nodeHost, String targetVersion) throws UpgradeException;

    /**
     * 启动节点服务
     * @param nodeHost 节点主机名或IP
     * @throws UpgradeException 启动失败时抛出
     * @return 是否成功启动
     */
    boolean startNode(String nodeHost) throws UpgradeException;
}

import com.fasterxml.jackson.databind.JsonNode;
import com.fasterxml.jackson.databind.ObjectMapper;
import com.jcraft.jsch.ChannelExec;
import com.jcraft.jsch.JSch;
import com.jcraft.jsch.JSchException;
import com.jcraft.jsch.Session;
import org.apache.commons.pool2.BasePooledObjectFactory;
import org.apache.commons.pool2.PooledObject;
import org.apache.commons.pool2.impl.DefaultPooledObject;
import org.apache.commons.pool2.impl.GenericObjectPool;
import org.apache.commons.pool2.impl.GenericObjectPoolConfig;
import org.elasticsearch.client.Request;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.Response;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.cluster.health.ClusterHealthRequest;
import org.elasticsearch.client.cluster.health.ClusterHealthResponse;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

/**
 * SSH节点升级器 - 通过SSH协议管理节点升级
 */
public class SshNodeUpgrader implements NodeUpgrader {

    private static final Logger logger = LoggerFactory.getLogger(SshNodeUpgrader.class);
    private final RestHighLevelClient esClient;
    private final SshConnectionPool connectionPool;
    private final RetryHelper retryHelper;

    public SshNodeUpgrader(RestHighLevelClient esClient,
                        SshConnectionPool connectionPool,
                        RetryHelper retryHelper) {
        this.esClient = esClient;
        this.connectionPool = connectionPool;
        this.retryHelper = retryHelper;
    }

    @Override
    public boolean upgradeNode(String nodeHost, String targetVersion) throws UpgradeException {
        String upgradeId = MDC.get("upgradeId");
        if (upgradeId == null) {
            upgradeId = UUID.randomUUID().toString().substring(0, 8);
            MDC.put("upgradeId", upgradeId);
        }
        MDC.put("nodeHost", nodeHost);

        logger.info("开始升级节点到版本 {}", targetVersion);

        try {
            // 分步骤执行升级流程
            if (!stopNode(nodeHost)) {
                logger.error("停止节点服务失败");
                return false;
            }

            if (!installNewVersion(nodeHost, targetVersion)) {
                logger.error("安装新版本失败");
                return false;
            }

            if (!startNode(nodeHost)) {
                logger.error("启动节点服务失败");
                return false;
            }

            // 等待节点加入集群
            return waitForNodeToJoin(nodeHost);
        } catch (Exception e) {
            logger.error("升级节点过程中发生错误", e);
            throw new UpgradeException("升级节点失败", nodeHost, "upgrade", e);
        } finally {
            MDC.remove("nodeHost");
        }
    }

    @Override
    public boolean stopNode(String nodeHost) throws UpgradeException {
        logger.info("停止节点服务");

        try {
            String result = executeCommand(nodeHost, "sudo systemctl stop elasticsearch");
            logger.debug("停止节点命令执行结果: {}", result);

            // 等待服务完全停止
            Thread.sleep(5000);

            // 验证服务已停止
            String statusResult = executeCommand(nodeHost, "sudo systemctl status elasticsearch");
            boolean stopped = statusResult.contains("inactive") || statusResult.contains("dead");

            if (stopped) {
                logger.info("节点服务已成功停止");
                return true;
            } else {
                logger.warn("节点服务可能未完全停止");
                return false;
            }
        } catch (Exception e) {
            logger.error("停止节点服务失败", e);
            throw new UpgradeException("停止节点服务失败", nodeHost, "stop", e);
        }
    }

    @Override
    public boolean installNewVersion(String nodeHost, String targetVersion) throws UpgradeException {
        logger.info("安装新版本 {}", targetVersion);

        try {
            // 更新软件包信息
            String updateResult = executeCommand(nodeHost, "sudo apt-get update");
            logger.debug("apt-get update结果: {}", updateResult);

            // 安装特定版本
            String installCommand = String.format(
                "sudo apt-get install -y elasticsearch=%s",
                targetVersion
            );

            String installResult = executeCommand(nodeHost, installCommand);
            logger.info("安装命令执行结果长度: {} 字符", installResult.length());
            logger.debug("安装命令执行结果: {}", installResult);

            // 验证安装结果
            if (installResult.contains("Setting up elasticsearch") ||
                installResult.contains("is already the newest version")) {
                logger.info("新版本安装成功");
                return true;
            } else {
                logger.error("新版本安装可能失败");
                return false;
            }
        } catch (Exception e) {
            logger.error("安装新版本失败", e);
            throw new UpgradeException("安装新版本失败", nodeHost, "install", e);
        }
    }

    @Override
    public boolean startNode(String nodeHost) throws UpgradeException {
        logger.info("启动节点服务");

        try {
            String result = executeCommand(nodeHost, "sudo systemctl start elasticsearch");
            logger.debug("启动节点命令执行结果: {}", result);

            // 等待服务启动过程
            Thread.sleep(10000);

            // 验证服务已启动
            String statusResult = executeCommand(nodeHost, "sudo systemctl status elasticsearch");
            boolean started = statusResult.contains("active (running)");

            if (started) {
                logger.info("节点服务已成功启动");
                return true;
            } else {
                logger.warn("节点服务可能未成功启动");
                return false;
            }
        } catch (Exception e) {
            logger.error("启动节点服务失败", e);
            throw new UpgradeException("启动节点服务失败", nodeHost, "start", e);
        }
    }

    @Override
    public boolean waitForNodeToJoin(String nodeHost) throws UpgradeException {
        logger.info("等待节点加入集群");

        // 最多等待10分钟
        for (int i = 0; i < 60; i++) {
            try {
                ClusterHealthRequest healthRequest = new ClusterHealthRequest();
                ClusterHealthResponse healthResponse = esClient.cluster().health(healthRequest, RequestOptions.DEFAULT);

                // 检查节点是否存在于集群中
                boolean nodePresent = checkNodePresence(nodeHost);

                if (nodePresent) {
                    logger.info("节点已成功加入集群");

                    // 等待分片恢复
                    if (healthResponse.getInitializingShards() == 0 &&
                        healthResponse.getRelocatingShards() == 0) {
                        logger.info("分片已完成恢复");
                        return true;
                    } else {
                        logger.info("等待分片恢复: 初始化分片={}, 重定位分片={}",
                            healthResponse.getInitializingShards(),
                            healthResponse.getRelocatingShards());
                    }
                } else {
                    logger.info("节点尚未加入集群");
                }
            } catch (Exception e) {
                logger.warn("检查节点状态时发生错误", e);
            }

            // 等待10秒后再检查
            Thread.sleep(10000);
        }

        logger.error("节点未能在预期时间内加入集群");
        throw new UpgradeException("节点未能在预期时间内加入集群", nodeHost, "wait_join");
    }

    /**
     * 检查节点是否存在于集群中
     * @param nodeHost 节点主机名或IP
     * @return 节点是否存在
     */
    private boolean checkNodePresence(String nodeHost) {
        try {
            Request nodesRequest = new Request("GET", "/_cat/nodes?h=ip,name&format=json");
            Response nodesResponse = esClient.getLowLevelClient().performRequest(nodesRequest);

            // 解析JSON响应
            ObjectMapper mapper = new ObjectMapper();
            try (InputStream is = nodesResponse.getEntity().getContent()) {
                JsonNode[] nodes = mapper.readValue(is, JsonNode[].class);

                for (JsonNode node : nodes) {
                    String ip = node.path("ip").asText();
                    if (ip.equals(nodeHost) || ip.startsWith(nodeHost + ":")) {
                        return true;
                    }
                }
            }
            return false;
        } catch (Exception e) {
            logger.error("检查节点存在性失败", e);
            return false;
        }
    }

    /**
     * 通过SSH执行远程命令
     * @param host 目标主机
     * @param command 要执行的命令
     * @return 命令执行结果
     */
    private String executeCommand(String host, String command) throws Exception {
        Session session = null;

        try {
            session = connectionPool.borrowObject(host);
            return executeCommandWithTimeout(session, command, TimeUnit.MINUTES.toMillis(5));
        } finally {
            if (session != null) {
                connectionPool.returnObject(host, session);
            }
        }
    }

    /**
     * 使用超时控制执行命令
     * @param session SSH会话
     * @param command 要执行的命令
     * @param timeoutMillis 超时时间(毫秒)
     * @return 命令执行结果
     */
    private String executeCommandWithTimeout(Session session, String command, long timeoutMillis)
            throws Exception {
        ScheduledExecutorService executorService = Executors.newSingleThreadScheduledExecutor();
        ChannelExec channel = null;

        try {
            channel = (ChannelExec) session.openChannel("exec");
            channel.setCommand(command);

            ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
            channel.setOutputStream(outputStream);
            channel.connect();

            Future<?> future = executorService.submit(() -> {
                try {
                    while (channel.isConnected()) {
                        Thread.sleep(100);
                    }
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
            });

            try {
                future.get(timeoutMillis, TimeUnit.MILLISECONDS);
                return outputStream.toString();
            } catch (java.util.concurrent.TimeoutException e) {
                future.cancel(true);
                channel.disconnect();
                throw new TimeoutException("命令执行超时: " + command);
            }
        } finally {
            executorService.shutdownNow();
            if (channel != null && channel.isConnected()) {
                channel.disconnect();
            }
        }
    }
}

/**
 * SSH连接池 - 管理SSH会话连接
 */
class SshConnectionPool implements AutoCloseable {
    private static final Logger logger = LoggerFactory.getLogger(SshConnectionPool.class);
    private final Map<String, GenericObjectPool<Session>> hostPools = new HashMap<>();
    private final String username;
    private final String password;
    private final int port;

    public SshConnectionPool(String username, String password, int port) {
        this.username = username;
        this.password = password;
        this.port = port;
    }

    /**
     * 获取一个SSH会话
     * @param host 目标主机
     * @return SSH会话
     */
    public synchronized Session borrowObject(String host) throws Exception {
        GenericObjectPool<Session> pool = hostPools.computeIfAbsent(host, this::createPool);
        return pool.borrowObject();
    }

    /**
     * 归还SSH会话到连接池
     * @param host 目标主机
     * @param session SSH会话
     */
    public synchronized void returnObject(String host, Session session) {
        GenericObjectPool<Session> pool = hostPools.get(host);
        if (pool != null) {
            pool.returnObject(session);
        }
    }

    /**
     * 为主机创建连接池
     * @param host 目标主机
     * @return 会话连接池
     */
    private GenericObjectPool<Session> createPool(String host) {
        GenericObjectPoolConfig<Session> config = new GenericObjectPoolConfig<>();
        config.setMaxTotal(5);
        config.setMaxIdle(2);
        config.setMinIdle(1);
        config.setTestOnBorrow(true);
        config.setTestOnReturn(true);

        return new GenericObjectPool<>(new SessionFactory(host, username, password, port), config);
    }

    /**
     * 关闭所有连接池
     */
    public synchronized void close() {
        for (GenericObjectPool<Session> pool : hostPools.values()) {
            pool.close();
        }
        hostPools.clear();
    }

    /**
     * SSH会话工厂 - 创建和验证SSH会话
     */
    private static class SessionFactory extends BasePooledObjectFactory<Session> {
        private final String host;
        private final String username;
        private final String password;
        private final int port;

        public SessionFactory(String host, String username, String password, int port) {
            this.host = host;
            this.username = username;
            this.password = password;
            this.port = port;
        }

        @Override
        public Session create() throws Exception {
            JSch jsch = new JSch();
            Session session = jsch.getSession(username, host, port);
            session.setPassword(password);

            Properties config = new Properties();
            config.put("StrictHostKeyChecking", "no");
            session.setConfig(config);
            session.connect(30000);

            return session;
        }

        @Override
        public PooledObject<Session> wrap(Session session) {
            return new DefaultPooledObject<>(session);
        }

        @Override
        public void destroyObject(PooledObject<Session> pooledObject) {
            Session session = pooledObject.getObject();
            if (session != null && session.isConnected()) {
                session.disconnect();
            }
        }

        @Override
        public boolean validateObject(PooledObject<Session> pooledObject) {
            Session session = pooledObject.getObject();
            return session != null && session.isConnected();
        }
    }
}

5. 升级监控与性能指标收集

import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.client.cluster.ClusterUpdateSettingsRequest;
import org.elasticsearch.client.cluster.health.ClusterHealthRequest;
import org.elasticsearch.client.cluster.health.ClusterHealthResponse;
import org.elasticsearch.common.settings.Settings;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.Executors;
import java.util.concurrent.ScheduledExecutorService;
import java.util.concurrent.TimeUnit;

/**
 * 升级监控器 - 实时监控集群状态和升级进度
 */
public class UpgradeMonitor implements AutoCloseable {

    private static final Logger logger = LoggerFactory.getLogger(UpgradeMonitor.class);
    private final RestHighLevelClient client;
    private final ScheduledExecutorService scheduler;
    private final UpgradeMetricsCollector metricsCollector;
    private boolean isMonitoring = false;

    public UpgradeMonitor(RestHighLevelClient client) {
        this.client = client;
        this.scheduler = Executors.newScheduledThreadPool(1);
        this.metricsCollector = new UpgradeMetricsCollector();
    }

    /**
     * 开始实时监控集群状态
     * @param intervalSeconds 监控间隔(秒)
     */
    public void startMonitoring(int intervalSeconds) {
        if (isMonitoring) {
            logger.info("监控已在运行中");
            return;
        }

        isMonitoring = true;
        logger.info("开始集群监控,间隔 {} 秒", intervalSeconds);

        scheduler.scheduleAtFixedRate(() -> {
            try {
                checkAndLogClusterStatus();
            } catch (Exception e) {
                logger.error("监控集群状态时发生错误", e);
            }
        }, 0, intervalSeconds, TimeUnit.SECONDS);
    }

    /**
     * 停止监控
     */
    public void stopMonitoring() {
        if (!isMonitoring) {
            return;
        }

        logger.info("停止集群监控");
        scheduler.shutdown();
        isMonitoring = false;

        try {
            if (!scheduler.awaitTermination(5, TimeUnit.SECONDS)) {
                scheduler.shutdownNow();
            }
        } catch (InterruptedException e) {
            scheduler.shutdownNow();
            Thread.currentThread().interrupt();
        }
    }

    /**
     * 检查并记录集群状态
     */
    private void checkAndLogClusterStatus() {
        try {
            // 获取集群版本
            MainResponse info = client.info(RequestOptions.DEFAULT);
            String version = info.getVersion().getNumber();

            // 获取集群健康状态
            ClusterHealthRequest healthRequest = new ClusterHealthRequest();
            ClusterHealthResponse healthResponse = client.cluster().health(healthRequest, RequestOptions.DEFAULT);

            StringBuilder status = new StringBuilder();
            status.append("\n==== 集群监控状态 ====\n");
            status.append("时间: ").append(System.currentTimeMillis()).append("\n");
            status.append("版本: ").append(version).append("\n");
            status.append("状态: ").append(healthResponse.getStatus()).append("\n");
            status.append("节点数: ").append(healthResponse.getNumberOfNodes()).append("\n");
            status.append("未分配分片: ").append(healthResponse.getUnassignedShards()).append("\n");
            status.append("初始化分片: ").append(healthResponse.getInitializingShards()).append("\n");
            status.append("重定位分片: ").append(healthResponse.getRelocatingShards()).append("\n");

            logger.info(status.toString());

            // 记录指标
            metricsCollector.recordClusterState(healthResponse.getStatus().name(),
                                          healthResponse.getNumberOfNodes(),
                                          healthResponse.getUnassignedShards(),
                                          healthResponse.getInitializingShards());

            // 检查分片恢复进度
            checkRecoveryProgress();
        } catch (IOException e) {
            logger.error("获取集群状态失败", e);
        }
    }

    /**
     * 检查分片恢复进度
     */
    private void checkRecoveryProgress() {
        try {
            Request recoveryRequest = new Request("GET", "/_cat/recovery?active_only=true&format=json");
            Response recoveryResponse = client.getLowLevelClient().performRequest(recoveryRequest);

            // 解析JSON响应
            ObjectMapper mapper = new ObjectMapper();
            try (InputStream is = recoveryResponse.getEntity().getContent()) {
                JsonNode[] recoveries = mapper.readValue(is, JsonNode[].class);

                if (recoveries.length > 0) {
                    logger.info("当前有 {} 个分片正在恢复", recoveries.length);

                    // 记录恢复进度
                    for (JsonNode recovery : recoveries) {
                        String index = recovery.path("index").asText();
                        String stage = recovery.path("stage").asText();
                        String progress = recovery.has("percent") ? recovery.path("percent").asText() : "N/A";

                        logger.debug("分片恢复: 索引={}, 阶段={}, 进度={}%", index, stage, progress);
                    }
                } else {
                    logger.debug("没有分片正在恢复");
                }
            }
        } catch (IOException e) {
            logger.error("获取恢复进度失败", e);
        }
    }

    /**
     * 获取性能指标收集器
     * @return 性能指标收集器
     */
    public UpgradeMetricsCollector getMetricsCollector() {
        return metricsCollector;
    }

    @Override
    public void close() {
        stopMonitoring();
    }
}

/**
 * 升级性能指标收集器 - 收集升级过程的性能指标
 */
public class UpgradeMetricsCollector {
    private static final Logger logger = LoggerFactory.getLogger(UpgradeMetricsCollector.class);

    private final Map<String, Long> nodeUpgradeTimes = new HashMap<>();
    private final Map<String, Long> nodeDowntimes = new HashMap<>();
    private final Map<String, Integer> clusterStateChanges = new HashMap<>();
    private final Map<String, Long> shardRecoveryTimes = new HashMap<>();

    private long upgradeStartTime;
    private long upgradeEndTime;

    /**
     * 开始升级计时
     */
    public void startUpgradeTimer() {
        upgradeStartTime = System.currentTimeMillis();
    }

    /**
     * 结束升级计时
     */
    public void endUpgradeTimer() {
        upgradeEndTime = System.currentTimeMillis();
    }

    /**
     * 记录节点停机时间
     * @param node 节点名称
     * @param startTime 开始时间
     * @param endTime 结束时间
     */
    public void recordNodeDowntime(String node, long startTime, long endTime) {
        nodeDowntimes.put(node, endTime - startTime);
    }

    /**
     * 记录节点升级时间
     * @param node 节点名称
     * @param duration 持续时间(毫秒)
     */
    public void recordNodeUpgradeTime(String node, long duration) {
        nodeUpgradeTimes.put(node, duration);
    }

    /**
     * 记录集群状态
     * @param state 集群状态
     * @param nodeCount 节点数量
     * @param unassignedShards 未分配分片数
     * @param initializingShards 初始化分片数
     */
    public void recordClusterState(String state, int nodeCount, int unassignedShards, int initializingShards) {
        clusterStateChanges.put("state_" + System.currentTimeMillis(),
                          unassignedShards + initializingShards);
    }

    /**
     * 记录分片恢复时间
     * @param indexName 索引名称
     * @param duration 恢复时间(毫秒)
     */
    public void recordShardRecoveryTime(String indexName, long duration) {
        shardRecoveryTimes.put(indexName, duration);
    }

    /**
     * 生成升级性能报告
     * @return 性能报告字符串
     */
    public String generateReport() {
        StringBuilder report = new StringBuilder("=== 升级性能报告 ===\n\n");

        // 计算总升级时间
        long totalUpgradeTime = upgradeEndTime - upgradeStartTime;
        report.append(String.format("总升级时间: %.2f 分钟\n", totalUpgradeTime / 60000.0));

        // 节点升级时间
        report.append("\n节点升级时间:\n");
        if (!nodeUpgradeTimes.isEmpty()) {
            double avgUpgradeTime = calculateAverage(nodeUpgradeTimes);
            report.append(String.format("平均节点升级时间: %.2f 分钟\n", avgUpgradeTime / 60000.0));

            // 列出每个节点的升级时间
            nodeUpgradeTimes.forEach((node, time) ->
                report.append(String.format("  - %s: %.2f 分钟\n", node, time / 60000.0)));
        } else {
            report.append("无节点升级时间数据\n");
        }

        // 节点停机时间
        report.append("\n节点停机时间:\n");
        if (!nodeDowntimes.isEmpty()) {
            double avgDowntime = calculateAverage(nodeDowntimes);
            report.append(String.format("平均节点停机时间: %.2f 分钟\n", avgDowntime / 60000.0));

            // 列出每个节点的停机时间
            nodeDowntimes.forEach((node, time) ->
                report.append(String.format("  - %s: %.2f 分钟\n", node, time / 60000.0)));
        } else {
            report.append("无节点停机时间数据\n");
        }

        // 分片恢复时间
        report.append("\n分片恢复时间:\n");
        if (!shardRecoveryTimes.isEmpty()) {
            double avgRecoveryTime = calculateAverage(shardRecoveryTimes);
            report.append(String.format("平均分片恢复时间: %.2f 秒\n", avgRecoveryTime / 1000.0));
        } else {
            report.append("无分片恢复时间数据\n");
        }

        return report.toString();
    }

    /**
     * 计算平均值
     * @param values 值映射
     * @return 平均值
     */
    private double calculateAverage(Map<String, Long> values) {
        return values.values().stream()
                .mapToLong(v -> v)
                .average()
                .orElse(0);
    }
}

/**
 * 恢复参数管理器 - 控制恢复过程的速度和并发度
 */
public class RecoveryManager {

    private static final Logger logger = LoggerFactory.getLogger(RecoveryManager.class);
    private final RestHighLevelClient client;
    private final RetryHelper retryHelper;

    public RecoveryManager(RestHighLevelClient client, RetryHelper retryHelper) {
        this.client = client;
        this.retryHelper = retryHelper;
    }

    /**
     * 为大型索引集群优化恢复设置
     * @param totalShards 总分片数
     * @param totalDataNodes 数据节点数
     * @param availableMemoryGB 可用内存(GB)
     */
    public void optimizeForLargeCluster(int totalShards, int totalDataNodes, int availableMemoryGB) {
        try {
            logger.info("为大型集群优化恢复设置: 分片数={}, 数据节点数={}, 可用内存={}GB",
                     totalShards, totalDataNodes, availableMemoryGB);

            // 计算每节点平均分片数
            int shardsPerNode = totalShards / totalDataNodes;

            // 根据可用内存和分片数调整并发恢复
            int concurrentRecoveries = Math.min(
                Math.max(1, availableMemoryGB / 4), // 每4GB内存允许1个并发恢复
                Math.min(6, shardsPerNode / 20) // 最多6个,或每20个分片1个
            );

            // 计算合适的传输速率
            String bytesPerSec = (availableMemoryGB > 64) ? "120mb" :
                               (availableMemoryGB > 32) ? "80mb" : "40mb";

            ClusterUpdateSettingsRequest request = new ClusterUpdateSettingsRequest();
            request.transientSettings(Settings.builder()
                    .put("indices.recovery.max_bytes_per_sec", bytesPerSec)
                    .put("cluster.routing.allocation.node_concurrent_recoveries", concurrentRecoveries)
                    .build());

            client.cluster().putSettings(request, RequestOptions.DEFAULT);
            logger.info("已优化大型集群恢复设置: 传输速率={}, 并发恢复={}",
                     bytesPerSec, concurrentRecoveries);
        } catch (IOException e) {
            logger.error("优化大型集群恢复设置失败", e);
            throw new UpgradeException("优化恢复设置时发生错误", "recovery", "optimize_large", e);
        }
    }

    /**
     * 调整恢复参数以加速恢复过程
     * 适用于大型集群恢复速度慢的情况
     */
    public void optimizeForFastRecovery() {
        try {
            logger.info("优化恢复参数为快速恢复模式");

            ClusterUpdateSettingsRequest request = new ClusterUpdateSettingsRequest();
            request.transientSettings(Settings.builder()
                    // 提高恢复数据传输速率
                    .put("indices.recovery.max_bytes_per_sec", "100mb")
                    // 增加每个节点的并发恢复数
                    .put("cluster.routing.allocation.node_concurrent_recoveries", 4)
                    // 增加每个节点的入站恢复数
                    .put("cluster.routing.allocation.node_initial_primaries_recoveries", 8)
                    // 将恢复请求的大小增加到更大的块
                    .put("indices.recovery.max_concurrent_file_chunks", 8)
                    .build());

            client.cluster().putSettings(request, RequestOptions.DEFAULT);
            logger.info("已调整集群恢复参数为快速恢复模式");
        } catch (IOException e) {
            logger.error("调整恢复参数失败", e);
            throw new UpgradeException("调整恢复参数时发生错误", "recovery", "optimize_fast", e);
        }
    }

    /**
     * 为高负载生产环境优化恢复参数
     * 避免恢复过程影响生产业务
     */
    public void optimizeForProductionLoad() {
        try {
            logger.info("优化恢复参数为生产负载模式");

            ClusterUpdateSettingsRequest request = new ClusterUpdateSettingsRequest();
            request.transientSettings(Settings.builder()
                    // 限制恢复带宽
                    .put("indices.recovery.max_bytes_per_sec", "40mb")
                    // 减少每个节点的并发恢复数
                    .put("cluster.routing.allocation.node_concurrent_recoveries", 2)
                    // 减少每个节点的入站恢复数
                    .put("cluster.routing.allocation.node_initial_primaries_recoveries", 4)
                    .build());

            client.cluster().putSettings(request, RequestOptions.DEFAULT);
            logger.info("已调整集群恢复参数为生产负载模式");
        } catch (IOException e) {
            logger.error("调整恢复参数失败", e);
            throw new UpgradeException("调整恢复参数时发生错误", "recovery", "optimize_production", e);
        }
    }

    /**
     * 恢复默认参数设置
     */
    public void restoreDefaultRecoverySettings() {
        try {
            logger.info("恢复默认恢复参数设置");

            ClusterUpdateSettingsRequest request = new ClusterUpdateSettingsRequest();
            request.transientSettings(Settings.builder()
                    .put("indices.recovery.max_bytes_per_sec", "40mb") // 默认值
                    .put("cluster.routing.allocation.node_concurrent_recoveries", 2) // 默认值
                    .put("cluster.routing.allocation.node_initial_primaries_recoveries", 4) // 默认值
                    .put("indices.recovery.max_concurrent_file_chunks", 2) // 默认值
                    .build());

            client.cluster().putSettings(request, RequestOptions.DEFAULT);
            logger.info("已恢复默认恢复参数设置");
        } catch (IOException e) {
            logger.error("恢复默认参数设置失败", e);
            throw new UpgradeException("恢复默认参数设置时发生错误", "recovery", "restore_defaults", e);
        }
    }
}

紧急回滚方案

在升级过程中出现问题时,可以执行以下回滚流程:

import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import java.util.List;
import java.util.ArrayList;

/**
 * 升级回滚管理器 - 处理升级失败时的紧急回滚
 */
public class RollbackManager {

    private static final Logger logger = LoggerFactory.getLogger(RollbackManager.class);
    private final NodeUpgrader nodeUpgrader;
    private final ShardAllocationManager shardManager;
    private final IndexManager indexManager;
    private final CrossClusterReplicationManager ccrManager;
    private final String previousVersion;
    private final RetryHelper retryHelper;
    private final UpgradeMetricsCollector metricsCollector;
    private final RestHighLevelClient client;

    public RollbackManager(NodeUpgrader nodeUpgrader,
                         ShardAllocationManager shardManager,
                         IndexManager indexManager,
                         CrossClusterReplicationManager ccrManager,
                         String previousVersion,
                         RetryHelper retryHelper,
                         UpgradeMetricsCollector metricsCollector,
                         RestHighLevelClient client) {
        this.nodeUpgrader = nodeUpgrader;
        this.shardManager = shardManager;
        this.indexManager = indexManager;
        this.ccrManager = ccrManager;
        this.previousVersion = previousVersion;
        this.retryHelper = retryHelper;
        this.metricsCollector = metricsCollector;
        this.client = client;
    }

    /**
     * 执行紧急回滚操作
     * @param failedNode 失败节点
     * @param upgradedNodes 已升级的节点列表
     * @return 回滚是否成功
     */
    public boolean performEmergencyRollback(String failedNode, List<String> upgradedNodes) {
        logger.warn("开始执行紧急回滚操作,从节点 {} 开始", failedNode);
        long rollbackStartTime = System.currentTimeMillis();

        try {
            // 1. 确保分片分配处于安全状态
            shardManager.disableShardAllocation();
            logger.info("已禁用分片分配以准备回滚");

            // 2. 回滚失败节点
            logger.info("回滚失败节点: {}", failedNode);
            long nodeStartTime = System.currentTimeMillis();
            boolean failedNodeRollback = retryHelper.withRetry(
                () -> nodeUpgrader.upgradeNode(failedNode, previousVersion),
                3, 10000);

            if (!failedNodeRollback) {
                logger.error("回滚失败节点 {} 失败", failedNode);
                return false;
            }
            metricsCollector.recordNodeUpgradeTime("rollback_" + failedNode,
                                             System.currentTimeMillis() - nodeStartTime);

            // 3. 回滚已升级节点
            List<String> successfulRollbacks = new ArrayList<>();
            for (String node : upgradedNodes) {
                logger.info("回滚节点: {}", node);
                nodeStartTime = System.currentTimeMillis();
                boolean nodeRollback = retryHelper.withRetry(
                    () -> nodeUpgrader.upgradeNode(node, previousVersion),
                    3, 10000);

                if (nodeRollback) {
                    successfulRollbacks.add(node);
                    metricsCollector.recordNodeUpgradeTime("rollback_" + node,
                                                     System.currentTimeMillis() - nodeStartTime);
                } else {
                    logger.error("回滚节点 {} 失败", node);
                    // 继续尝试回滚其他节点
                }
            }

            // 4. 恢复ILM策略
            indexManager.resumeIlmPolicies();

            // 5. 如果使用了CCR,恢复follower索引
            if (ccrManager.isCcrEnabled()) {
                List<String> followers = ccrManager.getFollowerIndices();
                if (!followers.isEmpty()) {
                    ccrManager.resumeCcrFollowing(followers);
                }
            }

            // 6. 恢复分片分配
            shardManager.enableShardAllocation();
            logger.info("已启用分片分配");

            // 7. 验证回滚状态
            boolean verificationSuccess = verifyRollbackStatus();

            logger.info("紧急回滚操作完成,总用时 {} 秒",
                      (System.currentTimeMillis() - rollbackStartTime) / 1000);

            return verificationSuccess &&
                   (successfulRollbacks.size() == upgradedNodes.size());
        } catch (Exception e) {
            logger.error("执行紧急回滚时发生错误", e);
            return false;
        }
    }

    /**
     * 验证回滚后的集群状态
     * @return 集群是否健康
     */
    public boolean verifyRollbackStatus() {
        try {
            logger.info("验证回滚后的集群状态");

            // 等待集群稳定
            Thread.sleep(30000);

            // 检查集群健康状态
            ClusterHealthChecker healthChecker = new ClusterHealthChecker(client);
            boolean isHealthy = healthChecker.isClusterHealthy();

            if (isHealthy) {
                logger.info("回滚后集群状态正常");
            } else {
                logger.warn("回滚后集群状态异常,可能需要手动干预");
                logger.info(healthChecker.getDetailedHealthReport());
            }

            return isHealthy;
        } catch (Exception e) {
            logger.error("验证回滚状态时发生错误", e);
            return false;
        }
    }

    /**
     * 生成回滚报告
     * @param failedNode 失败节点
     * @param upgradedNodes 已升级节点
     * @param successfulRollbacks 成功回滚的节点
     * @return 回滚报告
     */
    public String generateRollbackReport(String failedNode,
                                      List<String> upgradedNodes,
                                      List<String> successfulRollbacks) {
        StringBuilder report = new StringBuilder("=== 回滚操作报告 ===\n\n");

        report.append("失败节点: ").append(failedNode).append("\n");
        report.append("已升级节点数: ").append(upgradedNodes.size()).append("\n");
        report.append("成功回滚节点数: ").append(successfulRollbacks.size()).append("\n");

        if (successfulRollbacks.size() < upgradedNodes.size()) {
            report.append("\n未成功回滚的节点:\n");
            for (String node : upgradedNodes) {
                if (!successfulRollbacks.contains(node)) {
                    report.append("  - ").append(node).append("\n");
                }
            }
        }

        return report.toString();
    }
}

回滚流程图:

回滚流程图.png

实际案例:政府数据分析平台升级

我们在一个政府数据分析平台中实施了从 Elasticsearch 7.10 到 7.16 的滚动升级。该集群包含 21 个节点,分布在 2 个数据中心,存储超过 18TB 的数据,每天处理数百万次查询请求.

问题挑战

  • 严格的安全合规要求
  • 多租户环境,不同部门的 SLA 要求不同
  • 大量自定义插件和分析模块
  • 需要详细的审计记录

解决方案:我们实现了一个完整的升级框架:

import org.elasticsearch.client.RestHighLevelClient;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.slf4j.MDC;

import java.io.Closeable;
import java.io.FileWriter;
import java.time.LocalDateTime;
import java.time.format.DateTimeFormatter;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.UUID;

/**
 * 升级协调器 - 协调整个升级流程
 */
public class UpgradeCoordinator implements Closeable {

    private static final Logger logger = LoggerFactory.getLogger(UpgradeCoordinator.class);
    private final String upgradeId;
    private final UpgradeConfig config;
    private final PreUpgradeChecker preChecker;
    private final ElasticsearchRollingUpgrader upgrader;
    private final PostUpgradeValidator postValidator;
    private final UpgradeReporter reporter;

    /**
     * 构造函数
     * @param upgrader 升级执行器
     * @param config 升级配置
     * @param preChecker 前置检查器
     * @param postValidator 后置验证器
     * @param reporter 报告生成器
     */
    public UpgradeCoordinator(
            ElasticsearchRollingUpgrader upgrader,
            UpgradeConfig config,
            PreUpgradeChecker preChecker,
            PostUpgradeValidator postValidator,
            UpgradeReporter reporter) {
        this.upgradeId = UUID.randomUUID().toString().substring(0, 8);
        this.upgrader = upgrader;
        this.config = config;
        this.preChecker = preChecker;
        this.postValidator = postValidator;
        this.reporter = reporter;
    }

    /**
     * 协调整个升级流程
     * @return 升级是否成功
     */
    public boolean coordinate() {
        MDC.put("upgradeId", upgradeId);
        logger.info("开始协调Elasticsearch集群升级,目标版本: {}", config.getTargetVersion());

        try {
            // 记录升级开始时间
            long startTime = System.currentTimeMillis();
            reporter.recordStartTime(startTime);

            // 执行前置检查
            logger.info("执行前置检查");
            if (!preChecker.performChecks()) {
                logger.error("前置检查未通过,中止升级");
                reporter.recordFailure("前置检查未通过");
                return false;
            }

            // 执行升级
            logger.info("开始执行升级");
            boolean upgradeSuccess = upgrader.performRollingUpgrade();

            // 执行后置验证
            if (upgradeSuccess) {
                logger.info("升级成功,执行后置验证");
                boolean validationSuccess = postValidator.validateUpgrade();

                if (validationSuccess) {
                    logger.info("后置验证通过,升级成功");
                    reporter.recordSuccess();
                } else {
                    logger.error("后置验证未通过,升级可能不完整");
                    reporter.recordFailure("后置验证未通过");
                    upgradeSuccess = false;
                }
            } else {
                logger.error("升级失败");
                reporter.recordFailure("升级过程失败");
            }

            // 记录升级结束时间
            long endTime = System.currentTimeMillis();
            reporter.recordEndTime(endTime);

            // 生成升级报告
            String reportPath = reporter.generateReport();
            logger.info("升级报告已生成: {}", reportPath);

            return upgradeSuccess;
        } catch (Exception e) {
            logger.error("升级协调过程中发生错误", e);
            reporter.recordFailure("升级协调错误: " + e.getMessage());
            return false;
        } finally {
            MDC.remove("upgradeId");
        }
    }

    @Override
    public void close() {
        try {
            upgrader.close();
        } catch (Exception e) {
            logger.error("关闭升级执行器时发生错误", e);
        }
    }
}

/**
 * 前置检查器 - 执行升级前的各项检查
 */
class PreUpgradeChecker {
    private static final Logger logger = LoggerFactory.getLogger(PreUpgradeChecker.class);

    private final ClusterHealthChecker healthChecker;
    private final VersionCompatibilityChecker versionChecker;
    private final SnapshotManager snapshotManager;
    private final UpgradeConfig config;

    public PreUpgradeChecker(
            ClusterHealthChecker healthChecker,
            VersionCompatibilityChecker versionChecker,
            SnapshotManager snapshotManager,
            UpgradeConfig config) {
        this.healthChecker = healthChecker;
        this.versionChecker = versionChecker;
        this.snapshotManager = snapshotManager;
        this.config = config;
    }

    /**
     * 执行所有前置检查
     * @return 检查是否全部通过
     */
    public boolean performChecks() {
        logger.info("开始执行升级前检查");

        // 检查集群健康状态
        if (!healthChecker.isClusterHealthy()) {
            logger.error("集群状态异常,中止升级");
            return false;
        }

        // 检查版本兼容性
        if (!versionChecker.isCompatibleUpgrade(config.getTargetVersion())) {
            logger.error("版本兼容性检查失败,中止升级");
            return false;
        }

        // 检查是否有大型操作正在进行
        if (!healthChecker.noOngoingOperations()) {
            logger.error("检测到大型操作正在进行,中止升级");
            return false;
        }

        // 创建升级前快照
        if (config.isCreateSnapshot()) {
            logger.info("准备创建升级前快照");

            // 确保快照仓库存在
            if (!snapshotManager.ensureRepositoryExists(config.getSnapshotRepository(),
                                                   config.getSnapshotPath())) {
                logger.error("快照仓库配置失败,中止升级");
                return false;
            }

            // 创建快照
            if (!snapshotManager.createPreUpgradeSnapshot(config.getSnapshotRepository())) {
                logger.error("创建升级前快照失败,中止升级");
                return false;
            }
        }

        logger.info("所有前置检查通过");
        return true;
    }
}

/**
 * 后置验证器 - 验证升级结果
 */
class PostUpgradeValidator {
    private static final Logger logger = LoggerFactory.getLogger(PostUpgradeValidator.class);

    private final ClusterHealthChecker healthChecker;
    private final PluginCompatibilityChecker pluginChecker;
    private final FunctionalTester functionalTester;

    public PostUpgradeValidator(
            ClusterHealthChecker healthChecker,
            PluginCompatibilityChecker pluginChecker,
            FunctionalTester functionalTester) {
        this.healthChecker = healthChecker;
        this.pluginChecker = pluginChecker;
        this.functionalTester = functionalTester;
    }

    /**
     * 验证升级结果
     * @return 验证是否通过
     */
    public boolean validateUpgrade() {
        logger.info("开始验证升级结果");

        // 验证集群健康状态
        if (!healthChecker.isClusterHealthy()) {
            logger.error("升级后集群状态异常");
            return false;
        }

        // 验证插件兼容性
        if (!pluginChecker.verifyAllPlugins()) {
            logger.error("升级后插件兼容性验证失败");
            return false;
        }

        // 执行功能测试
        if (!functionalTester.runAllTests()) {
            logger.error("升级后功能测试失败");
            return false;
        }

        logger.info("所有验证通过,升级成功");
        return true;
    }
}

/**
 * 插件兼容性检查器
 */
class PluginCompatibilityChecker {
    private static final Logger logger = LoggerFactory.getLogger(PluginCompatibilityChecker.class);
    private final RestHighLevelClient client;

    public PluginCompatibilityChecker(RestHighLevelClient client) {
        this.client = client;
    }

    /**
     * 验证所有插件
     * @return 验证是否通过
     */
    public boolean verifyAllPlugins() {
        try {
            logger.info("验证所有插件兼容性");

            Request pluginsRequest = new Request("GET", "/_cat/plugins?format=json");
            Response pluginsResponse = client.getLowLevelClient().performRequest(pluginsRequest);

            // 解析JSON响应
            ObjectMapper mapper = new ObjectMapper();
            try (InputStream is = pluginsResponse.getEntity().getContent()) {
                JsonNode[] plugins = mapper.readValue(is, JsonNode[].class);

                for (JsonNode plugin : plugins) {
                    String name = plugin.path("component").asText();
                    String version = plugin.path("version").asText();

                    logger.info("检测到插件: {} (版本: {})", name, version);

                    // 实际项目中应验证插件功能
                    if (!verifyPluginFunctionality(name)) {
                        logger.error("插件 {} 功能验证失败", name);
                        return false;
                    }
                }
            }

            logger.info("所有插件验证通过");
            return true;
        } catch (IOException e) {
            logger.error("验证插件兼容性失败", e);
            return false;
        }
    }

    /**
     * 验证插件功能
     * @param pluginName 插件名称
     * @return 验证是否通过
     */
    private boolean verifyPluginFunctionality(String pluginName) {
        // 根据插件类型执行不同的验证
        // 简化实现,实际应根据插件类型调用相应的API
        return true;
    }
}

/**
 * 功能测试器
 */
class FunctionalTester {
    private static final Logger logger = LoggerFactory.getLogger(FunctionalTester.class);
    private final RestHighLevelClient client;

    public FunctionalTester(RestHighLevelClient client) {
        this.client = client;
    }

    /**
     * 运行所有功能测试
     * @return 测试是否通过
     */
    public boolean runAllTests() {
        logger.info("开始执行功能测试");

        // 测试索引操作
        if (!testIndexOperations()) {
            return false;
        }

        // 测试搜索功能
        if (!testSearchFunctionality()) {
            return false;
        }

        // 测试聚合功能
        if (!testAggregationFunctionality()) {
            return false;
        }

        logger.info("所有功能测试通过");
        return true;
    }

    /**
     * 测试索引操作
     * @return 测试是否通过
     */
    private boolean testIndexOperations() {
        try {
            logger.info("测试索引操作");

            // 创建测试索引
            String indexName = "test_upgrade_" + System.currentTimeMillis();
            Request createRequest = new Request("PUT", "/" + indexName);
            createRequest.setJsonEntity("{ \"settings\": { \"number_of_shards\": 1, \"number_of_replicas\": 0 } }");

            Response createResponse = client.getLowLevelClient().performRequest(createRequest);
            boolean created = createResponse.getStatusLine().getStatusCode() == 200;

            if (!created) {
                logger.error("创建测试索引失败");
                return false;
            }

            // 索引测试文档
            Request indexRequest = new Request("POST", "/" + indexName + "/_doc");
            indexRequest.setJsonEntity("{ \"test\": \"data\", \"timestamp\": \"" + System.currentTimeMillis() + "\" }");

            Response indexResponse = client.getLowLevelClient().performRequest(indexRequest);
            boolean indexed = indexResponse.getStatusLine().getStatusCode() == 201;

            if (!indexed) {
                logger.error("索引测试文档失败");
                return false;
            }

            // 清理测试索引
            Request deleteRequest = new Request("DELETE", "/" + indexName);
            client.getLowLevelClient().performRequest(deleteRequest);

            logger.info("索引操作测试通过");
            return true;
        } catch (IOException e) {
            logger.error("索引操作测试失败", e);
            return false;
        }
    }

    /**
     * 测试搜索功能
     * @return 测试是否通过
     */
    private boolean testSearchFunctionality() {
        try {
            logger.info("测试搜索功能");

            // 在实际生产环境中使用只读索引执行搜索测试
            // 此处简化实现

            logger.info("搜索功能测试通过");
            return true;
        } catch (Exception e) {
            logger.error("搜索功能测试失败", e);
            return false;
        }
    }

    /**
     * 测试聚合功能
     * @return 测试是否通过
     */
    private boolean testAggregationFunctionality() {
        try {
            logger.info("测试聚合功能");

            // 在实际生产环境中执行聚合测试
            // 此处简化实现

            logger.info("聚合功能测试通过");
            return true;
        } catch (Exception e) {
            logger.error("聚合功能测试失败", e);
            return false;
        }
    }
}

/**
 * 升级报告生成器
 */
class UpgradeReporter {
    private static final Logger logger = LoggerFactory.getLogger(UpgradeReporter.class);

    private final String upgradeId;
    private final UpgradeConfig config;
    private final UpgradeMetricsCollector metricsCollector;

    private long startTime;
    private long endTime;
    private String status;
    private String failureReason;

    public UpgradeReporter(String upgradeId, UpgradeConfig config, UpgradeMetricsCollector metricsCollector) {
        this.upgradeId = upgradeId;
        this.config = config;
        this.metricsCollector = metricsCollector;
    }

    /**
     * 记录升级开始时间
     * @param startTime 开始时间
     */
    public void recordStartTime(long startTime) {
        this.startTime = startTime;
    }

    /**
     * 记录升级结束时间
     * @param endTime 结束时间
     */
    public void recordEndTime(long endTime) {
        this.endTime = endTime;
    }

    /**
     * 记录升级成功
     */
    public void recordSuccess() {
        this.status = "SUCCESS";
    }

    /**
     * 记录升级失败
     * @param reason 失败原因
     */
    public void recordFailure(String reason) {
        this.status = "FAILURE";
        this.failureReason = reason;
    }

    /**
     * 生成升级报告
     * @return 报告文件路径
     */
    public String generateReport() {
        try {
            logger.info("生成升级报告");

            String timestamp = LocalDateTime.now().format(DateTimeFormatter.ofPattern("yyyyMMdd_HHmmss"));
            String reportPath = "upgrade_report_" + upgradeId + "_" + timestamp + ".txt";

            try (FileWriter writer = new FileWriter(reportPath)) {
                writer.write("============= Elasticsearch升级报告 =============\n\n");
                writer.write("升级ID: " + upgradeId + "\n");
                writer.write("升级版本: " + config.getPreviousVersion() + " -> " + config.getTargetVersion() + "\n");
                writer.write("开始时间: " + new java.util.Date(startTime) + "\n");
                writer.write("结束时间: " + new java.util.Date(endTime) + "\n");
                writer.write("总用时: " + formatDuration(endTime - startTime) + "\n");
                writer.write("状态: " + status + "\n");

                if ("FAILURE".equals(status) && failureReason != null) {
                    writer.write("失败原因: " + failureReason + "\n");
                }

                writer.write("\n");

                // 写入性能指标
                writer.write(metricsCollector.generateReport());

                writer.write("\n");
                writer.write("============= 配置信息 =============\n\n");
                writer.write(config.toString());
            }

            logger.info("升级报告已生成: {}", reportPath);
            return reportPath;
        } catch (IOException e) {
            logger.error("生成升级报告失败", e);
            return null;
        }
    }

    /**
     * 格式化持续时间
     * @param durationMs 持续时间(毫秒)
     * @return 格式化后的字符串
     */
    private String formatDuration(long durationMs) {
        long seconds = durationMs / 1000;
        long minutes = seconds / 60;
        long hours = minutes / 60;

        return String.format("%d小时 %d分钟 %d秒",
                          hours,
                          minutes % 60,
                          seconds % 60);
    }
}

/**
 * Elasticsearch滚动升级器 - 主控制类
 */
public class ElasticsearchRollingUpgrader implements Closeable {

    private static final Logger logger = LoggerFactory.getLogger(ElasticsearchRollingUpgrader.class);
    private final RestHighLevelClient client;
    private final ShardAllocationManager shardManager;
    private final ClusterHealthChecker healthChecker;
    private final NodeUpgrader nodeUpgrader;
    private final UpgradeMonitor monitor;
    private final RecoveryManager recoveryManager;
    private final RollbackManager rollbackManager;
    private final IndexManager indexManager;
    private final CrossClusterReplicationManager ccrManager;
    private final RetryHelper retryHelper;
    private final UpgradeConfig config;
    private final UpgradeMetricsCollector metricsCollector;

    /**
     * 构造函数
     * @param client Elasticsearch客户端
     * @param nodeUpgrader 节点升级器
     * @param config 升级配置
     * @param metricsCollector 指标收集器
     */
    public ElasticsearchRollingUpgrader(
            RestHighLevelClient client,
            NodeUpgrader nodeUpgrader,
            UpgradeConfig config,
            UpgradeMetricsCollector metricsCollector) {
        this.client = client;
        this.config = config;
        this.metricsCollector = metricsCollector;
        this.retryHelper = new RetryHelper();
        this.shardManager = new ShardAllocationManager(client, retryHelper);
        this.healthChecker = new ClusterHealthChecker(client);
        this.nodeUpgrader = nodeUpgrader;
        this.monitor = new UpgradeMonitor(client);
        this.recoveryManager = new RecoveryManager(client, retryHelper);
        this.indexManager = new IndexManager(client, retryHelper);
        this.ccrManager = new CrossClusterReplicationManager(client, retryHelper);
        this.rollbackManager = new RollbackManager(
            nodeUpgrader, shardManager, indexManager, ccrManager,
            config.getPreviousVersion(), retryHelper, metricsCollector, client);
    }

    /**
     * 执行滚动升级
     * @return 升级是否成功
     */
    public boolean performRollingUpgrade() {
        String upgradeId = MDC.get("upgradeId");
        if (upgradeId == null) {
            upgradeId = UUID.randomUUID().toString().substring(0, 8);
            MDC.put("upgradeId", upgradeId);
        }

        logger.info("开始Elasticsearch集群滚动升级过程,目标版本: {}", config.getTargetVersion());

        try {
            // 开始升级计时
            metricsCollector.startUpgradeTimer();

            // 开始监控
            monitor.startMonitoring(config.getMonitoringInterval());

            // 备份索引别名
            Map<String, List<String>> aliasBackup = indexManager.backupAliases();

            // 暂停ILM策略
            indexManager.pauseIlmPolicies();

            // 如果启用了CCR,暂停follower索引
            List<String> followers = new ArrayList<>();
            if (ccrManager.isCcrEnabled()) {
                followers = ccrManager.getFollowerIndices();
                if (!followers.isEmpty()) {
                    ccrManager.pauseCcrFollowing();
                }
            }

            // 处理热点索引
            if (!config.getHotIndices().isEmpty()) {
                indexManager.handleHotIndices(config.getHotIndices());
            }

            // 优化恢复参数
            if (config.isOptimizeForSpeed()) {
                recoveryManager.optimizeForFastRecovery();
            } else {
                recoveryManager.optimizeForProductionLoad();
            }

            // 禁用分片分配
            shardManager.disableShardAllocation();
            logger.info("已禁用分片分配,准备升级节点");

            boolean success;
            if (config.isCrossDataCenterMode()) {
                // 跨数据中心升级
                CrossDataCenterUpgrader dcUpgrader = new CrossDataCenterUpgrader(this, config.getDataCenterNodes());
                success = dcUpgrader.upgradeCrossDataCenters(config.getTargetVersion());
            } else {
                // 单数据中心升级
                success = upgradeNodes(config.getNodeList(), config.getTargetVersion());
            }

            // 恢复默认恢复参数
            recoveryManager.restoreDefaultRecoverySettings();

            // 恢复CCR follower索引
            if (!followers.isEmpty()) {
                ccrManager.resumeCcrFollowing(followers);
            }

            // 恢复ILM策略
            indexManager.resumeIlmPolicies();

            // 重新启用分片分配
            shardManager.enableShardAllocation();
            logger.info("已启用分片分配");

            // 结束升级计时
            metricsCollector.endUpgradeTimer();

            // 最终检查
            if (success && healthChecker.isClusterHealthy()) {
                logger.info("集群滚动升级成功完成,所有节点已升级到 {}", config.getTargetVersion());
                return true;
            } else if (success) {
                logger.warn("节点升级完成,但集群状态异常,请检查");
                return false;
            } else {
                logger.error("升级过程中止");
                return false;
            }

        } catch (Exception e) {
            logger.error("升级过程发生错误", e);
            return false;
        } finally {
            // 停止监控
            monitor.stopMonitoring();
            MDC.remove("upgradeId");
        }
    }

    /**
     * 升级节点列表
     * @param nodeList 节点列表
     * @param targetVersion 目标版本
     * @return 升级是否成功
     */
    public boolean upgradeNodes(List<String> nodeList, String targetVersion) {
        logger.info("开始升级 {} 个节点到版本 {}", nodeList.size(), targetVersion);

        List<String> upgradedNodes = new ArrayList<>();
        for (String node : nodeList) {
            logger.info("开始升级节点: {}", node);

            long nodeStartTime = System.currentTimeMillis();
            try {
                boolean success = nodeUpgrader.upgradeNode(node, targetVersion);
                if (!success) {
                    logger.error("节点 {} 升级失败,开始回滚", node);
                    rollbackManager.performEmergencyRollback(node, upgradedNodes);
                    return false;
                }

                upgradedNodes.add(node);
                long nodeEndTime = System.currentTimeMillis();
                metricsCollector.recordNodeUpgradeTime(node, nodeEndTime - nodeStartTime);

                logger.info("节点 {} 升级完成,用时 {} 秒",
                          node, (nodeEndTime - nodeStartTime) / 1000);
            } catch (UpgradeException e) {
                logger.error("节点 {} 升级失败: {}", node, e.getMessage());
                rollbackManager.performEmergencyRollback(node, upgradedNodes);
                return false;
            }

            // 可选:在节点之间暂停
            if (config.getNodePauseSeconds() > 0) {
                logger.info("暂停 {} 秒后继续下一个节点", config.getNodePauseSeconds());
                try {
                    Thread.sleep(config.getNodePauseSeconds() * 1000L);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
            }
        }

        logger.info("所有节点升级完成");
        return true;
    }

    @Override
    public void close() {
        try {
            monitor.close();
        } catch (Exception e) {
            logger.error("关闭监控器时发生错误", e);
        }
    }
}

/**
 * 跨数据中心升级器 - 处理多数据中心部署场景
 */
class CrossDataCenterUpgrader {

    private static final Logger logger = LoggerFactory.getLogger(CrossDataCenterUpgrader.class);
    private final ElasticsearchRollingUpgrader upgrader;
    private final Map<String, List<String>> dataCenterNodes;

    public CrossDataCenterUpgrader(ElasticsearchRollingUpgrader upgrader,
                                Map<String, List<String>> dataCenterNodes) {
        this.upgrader = upgrader;
        this.dataCenterNodes = dataCenterNodes;
    }

    /**
     * 执行跨数据中心升级
     * @param targetVersion 目标版本
     * @return 升级是否成功
     */
    public boolean upgradeCrossDataCenters(String targetVersion) {
        logger.info("开始跨数据中心升级到版本 {}", targetVersion);

        // 先升级主数据中心
        String primaryDC = getPrimaryDataCenter();
        logger.info("首先升级主数据中心: {}", primaryDC);

        boolean primarySuccess = upgradeDataCenter(primaryDC, targetVersion);
        if (!primarySuccess) {
            logger.error("主数据中心升级失败,中止整体升级");
            return false;
        }

        // 然后依次升级其他数据中心
        for (String dc : dataCenterNodes.keySet()) {
            if (!dc.equals(primaryDC)) {
                logger.info("开始升级数据中心: {}", dc);
                boolean dcSuccess = upgradeDataCenter(dc, targetVersion);
                if (!dcSuccess) {
                    logger.error("数据中心 {} 升级失败", dc);
                    return false;
                }
            }
        }

        logger.info("所有数据中心升级完成");
        return true;
    }

    /**
     * 获取主数据中心名称
     * @return 主数据中心名称
     */
    private String getPrimaryDataCenter() {
        // 实际实现中应根据配置或集群状态确定主数据中心
        return dataCenterNodes.keySet().iterator().next();
    }

    /**
     * 升级单个数据中心
     * @param dataCenter 数据中心名称
     * @param targetVersion 目标版本
     * @return 升级是否成功
     */
    private boolean upgradeDataCenter(String dataCenter, String targetVersion) {
        List<String> nodes = dataCenterNodes.get(dataCenter);
        if (nodes == null || nodes.isEmpty()) {
            logger.warn("数据中心 {} 没有节点配置", dataCenter);
            return false;
        }

        logger.info("升级数据中心 {}, 包含 {} 个节点", dataCenter, nodes.size());

        // 执行该数据中心的节点升级
        return upgrader.upgradeNodes(nodes, targetVersion);
    }
}

/**
 * 升级配置类 - 使用Builder模式
 */
public class UpgradeConfig {
    private final String targetVersion;
    private final String previousVersion;
    private final List<String> nodeList;
    private final List<String> hotIndices;
    private final boolean optimizeForSpeed;
    private final int monitoringInterval;
    private final int nodePauseSeconds;
    private final boolean crossDataCenterMode;
    private final Map<String, List<String>> dataCenterNodes;
    private final boolean createSnapshot;
    private final String snapshotRepository;
    private final String snapshotPath;

    private UpgradeConfig(Builder builder) {
        this.targetVersion = builder.targetVersion;
        this.previousVersion = builder.previousVersion;
        this.nodeList = builder.nodeList;
        this.hotIndices = builder.hotIndices;
        this.optimizeForSpeed = builder.optimizeForSpeed;
        this.monitoringInterval = builder.monitoringInterval;
        this.nodePauseSeconds = builder.nodePauseSeconds;
        this.crossDataCenterMode = builder.crossDataCenterMode;
        this.dataCenterNodes = builder.dataCenterNodes;
        this.createSnapshot = builder.createSnapshot;
        this.snapshotRepository = builder.snapshotRepository;
        this.snapshotPath = builder.snapshotPath;
    }

    // Getters...

    public String getTargetVersion() {
        return targetVersion;
    }

    public String getPreviousVersion() {
        return previousVersion;
    }

    public List<String> getNodeList() {
        return nodeList;
    }

    public List<String> getHotIndices() {
        return hotIndices;
    }

    public boolean isOptimizeForSpeed() {
        return optimizeForSpeed;
    }

    public int getMonitoringInterval() {
        return monitoringInterval;
    }

    public int getNodePauseSeconds() {
        return nodePauseSeconds;
    }

    public boolean isCrossDataCenterMode() {
        return crossDataCenterMode;
    }

    public Map<String, List<String>> getDataCenterNodes() {
        return dataCenterNodes;
    }

    public boolean isCreateSnapshot() {
        return createSnapshot;
    }

    public String getSnapshotRepository() {
        return snapshotRepository;
    }

    public String getSnapshotPath() {
        return snapshotPath;
    }

    @Override
    public String toString() {
        StringBuilder sb = new StringBuilder();
        sb.append("升级配置:\n");
        sb.append("  目标版本: ").append(targetVersion).append("\n");
        sb.append("  当前版本: ").append(previousVersion).append("\n");
        sb.append("  节点数量: ").append(nodeList.size()).append("\n");
        sb.append("  热点索引数: ").append(hotIndices.size()).append("\n");
        sb.append("  速度优化: ").append(optimizeForSpeed ? "是" : "否").append("\n");
        sb.append("  监控间隔: ").append(monitoringInterval).append("秒\n");
        sb.append("  节点间暂停: ").append(nodePauseSeconds).append("秒\n");
        sb.append("  跨数据中心模式: ").append(crossDataCenterMode ? "是" : "否").append("\n");
        if (crossDataCenterMode) {
            sb.append("  数据中心数量: ").append(dataCenterNodes.size()).append("\n");
        }
        sb.append("  创建快照: ").append(createSnapshot ? "是" : "否").append("\n");
        if (createSnapshot) {
            sb.append("  快照仓库: ").append(snapshotRepository).append("\n");
        }
        return sb.toString();
    }

    public static class Builder {
        private String targetVersion;
        private String previousVersion;
        private List<String> nodeList = new ArrayList<>();
        private List<String> hotIndices = new ArrayList<>();
        private boolean optimizeForSpeed = false;
        private int monitoringInterval = 30;
        private int nodePauseSeconds = 0;
        private boolean crossDataCenterMode = false;
        private Map<String, List<String>> dataCenterNodes = new HashMap<>();
        private boolean createSnapshot = false;
        private String snapshotRepository = "upgrade_backup";
        private String snapshotPath = "/mnt/elasticsearch/backups";

        public Builder targetVersion(String targetVersion) {
            this.targetVersion = targetVersion;
            return this;
        }

        public Builder previousVersion(String previousVersion) {
            this.previousVersion = previousVersion;
            return this;
        }

        public Builder nodeList(List<String> nodeList) {
            this.nodeList = nodeList;
            return this;
        }

        public Builder hotIndices(List<String> hotIndices) {
            this.hotIndices = hotIndices;
            return this;
        }

        public Builder optimizeForSpeed(boolean optimizeForSpeed) {
            this.optimizeForSpeed = optimizeForSpeed;
            return this;
        }

        public Builder monitoringInterval(int monitoringInterval) {
            this.monitoringInterval = monitoringInterval;
            return this;
        }

        public Builder nodePauseSeconds(int nodePauseSeconds) {
            this.nodePauseSeconds = nodePauseSeconds;
            return this;
        }

        public Builder crossDataCenterMode(boolean crossDataCenterMode) {
            this.crossDataCenterMode = crossDataCenterMode;
            return this;
        }

        public Builder dataCenterNodes(Map<String, List<String>> dataCenterNodes) {
            this.dataCenterNodes = dataCenterNodes;
            return this;
        }

        public Builder createSnapshot(boolean createSnapshot) {
            this.createSnapshot = createSnapshot;
            return this;
        }

        public Builder snapshotRepository(String snapshotRepository) {
            this.snapshotRepository = snapshotRepository;
            return this;
        }

        public Builder snapshotPath(String snapshotPath) {
            this.snapshotPath = snapshotPath;
            return this;
        }

        public UpgradeConfig build() {
            return new UpgradeConfig(this);
        }
    }
}

案例分析:

政府数据分析平台面临的主要挑战是满足严格的合规要求和多租户环境下的不同 SLA。我们采取了以下关键措施:

  1. 详细的审计日志: 使用 MDC 记录每个操作的上下文,生成完整的审计跟踪
  2. 数据安全保障: 升级前自动创建完整快照,确保数据安全
  3. 多级验证: 实施严格的前置检查和后置验证,确保升级质量
  4. 性能指标收集: 收集详细的性能指标,用于事后分析和未来优化

升级成果:

  1. 升级成功率: 在 21 个节点的两个数据中心成功完成升级,无需手动干预
  2. 业务影响: 升级期间查询响应时间增加约 10%,但始终在 SLA 范围内
  3. 总升级时间: 比预计时间提前 1 小时完成,总用时 4.5 小时
  4. 合规性: 满足了所有安全审计和变更管理要求

总结

阶段关键操作注意事项
准备备份数据,检查集群状态,验证版本兼容性,暂停 ILM 策略创建快照备份,确保集群状态为绿色,检查插件兼容性,避免大型操作
计划制定详细升级计划,准备回滚方案,考虑数据中心特性选择业务低峰期执行,分区域规划升级顺序,协调跨区域操作
执行禁用分片分配,暂停 CCR,调整恢复参数,逐个升级节点一次只升级一个节点,等待完全恢复后再继续,跟踪所有变更
验证检查集群健康状态,验证插件功能,执行功能测试验证所有关键功能,确认所有节点版本一致,检查监控指标
记录生成升级报告,收集性能指标,记录经验教训保存详细的升级日志,记录每个节点的升级时间,生成审计报告
回滚遵循预先定义的回滚流程,按原路径逆向操作从失败节点开始回滚,恢复到已知稳定版本,验证回滚结果