上一篇：Elasticsearch8.5.3源码分析(4)-写数据协调过程 - 掘金 (juejin.cn)

总述

TransportService负责接收和发送节点之间的消息通信。当一个节点需要与其他节点通信时，它会通过TransportService创建一个连接，并发送消息到目标节点。

消息无论是来自本地节点，还是由远程节点协调转发。最终都由TransportService.localNodeConnection对象接收并分发到相应的Action处理器中进行处理。

TransportService调用TransportReplicationAction#handlePrimaryRequest方法，检查对主分片的写入请求是否满足内存限制条件，然后处理主分片请求。

AsyncPrimaryAction#doRun方法校验请求主分片与当前分片是否一致，然后请求操作许可证，成功后回调runWithPrimaryShardReference方法

IndexShardOperationPermits 用于管理索引分片的并发操作许可。

每个索引分片都有一个 IndexShardOperationPermits 实例，它维护了该分片当前可用的写入许可数和等待写入许可的线程数。线程可以通过 acquirePrimaryOperationPermit 或 acquireReplicaOperationPermit 方法请求写入许可，如果当前可用许可数大于等于请求的许可数，则线程会直接获得许可。否则，线程将被阻塞，直到有足够的许可数可用或者超时。

ReplicationOperation#execute方法检查活动分片数量是否满足要求，以确保请求的正确处理。该方法在调PrimaryShardReference#perform方法执行操作时，传递了回调方法ReplicationOperation#handlePrimaryResult，该方法会在请求执行成功后，调用ReplicationOperation#performOnReplicas方法同步副本分片数据。

TransportShardBulkAction类继承至TraposrtWriteAction类，TraposrtWriteAction继承至TransportReplicationAction类。如下图所示：

TransportReplicationAction类实现了Elasticsearch中的复制操作，包括主分片的复制、从分片的复制和跨集群的复制等。当一个分片的状态发生变化时，例如分片被分配到一个新节点，TransportReplicationAction会负责将分片的状态同步到所有的副本分片中。在执行这些复制操作时，TransportReplicationAction类还会负责处理所有相关的错误和异常，以保证分片的复制操作能够成功完成。

此外，TransportReplicationAction类还实现了Elasticsearch中的流控机制，以控制复制操作的速率。当网络或其他资源出现瓶颈时，流控机制会暂停复制操作，以避免过度消耗资源，从而保证系统的稳定性和可靠性。

TransportShardBulkAction类接收来自客户端的批量请求，将请求体解析为多个单独的操作，每个操作对应一个索引、更新或删除请求，然后发送给对应的IndexShard执行。在处理过程中，TransportShardBulkAction 将每个请求的操作结果进行收集，并组合成批量操作结果，最终返回给客户端。

IndexShard类负责管理和执行分片级别的操作，以及处理和维护分片级别的数据和元数据.IndexShard是执行分片级别操作的入口，例如索引、删除、搜索、更新等操作。对于不同的操作，IndexShard调用不同的Engine API来处理。

核心方法源码分析

TransportReplicationAction#handlePrimaryRequest

处理对主分片的请求。该方法执行与主分片相关的操作，例如创建、更新或删除索引。

protected void handlePrimaryRequest(final ConcreteShardRequest<Request> request, final TransportChannel channel, final Task task) {
    //索引压力负载统计
    // 用api GET /_nodes/stats/indexing_pressure 可以查看这些数据
    Releasable releasable = checkPrimaryLimits(
        request.getRequest(),
        request.sentFromLocalReroute(),
        request.localRerouteInitiatedByNodeClient()
    );
    //请求完成返回响应前释放负载统计数据
    //ChannelActionListener用于在请求完成后，返回客户端的响应，无论请求是否成功。
    ActionListener<Response> listener = ActionListener.runBefore(
        new ChannelActionListener<>(channel, transportPrimaryAction, request),
        releasable::close
    );

    try {
        new AsyncPrimaryAction(request, listener, (ReplicationTask) task).run();
    } catch (RuntimeException e) {
        listener.onFailure(e);
    }
}

TransportWriteAction#checkPrimaryLimits

检查对主分片的写入请求是否满足限制

//用indexing_pressure.memory.limit 来配置索引请求可能消耗的未完成字节数。
//当达到或超过此限制时，节点将拒绝新的协调和主要操作。
//当副本操作消耗此限制的1.5倍时，节点将拒绝新的副本操作。默认为堆的10%
protected Releasable checkPrimaryLimits(Request request, boolean rerouteWasLocal, boolean localRerouteInitiatedByNodeClient) {
    //rerouteWasLocal表示是否在本节点上reroute。即是否本节点充当协调节点
    if (rerouteWasLocal) {
        //localRerouteInitiatedByNodeClient 本地reroute操作是否由NodeClient触发
        if (localRerouteInitiatedByNodeClient) {
            //本地reroute操作是由NodeClient触发,即主分片在本地节点
            //标记为协调节点的本地主分片操作
            return indexingPressure.markPrimaryOperationLocalToCoordinatingNodeStarted(
                primaryOperationCount(request),
                primaryOperationSize(request)
            );
        } else {
            //reroute操作是不由NodeClient触发
            //当主分片操作由远程协调节点转发请求至当前节点时发生
            return () -> {};
        }
    } else {
        //如果此主分片请求是直接从网络接收的，则必须标记为新的主分片操作，同时对数据进行校验。
        //如果写入操作跳过重新路由步骤（例如：rsync）或在主委派期间，在主重新定位切换之后，就会发生这种情况
        return indexingPressure.markPrimaryOperationStarted(
            primaryOperationCount(request),
            primaryOperationSize(request),
            force(request)
        );
    }
}

AsyncPrimaryAction#doRun

方法核心功能是向主分片请求许可证

如果请求成功，则执行回调函数runWithPrimaryShardReference，该回调函数接收PrimaryShardReference对象，该对象表示当前操作的主分片引用。

如果请求许可证失败，则回调onFailure方法。

protected void doRun() throws Exception {
    final ShardId shardId = primaryRequest.getRequest().shardId();
    final IndexShard indexShard = getIndexShard(shardId);
    final ShardRouting shardRouting = indexShard.routingEntry();
    // 校验当前分片是否仍然为主分片，必须确保当前操作是在主分片上执行的
    if (shardRouting.primary() == false) {
        throw new ReplicationOperation.RetryOnPrimaryException(shardId, "actual shard is not a primary " + shardRouting);
    }
    //校验主分片分配ID是否与请求分配ID一致。
    //shardRouting.allocationId().getId() 获取当前主分片的分配ID
    //primaryRequest.getTargetAllocationID() 获取请求中指定的分配ID
    final String actualAllocationId = shardRouting.allocationId().getId();
    if (actualAllocationId.equals(primaryRequest.getTargetAllocationID()) == false) {
        throw new ShardNotFoundException(
            shardId,
            "expected allocation id [{}] but found [{}]",
            primaryRequest.getTargetAllocationID(),
            actualAllocationId
        );
    }
    //校验主分片期数(term)是否与请求期数一致。
    final long actualTerm = indexShard.getPendingPrimaryTerm();
    if (actualTerm != primaryRequest.getPrimaryTerm()) {
        throw new ShardNotFoundException(
            shardId,
            "expected allocation id [{}] with term [{}] but found [{}]",
            primaryRequest.getTargetAllocationID(),
            primaryRequest.getPrimaryTerm(),
            actualTerm
        );
    }
    //请求操作执行的主分片的许可证，成功后回调runWithPrimaryShardReference方法
    acquirePrimaryOperationPermit(
        indexShard,
        primaryRequest.getRequest(),
        ActionListener.wrap(releasable -> runWithPrimaryShardReference(new PrimaryShardReference(indexShard, releasable)), e -> {
            if (e instanceof ShardNotInPrimaryModeException) {
                onFailure(new ReplicationOperation.RetryOnPrimaryException(shardId, "shard is not in primary mode", e));
            } else {
                onFailure(e);
            }
        })
    );
}

ReplicationOperation#execute

ReplicationOperation主要负责实现分片的复制，通过执行主分片的请求操作，再通过复制给从分片，以确保数据在索引中的一致性。

在处理请求操作时，ReplicationOperation会检查活动分片数量是否满足要求，以确保请求的正确处理

public void execute() throws Exception {
    //检查活动分片数量是否满足要求
    //由请求参数wait_for_active_shards决定,默认值为1.即只需要主分片写入成功即可。
    final String activeShardCountFailure = checkActiveShardCount();
    final ShardRouting primaryRouting = primary.routingEntry();
    final ShardId primaryId = primaryRouting.shardId();
    if (activeShardCountFailure != null) {
        finishAsFailed(
            new UnavailableShardsException(
                primaryId,
                "{} Timeout: [{}], request: [{}]",
                activeShardCountFailure,
                request.timeout(),
                request
            )
        );
        return;
    }

    totalShards.incrementAndGet();
    pendingActions.incrementAndGet(); 
    //主分片上执行请求，请求结束，如果成功，则回调handlePrimaryResult方法，同步副本分片。
    //如果失败，则回调finishAsFailed方法
    primary.perform(request, ActionListener.wrap(this::handlePrimaryResult, this::finishAsFailed));
}

TransportShardBulkAction#performOnPrimary

public static void performOnPrimary(
        BulkShardRequest request,
        IndexShard primary,
        UpdateHelper updateHelper,
        LongSupplier nowInMillisSupplier,
        MappingUpdatePerformer mappingUpdater,
        Consumer<ActionListener<Void>> waitForMappingUpdate,
        ActionListener<PrimaryResult<BulkShardRequest, BulkShardResponse>> listener,
        ThreadPool threadPool,
        String executorName
    ) {
        new ActionRunnable<>(listener) {

            private final Executor executor = threadPool.executor(executorName);
            //记录跟踪批量写操作的执行状态
            private final BulkPrimaryExecutionContext context = new BulkPrimaryExecutionContext(request, primary);

            final long startBulkTime = System.nanoTime();

            @Override
            protected void doRun() throws Exception {
                //循环迭代批量请求中的每个操作,调用executeBulkItemRequest方法来执行。
                while (context.hasMoreOperationsToExecute()) {
                    if (executeBulkItemRequest(
                        context,
                        updateHelper,
                        nowInMillisSupplier,
                        mappingUpdater,
                        waitForMappingUpdate,
                        //如果需要更新索引Mapping，不论成功与否executeBulkItemRequest都返回false.
                        //返回false则当前线程结束，返回之前，通过listner启用新的线程重新执行doRun方法。
                        //已经执行过的BulkItem通过context更新并传递状态。
                        ActionListener.wrap(v -> executor.execute(this), this::onRejection)
                    ) == false) {
                        return;
                    }
                    assert context.isInitial(); 
                }
                primary.getBulkOperationListener().afterBulk(request.totalSizeInBytes(), System.nanoTime() - startBulkTime);
                //所有写入请求执行完毕，返回WritePrimaryResult响应
                finishRequest();
            }

            @Override
            public void onRejection(Exception e) {
                //执行一些必须完成的操作，包括refresh和fsync操作。
                executor.execute(new ActionRunnable<>(listener) {
                    @Override
                    protected void doRun() {
                        while (context.hasMoreOperationsToExecute()) {
                            context.setRequestToExecute(context.getCurrent());
                            final DocWriteRequest<?> docWriteRequest = context.getRequestToExecute();
                            onComplete(
                                exceptionToResult(
                                    e,
                                    primary,
                                    docWriteRequest.opType() == DocWriteRequest.OpType.DELETE,
                                    docWriteRequest.version(),
                                    docWriteRequest.id()
                                ),
                                context,
                                null
                            );
                        }
                        finishRequest();
                    }

                    @Override
                    public boolean isForceExecution() {
                        return true;
                    }
                });
            }

            private void finishRequest() {
                ActionListener.completeWith(
                    listener,
                    () -> new WritePrimaryResult<>(
                        context.getBulkShardRequest(),
                        context.buildShardResponse(),
                        context.getLocationToSync(),
                        null,
                        context.getPrimary(),
                        logger
                    )
                );
            }
        }.run();
    }

TransportShardBulkAction#executeBulkItemRequest

static boolean executeBulkItemRequest(
    BulkPrimaryExecutionContext context,
    UpdateHelper updateHelper,
    LongSupplier nowInMillisSupplier,
    MappingUpdatePerformer mappingUpdater,
    Consumer<ActionListener<Void>> waitForMappingUpdate,
    ActionListener<Void> itemDoneListener
) throws Exception {
    //通过从上下文（BulkPrimaryExecutionContext）获取当前请求（DocWriteRequest）的操作类型（opType）。
    final DocWriteRequest.OpType opType = context.getCurrent().opType();

    // 如果请求操作类型是UPDATE，将其转换为可直接执行的index或delete请求
    final UpdateHelper.Result updateResult;
    if (opType == DocWriteRequest.OpType.UPDATE) {
        final UpdateRequest updateRequest = (UpdateRequest) context.getCurrent();
        try {
            //将更新(update)请求转换为索引(index)或删除(delete)操作、或更新响应（无操作）
            updateResult = updateHelper.prepare(updateRequest, context.getPrimary(), nowInMillisSupplier);
        } catch (Exception failure) {
            final Engine.Result result = new Engine.IndexResult(failure, updateRequest.version(), updateRequest.id());
            context.setRequestToExecute(updateRequest);
            context.markOperationAsExecuted(result);
            context.markAsCompleted(context.getExecutionResult());
            return true;
        }
        //如果更新结果是NOOP（即没有进行任何操作），则将执行状态标记为NOOP，并将执行状态标记为已完成
        if (updateResult.getResponseResult() == DocWriteResponse.Result.NOOP) {
            context.markOperationAsNoOp(updateResult.action());
            context.markAsCompleted(context.getExecutionResult());
            return true;
        }
        //将update请求预解析结果存入当前【待执行请求】变量中，等待执行
        context.setRequestToExecute(updateResult.action());
    } else {
        //如果不为update请求，则直接存入当前【待执行请求】变量中，等待执行
        context.setRequestToExecute(context.getCurrent());
        updateResult = null;
    }

    assert context.getRequestToExecute() != null; // also checks that we're in TRANSLATED state

    final IndexShard primary = context.getPrimary();
    final long version = context.getRequestToExecute().version();
    final boolean isDelete = context.getRequestToExecute().opType() == DocWriteRequest.OpType.DELETE;
    final Engine.Result result;
    if (isDelete) {
        //执行删除操作
        final DeleteRequest request = context.getRequestToExecute();
        result = primary.applyDeleteOperationOnPrimary(
            version,
            request.id(),
            request.versionType(),
            request.ifSeqNo(),
            request.ifPrimaryTerm()
        );
    } else {
        //执行创建或更新操作
        final IndexRequest request = context.getRequestToExecute();
        final SourceToParse sourceToParse = new SourceToParse(
            request.id(),
            request.source(),
            request.getContentType(),
            request.routing(),
            request.getDynamicTemplates()
        );
        result = primary.applyIndexOperationOnPrimary(
            version,
            request.versionType(),
            sourceToParse,
            request.ifSeqNo(),
            request.ifPrimaryTerm(),
            request.getAutoGeneratedTimestamp(),
            request.isRetry()
        );
    }
    //需要更新索引Mapping
    if (result.getResultType() == Engine.Result.Type.MAPPING_UPDATE_REQUIRED) {
        try {
            //在主分片上合并所需的映射更新，以确保所有分片具有相同的映射.
            primary.mapperService()
                .merge(
                    MapperService.SINGLE_MAPPING_NAME,
                    new CompressedXContent(result.getRequiredMappingUpdate()),
                    MapperService.MergeReason.MAPPING_UPDATE_PREFLIGHT
                );
        } catch (Exception e) {
            logger.info(() -> format("%s mapping update rejected by primary", primary.shardId()), e);
            assert result.getId() != null;
            onComplete(exceptionToResult(e, primary, isDelete, version, result.getId()), context, updateResult);
            return true;
        }

        mappingUpdater.updateMappings(result.getRequiredMappingUpdate(), primary.shardId(), new ActionListener<>() {
            @Override
            public void onResponse(Void v) {
                context.markAsRequiringMappingUpdate();
                waitForMappingUpdate.accept(ActionListener.runAfter(new ActionListener<>() {
                    @Override
                    public void onResponse(Void v) {
                        assert context.requiresWaitingForMappingUpdate();
                        context.resetForExecutionForRetry();
                    }

                    @Override
                    public void onFailure(Exception e) {
                        context.failOnMappingUpdate(e);
                    }
                    //索引Mapping更新完成，启动异步线程遍历执行未完成状态的BulkItem
                }, () -> itemDoneListener.onResponse(null)));
            }

            @Override
            public void onFailure(Exception e) {
                onComplete(exceptionToResult(e, primary, isDelete, version, result.getId()), context, updateResult);
                assert context.isInitial();
                //索引Mapping更新失败，启动异步线程遍历执行未完成状态的BulkItem
                itemDoneListener.onResponse(null);
            }
        });
        return false;
    } else {
        onComplete(result, context, updateResult);
    }
    return true;
}

IndexShard#applyIndexOperation

将给定的源（sourceToParse）转换为文档，并在需要时更新文档映射。然后，它调用prepareIndex方法创建一个新的Engine.Index对象，其中包含该文档的索引信息。

如果文档包含动态映射，则applyIndexOperation方法将返回包含动态映射更新的Engine.IndexResult对象，而不是实际索引操作的结果。

最后，该方法将调用Index方法将索引操作应用于给定的引擎，并返回Engine.IndexResult对象，该对象包含执行操作的结果，例如成功与否，文档版本号等信息。

private Engine.IndexResult applyIndexOperation(
    Engine engine,
    long seqNo,
    long opPrimaryTerm,
    long version,
    @Nullable VersionType versionType,
    long ifSeqNo,
    long ifPrimaryTerm,
    long autoGeneratedTimeStamp,
    boolean isRetry,
    Engine.Operation.Origin origin,
    SourceToParse sourceToParse
) throws IOException {
    assert opPrimaryTerm <= getOperationPrimaryTerm()
        : "op term [ " + opPrimaryTerm + " ] > shard term [" + getOperationPrimaryTerm() + "]";
    ensureWriteAllowed(origin);
    Engine.Index operation;
    try {
        //将给定的源（sourceToParse）转换为文档，并检测是否包含动态映射,然后包装成一个Engine.Index对象
        operation = prepareIndex(
            mapperService,
            sourceToParse,
            seqNo,
            opPrimaryTerm,
            version,
            versionType,
            origin,
            autoGeneratedTimeStamp,
            isRetry,
            ifSeqNo,
            ifPrimaryTerm
        );
        //如果文档包含动态映射，则applyIndexOperation方法将返回包含动态映射更新的Engine.IndexResult对象
        Mapping update = operation.parsedDoc().dynamicMappingsUpdate();
        if (update != null) {
            return new Engine.IndexResult(update, operation.parsedDoc().id());
        }
    } catch (Exception e) {
        verifyNotClosed(e);
        return new Engine.IndexResult(e, version, opPrimaryTerm, seqNo, sourceToParse.id());
    }
    //在指定引擎上执行索引操作
    return index(engine, operation);
}

IndexShard#index

调用引擎执行索引操作,并在操作执行前后别分调用关联监听器的preIndex方法和postIndex方法，来做一些特殊处理。包括记录慢日志、内存管理、以及各种统计和监听报警等。

private Engine.IndexResult index(Engine engine, Engine.Index index) throws IOException {
    //将active状态设置为true，表示当前节点正在执行索引操作
    active.set(true);
    final Engine.IndexResult result;
    //调用索引操作所有的监听器的preIndex方法，监听器可以在文档被加入索引前，对文档进行一些前置操作
    //参考下面运行时截图和监听器说明。
    final Engine.Index preIndex = indexingOperationListeners.preIndex(shardId, index);
    try {
        //调用引擎将文档加入到内存中的倒排索引中
        //返回IndexResult对象，包含了操作的结果信息，如文档的版本号、文档是否创建成功等。
        result = engine.index(preIndex);
    } catch (Exception e) {
        indexingOperationListeners.postIndex(shardId, preIndex, e);
        throw e;
    }
    //调用索引操作所有的监听器的postIndex方法
    indexingOperationListeners.postIndex(shardId, preIndex, result);
    return result;
}

preIndex

postIndex

监听器说明：

IndexingSlowLog：记录慢日志信息。记录写入操作（Index和Update）的时间超过了一定阈值的信息，该阈值可以在Elasticsearch的配置文件中进行配置。开发人员可以通过分析慢日志的原因，优化写入操作的性能，提高系统的吞吐量
IndexingMemoryController：跟踪正在进行的索引操作的内存使用情况，如果内存使用量超过了一个预设的阈值，它会暂停一部分正在进行的索引操作，等待内存使用量下降后再继续执行，避免过度使用内存导致系统崩溃
WatcherIndexingListener：用于在将触发警报时将有关事件的相关信息写入索引。在Elasticsearch Watcher模块中，用户可以创建触发警报的规则，并配置警报的行为和接收者。当满足触发规则的条件时，Watcher会将有关事件的相关信息记录到Elasticsearch索引中，并根据配置的警报行为进行通知
InternalIndexingStats：记录关于索引操作的统计信息

InternalIndexingStats记录以下指标：

indexCount：索引请求总数
deleteCount：删除请求总数
isAutoRefreshScheduled：是否已安排自动刷新任务
isPrimary：是否是主分片
isReplica：是否是副本分片
operationCount：索引和删除操作总数
noops：无操作计数
throttled：因为限流而被限制的操作计数
throttleTime：因为限流而等待的总时间
indexTime：所有索引操作的总时间
indexCurrent：正在执行的索引操作数
deleteTime：所有删除操作的总时间
deleteCurrent：正在执行的删除操作数
typesCount：按类型分组的索引和删除操作数

InternalIndexingStats通过在每个节点上跟踪各种类型的操作并定期更新指标来提供这些信息。这些统计数据可用于监控和诊断节点上的索引性能问题。

Elasticsearch8.5.3源码分析(5)-写主分片

总述