Flink 心跳超时，akka超时问题问题：现场多条业务流程报错，报错信息各不相同，大致分为以下两类，但都是超时，一个

问题：

现场多条业务流程报错，报错信息各不相同，大致分为以下两类，但都是超时，一个是心跳超时，一个是akka超时。

报错信息：

Heartbeat of TaskManager with id container_e02_1654737428229_0374_01_000002(emr-worker-2.cluster-310060:40533) timed out

报错信息：

Caused by: org.apache.flink.util.FlinkException: An OperatorEvent from an OperatorCoordinator to a task was lost. Triggering task failover to ensure consistency. Event: 'SourceEventWrapper[com.ververica.cdc.connectors.mysql.source.events.FinishedSnapshotSplitsRequestEvent@13985d56]', targetTask: Source: TableSourceScan(table=[[default_catalog, default_database, cdc_fm_kyc_material]], fields=[id, clientId, certType, usages, moduleId, orderIndex, material, materialType, value, warnMsg, version, submitIndex, compareChange, created, modified, flag]) -> Calc(select=[id, clientId, material, value, version], where=[(LIKE(moduleId, _UTF-16LE'%control%') AND (materialType = _UTF-16LE'text':VARCHAR(2147483647) CHARACTER SET "UTF-16LE") AND (CAST(flag) = 1))]) (1/1) - execution #3\
... 33 more\
Caused by: java.util.concurrent.TimeoutException: Invocation of [RemoteRpcInvocation(null.sendOperatorEventToTask(ExecutionAttemptID, OperatorID, SerializedValue))] at recipient [akka.tcp://flink@emr-worker-2.cluster-310060:46510/user/rpc/taskmanager_0] timed out.\
at org.apache.flink.runtime.jobmaster.RpcTaskManagerGateway.sendOperatorEventToTask(RpcTaskManagerGateway.java:119)\
at org.apache.flink.runtime.executiongraph.Execution.sendOperatorEvent(Execution.java:889)\
at org.apache.flink.runtime.operators.coordination.ExecutionSubtaskAccess.lambda$createEventSendAction$1(ExecutionSubtaskAccess.java:67)\
at org.apache.flink.runtime.operators.coordination.OperatorEventValve.callSendAction(OperatorEventValve.java:180)\
at org.apache.flink.runtime.operators.coordination.OperatorEventValve.sendEvent(OperatorEventValve.java:94)\
at org.apache.flink.runtime.operators.coordination.SubtaskGatewayImpl.lambda$sendEvent$2(SubtaskGatewayImpl.java:98)\
... 26 more\
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@emr-worker-2.cluster-310060:46510/user/rpc/taskmanager_0#315327235]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply

排查过程：

首先分析流程内容，全部室基于Flink SQL的流程，共同点是都包含聚合操作，而且用的是普通聚合，没有用window agg，并且没有配置过期时间，因此随着时间和数据的增长，内存肯定是一个问题，于是配置检查点失败次数为3，checkpoint超时为5min，重新启动流程，观察checkpoint状态，内存状态。大约30min后，堆内存占比达到96%以上，checkpoint出现超时失败情况，进一步排查是在agg的sub task的超时，与预期猜想一致。

问题解决：

因为任务的实时性不高，所以配置状态后端为rocksdb，同时加大taskmanager Managed Memory，再次启动流程，运行两小时无异常。