Flink 心跳超时,akka超时问题

1,789 阅读1分钟

问题:

现场多条业务流程报错,报错信息各不相同,大致分为以下两类,但都是超时,一个是心跳超时,一个是akka超时。

报错信息:

Heartbeat of TaskManager with id container_e02_1654737428229_0374_01_000002(emr-worker-2.cluster-310060:40533) timed out

报错信息:

Caused by: org.apache.flink.util.FlinkException: An OperatorEvent from an OperatorCoordinator to a task was lost. Triggering task failover to ensure consistency. Event: 'SourceEventWrapper[com.ververica.cdc.connectors.mysql.source.events.FinishedSnapshotSplitsRequestEvent@13985d56]', targetTask: Source: TableSourceScan(table=[[default_catalog, default_database, cdc_fm_kyc_material]], fields=[id, clientId, certType, usages, moduleId, orderIndex, material, materialType, value, warnMsg, version, submitIndex, compareChange, created, modified, flag]) -> Calc(select=[id, clientId, material, value, version], where=[(LIKE(moduleId, _UTF-16LE'%control%') AND (materialType = _UTF-16LE'text':VARCHAR(2147483647) CHARACTER SET "UTF-16LE") AND (CAST(flag) = 1))]) (1/1) - execution #3\
... 33 more\
Caused by: java.util.concurrent.TimeoutException: Invocation of [RemoteRpcInvocation(null.sendOperatorEventToTask(ExecutionAttemptID, OperatorID, SerializedValue))] at recipient [akka.tcp://flink@emr-worker-2.cluster-310060:46510/user/rpc/taskmanager_0] timed out.\
at org.apache.flink.runtime.jobmaster.RpcTaskManagerGateway.sendOperatorEventToTask(RpcTaskManagerGateway.java:119)\
at org.apache.flink.runtime.executiongraph.Execution.sendOperatorEvent(Execution.java:889)\
at org.apache.flink.runtime.operators.coordination.ExecutionSubtaskAccess.lambda$createEventSendAction$1(ExecutionSubtaskAccess.java:67)\
at org.apache.flink.runtime.operators.coordination.OperatorEventValve.callSendAction(OperatorEventValve.java:180)\
at org.apache.flink.runtime.operators.coordination.OperatorEventValve.sendEvent(OperatorEventValve.java:94)\
at org.apache.flink.runtime.operators.coordination.SubtaskGatewayImpl.lambda$sendEvent$2(SubtaskGatewayImpl.java:98)\
... 26 more\
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink@emr-worker-2.cluster-310060:46510/user/rpc/taskmanager_0#315327235]] after [10000 ms]. Message of type [org.apache.flink.runtime.rpc.messages.RemoteRpcInvocation]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply

排查过程:

首先分析流程内容,全部室基于Flink SQL的流程,共同点是都包含聚合操作,而且用的是普通聚合,没有用window agg,并且没有配置过期时间,因此随着时间和数据的增长,内存肯定是一个问题,于是配置检查点失败次数为3,checkpoint超时为5min,重新启动流程,观察checkpoint状态,内存状态。大约30min后,堆内存占比达到96%以上,checkpoint出现超时失败情况,进一步排查是在agg的sub task的超时,与预期猜想一致。

问题解决:

因为任务的实时性不高,所以配置状态后端为rocksdb,同时加大taskmanager Managed Memory,再次启动流程,运行两小时无异常。