从Flink集群的启动异常日志看内存计算过程

2,468 阅读7分钟

特别提醒:当前Flink版本为:1.10

在一次启动Flink集群的过程中,内存参数方面采用下述配置:

taskmanager.memory.process.size: 1728m
taskmanager.memory.managed.size: 0m
taskmanager.memory.task.heap.size: 1024m

很遗憾,在执行start-cluster.sh脚本的时候,报错。

image.png 鉴于这次错误,对Flink内存分布进行了简单理解。
如果有分析不当或者片面的地方,请在评论中指出,一起学习一起进步。

基础知识

image.png
上图是Flink TaskManager内存详细图。
Total Process Memory [taskmanager.memory.process.size]:用于声明分配给Flink JVM process 总共多少内存。主要用于容器化部署(如K8s、Yarn等),其对应于请求的容器的内存大小。
Total Flink Memory [taskmanager.memory.flink.size]:该参数更倾向于表示为Flink本身分配了多少内存。该参数主要用于Standalone部署方式。
Framework Heap[taskmanager.memory.framework.heap.size]:该参数为TaskExecutor进程的框架堆内存大小,默认值为128 mb。其中,该参数为高级参数,不建议随意修改。
Task Heap[taskmanager.memory.task.heap.size]:该参数为Flink任务能够使用堆内存大小。
Managed Memory[taskmanager.memory.managed.size]:该参数用于配置TaskExecutor的托管内存大小。主要用于批任务中的sorting、hash tables、caching of intermediate results,以及流任务的RocksDB状态后端。
Framework Off-heap[taskmanager.memory.framework.off-heap.size]:该参数用于配置TaskExecutor进程保留的堆外内存大小,默认值为128 mb。其中,该参数为高级参数,不建议随意修改。
Task Off-Heap[taskmanager.memory.task.off-heap.size]:该参数表示Flink任务能够使用堆外内存大小,默认为0 bytes
Network:该内存主要用于保存shuffle数据,例如network buffers。与该部分内存相关的参数有:taskmanager.memory.network.fraction[默认值:0.1]taskmanager.memory.network.max[默认值:1 gb]taskmanager.memory.network.min[默认值:64 mb]
JVM Metaspace[taskmanager.memory.jvm-metaspace.size]:该参数表示TaskExecutor进程的JVM Metaspace size。
JVM Overhead:该内存主要为JVM开销保留的内存,比如:thread stack space、compile cache等。与该部分内存相关的参数有:taskmanager.memory.jvm-overhead.fraction[默认值:0.1]taskmanager.memory.jvm-overhead.max[默认值:1 gb]taskmanager.memory.jvm-overhead.min[默认值:192 mb]

最后,在启动TaskExecutor进程的时候,Flink根据配置或推倒的内存组件大小来配置与内存相关的JVM参数:

JVM ArgumentsValue
-Xmx and -XmsFramework Heap + Task Heap Memory
-XX:MaxDirectMemorySizeFramework Off-Heap + Task Off-Heap + Network Memory
-XX:MaxMetaspaceSizeJVM Metaspace

上述JVM参数具体信息将在TaskExecutor启动的过程中以日志的方式打印出来,如图:

image.png

计算公式及源码分析

内存计算执行逻辑

# step1 start-cluster.sh
# Start TaskManager instance(s)
TMSlaves start # TMSlaves是config.sh中的方法

# step2 config.sh
# starts or stops TMs on all slaves
TMSlaves() {
    ...
        if [[ $? -ne 0 ]]; then
            for slave in ${SLAVES[@]}; do
                ssh -n $FLINK_SSH_OPTS $slave -- "nohup /bin/bash -l \"${FLINK_BIN_DIR}/taskmanager.sh\" \"${CMD}\" &"
            done
    ...
}

# step3 taskmanager.sh
  jvm_params_output=$(runBashJavaUtilsCmd GET_TM_RESOURCE_JVM_PARAMS "${FLINK_CONF_DIR}" "$FLINK_BIN_DIR/bash-java-utils.jar:$(findFlinkDistJar)" "${ARGS[@]}")

  dynamic_configs_output=$(runBashJavaUtilsCmd GET_TM_RESOURCE_DYNAMIC_CONFIGS ${FLINK_CONF_DIR} $FLINK_BIN_DIR/bash-java-utils.jar:$(findFlinkDistJar) "${ARGS[@]}")
  
# step4 config.sh
runBashJavaUtilsCmd() {
    ...
    local output=`${JAVA_RUN} -classpath "${class_path}" org.apache.flink.runtime.util.BashJavaUtils ${cmd} --configDir "${conf_dir}" $dynamic_args 2>&1 | tail -n 1000`
    ...
    echo "$output"
}

从上述脚本调用栈,我们可以发现,最后实际执行的是org.apache.flink.runtime.util.BashJavaUtils类。

public class BashJavaUtils {

	private static final String EXECUTION_PREFIX = "BASH_JAVA_UTILS_EXEC_RESULT:";

	public static void main(String[] args) throws Exception {
		checkArgument(args.length > 0, "Command not specified.");

		switch (Command.valueOf(args[0])) {
			case GET_TM_RESOURCE_DYNAMIC_CONFIGS:
				getTmResourceDynamicConfigs(args);
				break;
			case GET_TM_RESOURCE_JVM_PARAMS:
				getTmResourceJvmParams(args);
				break;
			default:
				// unexpected, Command#valueOf should fail if a unknown command is passed in
				throw new RuntimeException("Unexpected, something is wrong.");
		}
	}
        private static void getTmResourceDynamicConfigs(String[] args) throws Exception {
		Configuration configuration = getConfigurationForStandaloneTaskManagers(args);
		TaskExecutorProcessSpec taskExecutorProcessSpec = TaskExecutorProcessUtils.processSpecFromConfig(configuration);
		System.out.println(EXECUTION_PREFIX + TaskExecutorProcessUtils.generateDynamicConfigsStr(taskExecutorProcessSpec));
	}

	private static void getTmResourceJvmParams(String[] args) throws Exception {
		Configuration configuration = getConfigurationForStandaloneTaskManagers(args);
		TaskExecutorProcessSpec taskExecutorProcessSpec = TaskExecutorProcessUtils.processSpecFromConfig(configuration);
		System.out.println(EXECUTION_PREFIX + TaskExecutorProcessUtils.generateJvmParametersStr(taskExecutorProcessSpec));
	}
        ...
}

BashJavaUtils代码中,会根据当前求的是Jvm参数(GET_TM_RESOURCE_JVM_PARAMS),还是求的是Flink TaskExecutor内存组件(GET_TM_RESOURCE_DYNAMIC_CONFIGS),调用不同的方法。实际上,这两种情况调用的是同一套代码,执行的逻辑也是相同的(PS:可能是为了分别打印出与内存相关的JVM配置以及Flink TaskExecutor内存组件的内存大小)。
getTmResourceDynamicConfigs()为例,

private static void getTmResourceDynamicConfigs(String[] args) throws Exception {
        # 兼容Flink 1.9内存配置的写法,这里不做介绍
	Configuration configuration = getConfigurationForStandaloneTaskManagers(args);
        
        # 根据flink-conf.yaml中的配置,解析与推倒内存分配
	TaskExecutorProcessSpec taskExecutorProcessSpec = TaskExecutorProcessUtils.processSpecFromConfig(configuration);
        
	System.out.println(EXECUTION_PREFIX + TaskExecutorProcessUtils.generateDynamicConfigsStr(taskExecutorProcessSpec));
}

TaskExecutorProcessUtils.processSpecFromConfig(configuration)主要用于根据flink-conf.yaml解析与推导 Flink TaskExecutor 内存组件的内存分配。

public static TaskExecutorProcessSpec processSpecFromConfig(final Configuration config) {
	if (isTaskHeapMemorySizeExplicitlyConfigured(config) && isManagedMemorySizeExplicitlyConfigured(config)) {
		// both task heap memory and managed memory are configured, use these to derive total flink memory
		return deriveProcessSpecWithExplicitTaskAndManagedMemory(config);
	} else if (isTotalFlinkMemorySizeExplicitlyConfigured(config)) {
		// either of task heap memory and managed memory is not configured, total flink memory is configured,
		// derive from total flink memory
		return deriveProcessSpecWithTotalFlinkMemory(config);
	} else if (isTotalProcessMemorySizeExplicitlyConfigured(config)) {
		// total flink memory is not configured, total process memory is configured,
		// derive from total process memory
		return deriveProcessSpecWithTotalProcessMemory(config);
	} else {
		throw new IllegalConfigurationException(String.format("Either Task Heap Memory size (%s) and Managed Memory size (%s), or Total Flink"
			+ " Memory size (%s), or Total Process Memory size (%s) need to be configured explicitly.",
			TaskManagerOptions.TASK_HEAP_MEMORY.key(),
			TaskManagerOptions.MANAGED_MEMORY_SIZE.key(),
			TaskManagerOptions.TOTAL_FLINK_MEMORY.key(),
			TaskManagerOptions.TOTAL_PROCESS_MEMORY.key()));
	}
}

根据上述代码,我们可以清楚的发现,Flink内存组件的计算是根据用户的配置来决定的,主要分为以下三种情况(优先级自上而下):

  • 设置 task heap memory 和 managed memory,推算 total flink memory内存
  • 指定 total flink memory,推算 managed memory 和 network memory 以及 task heap memory
  • 指定 total flink memory 和 task heap memory,推算 managed memory 和 network memory
  • 指定 total process memory,推算 jvm-overhead、managed memory、network memory、task memory。

内存配置与计算

设置 task heap memory 和 managed memory,推算 network memory 和 total flink memory内存

  • step1 从配置中读取 task heap memory、managed memory、framework heap、framework off-heap、task off-heap,然后求和,记为totalFlinkExcludeNetworkMemorySize
final MemorySize taskHeapMemorySize = getTaskHeapMemorySize(config);
final MemorySize managedMemorySize = getManagedMemorySize(config);

final MemorySize frameworkHeapMemorySize = getFrameworkHeapMemorySize(config);  // 128m
final MemorySize frameworkOffHeapMemorySize = getFrameworkOffHeapMemorySize(config); // 128m
final MemorySize taskOffHeapMemorySize = getTaskOffHeapMemorySize(config); // 0m

final MemorySize networkMemorySize;
final MemorySize totalFlinkExcludeNetworkMemorySize =
		frameworkHeapMemorySize.add(frameworkOffHeapMemorySize).add(taskHeapMemorySize).add(taskOffHeapMemorySize).add(managedMemorySize);
  • step2 计算 network size
  • step2.1 在配置 task heap memory 和 managed memory 基础上,额外配置 total flink memory 内存。这个时候,获得 total flink memory 值,即可推算network memory:total flink memory - totalFlinkExcludeNetworkMemorySize
// derive network memory from total flink memory, and check against network min/max
final MemorySize totalFlinkMemorySize = getTotalFlinkMemorySize(config);
networkMemorySize = totalFlinkMemorySize.subtract(totalFlinkExcludeNetworkMemorySize);
  • step2.2 配置 task heap memory 和 managed memory,但是没有配置 total flink memory,这个时候 network memory 计算方式也发生了改变:totalFlinkExcludeNetworkMemorySize * (network.fraction / (1 - netowrk.fraction))。最后,推算network memory size 是否 处于 network.min 与 network.max 之间,否则取最值;
final MemorySize relative = base.multiply(rangeFraction.fraction / (1 - rangeFraction.fraction));
capToMinMax(memoryDescription, relative, rangeFraction);
  • step3 计算 jvm-overhead
  • step3.1 在上述基础上,额外配置了 total process memory。在这种情况下,从配置中获得 jvm metaspace 与 total process memory,即可推算 jvm-overhead:total process memory - (total flink memory + jvm metaspace)
final MemorySize jvmMetaspaceSize = getJvmMetaspaceSize(config); 
final MemorySize totalFlinkAndJvmMetaspaceSize = totalFlinkMemorySize.add(jvmMetaspaceSize);
final MemorySize jvmOverheadSize = totalProcessMemorySize.subtract(totalFlinkAndJvmMetaspaceSize);
  • step3.2 当然,也可能没有额外配置 total process memory。在这种情况下,从配置中获得 jvm metaspace,通过计算公式求的 jvm-overhead:(total flink memory + jvm metaspace) * (jvm-overhead.fraction/(1 - jvm-overhead.fraction))。最后,判断 jvm overhead 大小是否处于 jvm overhead.min 与 jvm-overhead.max 之间,否则取最值。
final MemorySize jvmMetaspaceSize = getJvmMetaspaceSize(config); 
final MemorySize totalFlinkAndJvmMetaspaceSize = totalFlinkMemorySize.add(jvmMetaspaceSize);
final MemorySize jvmOverheadSize = deriveJvmOverheadWithInverseFraction(config, totalFlinkAndJvmMetaspaceSize);

指定 total flink memory,推算 managed memory 和 network memory

  • step1 从配置中获得 total flink memory
final MemorySize totalFlinkMemorySize = getTotalFlinkMemorySize(config);
  • step2 计算 managed memory 和 network memory
  • step2.1 如果额外指定 task heap memory,从配置中读取 framework heap、framework off-heap、task off-heap、task heap等内存值,然后计算 managed memory 与 netowrk memory(见注释);
final MemorySize frameworkHeapMemorySize = getFrameworkHeapMemorySize(config);
final MemorySize frameworkOffHeapMemorySize = getFrameworkOffHeapMemorySize(config);
final MemorySize taskOffHeapMemorySize = getTaskOffHeapMemorySize(config);
final MemorySize taskHeapMemorySize = getTaskHeapMemorySize(config);

# 如果配置了 managed memory size, 则从配置中取出 managed memeory值; 如果没有,则使用fraction计算;
final MemorySize managedMemorySize = deriveManagedMemoryAbsoluteOrWithFraction(config, totalFlinkMemorySize);

final MemorySize totalFlinkExcludeNetworkMemorySize =
				frameworkHeapMemorySize.add(frameworkOffHeapMemorySize).add(taskHeapMemorySize).add(taskOffHeapMemorySize).add(managedMemorySize);

# network memory = total flink memory - totalFlinkExcludeNetworkMemorySize
final MemorySize networkMemorySize = totalFlinkMemorySize.subtract(totalFlinkExcludeNetworkMemorySize);
  • step2.2 如果没有额外指定 task heap memory,通过 managed.fraction 计算 managed memory,通过 network.fraction 计算 network memory。[managed memory = total flink memory * managed.faction;network memory = total flink memory * network.fraction]。最后计算 task heap memory。
managedMemorySize = deriveManagedMemoryAbsoluteOrWithFraction(config, totalFlinkMemorySize); 
networkMemorySize = deriveNetworkMemoryWithFraction(config, totalFlinkMemorySize);
final MemorySize totalFlinkExcludeTaskHeapMemorySize =
		frameworkHeapMemorySize.add(frameworkOffHeapMemorySize).add(taskOffHeapMemorySize).add(managedMemorySize).add(networkMemorySize);
taskHeapMemorySize = totalFlinkMemorySize.subtract(totalFlinkExcludeTaskHeapMemorySize);
  • step3 计算 jvm-overhead
  • step3.1 如果配置了 total process memory,则jvm-overhead = total process memory - (total flink memory + jvm metaspace)
final MemorySize jvmMetaspaceSize = getJvmMetaspaceSize(config);
final MemorySize totalFlinkAndJvmMetaspaceSize = totalFlinkMemorySize.add(jvmMetaspaceSize);
final MemorySize totalProcessMemorySize = getTotalProcessMemorySize(config);

final MemorySize jvmOverheadSize = totalProcessMemorySize.subtract(totalFlinkAndJvmMetaspaceSize);
  • step3.2 如果没有指定 total process memory,则通过如下计算公式获得jvm-overhead:(total flink memory + jvm metaspace) * (jvm-overhead.fraction / (1 - jvm-overhead.fraction))
final MemorySize jvmMetaspaceSize = getJvmMetaspaceSize(config);
final MemorySize totalFlinkAndJvmMetaspaceSize = totalFlinkMemorySize.add(jvmMetaspaceSize);

final MemorySize jvmOverheadSize = deriveJvmOverheadWithInverseFraction(config, totalFlinkAndJvmMetaspaceSize);

指定 total process memory,推算 jvm-overhead、managed memory、network memory、task heap memory

  • step1 从配置中获得 total process memory,jvm metaspace,并根据jvm-overhead.fraction 获得 jvm-overhead [jvm-overhead = total process memory * jvm-overhead.fraction]
final MemorySize totalProcessMemorySize = getTotalProcessMemorySize(config);
final MemorySize jvmMetaspaceSize = getJvmMetaspaceSize(config);
final MemorySize jvmOverheadSize = deriveJvmOverheadWithFraction(config, totalProcessMemorySize);
  • step2 推导 total flink memory [total flink memory = total process memory - jvm metaspace - jvm-overhead]
final MemorySize totalFlinkMemorySize = totalProcessMemorySize.subtract(jvmMetaspaceAndOverhead.getTotalJvmMetaspaceAndOverheadSize());
  • step3 先从配置中 获取framework heap、framework off-heap、task off-heap值,紧接着推导 flink 内存组件的内存大小。
final MemorySize frameworkHeapMemorySize = getFrameworkHeapMemorySize(config);
final MemorySize frameworkOffHeapMemorySize = getFrameworkOffHeapMemorySize(config);
final MemorySize taskOffHeapMemorySize = getTaskOffHeapMemorySize(config);
  • step3.1 如果额外指定了 task heap memory,从配置中获得 task heap memory,然后 推导 managed memory,最后计算network memory。(见注释)
taskHeapMemorySize = getTaskHeapMemorySize(config); 

# 如果配置了 managed size, 则直接取值;如果没有配置,则使用fraction计算得到:total flink memory * managed.fraction
managedMemorySize = deriveManagedMemoryAbsoluteOrWithFraction(config, totalFlinkMemorySize);
final MemorySize totalFlinkExcludeNetworkMemorySize =
frameworkHeapMemorySize.add(frameworkOffHeapMemorySize).add(taskHeapMemorySize).add(taskOffHeapMemorySize).add(managedMemorySize);

# network = total flink memory - totalFlinkExcludeNetworkMemorySize
networkMemorySize = totalFlinkMemorySize.subtract(totalFlinkExcludeNetworkMemorySize);
  • step3.2 如果没有配置 task heap memory,先推导 managed memory、network memory,最后得到 tash heap memory。(见注释)
# 如果配置了 managed size, 则直接取值;如果没有配置,则使用fraction计算得到:total flink memory * managed.fraction
managedMemorySize = deriveManagedMemoryAbsoluteOrWithFraction(config, totalFlinkMemorySize);

# 如果配置了 network size, 则直接取值;如果没有配置,则使用fraction计算得到:total flink memory * network.fraction
networkMemorySize = deriveNetworkMemoryWithFraction(config, totalFlinkMemorySize);

final MemorySize totalFlinkExcludeTaskHeapMemorySize =
		frameworkHeapMemorySize.add(frameworkOffHeapMemorySize).add(taskOffHeapMemorySize).add(managedMemorySize).add(networkMemorySize);

# task heap memory = total flink memory - totalFlinkExcludeTaskHeapMemorySize;
taskHeapMemorySize = totalFlinkMemorySize.subtract(totalFlinkExcludeTaskHeapMemorySize);