结论:在azkaban配置时如果filter有MinimumFreeMemory,会选择可用内存大于6GB的executor去执行任务,如所有的executor可用内存都不足6GB时则会分配executor失败,导致任务无法执行,一直到内存重新回到大于6GB分配executor成功,任务才会执行。
分析过程:第一次出现问题是在8月4号,当时看了下日志,发现如下内容
[ExecutorManager] [Azkaban] Reached handleNoExecutorSelectedCase stage for exec 4672839 with error count 0
以“Reached handleNoExecutorSelectedCase stage”作为关键字搜了一下日志,发现近三天每天都有好多,并且出现时间不固定,那段时间网络无抖动,主机机器iowait,cpu使用、内存等也都处于比较低的状态,
搜了一下,查到有人反馈如下
azkaban会记录每个任务下次的执行时间,那么如果executor主机时间比web主机慢的话,可能会导致该问题,结合之前掌握的信息,感觉这个是比较靠谱的,看了下其中一台executor机器的ntp配置,发现竟然使用了一个非内网的ntp服务器0.asia.pool.ntp.org,ping了下这个地址,延时在180+ms,使我更加相信问题的出现是因为时间同步失败了,于是果断改成指向内网的ntp服务,改完后我还重启了下executor
时间过了两个礼拜,20号上午同事反馈延时的问题又出现了,并且还处于现场中,我上去之后先看了下系统时间,发现是正常的,日志中也仍然是大量的Reached handleNoExecutorSelectedCase stage错误,先重启了一下webserver发现问题没解决,然后又重启了executor,重启后恢复正常。这个时候我就想这么久没出问题应该不是改了ntp服务器而是重启了executor,汗。。。
由于之前查过该问题,并且没有查出什么好的答案,所以这次准备通过查看源码找出问题原因,自己实在是不会java,里面各种依赖关系,看起来实在是吃力,好在azkaban用的人比较多,也有很多人分析过源码,网上找到了两篇不错的分析文章
重点看了一下任务调度相关的过程,大概为web服务将执行指令写入db--->web服务通知executor有任务需要执行(如果是多节点模式,会先根据配置的filter及权重来分配具体执行的主机,本次也是在这步出现的问题)--->Executor接收到指令,去db里面找出执行参数,发起执行命令
从分析文章里得知,导致日志里相关错误的代码段在
private void selectExecutorAndDispatchFlow(ExecutionReference reference,`` ExecutableFlow exflow, Set<Executor> availableExecutors)`` throws ExecutorManagerException {`` synchronized (exflow) {`` Executor selectedExecutor = selectExecutor(exflow, availableExecutors);`` if (selectedExecutor != null ) {`` try {`` dispatch(reference, exflow, selectedExecutor);`` } catch (ExecutorManagerException e) {`` logger.warn(String.format(`` "Executor %s responded with exception for exec: %d" ,`` selectedExecutor, exflow.getExecutionId()), e);`` handleDispatchExceptionCase(reference, exflow, selectedExecutor,`` availableExecutors);`` }`` } else {`` handleNoExecutorSelectedCase(reference, exflow);`` }`` }``} |
|---|
如果不满足if (selectedExecutor != null)这个条件,就会在日志中出现handleNoExecutorSelectedCase错误,这里跟一下selectedExecutor
Executor selectedExecutor = selectExecutor(exflow, availableExecutors);
继续跟进
private Executor selectExecutor(ExecutableFlow exflow,`` Set<Executor> availableExecutors) {`` Executor choosenExecutor =`` getUserSpecifiedExecutor(exflow.getExecutionOptions(),`` exflow.getExecutionId()); // If no executor was specified by admin`` if (choosenExecutor == null ) {`` logger.info( "Using dispatcher for execution id :"`` + exflow.getExecutionId());`` ExecutorSelector selector = new ExecutorSelector(filterList, comparatorWeightsMap);`` choosenExecutor = selector.getBest(availableExecutors, exflow);`` }`` return choosenExecutor;``} |
|---|
我们是没有选取具体的executor,所以会进入到
if (choosenExecutor == null)逻辑下,日志中存在“Using dispatcher for execution id xxx”也可以验证这点
后面的
ExecutorSelector selector = new ExecutorSelector(filterList, comparatorWeightsMap);
choosenExecutor = selector.getBest(availableExecutors, exflow);
看起来作用是结合配置里面的filters与comparator来选取适合的executor,开始看到这里有个猜测是会不会数据库中的executors表,active字段因为某种原因被update导致的(availableExecutors就是从这里取的
),不过日志中可以取到下面的值,说明应该不会
2017/08/20 01:26:09.990 +0800 INFO [ExecutorManager] [Azkaban] Successfully refreshed executor: 10.10.221.204:12321 (i``d: 1) with executor info : {``"remainingMemoryPercent" : 76.44902066392035,``"remainingMemoryInMB" : 6104,``"remainingFlowCapacity" : 29,``"numberOfAssignedFlows" : 1,``"lastDispatchedTime" : 1503153050310,``"cpuUsage" : 0.0``}``2017/08/20 01:26:09.992 +0800 INFO [ExecutorManager] [Azkaban] Successfully refreshed executor: 10.10.4.50:12321 (id:``7) with executor info : {``"remainingMemoryPercent" : 76.91752649941591,``"remainingMemoryInMB" : 6141,``"remainingFlowCapacity" : 22,``"numberOfAssignedFlows" : 8,``"lastDispatchedTime" : 1503163310196,``"cpuUsage" : 0.08``} |
|---|
后来还是看了下binlog,通过binlog确认executors表没有update操作
那么还是继续跟代码,省略漫长的跟进过程,最后终于看到了这么一处地方
/**`` * <pre>`` * function to register the static Minimum Reserved Memory filter.`` * NOTE : this is a static filter which means the filter will be filtering based on the system`` * standard which is not`` * Coming for the passed flow.`` * This filter will filter out any executors that has the remaining memory below 6G`` * </pre>`` */`` private static FactorFilter<Executor, ExecutableFlow> getMinimumReservedMemoryFilter() {`` return FactorFilter`` .create(MINIMUMFREEMEMORY_FILTER_NAME, new FactorFilter.Filter<Executor, ExecutableFlow>() {`` private static final int MINIMUM_FREE_MEMORY = 6 * 1024 ;`` `` @Override`` public boolean filterTarget( final Executor filteringTarget,`` final ExecutableFlow referencingObject) {`` if ( null == filteringTarget) {`` logger.debug(String.format( "%s : filtering out the target as it is null." ,`` MINIMUMFREEMEMORY_FILTER_NAME));`` return false ;`` }`` `` final ExecutorInfo stats = filteringTarget.getExecutorInfo();`` if ( null == stats) {`` logger.debug(String.format( "%s : filtering out %s as it's stats is unavailable." ,`` MINIMUMFREEMEMORY_FILTER_NAME,`` filteringTarget.toString()));`` return false ;`` }`` return stats.getRemainingMemoryInMB() > MINIMUM_FREE_MEMORY;`` }`` });`` } |
|---|
This filter will filter out any executors that has the remaining memory below 6G
可用内存必须超过6G的时候才满足条件,6x1024=6144M,回头翻了下不成功时候取到的executor信息(为了确认基本翻了所有失败时候的日志与部分成功时间的日志),果然两台机器都不到6G的可用内存,只要有超过6144的时候都会成功分配executor,到目前为止终于算是成功定位了问题,那么解决方法就是把原配置中
azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus
更改为
azkaban.executorselector.filters=StaticRemainingFlowSize,CpuStatus
并重启webserver来生效新的配置
PS:这里再说下另外两个filters,如果再出现handleNoExecutorSelectedCase错误基本就是另外两个filters的问题了,StaticRemainingFlowSiz是每个executor可以同时执行30个flow,如果都满了就会分配失败,CpuStatus是指
cpuUsage的值大于等于95的时候分配失败
后续:设立azkaban执行情况的监控,目前最简单的做法是设置一个每分钟执行的任务,设置任务失败的报警,但本次这种情况由于没有进入到任务执行流程仍然无法监控,同时需要考虑其他维度的监控