1.注册中心数据结构(数据来源于官网)

注册中心在定义的命名空间下，创建作业名称节点，用于区分不同作业，所以作业一旦创建则不能修改作业名称，如果修改名称将视为新的作业。作业名称节点下又包含4个数据子节点，分别是config, instances, sharding, servers和leader。

config节点

作业配置信息，以JSON格式存储

instances节点

作业运行实例信息，子节点是当前作业运行实例的主键。作业运行实例主键由作业运行服务器的IP地址和PID构成。作业运行实例主键均为临时节点，当作业实例上线时注册，下线时自动清理。注册中心监控这些节点的变化来协调分布式作业的分片以及高可用。可在作业运行实例节点写入TRIGGER表示该实例立即执行一次。

sharding节点

作业分片信息，子节点是分片项序号，从零开始，至分片总数减一。分片项序号的子节点存储详细信息。每个分片项下的子节点用于控制和记录分片运行状态。节点详细信息说明：

可以直接在ShardingNode中查看:

servers节点

作业服务器信息，子节点是作业服务器的IP地址。可在IP地址节点写入DISABLED表示该服务器禁用。在新的cloud native架构下，servers节点大幅弱化，仅包含控制服务器是否可以禁用这一功能。为了更加纯粹的实现job核心，servers功能未来可能删除，控制服务器是否禁用的能力应该下放至自动化部署系统。

leader节点

作业服务器主节点信息，分为election，sharding和failover三个子节点。分别用于主节点选举，分片和失效转移处理。leader节点是内部使用的节点，如果对作业框架原理不感兴趣，可不关注此节点。

可以直接在LeaderNode类中查看，需要与ShardingNode一起查看，processing和necessary子节点再代码中被放在了ShardingNode中:

2.JOB启动源码分析

这里用一个官网上面的例子来展开对Elastic-job源码的解析:

public class JobDemo {
    public static void main(String[] args) {
        new JobScheduler(createRegistryCenter(), createJobConfiguration()).init(); // 初始化任务调度器.
    }

    private static CoordinatorRegistryCenter createRegistryCenter() {
        CoordinatorRegistryCenter regCenter = new ZookeeperRegistryCenter(new ZookeeperConfiguration("zk_host:2181", "elastic-job-demo"));
        regCenter.init(); // 初始化zk注册中心.
        return regCenter;
    }

    private static LiteJobConfiguration createJobConfiguration() {
        // 定义作业核心配置
        JobCoreConfiguration simpleCoreConfig = JobCoreConfiguration.newBuilder("demoSimpleJob", "0/15 * * * * ?", 10).build();
        // 定义SIMPLE类型配置
        SimpleJobConfiguration simpleJobConfig = new SimpleJobConfiguration(simpleCoreConfig, "com.test.SimpleDemoJob");
        return LiteJobConfiguration.newBuilder(simpleJobConfig).build();
    }
}

这里重点看下JobScheduler的init方法.

/**
     * 初始化作业.
     */
    public void init() {
        LiteJobConfiguration liteJobConfigFromRegCenter = schedulerFacade.updateJobConfiguration(liteJobConfig); // 修改任务配置.
        JobRegistry.getInstance().setCurrentShardingTotalCount(liteJobConfigFromRegCenter.getJobName(), liteJobConfigFromRegCenter.getTypeConfig().getCoreConfig().getShardingTotalCount());
        JobScheduleController jobScheduleController = new JobScheduleController(
                createScheduler(), createJobDetail(liteJobConfigFromRegCenter.getTypeConfig().getJobClass()), liteJobConfigFromRegCenter.getJobName());
        JobRegistry.getInstance().registerJob(liteJobConfigFromRegCenter.getJobName(), jobScheduleController, regCenter);
        schedulerFacade.registerStartUpInfo(!liteJobConfigFromRegCenter.isDisabled()); // 2.1
        jobScheduleController.scheduleJob(liteJobConfigFromRegCenter.getTypeConfig().getCoreConfig().getCron()); // 2.2
    }

上面的JobScheduleController就是对quartz代码的封装类.createScheduler()方法即是调用quartz的代码创建QuartzScheduler的对象.createJobDetail()方法则是将elastic的LiteJob封装为quartz支持的JobDetail对象.

重点1

registerStartUpInfo是重点方法，下面是它的实现.

/**
     * 注册作业启动信息.
     * 
     * @param enabled 作业是否启用
     */
    public void registerStartUpInfo(final boolean enabled) {
        listenerManager.startAllListeners();
        leaderService.electLeader(); // 1.选举leader节点.
        serverService.persistOnline(enabled); // 持久化server节点.
        instanceService.persistOnline(); // 持久化instance节点.
        shardingService.setReshardingFlag(); // 设置是否需要分片的标志，后面代码中会用到.
        monitorService.listen(); // 初始化作业监控服务.
        if (!reconcileService.isRunning()) {
            reconcileService.startAsync();
        }
    }

1.选举主节点(leader) 下面是选举的代码，只有一行.

    /**
     * 选举主节点.
     */
    public void electLeader() {
        log.debug("Elect a new leader now.");
        jobNodeStorage.executeInLeader(LeaderNode.LATCH, new LeaderElectionExecutionCallback());
        log.debug("Leader election completed.");
    }

进入executeInLeader方法可以看到，主要是调用了callback的execute()方法.

/**
     * 在主节点执行操作.
     * 
     * @param latchNode 分布式锁使用的作业节点名称
     * @param callback 执行操作的回调
     */
    public void executeInLeader(final String latchNode, final LeaderExecutionCallback callback) {
        try (LeaderLatch latch = new LeaderLatch(getClient(), jobNodePath.getFullPath(latchNode))) {
            latch.start();
            latch.await();
            callback.execute();
        //CHECKSTYLE:OFF
        } catch (final Exception ex) {
        //CHECKSTYLE:ON
            handleException(ex);
        }
    }

这里的callback是LeaderElectionExecutionCallback，如下.

@RequiredArgsConstructor
class LeaderElectionExecutionCallback， implements LeaderExecutionCallback{
    
    @Override
    public void execute() {
        if (!hasLeader()) {
            jobNodeStorage.fillEphemeralJobNode(LeaderNode.INSTANCE,JobRegistry.getInstance().getJobInstance(jobName).getJobInstnceId());
        }
    }
}

综合上面的代码可以看出，选举的过程就是各个客户端去获取分布式锁，谁先将自己的instanceId写入到leader节点中，那么谁就是主节点(master).

2.2

/**
 * 调度作业.
 * 
 * @param cron CRON表达式
 */
public void scheduleJob(final String cron) {
    try {
        // 检查job是否存在，若存在，就直接调度.
        if (!scheduler.checkExists(jobDetail.getKey())) {
            scheduler.scheduleJob(jobDetail, createTrigger(cron));
        }
        
        // 若不存在，则执行start操作.
        scheduler.start();
    } catch (final SchedulerException ex) {
        throw new JobSystemException(ex);
    }
}

这里的任务调度是委托给scheduler来执行的，而scheduler最终又是委托QuartzScheduler来进行调度，下面是QuartzScheduler的构造函数.

/**
 * <p>
 * Create a <code>QuartzScheduler</code> with the given configuration
 * properties.
 * </p>
 * 
 * @see QuartzSchedulerResources
 */
public QuartzScheduler(QuartzSchedulerResources resources, longidleWaitTime, @Deprecated long dbRetryInterval)
    throws SchedulerException {
    this.resources = resources;
    if (resources.getJobStore() instanceof JobListener) {
        addInternalJobListener((JobListener)resources.getJobStore());
    }
    this.schedThread = new QuartzSchedulerThread(this, resources);
    ThreadExecutor schedThreadExecutor = resources.getThreadExecutor(); //  2.3
    schedThreadExecutor.execute(this.schedThread);
    if (idleWaitTime > 0) {
        this.schedThread.setIdleWaitTime(idleWaitTime);
    }
    jobMgr = new ExecutingJobsManager();
    addInternalJobListener(jobMgr);
    errLogger = new ErrorLogger();
    addInternalSchedulerListener(errLogger);
    signaler = new SchedulerSignalerImpl(this, this.schedThread);
    
    if(shouldRunUpdateCheck()) 
        updateTimer = scheduleUpdateCheck();
    else
        updateTimer = null;
    
    getLog().info("Quartz Scheduler v." + getVersion() + " created.");
}

这里不关注quarts是如何实现scheduleJob和start方法.

2.3

先看下QuartzSchedulerThread类，它是一个线程.这里不关注run方法中的具体实现，只需要关注下面的代码.

JobRunShell shell = null;
try {
    shell = qsRsrcs.getJobRunShellFactory().createJobRunShell(bndle);
    shell.initialize(qs);
} catch (SchedulerException se) {
    qsRsrcs.getJobStore().triggeredJobComplete(triggers.get(i), bndle.getJobDetail(), CompletedExecutionInstruction.SET_ALL_JOB_TRIGGERS_ERROR);
   continue;                           
}

if (qsRsrcs.getThreadPool().runInThread(shell) == false) {
    // this case should never happen, as it is indicative of the
    // scheduler being shutdown or a bug in the thread pool or
    // a thread pool being used concurrently - which the docs
    // say not to do...
    getLog().error("ThreadPool.runInThread() return false!");
    qsRsrcs.getJobStore().triggeredJobComplete(triggers.get(i), bndle.getJobDetail(), CompletedExecutionInstruction.SET_ALL_JOB_TRIGGERS_ERROR);            
}

JobRunShell也是一个线程.并且在随后的代码中被放入线程池中被执行，在该类中最终会调用到elastic-job所定义的job.

public void run() {
        qs.addInternalSchedulerListener(this);

        try {
            ...

                // execute the job
                try {
                    log.debug("Calling execute on job " + jobDetail.getKey());
                    job.execute(jec); // 执行任务.
                    endTime = System.currentTimeMillis();
                } catch (JobExecutionException jee) {
                    endTime = System.currentTimeMillis();
                    jobExEx = jee;
                    getLog().info("Job " + jobDetail.getKey() +
                            " threw a JobExecutionException: ", jobExEx);
                } catch (Throwable e) {
                    endTime = System.currentTimeMillis();
                    getLog().error("Job " + jobDetail.getKey() +
                            " threw an unhandled Exception: ", e);
                    SchedulerException se = new SchedulerException(
                            "Job threw an unhandled exception.", e);
                    qs.notifySchedulerListenersError("Job ("
                            + jec.getJobDetail().getKey()
                            + " threw an exception.", se);
                    jobExEx = new JobExecutionException(se, false);
                }

                ...

        } finally {
            qs.removeInternalSchedulerListener(this);
        }
    }

上面代码中的job即为我们之前定义的SimpleDemoJob.到这里，elasticjob的启动，以及如何与quartz关联起来的逻辑就已经很清晰了.

这里的execute方法，调用的是com.dangdang.ddframe.job.lite.internal.schedule.LiteJob的execute方法.

3.JOB执行流程.

Job执行的起始点即为上面所说的com.dangdang.ddframe.job.lite.internal.schedule.LiteJob#execute方法.

/**
 * Lite调度作业.
 *
 * @author zhangliang
 */
public final class LiteJob implements Job {
    
    @Setter
    private ElasticJob elasticJob;
    
    @Setter
    private JobFacade jobFacade;
    
    @Override
    public void execute(final JobExecutionContext context) throws JobExecutionException {
        JobExecutorFactory.getJobExecutor(elasticJob, jobFacade).execute();
    }
}

ElasticJob有三种类型，分别为DataflowJob、ScriptJob、SimpleJob，一般使用的未SimpleJob.这三种任务对应了三种任务执行器，如下所示.


    /**
     * 获取作业执行器.
     *
     * @param elasticJob 分布式弹性作业
     * @param jobFacade 作业内部服务门面服务
     * @return 作业执行器
     */
    @SuppressWarnings("unchecked")
    public static AbstractElasticJobExecutor getJobExecutor(final ElasticJob elasticJob, final JobFacade jobFacade) {
        if (null == elasticJob) {
            return new ScriptJobExecutor(jobFacade); // 脚本分布式任务执行器
        }
        if (elasticJob instanceof SimpleJob) {
            return new SimpleJobExecutor((SimpleJob) elasticJob, jobFacade); // 简单分布式任务执行器.
        }
        if (elasticJob instanceof DataflowJob) {
            return new DataflowJobExecutor((DataflowJob) elasticJob, jobFacade); // 流式分布式任务执行器.
        }
        throw new JobConfigurationException("Cannot support job type '%s'", elasticJob.getClass().getCanonicalName());
    }

这三种执行器都继承自AbstractElasticJobExecutor，所以在LiteJob的execute方法中调用的其实是AbstractElasticJobExecutor的execute方法.

/**
 * 执行作业.
 */
public final void execute() {
    try {
        jobFacade.checkJobExecutionEnvironment(); // 3.1
    } catch (final JobExecutionEnvironmentException cause) {
        jobExceptionHandler.handleException(jobName, cause);
    }
    ShardingContexts shardingContexts = jobFacade.getShardingContexts(); // 3.2
    if (shardingContexts.isAllowSendJobEvent()) {
        jobFacade.postJobStatusTraceEvent(shardingContexts.getTaskId(),State.TASK_STAGING, String.format("Job '%s' execute begin.",jobName));
    }
    if (jobFacade.misfireIfRunning(shardingContexts.getShardingItemParamters().keySet())) {
        if (shardingContexts.isAllowSendJobEvent()) {
            jobFacade.postJobStatusTraceEvent(shardingContexts.getTaskId, State.TASK_FINISHED, String.format(
                    "Previous job '%s' - shardingItems '%s' is stillrunning, misfired job will start after previous jobcompleted.", jobName, 
                shardingContexts.getShardingItemParameters().keySet());
        }
        return;
    }
    try {
        jobFacade.beforeJobExecuted(shardingContexts);
        //CHECKSTYLE:OFF
    } catch (final Throwable cause) {
        //CHECKSTYLE:ON
        jobExceptionHandler.handleException(jobName, cause);
    }
    execute(shardingContexts,JobExecutionEvent.ExecutionSource.NORMAL_TRIGGER);
    while (jobFacade.isExecuteMisfired(shardingContexts.getShardingItemPrameters().keySet())) {
        jobFacade.clearMisfire(shardingContexts.getShardingItemParameter().keySet());
        execute(shardingContexts,JobExecutionEvent.ExecutionSource.MISFIRE);
    }
    jobFacade.failoverIfNecessary();
    try {
        jobFacade.afterJobExecuted(shardingContexts);
        //CHECKSTYLE:OFF
    } catch (final Throwable cause) {
        //CHECKSTYLE:ON
        jobExceptionHandler.handleException(jobName, cause);
    }
}

Elastic-job笔记