XXL-JOB源码解析
一、执行器注册 & 心跳上报
执行器对应的服务配置里面会配置xxl-job调度中心的地址,同时标注当前执行器的基础信息,如执行器名称、端口等,而执行器的注册过程就是当执行器服务启动的时候将本机的信息推送到调度中心,让调度中心存储下来。同时不断心跳上报,与服务端之间保持活跃连接。
接着看具体的实现。
1.1 执行器(客户端)逻辑
// ---------------------- registry ----------------------
public void startRegistry(final String appname, final String address) {
// start registry
ExecutorRegistryThread.getInstance().start(appname, address);
}
会有个单例的类ExecutorRegistryThread负责整个执行器的上线和下线。
public class ExecutorRegistryThread {
private static ExecutorRegistryThread instance = new ExecutorRegistryThread();
public static ExecutorRegistryThread getInstance(){
return instance;
}
// 注册线程
private Thread registryThread;
private volatile boolean toStop = false;
public void start(final String appname, final String address){
// ...
registryThread = new Thread(new Runnable() {
@Override
public void run() {
// registry
while (!toStop) {
try {
RegistryParam registryParam = new RegistryParam(RegistryConfig.RegistType.EXECUTOR.name(), appname, address);
for (AdminBiz adminBiz: XxlJobExecutor.getAdminBizList()) {
ReturnT<String> registryResult = adminBiz.registry(registryParam);
// ...
}
} catch (Exception e) {
// ...
}
try {
if (!toStop) {
// 30秒
TimeUnit.SECONDS.sleep(RegistryConfig.BEAT_TIMEOUT);
}
} catch (InterruptedException e) {
// ...
}
}
// registry remove
try {
RegistryParam registryParam = new RegistryParam(RegistryConfig.RegistType.EXECUTOR.name(), appname, address);
for (AdminBiz adminBiz: XxlJobExecutor.getAdminBizList()) {
try {
ReturnT<String> registryResult = adminBiz.registryRemove(registryParam);
// ...
} catch (Exception e) {
// ...
}
}
} catch (Exception e) {
if (!toStop) {
logger.error(e.getMessage(), e);
}
}
}
});
registryThread.setDaemon(true);
registryThread.setName("xxl-job, executor ExecutorRegistryThread");
registryThread.start();
}
// ...
}
ExecutorRegistryThread单例中会启动一个registryThread线程,这个线程会每30秒向调度中心发送注册请求,实际上也是心跳的作用,将本执行器的信息上报到服务端,最终更新心跳时间。
1.2 调度中心(服务端)逻辑
JobRegistryHelper单例负责对调度中心执行器实例信息的维护。
public class JobRegistryHelper {
private static JobRegistryHelper instance = new JobRegistryHelper();
public static JobRegistryHelper getInstance(){
return instance;
}
private Thread registryMonitorThread;
private volatile boolean toStop = false;
public void start(){
// for monitor
registryMonitorThread = new Thread(new Runnable() {
@Override
public void run() {
while (!toStop) {
try {
// auto registry group
List<XxlJobGroup> groupList = XxlJobAdminConfig.getAdminConfig().getXxlJobGroupDao().findByAddressType(0);
if (groupList!=null && !groupList.isEmpty()) {
// remove dead address (admin/executor)
List<Integer> ids = XxlJobAdminConfig.getAdminConfig().getXxlJobRegistryDao().findDead(RegistryConfig.DEAD_TIMEOUT, new Date());
if (ids!=null && ids.size()>0) {
XxlJobAdminConfig.getAdminConfig().getXxlJobRegistryDao().removeDead(ids);
}
// fresh online address (admin/executor)
HashMap<String, List<String>> appAddressMap = new HashMap<String, List<String>>();
List<XxlJobRegistry> list = XxlJobAdminConfig.getAdminConfig().getXxlJobRegistryDao().findAll(RegistryConfig.DEAD_TIMEOUT, new Date());
if (list != null) {
for (XxlJobRegistry item: list) {
if (RegistryConfig.RegistType.EXECUTOR.name().equals(item.getRegistryGroup())) {
String appname = item.getRegistryKey();
List<String> registryList = appAddressMap.get(appname);
if (registryList == null) {
registryList = new ArrayList<String>();
}
if (!registryList.contains(item.getRegistryValue())) {
registryList.add(item.getRegistryValue());
}
appAddressMap.put(appname, registryList);
}
}
}
// fresh group address
for (XxlJobGroup group: groupList) {
List<String> registryList = appAddressMap.get(group.getAppname());
String addressListStr = null;
if (registryList!=null && !registryList.isEmpty()) {
Collections.sort(registryList);
StringBuilder addressListSB = new StringBuilder();
for (String item:registryList) {
addressListSB.append(item).append(",");
}
addressListStr = addressListSB.toString();
addressListStr = addressListStr.substring(0, addressListStr.length()-1);
}
group.setAddressList(addressListStr);
group.setUpdateTime(new Date());
XxlJobAdminConfig.getAdminConfig().getXxlJobGroupDao().update(group);
}
}
} catch (Exception e) {
if (!toStop) {
logger.error(">>>>>>>>>>> xxl-job, job registry monitor thread error:{}", e);
}
}
try {
TimeUnit.SECONDS.sleep(RegistryConfig.BEAT_TIMEOUT);
} catch (InterruptedException e) {
if (!toStop) {
logger.error(">>>>>>>>>>> xxl-job, job registry monitor thread error:{}", e);
}
}
}
}
});
registryMonitorThread.setDaemon(true);
registryMonitorThread.start();
}
会有一个registryMonitorThread线程负责执行器实例信息的维护,每30秒进行检测,如果超过90秒没有心跳信息,则认为该执行器已下线,会移除掉该执行器的注册信息,并刷新最新的活跃地址。
二、执行器启动过程
// ---------------------- start + stop ----------------------
public void start() throws Exception {
// init logpath
XxlJobFileAppender.initLogPath(logPath);
// init invoker, admin-client
initAdminBizList(adminAddresses, accessToken);
// init JobLogFileCleanThread
JobLogFileCleanThread.getInstance().start(logRetentionDays);
// init TriggerCallbackThread
TriggerCallbackThread.getInstance().start();
// init executor-server
initEmbedServer(address, ip, port, appname, accessToken);
}
执行器的启动整体分为五个步骤:
- 初始化日志目录
- 初始化与服务端通信的http client
- 开启日志文件的定时清理
- 初始化任务执行完成后的回调线程
- 初始化嵌入式的http 服务器,用于监听处理调度中心的请求
2.1 初始化日志目录
public static void initLogPath(String logPath){
// init
if (logPath!=null && logPath.trim().length()>0) {
logBasePath = logPath;
}
// mk base dir
File logPathDir = new File(logBasePath);
if (!logPathDir.exists()) {
logPathDir.mkdirs();
}
logBasePath = logPathDir.getPath();
// mk glue dir
File glueBaseDir = new File(logPathDir, "gluesource");
if (!glueBaseDir.exists()) {
glueBaseDir.mkdirs();
}
glueSrcPath = glueBaseDir.getPath();
}
按照配置信息,会初始化日志存放的磁盘位置以及glue模式下对应的脚本文件存储位置。
2.2 初始化与服务端通信的http client
private static List<AdminBiz> adminBizList;
private void initAdminBizList(String adminAddresses, String accessToken) throws Exception {
if (adminAddresses!=null && adminAddresses.trim().length()>0) {
for (String address: adminAddresses.trim().split(",")) {
if (address!=null && address.trim().length()>0) {
AdminBiz adminBiz = new AdminBizClient(address.trim(), accessToken);
if (adminBizList == null) {
adminBizList = new ArrayList<AdminBiz>();
}
adminBizList.add(adminBiz);
}
}
}
}
public interface AdminBiz {
/**
* callback
*
* @param callbackParamList
* @return
*/
public ReturnT<String> callback(List<HandleCallbackParam> callbackParamList);
/**
* registry
*
* @param registryParam
* @return
*/
public ReturnT<String> registry(RegistryParam registryParam);
/**
* registry remove
*
* @param registryParam
* @return
*/
public ReturnT<String> registryRemove(RegistryParam registryParam);
}
初始化执行器与调度中心通信的Client,如果调度中心高可用有多台机器的情况下,会有多个通信Client。
调度中心支持多个地址,用逗号隔开。
这个通信Client默认走Http协议,主要负责执行器的上线、下线、结果执行回调三个动作。
2.3 日志文件的定时清理
public class JobLogFileCleanThread {
private static JobLogFileCleanThread instance = new JobLogFileCleanThread();
public static JobLogFileCleanThread getInstance(){
return instance;
}
private Thread localThread;
private volatile boolean toStop = false;
public void start(final long logRetentionDays){
if (logRetentionDays < 3 ) {
return;
}
localThread = new Thread(new Runnable() {
@Override
public void run() {
while (!toStop) {
try {
// clean log dir, over logRetentionDays
File[] childDirs = new File(XxlJobFileAppender.getLogPath()).listFiles();
if (childDirs!=null && childDirs.length>0) {
// today
Calendar todayCal = Calendar.getInstance();
todayCal.set(Calendar.HOUR_OF_DAY,0);
todayCal.set(Calendar.MINUTE,0);
todayCal.set(Calendar.SECOND,0);
todayCal.set(Calendar.MILLISECOND,0);
Date todayDate = todayCal.getTime();
for (File childFile: childDirs) {
// valid
if (!childFile.isDirectory()) {
continue;
}
if (childFile.getName().indexOf("-") == -1) {
continue;
}
// file create date
Date logFileCreateDate = null;
try {
SimpleDateFormat simpleDateFormat = new SimpleDateFormat("yyyy-MM-dd");
logFileCreateDate = simpleDateFormat.parse(childFile.getName());
} catch (ParseException e) {
logger.error(e.getMessage(), e);
}
if (logFileCreateDate == null) {
continue;
}
if ((todayDate.getTime()-logFileCreateDate.getTime()) >= logRetentionDays * (24 * 60 * 60 * 1000) ) {
FileUtil.deleteRecursively(childFile);
}
}
}
} catch (Exception e) {
if (!toStop) {
logger.error(e.getMessage(), e);
}
}
try {
// 每天进行清理
TimeUnit.DAYS.sleep(1);
} catch (InterruptedException e) {
if (!toStop) {
logger.error(e.getMessage(), e);
}
}
}
}
});
localThread.setDaemon(true);
localThread.setName("xxl-job, executor JobLogFileCleanThread");
localThread.start();
}
}
根据配置的日志保留天数对多余的日志做清理。
2.4 初始化回调线程
public class TriggerCallbackThread {
private static TriggerCallbackThread instance = new TriggerCallbackThread();
public static TriggerCallbackThread getInstance(){
return instance;
}
private LinkedBlockingQueue<HandleCallbackParam> callBackQueue = new LinkedBlockingQueue<HandleCallbackParam>();
// 回调结果放入内存阻塞队列
public static void pushCallBack(HandleCallbackParam callback){
getInstance().callBackQueue.add(callback);
logger.debug(">>>>>>>>>>> xxl-job, push callback request, logId:{}", callback.getLogId());
}
// 回调线程
private Thread triggerCallbackThread;
// 回调重试线程
private Thread triggerRetryCallbackThread;
public void start() {
// ...
triggerCallbackThread = new Thread(new Runnable() {
@Override
public void run() {
while(!toStop){
try {
HandleCallbackParam callback = getInstance().callBackQueue.take();
if (callback != null) {
// callback list param
List<HandleCallbackParam> callbackParamList = new ArrayList<HandleCallbackParam>();
int drainToNum = getInstance().callBackQueue.drainTo(callbackParamList);
callbackParamList.add(callback);
// callback, will retry if error
if (callbackParamList!=null && callbackParamList.size()>0) {
doCallback(callbackParamList);
}
}
} catch (Exception e) {
if (!toStop) {
logger.error(e.getMessage(), e);
}
}
}
//...
}
});
triggerCallbackThread.setDaemon(true);
triggerCallbackThread.setName("xxl-job, executor TriggerCallbackThread");
triggerCallbackThread.start();
// 回调重试
triggerRetryCallbackThread = new Thread(new Runnable() {
@Override
public void run() {
while(!toStop){
try {
retryFailCallbackFile();
} catch (Exception e) {
if (!toStop) {
logger.error(e.getMessage(), e);
}
}
try {
TimeUnit.SECONDS.sleep(RegistryConfig.BEAT_TIMEOUT);
} catch (InterruptedException e) {
if (!toStop) {
logger.error(e.getMessage(), e);
}
}
}
logger.info(">>>>>>>>>>> xxl-job, executor retry callback thread destroy.");
}
});
triggerRetryCallbackThread.setDaemon(true);
triggerRetryCallbackThread.start();
}
// ...
}
TriggerCallbackThread的启动主要涉及triggerCallbackThread和triggerRetryCallbackThread,triggerCallbackThread负责对暂存的回调数据做消费,triggerRetryCallbackThread则是通过重试机制保证回调请求能够稳定成功。
2.5 监听处理调度中心的请求
private void initEmbedServer(String address, String ip, int port, String appname, String accessToken) throws Exception {
// ...
embedServer = new EmbedServer();
embedServer.start(address, port, appname, accessToken);
}
public class EmbedServer {
private ExecutorBiz executorBiz;
private Thread thread;
public void start(final String address, final int port, final String appname, final String accessToken) {
executorBiz = new ExecutorBizImpl();
thread = new Thread(new Runnable() {
@Override
public void run() {
// 启动netty的通信方式
EventLoopGroup bossGroup = new NioEventLoopGroup();
EventLoopGroup workerGroup = new NioEventLoopGroup();
ThreadPoolExecutor bizThreadPool = new ThreadPoolExecutor(
// ...
});
try {
// start server
ServerBootstrap bootstrap = new ServerBootstrap();
bootstrap.group(bossGroup, workerGroup)
.channel(NioServerSocketChannel.class)
.childHandler(new ChannelInitializer<SocketChannel>() {
@Override
public void initChannel(SocketChannel channel) throws Exception {
channel.pipeline()
.addLast(new IdleStateHandler(0, 0, 30 * 3, TimeUnit.SECONDS)) // beat 3N, close if idle
.addLast(new HttpServerCodec())
.addLast(new HttpObjectAggregator(5 * 1024 * 1024)) // merge request & reponse to FULL
.addLast(new EmbedHttpServerHandler(executorBiz, accessToken, bizThreadPool));
}
})
.childOption(ChannelOption.SO_KEEPALIVE, true);
// bind
ChannelFuture future = bootstrap.bind(port).sync();
// 服务注册 & 心跳探活
startRegistry(appname, address);
// wait util stop
future.channel().closeFuture().sync();
} catch (InterruptedException e) {
// ...
}
}
});
thread.setDaemon(true); // daemon, service jvm, user thread leave >>> daemon leave >>> jvm leave
thread.start();
}
}
使用netty创建了一个http服务器,用于监听调度中心的请求。
public interface ExecutorBiz {
// 心跳
public ReturnT<String> beat();
// 空闲检测
public ReturnT<String> idleBeat(IdleBeatParam idleBeatParam);
// 任务执行
public ReturnT<String> run(TriggerParam triggerParam);
// 下线
public ReturnT<String> kill(KillParam killParam);
// 获取日志信息
public ReturnT<LogResult> log(LogParam logParam);
}
主要负责 心跳探活、空闲检测、任务执行、下线、获取日志信息的动作。
需要注意的是,这里的心跳跟 执行器的心跳是不一样的,并非用于维护执行器当前活跃的地址列表,而是为了实现故障转移路由方式而实现的服务端主动探活。
故障转移主要是为了找到执行器中第一个活跃的实例地址。
而空闲检测则是为了实现忙碌转移这种路由策略。
任务执行顾名思义是负责调度中心的任务出发,后续会详细分析具体的过程。
下线和获取日志信息都是为了控制台能够进行手动下线和查看任务执行的详细日志。
三、调度中心调度原理
任务调度是通过走简单时间轮的方式,ringData就是一个60刻度的时间轮,每个刻度是1秒的单位,每个刻度上都记录了该秒时间下待执行的任务集合。shceduleThread负责将待执行的任务预读出来并写入时间轮,ringThread则负责对时间轮上的任务做调度执行。
3.1 预读逻辑
scheduleThread = new Thread(new Runnable() {
@Override
public void run() {
// ...
while (!scheduleThreadToStop) {
// ...
// 预读 小于未来5秒的任务集合
long nowTime = System.currentTimeMillis();
List<XxlJobInfo> scheduleList =XxlJobAdminConfig.getAdminConfig().getXxlJobInfoDao().scheduleJobQuery(nowTime + PRE_READ_MS, preReadCount);
if (scheduleList!=null && scheduleList.size()>0) {
for (XxlJobInfo jobInfo: scheduleList) {
// 1. 任务过期时间 > 5秒
if (nowTime > jobInfo.getTriggerNextTime() + PRE_READ_MS) {
// misfire match 调度过期策略
MisfireStrategyEnum misfireStrategyEnum = MisfireStrategyEnum.match(jobInfo.getMisfireStrategy(), MisfireStrategyEnum.DO_NOTHING);
if (MisfireStrategyEnum.FIRE_ONCE_NOW == misfireStrategyEnum) {
JobTriggerPoolHelper.trigger(jobInfo.getId(), TriggerTypeEnum.MISFIRE, -1, null, null, null);
logger.debug(">>>>>>>>>>> xxl-job, schedule push trigger : jobId = " + jobInfo.getId() );
}
// 刷新下次执行时间
refreshNextValidTime(jobInfo, new Date());
} else if (nowTime > jobInfo.getTriggerNextTime()) {
// 2. 0秒 < 任务过期时间 < 5秒
// 任务触发
JobTriggerPoolHelper.trigger(jobInfo.getId(), TriggerTypeEnum.CRON, -1, null, null, null);
refreshNextValidTime(jobInfo, new Date());
// 下次执行时间依然在5秒未来以内,则继续预读
if (jobInfo.getTriggerStatus()==1 && nowTime + PRE_READ_MS > jobInfo.getTriggerNextTime()) {
// 计算刻度位置
int ringSecond = (int)((jobInfo.getTriggerNextTime()/1000)%60);
// 放入时间刻度盘
pushTimeRing(ringSecond, jobInfo.getId());
refreshNextValidTime(jobInfo, new Date(jobInfo.getTriggerNextTime()));
}
} else {
// 3. 任务在未来5秒内执行,正常预读
int ringSecond = (int)((jobInfo.getTriggerNextTime()/1000)%60);
pushTimeRing(ringSecond, jobInfo.getId());
refreshNextValidTime(jobInfo, new Date(jobInfo.getTriggerNextTime()));
}
}
// ...
}
// ...
scheduleThread.start();
预读逻辑会先从数据库里面捞出小于未来5秒内执行的任务集合,然后遍历进行判断。
如果任务完全过期,过期时间大于5秒,则会调度过期策略:忽略 or 立即执行一次
如果任务只是过期了,但是过期时间在5秒内,则会立即进行执行,并根据下次执行时间来决定是否应该预读
如果任务在未来5秒内执行,则正常进行预读,刷新下次执行时间,并将执行任务放入刻度盘。
3.2 时间刻度盘消费逻辑
ringThread = new Thread(new Runnable() {
@Override
public void run() {
while (!ringThreadToStop) {
// 整秒时间执行
TimeUnit.MILLISECONDS.sleep(1000 - System.currentTimeMillis() % 1000);
try {
List<Integer> ringItemData = new ArrayList<>();
int nowSecond = Calendar.getInstance().get(Calendar.SECOND); // 避免处理耗时太长,跨过刻度,向前校验一个刻度;
for (int i = 0; i < 2; i++) {
List<Integer> tmpData = ringData.remove( (nowSecond+60-i)%60 );
if (tmpData != null) {
ringItemData.addAll(tmpData);
}
}
// 任务触发
if (ringItemData.size() > 0) {
for (int jobId: ringItemData) {
JobTriggerPoolHelper.trigger(jobId, TriggerTypeEnum.CRON, -1, null, null, null);
}
ringItemData.clear();
}
} catch (Exception e) {
if (!ringThreadToStop) {
logger.error(">>>>>>>>>>> xxl-job, JobScheduleHelper#ringThread error:{}", e);
}
}
}
}
});
ringThread.setDaemon(true);
ringThread.setName("xxl-job, admin JobScheduleHelper#ringThread");
ringThread.start();
}
ringThread的逻辑比较简单,只需要在每个整秒时间从时间刻度盘中拉取当秒以及前一秒的任务集合,然后进行触发。
而向前读一个刻度是为了防止处理耗时太长,跨过刻度的情况。虽然trigger本身是一个异步的过程,但是某秒的执行任务非常多的时候,也是有可能跨刻度的,所以这里做一个健壮性保障。
3.3 任务触发的逻辑
public void addTrigger(final int jobId,
final TriggerTypeEnum triggerType,
final int failRetryCount,
final String executorShardingParam,
final String executorParam,
final String addressList) {
// 选择线程池 快/慢 线程池
ThreadPoolExecutor triggerPool_ = fastTriggerPool;
AtomicInteger jobTimeoutCount = jobTimeoutCountMap.get(jobId);
if (jobTimeoutCount!=null && jobTimeoutCount.get() > 10) { // job-timeout 10 times in 1 min
triggerPool_ = slowTriggerPool;
}
// 放入线程池异步执行
triggerPool_.execute(new Runnable() {
@Override
public void run() {
long start = System.currentTimeMillis();
try {
// 触发
XxlJobTrigger.trigger(jobId, triggerType, failRetryCount, executorShardingParam, executorParam, addressList);
} catch (Exception e) {
logger.error(e.getMessage(), e);
} finally {
// jobTimeoutCountMap 每分钟清空一次,意味着在1分钟以内任务超过10次请求时间超过500ms则会将该任务放入慢线程池中执行
long minTim_now = System.currentTimeMillis()/60000;
if (minTim != minTim_now) {
minTim = minTim_now;
jobTimeoutCountMap.clear();
}
// 超过500ms响应时间则增加一次慢请求次数
long cost = System.currentTimeMillis()-start;
if (cost > 500) { // ob-timeout threshold 500ms
AtomicInteger timeoutCount = jobTimeoutCountMap.putIfAbsent(jobId, new AtomicInteger(1));
if (timeoutCount != null) {
timeoutCount.incrementAndGet();
}
}
}
}
});
}
任务在触发的时候会首先选择放入快线程池还是慢线程池中执行,这样做的目的是为了防止慢任务的执行影响到了快任务的执行。因为本身都是放入线程池进行异步触发,而线程池本身也会有资源限制,如果慢任务阻塞一直不释放线程,是会影响到整体的执行吞吐量的。而可以想到的是,xxl-job单机的理论最大吞吐量其实就是快慢线程池最大线程数之和。
public static void trigger(int jobId,
TriggerTypeEnum triggerType,
int failRetryCount,
String executorShardingParam,
String executorParam,
String addressList) {
// 获取任务信息
XxlJobInfo jobInfo = XxlJobAdminConfig.getAdminConfig().getXxlJobInfoDao().loadById(jobId);
if (jobInfo == null) {
logger.warn(">>>>>>>>>>>> trigger fail, jobId invalid,jobId={}", jobId);
return;
}
if (executorParam != null) {
jobInfo.setExecutorParam(executorParam);
}
// 获取重试次数
int finalFailRetryCount = failRetryCount>=0?failRetryCount:jobInfo.getExecutorFailRetryCount();
XxlJobGroup group = XxlJobAdminConfig.getAdminConfig().getXxlJobGroupDao().load(jobInfo.getJobGroup());
// 分片参数
int[] shardingParam = null;
if (executorShardingParam!=null){
String[] shardingArr = executorShardingParam.split("/");
if (shardingArr.length==2 && isNumeric(shardingArr[0]) && isNumeric(shardingArr[1])) {
shardingParam = new int[2];
shardingParam[0] = Integer.valueOf(shardingArr[0]);
shardingParam[1] = Integer.valueOf(shardingArr[1]);
}
}
// 路由模式
if (ExecutorRouteStrategyEnum.SHARDING_BROADCAST ==ExecutorRouteStrategyEnum.match(jobInfo.getExecutorRouteStrategy(), null)
&& group.getRegistryList() != null && !group.getRegistryList().isEmpty()
&& shardingParam == null) {
// 广播
for (int i = 0; i < group.getRegistryList().size(); i++) {
processTrigger(group, jobInfo, finalFailRetryCount, triggerType, i, group.getRegistryList().size());
}
} else {
// 单分片执行(其他路由模式)
if (shardingParam == null) {
shardingParam = new int[]{0, 1};
}
processTrigger(group, jobInfo, finalFailRetryCount, triggerType, shardingParam[0], shardingParam[1]);
}
}
获取任务基本信息,拿到重试次数,解析分片参数,如果为分片广播模式,则需要对所有执行器实例做执行。
private static void processTrigger(XxlJobGroup group, XxlJobInfo jobInfo, int finalFailRetryCount, TriggerTypeEnum triggerType, intindex, int total){
//...
// 组装触发参数
TriggerParam triggerParam = new TriggerParam();
....
// 根据路由模式确定执行器地址
String address = null;
ReturnT<String> routeAddressResult = null;
if (group.getRegistryList()!=null && !group.getRegistryList().isEmpty()) {
if (ExecutorRouteStrategyEnum.SHARDING_BROADCAST == executorRouteStrategyEnum) {
if (index < group.getRegistryList().size()) {
address = group.getRegistryList().get(index);
} else {
address = group.getRegistryList().get(0);
}
} else {
routeAddressResult = executorRouteStrategyEnum.getRouter().route(triggerParam, group.getRegistryList());
if (routeAddressResult.getCode() == ReturnT.SUCCESS_CODE) {
address = routeAddressResult.getContent();
}
}
} else {
routeAddressResult = new ReturnT<String>(ReturnT.FAIL_CODE, I18nUtil.getString("jobconf_trigger_address_empty"));
}
// 远程调用执行器
ReturnT<String> triggerResult = null;
if (address != null) {
triggerResult = runExecutor(triggerParam, address);
} else {
triggerResult = new ReturnT<String>(ReturnT.FAIL_CODE, null);
}
// 存储执行日志
jobLog.setExecutorAddress(address);
jobLog.setExecutorHandler(jobInfo.getExecutorHandler());
jobLog.setExecutorParam(jobInfo.getExecutorParam());
jobLog.setExecutorShardingParam(shardingParam);
jobLog.setExecutorFailRetryCount(finalFailRetryCount);
jobLog.setTriggerCode(triggerResult.getCode());
jobLog.setTriggerMsg(triggerMsgSb.toString());
XxlJobAdminConfig.getAdminConfig().getXxlJobLogDao().updateTriggerInfo(jobLog);
}
public static ReturnT<String> runExecutor(TriggerParam triggerParam, String address){
ReturnT<String> runResult = null;
try {
ExecutorBiz executorBiz = XxlJobScheduler.getExecutorBiz(address);
runResult = executorBiz.run(triggerParam);
} catch (Exception e) {
logger.error(">>>>>>>>>>> xxl-job trigger error, please check if the executor[{}] is running.", address, e);
runResult = new ReturnT<String>(ReturnT.FAIL_CODE, ThrowableUtil.toString(e));
}
...
return runResult;
}
组装参数,根据路由模式确定路由地址,远程执行,并记录调用日志。
远程执行通过ExecutorBiz执行http请求,最终调到执行器。
执行器的执行是由JobThread执行,执行的过程是异步的。会先将执行参数放入triggerQueue,然后会有异步线程不断进行消费。
消费的时候会通过futureTask.get(long timeout, TimeUnit unit)控制超时策略,最终走反射去调到具体的任务方法。
@Override
public void execute() throws Exception {
Class<?>[] paramTypes = method.getParameterTypes();
if (paramTypes.length > 0) {
method.invoke(target, new Object[paramTypes.length]); // method-param can not be primitive-types
} else {
method.invoke(target);
}
}
四、调度中心启动过程
public void init() throws Exception {
// init i18n
initI18n();
// admin trigger pool start
JobTriggerPoolHelper.toStart();
// admin registry monitor run
JobRegistryHelper.getInstance().start();
// admin fail-monitor run
JobFailMonitorHelper.getInstance().start();
// admin lose-monitor run ( depend on JobTriggerPoolHelper )
JobCompleteHelper.getInstance().start();
// admin log report start
JobLogReportHelper.getInstance().start();
// start-schedule ( depend on JobTriggerPoolHelper )
JobScheduleHelper.getInstance().start();
logger.info(">>>>>>>>> init xxl-job admin success.");
}
步骤:
- 初始化国际化配置
- 初始化任务触发线程池(快慢线程池,负责对执行器的任务触发)
- 初始化执行器注册&心跳监测线程池,负责对执行器活跃地址的维护
- 初始化异常任务监控线程池,对失败任务进行重试以及执行告警策略
- 初始化任务结果丢失监控线程池,对任务结果丢失的任务做结果补偿
- 初始化报表生成线程池,生成控制台的概览数据
- 初始化调度线程池,执行调度逻辑
五、路由策略
第一个、最后一个、轮训、随机 模式没啥说的,实现起来都非常简单,看看xxl-job其他模式是怎么实现的。
5.1 一致性HASH
public class ExecutorRouteConsistentHash extends ExecutorRouter {
private static int VIRTUAL_NODE_NUM = 100;
/**
* get hash code on 2^32 ring (md5散列的方式计算hash值)
* @param key
* @return
*/
private static long hash(String key) {
// md5 byte
MessageDigest md5;
try {
md5 = MessageDigest.getInstance("MD5");
} catch (NoSuchAlgorithmException e) {
throw new RuntimeException("MD5 not supported", e);
}
md5.reset();
byte[] keyBytes = null;
try {
keyBytes = key.getBytes("UTF-8");
} catch (UnsupportedEncodingException e) {
throw new RuntimeException("Unknown string :" + key, e);
}
md5.update(keyBytes);
byte[] digest = md5.digest();
// hash code, Truncate to 32-bits
long hashCode = ((long) (digest[3] & 0xFF) << 24)
| ((long) (digest[2] & 0xFF) << 16)
| ((long) (digest[1] & 0xFF) << 8)
| (digest[0] & 0xFF);
long truncateHashCode = hashCode & 0xffffffffL;
return truncateHashCode;
}
public String hashJob(int jobId, List<String> addressList) {
// ------A1------A2-------A3------
// -----------J1------------------
TreeMap<Long, String> addressRing = new TreeMap<Long, String>();
for (String address: addressList) {
for (int i = 0; i < VIRTUAL_NODE_NUM; i++) {
long addressHash = hash("SHARD-" + address + "-NODE-" + i);
addressRing.put(addressHash, address);
}
}
long jobHash = hash(String.valueOf(jobId));
// 找到大于该hash值的所有Entry
SortedMap<Long, String> lastRing = addressRing.tailMap(jobHash);
if (!lastRing.isEmpty()) {
return lastRing.get(lastRing.firstKey());
}
return addressRing.firstEntry().getValue();
}
}
hash()方法主要是散列作用。
先将所有的地址hash打散放入TreeMap中,大家知道TreeMap默认会根据key排序,也就是散列值排序。
然后把jobId进行hash,并匹配出大于任务ID哈希值的第一个地址。
// ------A1------A2-------A3------ (地址哈希位置)
// -----------J1------------------ (任务ID哈希位置)
比如散列出来是这样的位置,从左到右hash值越大。
那么 J1 将会和A2 匹配
通过此算法,在地址集合不变的情况下,每个任务会固定调度其中一台机器,同时保证不同的任务能够调到不同的地址上面去。
5.2 最不经常使用
public class ExecutorRouteLFU extends ExecutorRouter {
private static ConcurrentMap<Integer, HashMap<String, Integer>> jobLfuMap = new ConcurrentHashMap<Integer, HashMap<String, Integer>>();
private static long CACHE_VALID_TIME = 0;
public String route(int jobId, List<String> addressList) {
// cache clear
if (System.currentTimeMillis() > CACHE_VALID_TIME) {
jobLfuMap.clear();
// 24小时周期
CACHE_VALID_TIME = System.currentTimeMillis() + 1000*60*60*24;
}
// lfu item init
HashMap<String, Integer> lfuItemMap = jobLfuMap.get(jobId); // Key排序可以用TreeMap+构造入参Compare;Value排序暂时只能通过ArrayList;
if (lfuItemMap == null) {
lfuItemMap = new HashMap<String, Integer>();
jobLfuMap.putIfAbsent(jobId, lfuItemMap); // 避免重复覆盖
}
// put new
for (String address: addressList) {
if (!lfuItemMap.containsKey(address) || lfuItemMap.get(address) >1000000 ) {
lfuItemMap.put(address, new Random().nextInt(addressList.size())); // 初始化时主动Random一次,缓解首次压力
}
}
// remove old
List<String> delKeys = new ArrayList<>();
for (String existKey: lfuItemMap.keySet()) {
if (!addressList.contains(existKey)) {
delKeys.add(existKey);
}
}
if (delKeys.size() > 0) {
for (String delKey: delKeys) {
lfuItemMap.remove(delKey);
}
}
// load least userd count address
List<Map.Entry<String, Integer>> lfuItemList = new ArrayList<Map.Entry<String, Integer>>(lfuItemMap.entrySet());
Collections.sort(lfuItemList, new Comparator<Map.Entry<String, Integer>>() {
@Override
public int compare(Map.Entry<String, Integer> o1, Map.Entry<String, Integer> o2) {
return o1.getValue().compareTo(o2.getValue());
}
});
Map.Entry<String, Integer> addressItem = lfuItemList.get(0);
String minAddress = addressItem.getKey();
addressItem.setValue(addressItem.getValue() + 1);
return addressItem.getKey();
}
}
其实就是维护了一个jobLfuMap,格式为 (JobId、(adress、count)
每个任务都记录了各个地址的调用次数,每次通过通过排序选择出调用次数最少的地址返回。
jobLfuMap以24小时为一个周期,超过24小时则会清空数据重新计算。
5.3 最近最近未使用
public class ExecutorRouteLRU extends ExecutorRouter {
private static ConcurrentMap<Integer, LinkedHashMap<String, String>> jobLRUMap = new ConcurrentHashMap<Integer, LinkedHashMap<String, String>>();
private static long CACHE_VALID_TIME = 0;
public String route(int jobId, List<String> addressList) {
// cache clear
if (System.currentTimeMillis() > CACHE_VALID_TIME) {
jobLRUMap.clear();
CACHE_VALID_TIME = System.currentTimeMillis() + 1000*60*60*24;
}
// init lru
LinkedHashMap<String, String> lruItem = jobLRUMap.get(jobId);
if (lruItem == null) {
/**
* LinkedHashMap
* a、accessOrder:true=访问顺序排序(get/put时排序);false=插入顺序排期;
* b、removeEldestEntry:新增元素时将会调用,返回true时会删除最老元素;可封装LinkedHashMap并重写该方法,比如定义最大容量,超出是返回true即可实现固定长度的LRU算法;
*/
lruItem = new LinkedHashMap<String, String>(16, 0.75f, true);
jobLRUMap.putIfAbsent(jobId, lruItem);
}
// put new
for (String address: addressList) {
if (!lruItem.containsKey(address)) {
lruItem.put(address, address);
}
}
// remove old
List<String> delKeys = new ArrayList<>();
for (String existKey: lruItem.keySet()) {
if (!addressList.contains(existKey)) {
delKeys.add(existKey);
}
}
if (delKeys.size() > 0) {
for (String delKey: delKeys) {
lruItem.remove(delKey);
}
}
// load
String eldestKey = lruItem.entrySet().iterator().next().getKey();
String eldestValue = lruItem.get(eldestKey);
return eldestValue;
}
}
底层也是使用LRU的经典实现LinkedHashMap,每个任务对应一个LinkedHashMap,24小时刷新一次。
5.4 故障转移
public class ExecutorRouteFailover extends ExecutorRouter {
@Override
public ReturnT<String> route(TriggerParam triggerParam, List<String> addressList) {
StringBuffer beatResultSB = new StringBuffer();
for (String address : addressList) {
// beat
ReturnT<String> beatResult = null;
try {
ExecutorBiz executorBiz = XxlJobScheduler.getExecutorBiz(address);
beatResult = executorBiz.beat();
} catch (Exception e) {
logger.error(e.getMessage(), e);
beatResult = new ReturnT<String>(ReturnT.FAIL_CODE, ""+e );
}
beatResultSB.append( (beatResultSB.length()>0)?"<br><br>":"")
.append(I18nUtil.getString("jobconf_beat") + ":")
.append("<br>address:").append(address)
.append("<br>code:").append(beatResult.getCode())
.append("<br>msg:").append(beatResult.getMsg());
// beat success
if (beatResult.getCode() == ReturnT.SUCCESS_CODE) {
beatResult.setMsg(beatResultSB.toString());
beatResult.setContent(address);
return beatResult;
}
}
return new ReturnT<String>(ReturnT.FAIL_CODE, beatResultSB.toString());
}
}
故障转移的逻辑也比较简单,就是在地址集合里面寻找一个通信正常的地址
使用executorBiz.beat(),通过服务端主动心跳客户端的方式去判断客户端是否活跃。
5.5 忙碌转移
public class ExecutorRouteBusyover extends ExecutorRouter {
@Override
public ReturnT<String> route(TriggerParam triggerParam, List<String> addressList) {
StringBuffer idleBeatResultSB = new StringBuffer();
for (String address : addressList) {
// beat
ReturnT<String> idleBeatResult = null;
try {
ExecutorBiz executorBiz = XxlJobScheduler.getExecutorBiz(address);
idleBeatResult = executorBiz.idleBeat(new IdleBeatParam(triggerParam.getJobId()));
} catch (Exception e) {
logger.error(e.getMessage(), e);
idleBeatResult = new ReturnT<String>(ReturnT.FAIL_CODE, ""+e );
}
idleBeatResultSB.append( (idleBeatResultSB.length()>0)?"<br><br>":"")
.append(I18nUtil.getString("jobconf_idleBeat") + ":")
.append("<br>address:").append(address)
.append("<br>code:").append(idleBeatResult.getCode())
.append("<br>msg:").append(idleBeatResult.getMsg());
// beat success
if (idleBeatResult.getCode() == ReturnT.SUCCESS_CODE) {
idleBeatResult.setMsg(idleBeatResultSB.toString());
idleBeatResult.setContent(address);
return idleBeatResult;
}
}
return new ReturnT<String>(ReturnT.FAIL_CODE, idleBeatResultSB.toString());
}
}
// 忙碌监测
public ReturnT<String> idleBeat(IdleBeatParam idleBeatParam) {
// isRunningOrHasQueue
boolean isRunningOrHasQueue = false;
JobThread jobThread = XxlJobExecutor.loadJobThread(idleBeatParam.getJobId());
if (jobThread != null && jobThread.isRunningOrHasQueue()) {
isRunningOrHasQueue = true;
}
if (isRunningOrHasQueue) {
return new ReturnT<String>(ReturnT.FAIL_CODE, "job thread is running or has trigger queue.");
}
return ReturnT.SUCCESS;
}
忙碌转移就是通过executorBiz.idleBeat去查询客户端是否忙碌来实现的。
而是否忙碌的逻辑是判断客户端中是否有正在执行的该任务
六、glue模式实现原理
运行模式支持多种,而整体可以分为三种类型,Bean模式、Glue java、Glue 脚本。
Bean模式是用的最多的一种方式,通过使用@XxlJob注解来指定执行器的执行方法。
Glue java是适合上线后能够动态变更执行逻辑的场景,任务上线后可以通过改写java代码去动态编译,底层是使用groovy来实现的。
Glue 脚本主要是对定时调度脚本的支持,丰富xxl-job的非java场景的调度能力。
public class ExecutorBizImpl implements ExecutorBiz {
// ...
@Override
public ReturnT<String> run(TriggerParam triggerParam) {
// load old:jobHandler + jobThread
JobThread jobThread = XxlJobExecutor.loadJobThread(triggerParam.getJobId());
IJobHandler jobHandler = jobThread!=null?jobThread.getHandler():null;
String removeOldReason = null;
GlueTypeEnum glueTypeEnum = GlueTypeEnum.match(triggerParam.getGlueType());
// 运行模式:bean模式
if (GlueTypeEnum.BEAN == glueTypeEnum) {
IJobHandler newJobHandler = XxlJobExecutor.loadJobHandler(triggerParam.getExecutorHandler());
// ...
if (jobHandler == null) {
jobHandler = newJobHandler;
if (jobHandler == null) {
return new ReturnT<String>(ReturnT.FAIL_CODE, "job handler [" + triggerParam.getExecutorHandler() + "] not found.");
}
}
}
// 运行模式:Glue java 模式
else if (GlueTypeEnum.GLUE_GROOVY == glueTypeEnum) {
// ...
if (jobHandler == null) {
try {
IJobHandler originJobHandler = GlueFactory.getInstance().loadNewInstance(triggerParam.getGlueSource());
jobHandler = new GlueJobHandler(originJobHandler, triggerParam.getGlueUpdatetime());
} catch (Exception e) {
logger.error(e.getMessage(), e);
return new ReturnT<String>(ReturnT.FAIL_CODE, e.getMessage());
}
}
}
// 运行模式:Glue 脚本模式
else if (glueTypeEnum!=null && glueTypeEnum.isScript()) {
// ...
if (jobHandler == null) {
jobHandler = new ScriptJobHandler(triggerParam.getJobId(), triggerParam.getGlueUpdatetime(), triggerParam.getGlueSource(), GlueTypeEnum.match(triggerParam.getGlueType()));
}
} else {
return new ReturnT<String>(ReturnT.FAIL_CODE, "glueType[" + triggerParam.getGlueType() + "] is not valid.");
}
// 执行阻塞处理策略
if (jobThread != null) {
ExecutorBlockStrategyEnum blockStrategy = ExecutorBlockStrategyEnum.match(triggerParam.getExecutorBlockStrategy(), null);
if (ExecutorBlockStrategyEnum.DISCARD_LATER == blockStrategy) {
if (jobThread.isRunningOrHasQueue()) {
return new ReturnT<String>(ReturnT.FAIL_CODE, "block strategy effect:"+ExecutorBlockStrategyEnum.DISCARD_LATER.getTitle());
}
} else if (ExecutorBlockStrategyEnum.COVER_EARLY == blockStrategy) {
if (jobThread.isRunningOrHasQueue()) {
removeOldReason = "block strategy effect:" + ExecutorBlockStrategyEnum.COVER_EARLY.getTitle();
jobThread = null;
}
} else {
// just queue trigger
}
}
// replace thread (new or exists invalid)
if (jobThread == null) {
jobThread = XxlJobExecutor.registJobThread(triggerParam.getJobId(), jobHandler, removeOldReason);
}
// 放入执行队列异步执行
ReturnT<String> pushResult = jobThread.pushTriggerQueue(triggerParam);
return pushResult;
}
}
执行器的执行入口在ExecutorBiz.run(),会先根据运行模式去拿到对应的jobHandler,然后执行阻塞处理策略,最后放入jobThread的缓冲队列中异步进行执行。
jobHandler就是不同模式具体的执行框架,bean模式的具体执行会通过反射去调用,而glue java会用在线的java代码通过groovy动态加载class并生成实例对象,然后进行执行。
IJobHandler originJobHandler = GlueFactory.getInstance().loadNewInstance(triggerParam.getGlueSource());
jobHandler = new GlueJobHandler(originJobHandler, triggerParam.getGlueUpdatetime());
// 加载实例对象
public IJobHandler loadNewInstance(String codeSource) throws Exception{
if (codeSource!=null && codeSource.trim().length()>0) {
Class<?> clazz = getCodeSourceClass(codeSource);
if (clazz != null) {
Object instance = clazz.newInstance();
if (instance!=null) {
if (instance instanceof IJobHandler) {
this.injectService(instance);
return (IJobHandler) instance;
} else {
throw new IllegalArgumentException(">>>>>>>>>>> xxl-glue, loadNewInstance error, "
+ "cannot convert from instance["+ instance.getClass() +"] to IJobHandler");
}
}
}
}
throw new IllegalArgumentException(">>>>>>>>>>> xxl-glue, loadNewInstance error, instance is null");
}
// 通过groovy动态加载class
private Class<?> getCodeSourceClass(String codeSource){
try {
// md5
byte[] md5 = MessageDigest.getInstance("MD5").digest(codeSource.getBytes());
String md5Str = new BigInteger(1, md5).toString(16);
// class缓存,防止重复加载,会耗费性能
Class<?> clazz = CLASS_CACHE.get(md5Str);
if(clazz == null){
clazz = groovyClassLoader.parseClass(codeSource);
CLASS_CACHE.putIfAbsent(md5Str, clazz);
}
return clazz;
} catch (Exception e) {
return groovyClassLoader.parseClass(codeSource);
}
}
Glue脚本模式最终会调用到脚本
举个例子,如果是 glue python 那么最终就会在执行器的机器上执行 python {自定义参数} {分配总数},会携带三个脚本参数,分别是自定义参数、分片序号、分配总数。
脚本文件都会在执行器本地的glueSource位置放一份,也是方便后续的脚本执行。
public void execute() throws Exception {
// cmd
String cmd = glueType.getCmd();
// make script file
String scriptFileName = XxlJobFileAppender.getGlueSrcPath()
.concat(File.separator)
.concat(String.valueOf(jobId))
.concat("_")
.concat(String.valueOf(glueUpdatetime))
.concat(glueType.getSuffix());
File scriptFile = new File(scriptFileName);
if (!scriptFile.exists()) {
ScriptUtil.markScriptFile(scriptFileName, gluesource);
}
// log file
String logFileName = XxlJobContext.getXxlJobContext().getJobLogFileName();
// script params:0=param、1=分片序号、2=分片总数
String[] scriptParams = new String[3];
scriptParams[0] = XxlJobHelper.getJobParam();
scriptParams[1] = String.valueOf(XxlJobContext.getXxlJobContext().getShardIndex());
scriptParams[2] = String.valueOf(XxlJobContext.getXxlJobContext().getShardTotal());
// invoke
XxlJobHelper.log("----------- script file:"+ scriptFileName +" -----------");
int exitValue = ScriptUtil.execToFile(cmd, scriptFileName, logFileName, scriptParams);
if (exitValue == 0) {
XxlJobHelper.handleSuccess();
return;
} else {
XxlJobHelper.handleFail("script exit value("+exitValue+") is failed");
return ;
}
}
七、思考
xxl-job整体是一个简单易用的调度框架,代码也写的很简单易懂,二次开发起来也不难,非常适合中小公司的使用。
缺点的话个人觉得就是对应大规模的任务调度场景,可能会有性能上的问题。
如果任务量级上来后我觉得可以首先优化下执行线程池(快慢线程池)的参数,使用默认参数的话同时工作的线程只会有10个~20个,多余的触发只能在队列进行阻塞等待。虽然xxl-job整体都是异步的处理逻辑,但是当任务量级上来后,触发的过程理论上也可能会阻塞。
xxl-job整体的执行完全依赖mysql,任务的调度和执行器心跳的上报会对mysql有很多读写,所以mysql也是影响xxl-job整体执行非常关键的一环。
xxl-job对于大规模的任务执行不具备线性扩展的能力,光是增加调度中心的实例数是没有用的,因为调度的执行主要是依赖于其中一台机器,而多实例的目的主要是为了高可用,类似于主从,如果某台实例意外崩溃了,其他正常的实例会执行调度逻辑。
scheduleThread = new Thread(new Runnable() {
@Override
public void run() {
// ...
while (!scheduleThreadToStop) {
Connection conn = null;
Boolean connAutoCommit = null;
PreparedStatement preparedStatement = null;
boolean preReadSuc = true;
try {
conn = XxlJobAdminConfig.getAdminConfig().getDataSource().getConnection();
connAutoCommit = conn.getAutoCommit();
conn.setAutoCommit(false);
// 数据库锁
preparedStatement = conn.prepareStatement( "select * from xxl_job_lock where lock_name = 'schedule_lock' for update");
preparedStatement.execute();
// ...
那对于大规模的任务场景,xxl-job 该如何优化呢?
个人觉得可以借鉴 nacos、redis cluster的思想,做压力分摊,支持对调度中心的线性扩展能力。
对心跳、调度的执行做分片,一个执行器通过hash后会对应固定一台调度中心做数据写,而任意一台调度中心可以做数据读。
也不用所有调度中心都依赖mysql了,这样压力全部会来到mysql侧,每个调度中心的数据可以写入到各自的数据库中,比如嵌入式数据库。
当然基于这种压力分摊的逻辑会增加系统复杂性,需要考虑到数据一致性、故障转移等其他更复杂的问题。