对于分布式任务调度平台,如何解决多server集群部署场景下保证只有一台server进行调度?
xxl-job基于数据库锁,而PowerJob则基于无锁化设计。
无锁化设计主要通过分组隔离实现,什么是分组隔离?核心思想是让一个应用只被一个server关联,这样每次调度是每个server调度自己所关联的worker,避免竞争
下面从源码角度看一下:
PowerJob并不依赖注册中心,那么server和worker之间是如何通信呢?
public void start(ScheduledExecutorService timingPool) {
this.currentServerAddress = discovery();
if (org.springframework.util.StringUtils.isEmpty(this.currentServerAddress) && !config.isEnableTestMode()) {
throw new PowerJobException("can't find any available server, this worker has been quarantined.");
}
timingPool.scheduleAtFixedRate(() -> this.currentServerAddress = discovery(), 10, 10, TimeUnit.SECONDS);
}
worker启动时先进行一次服务发现,同时初始化一个定时任务去定期执行一次服务发现
服务发现的实现:
// 先对当前机器发起请求
String currentServer = OhMyWorker.getCurrentServer();
if (!StringUtils.isEmpty(currentServer)) {
String ip = currentServer.split(":")[0];
// 直接请求当前Server的HTTP服务,可以少一次网络开销,减轻Server负担
String firstServerAddress = IP2ADDRESS.get(ip);
if (firstServerAddress != null) {
result = acquire(firstServerAddress);
}
}
for (String httpServerAddress : OhMyWorker.getConfig().getServerAddress()) {
if (StringUtils.isEmpty(result)) {
result = acquire(httpServerAddress);
}else {
break;
}
}
主要请求后端的/acquire接口
@GetMapping("/acquire")
public ResultDTO<String> acquireServer(Long appId, String protocol, String currentServer) {
return ResultDTO.success(serverElectionService.elect(appId, protocol, currentServer));
}
public String elect(Long appId, String protocol, String currentServer) {
if (!accurate()) {
// 如果是本机,就不需要查数据库那么复杂的操作了,直接返回成功
if (getProtocolServerAddress(protocol).equals(currentServer)) {
return currentServer;
}
}
return getServer0(appId, protocol);
}
private String getServer0(Long appId, String protocol) {
Set<String> downServerCache = Sets.newHashSet();
for (int i = 0; i < RETRY_TIMES; i++) {
// 无锁获取当前数据库中的Server
Optional<AppInfoDO> appInfoOpt = appInfoRepository.findById(appId);
if (!appInfoOpt.isPresent()) {
throw new PowerJobException(appId + " is not registered!");
}
String appName = appInfoOpt.get().getAppName();
String originServer = appInfoOpt.get().getCurrentServer();
String activeAddress = activeAddress(originServer, downServerCache, protocol);
if (StringUtils.isNotEmpty(activeAddress)) {
return activeAddress;
}
// 无可用Server,重新进行Server选举,需要加锁
String lockName = String.format(SERVER_ELECT_LOCK, appId);
boolean lockStatus = lockService.tryLock(lockName, 30000);
if (!lockStatus) {
try {
Thread.sleep(500);
}catch (Exception ignore) {
}
continue;
}
try {
// 可能上一台机器已经完成了Server选举,需要再次判断
AppInfoDO appInfo = appInfoRepository.findById(appId).orElseThrow(() -> new RuntimeException("impossible, unless we just lost our database."));
String address = activeAddress(appInfo.getCurrentServer(), downServerCache, protocol);
if (StringUtils.isNotEmpty(address)) {
return address;
}
// 篡位,本机作为Server
// 注意,写入 AppInfoDO#currentServer 的永远是 ActorSystem 的地址,仅在返回的时候特殊处理
appInfo.setCurrentServer(transportService.getTransporter(Protocol.AKKA).getAddress());
appInfo.setGmtModified(new Date());
appInfoRepository.saveAndFlush(appInfo);
log.info("[ServerElection] this server({}) become the new server for app(appId={}).", appInfo.getCurrentServer(), appId);
return getProtocolServerAddress(protocol);
}catch (Exception e) {
log.error("[ServerElection] write new server to db failed for app {}.", appName, e);
}finally {
lockService.unlock(lockName);
}
}
throw new PowerJobException("server elect failed for app " + appId);
}
先从数据库中取当前appid对应的server,然后发起一次心跳(AKKA通信),如果成功就返回当前server地址,失败则认为当前server出问题,需要重新进行选举,需要加锁,最后改下appinfo关联的server地址
然后server会定期到数据库里拉他自己持有的所有应用所关联的所有job
@Async(PJThreadPool.TIMING_POOL)
@Scheduled(fixedDelay = SCHEDULE_RATE)
public void timingSchedule() {
long start = System.currentTimeMillis();
Stopwatch stopwatch = Stopwatch.createStarted();
// 先查询DB,查看本机需要负责的任务
List<AppInfoDO> allAppInfos = appInfoRepository.findAllByCurrentServer(AkkaStarter.getActorSystemAddress());
if (CollectionUtils.isEmpty(allAppInfos)) {
log.info("[JobScheduleService] current server has no app's job to schedule.");
return;
}
List<Long> allAppIds = allAppInfos.stream().map(AppInfoDO::getId).collect(Collectors.toList());
// 清理不需要维护的数据
WorkerClusterManagerService.clean(allAppIds);
// 调度 CRON 表达式 JOB
try {
scheduleCronJob(allAppIds);
} catch (Exception e) {
log.error("[CronScheduler] schedule cron job failed.", e);
}
String cronTime = stopwatch.toString();
stopwatch.reset().start();
// 调度 workflow 任务
try {
scheduleWorkflow(allAppIds);
} catch (Exception e) {
log.error("[WorkflowScheduler] schedule workflow job failed.", e);
}
String wfTime = stopwatch.toString();
stopwatch.reset().start();
// 调度 秒级任务
try {
scheduleFrequentJob(allAppIds);
} catch (Exception e) {
log.error("[FrequentScheduler] schedule frequent job failed.", e);
}
log.info("[JobScheduleService] cron schedule: {}, workflow schedule: {}, frequent schedule: {}.", cronTime, wfTime, stopwatch.stop());
long cost = System.currentTimeMillis() - start;
if (cost > SCHEDULE_RATE) {
log.warn("[JobScheduleService] The database query is using too much time({}ms), please check if the database load is too high!", cost);
}
}
将任务推到时间轮中,计算他们的nextTriggerTime并推入时间轮进行触发
// 2. 推入时间轮中等待调度执行
jobInfos.forEach(jobInfoDO -> {
Long instanceId = jobId2InstanceId.get(jobInfoDO.getId());
long targetTriggerTime = jobInfoDO.getNextTriggerTime();
long delay = 0;
if (targetTriggerTime < nowTime) {
log.warn("[Job-{}] schedule delay, expect: {}, current: {}", jobInfoDO.getId(), targetTriggerTime, System.currentTimeMillis());
} else {
delay = targetTriggerTime - nowTime;
}
InstanceTimeWheelService.schedule(instanceId, delay, () -> dispatchService.dispatch(jobInfoDO, instanceId));
});
参考链接: