PowerJob源码分析-分组隔离对于分布式任务调度平台,如何解决多server集群部署场景下保证只有一台server进

对于分布式任务调度平台,如何解决多server集群部署场景下保证只有一台server进行调度?

xxl-job基于数据库锁,而PowerJob则基于无锁化设计。

无锁化设计主要通过分组隔离实现,什么是分组隔离?核心思想是让一个应用只被一个server关联,这样每次调度是每个server调度自己所关联的worker,避免竞争

下面从源码角度看一下:

PowerJob并不依赖注册中心,那么server和worker之间是如何通信呢?

 public void start(ScheduledExecutorService timingPool) {
        this.currentServerAddress = discovery();
        if (org.springframework.util.StringUtils.isEmpty(this.currentServerAddress) && !config.isEnableTestMode()) {
            throw new PowerJobException("can't find any available server, this worker has been quarantined.");
        }
        timingPool.scheduleAtFixedRate(() -> this.currentServerAddress = discovery(), 10, 10, TimeUnit.SECONDS);
    }

worker启动时先进行一次服务发现,同时初始化一个定时任务去定期执行一次服务发现

服务发现的实现:

 // 先对当前机器发起请求
        String currentServer = OhMyWorker.getCurrentServer();
        if (!StringUtils.isEmpty(currentServer)) {
            String ip = currentServer.split(":")[0];
            // 直接请求当前Server的HTTP服务，可以少一次网络开销，减轻Server负担
            String firstServerAddress = IP2ADDRESS.get(ip);
            if (firstServerAddress != null) {
                result = acquire(firstServerAddress);
            }
        }

        for (String httpServerAddress : OhMyWorker.getConfig().getServerAddress()) {
            if (StringUtils.isEmpty(result)) {
                result = acquire(httpServerAddress);
            }else {
                break;
            }
        }

主要请求后端的/acquire接口

@GetMapping("/acquire")
    public ResultDTO<String> acquireServer(Long appId, String protocol, String currentServer) {
        return ResultDTO.success(serverElectionService.elect(appId, protocol, currentServer));
    }

public String elect(Long appId, String protocol, String currentServer) {
        if (!accurate()) {
            // 如果是本机，就不需要查数据库那么复杂的操作了，直接返回成功
            if (getProtocolServerAddress(protocol).equals(currentServer)) {
                return currentServer;
            }
        }
        return getServer0(appId, protocol);
    }

private String getServer0(Long appId, String protocol) {

        Set<String> downServerCache = Sets.newHashSet();

        for (int i = 0; i < RETRY_TIMES; i++) {

            // 无锁获取当前数据库中的Server
            Optional<AppInfoDO> appInfoOpt = appInfoRepository.findById(appId);
            if (!appInfoOpt.isPresent()) {
                throw new PowerJobException(appId + " is not registered!");
            }
            String appName = appInfoOpt.get().getAppName();
            String originServer = appInfoOpt.get().getCurrentServer();
            String activeAddress = activeAddress(originServer, downServerCache, protocol);
            if (StringUtils.isNotEmpty(activeAddress)) {
                return activeAddress;
            }

            // 无可用Server，重新进行Server选举，需要加锁
            String lockName = String.format(SERVER_ELECT_LOCK, appId);
            boolean lockStatus = lockService.tryLock(lockName, 30000);
            if (!lockStatus) {
                try {
                    Thread.sleep(500);
                }catch (Exception ignore) {
                }
                continue;
            }
            try {

                // 可能上一台机器已经完成了Server选举，需要再次判断
                AppInfoDO appInfo = appInfoRepository.findById(appId).orElseThrow(() -> new RuntimeException("impossible, unless we just lost our database."));
                String address = activeAddress(appInfo.getCurrentServer(), downServerCache, protocol);
                if (StringUtils.isNotEmpty(address)) {
                    return address;
                }

                // 篡位，本机作为Server
                // 注意，写入 AppInfoDO#currentServer 的永远是 ActorSystem 的地址，仅在返回的时候特殊处理
                appInfo.setCurrentServer(transportService.getTransporter(Protocol.AKKA).getAddress());
                appInfo.setGmtModified(new Date());

                appInfoRepository.saveAndFlush(appInfo);
                log.info("[ServerElection] this server({}) become the new server for app(appId={}).", appInfo.getCurrentServer(), appId);
                return getProtocolServerAddress(protocol);
            }catch (Exception e) {
                log.error("[ServerElection] write new server to db failed for app {}.", appName, e);
            }finally {
                lockService.unlock(lockName);
            }
        }
        throw new PowerJobException("server elect failed for app " + appId);
    }

先从数据库中取当前appid对应的server,然后发起一次心跳(AKKA通信),如果成功就返回当前server地址,失败则认为当前server出问题,需要重新进行选举,需要加锁,最后改下appinfo关联的server地址

然后server会定期到数据库里拉他自己持有的所有应用所关联的所有job

@Async(PJThreadPool.TIMING_POOL)
    @Scheduled(fixedDelay = SCHEDULE_RATE)
    public void timingSchedule() {

        long start = System.currentTimeMillis();
        Stopwatch stopwatch = Stopwatch.createStarted();

        // 先查询DB，查看本机需要负责的任务
        List<AppInfoDO> allAppInfos = appInfoRepository.findAllByCurrentServer(AkkaStarter.getActorSystemAddress());
        if (CollectionUtils.isEmpty(allAppInfos)) {
            log.info("[JobScheduleService] current server has no app's job to schedule.");
            return;
        }
        List<Long> allAppIds = allAppInfos.stream().map(AppInfoDO::getId).collect(Collectors.toList());
        // 清理不需要维护的数据
        WorkerClusterManagerService.clean(allAppIds);

        // 调度 CRON 表达式 JOB
        try {
            scheduleCronJob(allAppIds);
        } catch (Exception e) {
            log.error("[CronScheduler] schedule cron job failed.", e);
        }
        String cronTime = stopwatch.toString();
        stopwatch.reset().start();

        // 调度 workflow 任务
        try {
            scheduleWorkflow(allAppIds);
        } catch (Exception e) {
            log.error("[WorkflowScheduler] schedule workflow job failed.", e);
        }
        String wfTime = stopwatch.toString();
        stopwatch.reset().start();

        // 调度 秒级任务
        try {
            scheduleFrequentJob(allAppIds);
        } catch (Exception e) {
            log.error("[FrequentScheduler] schedule frequent job failed.", e);
        }

        log.info("[JobScheduleService] cron schedule: {}, workflow schedule: {}, frequent schedule: {}.", cronTime, wfTime, stopwatch.stop());

        long cost = System.currentTimeMillis() - start;
        if (cost > SCHEDULE_RATE) {
            log.warn("[JobScheduleService] The database query is using too much time({}ms), please check if the database load is too high!", cost);
        }
    }

将任务推到时间轮中,计算他们的nextTriggerTime并推入时间轮进行触发

// 2. 推入时间轮中等待调度执行
                jobInfos.forEach(jobInfoDO -> {

                    Long instanceId = jobId2InstanceId.get(jobInfoDO.getId());

                    long targetTriggerTime = jobInfoDO.getNextTriggerTime();
                    long delay = 0;
                    if (targetTriggerTime < nowTime) {
                        log.warn("[Job-{}] schedule delay, expect: {}, current: {}", jobInfoDO.getId(), targetTriggerTime, System.currentTimeMillis());
                    } else {
                        delay = targetTriggerTime - nowTime;
                    }

                    InstanceTimeWheelService.schedule(instanceId, delay, () -> dispatchService.dispatch(jobInfoDO, instanceId));
                });

参考链接:

zhuanlan.zhihu.com/p/163487886

mp.weixin.qq.com/s/LgJYK8atY…