Apache开源项目 Linkis源码分析03--后端代码初始化篇

135 阅读4分钟

首先打开官方的API文档作为参考,官方文档链接如下

任务提交执行RestAPI文档

但是仔细看,会发现官方文档的写得还不够详细,比如api接口中涉及到 execID ,其实读者看文档并不知道自己搭建的集群中的${execID}变量值是多少,所以这个最好是参考笔者上一篇文档的调试方法,来把自己感兴趣的接口调用都先截图保存起来,然后再结合官方文档去看就能实践结合理论了。笔者上一篇文档的链接如下:

Apache开源项目Linkis源码分析02--前端调试篇

下面开始分析后端接口,第一个要分析的接口是上文的提交接口,即execute接口

接口调用的前端截图如下:

image.png

image.png

image.png

我们可以看到是调用了

http://192.168.233.131:18088/api/rest_j/v1/entrance/execute

调用时候是Post请求,发送的内容json如下:

{
	"executeApplicationName": "hive",
	"executionCode": "show databases",
	"runType": "hql",
	"params": {
		"variable": {},
		"configuration": {}
	},
	"source": {
		"scriptPath": "file:///tmp/linkis/root/workDir/11111.hql"
	}
}

调用该接口的返回的内容如下所示:

{
    "method""/api/entrance/execute",
    "status": 0,
    "message": "OK",
    "data": {
        "taskID": 27,
        "execID": "exec_id018014linkis-cg-entrancelocalhost:9140IDE_root_hive_2"
    }
}

打开linkis的源码,找到处理该请求的代码,截图如下:

image.png

贴一下所有的代码

    /**
     * The execute function handles the request submitted by the user to execute the task, and the
     * execution ID is returned to the user. execute
     * 函数处理的是用户提交执行任务的请求,返回给用户的是执行ID json Incoming
     * key-value pair(传入的键值对) Repsonse
     */
    @ApiOperation(value = "execute", notes = "execute the submitted task", response = Message.class)
    @ApiOperationSupport(ignoreParameters = {"json"})
    @Override
    @RequestMapping(path = "/execute", method = RequestMethod.POST)
    public Message execute(HttpServletRequest req, @RequestBody Map<String, Object> json) {
        Message message = null;
        logger.info("Begin to get an execID");
        json.put(TaskConstant.EXECUTE_USER, ModuleUserUtils.getOperationUser(req));
        json.put(TaskConstant.SUBMIT_USER, SecurityFilter.getLoginUsername(req));
        HashMap<String, String> map = (HashMap<String, String>) json.get(TaskConstant.SOURCE);
        if (map == null) {
            map = new HashMap<>();
            json.put(TaskConstant.SOURCE, map);
        }
        String ip = JobHistoryHelper.getRequestIpAddr(req);
        map.put(TaskConstant.REQUEST_IP, ip);
        Job job = entranceServer.execute(json);
        JobRequest jobReq = ((EntranceJob) job).getJobRequest();
        Long jobReqId = jobReq.getId();
        ModuleUserUtils.getOperationUser(req, "execute task,id: " + jobReqId);
        pushLog(
                LogUtils.generateInfo(
                        "You have submitted a new job, script code (after variable substitution) is"),
                job);
        pushLog(
                "************************************SCRIPT CODE************************************", job);
        pushLog(jobReq.getExecutionCode(), job);
        pushLog(
                "************************************SCRIPT CODE************************************", job);
        String execID =
                ZuulEntranceUtils.generateExecID(
                        job.getId(),
                        Sender.getThisServiceInstance().getApplicationName(),
                        new String[] {Sender.getThisInstance()});
        pushLog(
                LogUtils.generateInfo(
                        "Your job is accepted,    jobID is "
                                + execID
                                + " and taskID is "
                                + jobReqId
                                + " in "
                                + Sender.getThisServiceInstance().toString()
                                + ". Please wait it to be scheduled"),
                job);
        message = Message.ok();
        message.setMethod("/api/entrance/execute");
        message.data("execID", execID);
        message.data("taskID", jobReqId);
        logger.info("End to get an an execID: {}, taskID: {}", execID, jobReqId);
        return message;
    }

上述代码中,最核心的代码是第23行

Job job = entranceServer.execute(json);

这行代码返回了一个Job对象,其中Job对象的getId方法可以返回该job的id,比如此次我执行的id是27号,参考调用返回的json对象中的taskID。该id用于后续的跟踪,跟踪该作业的执行情况。此外后续还有全局唯一的execID也可以用于跟踪任务的进展情况。参考上述的36-40行如何生成execID,此处不展开。

这一行代码调用了entranceServer类的execute方法,entranceServer类属于scala类,我们把该类的execute方法代码贴在下面:

    /**
     * Execute a task and return an job(执行一个task,返回一个job)
     * @param params
     * @return Job
     */
    def execute(params: java.util.Map[String, AnyRef]): Job = {
        if (!params.containsKey(EntranceServer.DO_NOT_PRINT_PARAMS_LOG)) {
            logger.debug("received a request: " + params)
        } else params.remove(EntranceServer.DO_NOT_PRINT_PARAMS_LOG)
        var jobRequest = getEntranceContext.getOrCreateEntranceParser().parseToTask(params)
        // todo: multi entrance instances
        jobRequest.setInstances(Sender.getThisInstance)
        Utils.tryAndWarn(CSEntranceHelper.resetCreator(jobRequest))
        // After parse the map into a jobRequest, we need to store it in the database, and the jobRequest can get a unique taskID.
        // 将map parse 成 jobRequest 之后,我们需要将它存储到数据库中,task可以获得唯一的taskID
        getEntranceContext
            .getOrCreatePersistenceManager()
            .createPersistenceEngine()
            .persist(jobRequest)
        if (null == jobRequest.getId || jobRequest.getId <= 0) {
            throw new EntranceErrorException(
                PERSIST_JOBREQUEST_ERROR.getErrorCode,
                PERSIST_JOBREQUEST_ERROR.getErrorDesc
            )
        }
        logger.info(s"received a request,convert $jobRequest")

        LoggerUtils.setJobIdMDC(jobRequest.getId.toString)

        val logAppender = new java.lang.StringBuilder()
        Utils.tryThrow(
            getEntranceContext
                .getOrCreateEntranceInterceptors()
                .foreach(int => jobRequest = int.apply(jobRequest, logAppender))
        ) { t =>
            LoggerUtils.removeJobIdMDC()
            val error = t match {
                case error: ErrorException => error
                case t1: Throwable =>
                    val exception = new EntranceErrorException(
                        FAILED_ANALYSIS_TASK.getErrorCode,
                        MessageFormat.format(
                            FAILED_ANALYSIS_TASK.getErrorDesc,
                            ExceptionUtils.getRootCauseMessage(t)
                        )
                    )
                    exception.initCause(t1)
                    exception
                case _ =>
                    new EntranceErrorException(
                        FAILED_ANALYSIS_TASK.getErrorCode,
                        MessageFormat.format(
                            FAILED_ANALYSIS_TASK.getErrorDesc,
                            ExceptionUtils.getRootCauseMessage(t)
                        )
                    )
            }
            jobRequest match {
                case t: JobRequest =>
                    t.setErrorCode(error.getErrCode)
                    t.setErrorDesc(error.getDesc)
                    t.setStatus(SchedulerEventState.Failed.toString)
                    t.setProgress(EntranceJob.JOB_COMPLETED_PROGRESS.toString)
                    val infoMap = new util.HashMap[String, AnyRef]
                    infoMap.put(TaskConstant.ENGINE_INSTANCE, "NULL")
                    infoMap.put(TaskConstant.TICKET_ID, "")
                    infoMap.put("message", "Task interception failed and cannot be retried")
                    JobHistoryHelper.updateJobRequestMetrics(jobRequest, null, infoMap)
                case _ =>
            }
            getEntranceContext
                .getOrCreatePersistenceManager()
                .createPersistenceEngine()
                .updateIfNeeded(jobRequest)
            error
        }

        val job = getEntranceContext.getOrCreateEntranceParser().parseToJob(jobRequest)
        Utils.tryThrow {
            job.init()
            job.setLogListener(getEntranceContext.getOrCreateLogManager())
            job.setProgressListener(getEntranceContext.getOrCreatePersistenceManager())
            job.setJobListener(getEntranceContext.getOrCreatePersistenceManager())
            job match {
                case entranceJob: EntranceJob =>
                    entranceJob.setEntranceListenerBus(getEntranceContext.getOrCreateEventListenerBus)
                case _ =>
            }
            Utils.tryCatch {
                if (logAppender.length() > 0) {
                    job.getLogListener.foreach(_.onLogUpdate(job, logAppender.toString.trim))
                }
            } { t =>
                logger.error("Failed to write init log, reason: ", t)
            }

            /**
             * job.afterStateChanged() method is only called in job.run(), and job.run() is called only
             * after job is scheduled so it suggest that we lack a hook for job init, currently we call
             * this to trigger JobListener.onJobinit()
             */
            Utils.tryAndWarn(job.getJobListener.foreach(_.onJobInited(job)))
            getEntranceContext.getOrCreateScheduler().submit(job)
            val msg = LogUtils.generateInfo(
                s"Job with jobId : ${jobRequest.getId} and execID : ${job.getId()} submitted "
            )
            logger.info(msg)

            job match {
                case entranceJob: EntranceJob =>
                    entranceJob.getJobRequest.setReqId(job.getId())
                    if (jobTimeoutManager.timeoutCheck && JobTimeoutManager.hasTimeoutLabel(entranceJob)) {
                        jobTimeoutManager.add(job.getId(), entranceJob)
                    }
                    entranceJob.getLogListener.foreach(_.onLogUpdate(entranceJob, msg))
                case _ =>
            }
            LoggerUtils.removeJobIdMDC()
            job
        } { t =>
            LoggerUtils.removeJobIdMDC()
            job.onFailure("Submitting the query failed!(提交查询失败!)", t)
            val _jobRequest: JobRequest =
                getEntranceContext.getOrCreateEntranceParser().parseToJobRequest(job)
            getEntranceContext
                .getOrCreatePersistenceManager()
                .createPersistenceEngine()
                .updateIfNeeded(_jobRequest)
            t match {
                case e: LinkisException => e
                case e: LinkisRuntimeException => e
                case t: Throwable =>
                    new SubmitFailedException(
                        SUBMITTING_QUERY_FAILED.getErrorCode,
                        SUBMITTING_QUERY_FAILED.getErrorDesc + ExceptionUtils.getRootCauseMessage(t),
                        t
                    )
            }
        }
    }

此处代码重点看103行

getEntranceContext.getOrCreateScheduler().submit(job)

这里会先获取到一个Scheduler类,Scheduler是一个接口类,实现这个接口类的有如下三种:

image.png

其中AbstractScheduler类是抽象类,被FIFOScheduler类和ParallelScheduler类所继承。

其中上面用到的submit方法调用的就是AbstractScheduler类的submit()方法。

FIFOScheduler类和ParallelScheduler类都没有重写submit方法。

所以我们就要分析一下AbstractScheduler类的submit()方法做了什么。

AbstractScheduler类的submit()方法代码如下:

    override def submit(event: SchedulerEvent): Unit = {
        val group = getSchedulerContext.getOrCreateGroupFactory.getOrCreateGroup(event)
        val consumer =
            getSchedulerContext.getOrCreateConsumerManager.getOrCreateConsumer(group.getGroupName)
        val index = consumer.getConsumeQueue.offer(event)
        index.map(getEventId(_, group.getGroupName)).foreach(event.setId)
        if (index.isEmpty) {
            throw new SchedulerErrorException(
                JOB_QUEUE_IS_FULL.getErrorCode,
                JOB_QUEUE_IS_FULL.getErrorDesc
            )
        }
    }

首先根据event就是不同作业类型创建了不同的group,然后不同的group中分别创建了consumer

然后调用了consumer的offer方法

put()方法是如果容器满了的话就会把当前线程挂起 

offer()方法是容器如果满的话就会返回false

每个groupName对应一个consumer,存放同租户同类型的job。

consumer.getConsumeQueue.offer(event)

这里的getConsumeQueue方法会返回一个ConsumeQueue,ConsumeQueue类是一个接口类,只有一个实现类,叫LoopArrayQueue类,下面贴一下LoopArrayQueue类的offer方法:

override def offer(event: SchedulerEvent): Option[Int] = {
  var index = -1
  writeLock synchronized {
    if (isFull) return None
    else {
      index = add(event)
    }
  }
  readLock synchronized { readLock.notify() }
  Some(index)
}

接着分析AbstractScheduler类的submit()方法中的3-4行代码:

val consumer = getSchedulerContext.getOrCreateConsumerManager.getOrCreateConsumer(group.getGroupName)

此处getOrCreateConsumerManager会调用FIFOSchedulerContextImpl类的getOrCreateConsumerManager方法, FIFOSchedulerContextImpl类的getOrCreateConsumerManager方法如下:

override def getOrCreateConsumerManager: ConsumerManager = {
  if (consumerManager != null) return consumerManager
  lock.synchronized {
    if (consumerManager == null) {
      consumerManager = createConsumerManager()
      consumerManager.setSchedulerContext(this)
    }
  }
  consumerManager
}

此处调用了consumerManager.setSchedulerContext(this)方法,跟踪进去,代码如下:

override def setSchedulerContext(schedulerContext: SchedulerContext): Unit = {
  super.setSchedulerContext(schedulerContext)
  group = getSchedulerContext.getOrCreateGroupFactory.getOrCreateGroup(null)
  executorService = group match {
    case g: FIFOGroup =>
      Utils.newCachedThreadPool(g.getMaxRunningJobs + 2, groupName + "-Thread-")
    case _ =>
      throw new SchedulerErrorException(
        NEED_SUPPORTED_GROUP.getErrorCode,
        MessageFormat.format(NEED_SUPPORTED_GROUP.getErrorDesc, group.getClass)
      )
  }
  consumerQueue = new LoopArrayQueue(
    getSchedulerContext.getOrCreateGroupFactory.getOrCreateGroup(null)
  )
  consumer = createConsumer(groupName)
}

再看此处的consumer = createConsumer(groupName),也就是说这里会调用createConsumer方法,再跟踪进去,createConsumer代码如下:

override protected def createConsumer(groupName: String): Consumer = {
  val group = getSchedulerContext.getOrCreateGroupFactory.getOrCreateGroup(null)
  val consumer = new FIFOUserConsumer(getSchedulerContext, getOrCreateExecutorService, group)
  consumer.setGroup(group)
  consumer.setConsumeQueue(consumerQueue)
  if (consumerListener != null) consumerListener.onConsumerCreated(consumer)
  consumer.start()
  consumer
}

这里可以看到调用了consumer.start()方法。跟踪start()方法进去:

def start(): Unit = {
  future = executeService.submit(this)
  bdpFutureTask = new BDPFutureTask(this.future)
}

这里看到调用了executeService.submit(this)方法,把当前类对象传递进去,提交到线程池。 当运行到的时候,会调用本类(FIFOUserConsumer类)的run方法。如下:

override def run(): Unit = {
  Thread.currentThread().setName(s"${toString}Thread")
  logger.info(s"$toString thread started!")
  while (!terminate) {
    Utils.tryAndError(loop())
    Utils.tryAndError(Thread.sleep(10))
  }
  logger.info(s"$toString thread stopped!")
}

这里就引出了最重要的Loop()方法了。loop()方法是一个一直在线的循环。代码如下:

protected def loop(): Unit = {
  var isRetryJob = false
  def getWaitForRetryEvent: Option[SchedulerEvent] = {
    val waitForRetryJobs = runningJobs.filter(job => job != null && job.isJobCanRetry)
    waitForRetryJobs.find { job =>
      isRetryJob = Utils.tryCatch(job.turnToRetry()) { t =>
        job.onFailure(
          "Job state flipped to Scheduled failed in Retry(Retry时,job状态翻转为Scheduled失败)!",
          t
        )
        false
      }
      isRetryJob
    }
  }
  var event: Option[SchedulerEvent] = getWaitForRetryEvent
  if (event.isEmpty) {
    val completedNums = runningJobs.filter(job => job == null || job.isCompleted)
    if (completedNums.length < 1) {
      Utils.tryQuietly(Thread.sleep(1000)) // TODO 还可以优化,通过实现JobListener进行优化
      return
    }
    while (event.isEmpty) {
      val takeEvent = if (getRunningEvents.isEmpty) Option(queue.take()) else queue.take(3000)
      event =
        if (
            takeEvent.exists(e =>
              Utils.tryCatch(e.turnToScheduled()) { t =>
                takeEvent.get.asInstanceOf[Job].onFailure("Job状态翻转为Scheduled失败!", t)
                false
              }
            )
        ) {
          takeEvent
        } else getWaitForRetryEvent
    }
  }
  event.foreach { case job: Job =>
    Utils.tryCatch {
      val (totalDuration, askDuration) =
        (fifoGroup.getMaxAskExecutorDuration, fifoGroup.getAskExecutorInterval)
      var executor: Option[Executor] = None
      job.consumerFuture = bdpFutureTask
      Utils.waitUntil(
        () => {
          executor = Utils.tryCatch(
            schedulerContext.getOrCreateExecutorManager.askExecutor(job, askDuration)
          ) {
            case warn: WarnException =>
              job.getLogListener.foreach(_.onLogUpdate(job, LogUtils.generateWarn(warn.getDesc)))
              None
            case e: ErrorException =>
              job.getLogListener.foreach(_.onLogUpdate(job, LogUtils.generateERROR(e.getMessage)))
              throw e
            case error: Throwable =>
              job.getLogListener.foreach(
                _.onLogUpdate(job, LogUtils.generateERROR(error.getMessage))
              )
              throw error
          }
          Utils.tryQuietly(askExecutorGap())
          executor.isDefined
        },
        totalDuration
      )
      job.consumerFuture = null
      executor.foreach { executor =>
        job.setExecutor(executor)
        job.future = executeService.submit(job)
        job.getJobDaemon.foreach(jobDaemon => jobDaemon.future = executeService.submit(jobDaemon))
        if (!isRetryJob) putToRunningJobs(job)
      }
    } {
      case _: TimeoutException =>
        logger.warn(s"Ask executor for Job $job timeout!")
        job.onFailure(
          "The request engine times out (请求引擎超时,可能是EngineConnManager 启动EngineConn失败导致,可以去查看看EngineConnManager的linkis.out和linkis.log日志).",
          new SchedulerErrorException(
            REQUEST_ENGINE_TIME_OUT.getErrorCode,
            REQUEST_ENGINE_TIME_OUT.getErrorDesc
          )
        )
      case error: Throwable =>
        job.onFailure("请求引擎失败,可能是由于后台进程错误!请联系管理员", error)
        if (job.isWaitForRetry) {
          logger.warn(s"Ask executor for Job $job failed, wait for the next retry!", error)
          if (!isRetryJob) putToRunningJobs(job)
        } else logger.warn(s"Ask executor for Job $job failed!", error)
    }
  }
}

loop()方法中47行代码如下

schedulerContext.getOrCreateExecutorManager.askExecutor(job, askDuration)

该方法调用了getOrCreateExecutorManager方法,该方法实际会返回一个EntranceExecutorManager抽象类,然后这行代码会调用该抽象类的askExecutor()方法。askExecutor方法代码如下:

override def askExecutor(schedulerEvent: SchedulerEvent): Option[Executor] =
  schedulerEvent match {
    case job: Job =>
      Option(createExecutor(job))
  }

该方法又调用了createExecutor()方法,跟踪进去代码如下:

  override protected def createExecutor(schedulerEvent: SchedulerEvent): EntranceExecutor =
    schedulerEvent match {
      case job: EntranceJob =>
        job.getJobRequest match {
          case jobReq: JobRequest =>
            val entranceEntranceExecutor =
              new DefaultEntranceExecutor(jobReq.getId)
            // getEngineConn Executor
            job.getLogListener.foreach(
              _.onLogUpdate(
                job,
                LogUtils.generateInfo("Your job is being scheduled by orchestrator.")
              )
            )
            jobReq.setUpdatedTime(new Date(System.currentTimeMillis()))

            /**
             * // val engineConnExecutor = engineConnManager.getAvailableEngineConnExecutor(mark)
             * idToEngines.put(entranceEntranceExecutor.getId, entranceEntranceExecutor)
             */
//          instanceToEngines.put(engineConnExecutor.getServiceInstance.getInstance, entranceEntranceExecutor) // todo
//          entranceEntranceExecutor.setInterceptors(getOrCreateInterceptors()) // todo
            entranceEntranceExecutor
          case _ =>
            throw new EntranceErrorException(
              NOT_CREATE_EXECUTOR.getErrorCode,
              NOT_CREATE_EXECUTOR.getErrorDesc
            )
        }
      case _ =>
        throw new EntranceErrorException(
          ENTRA_NOT_CREATE_EXECUTOR.getErrorCode,
          ENTRA_NOT_CREATE_EXECUTOR.getErrorDesc
        )
    }

这里的第6-7行代码,也就是

val entranceEntranceExecutor = new DefaultEntranceExecutor(jobReq.getId)

这行代码会创建一个DefaultEntranceExecutor类,这个类中重要的方法有callExecute方法和requestToComputationJobReq方法。 当Job被线程池调度运行起来时,Job类中的run方法会被调用。Job类中的run方法中就有一句代码:

val rs = Utils.tryCatch(executor.execute(jobToExecuteRequest)) 

这行代码就是调用了EntranceExecutor类的execute方法,EntranceExecutor的execute方法再调用DefaultEntranceExecutor类的callExecute方法。 requestToComputationJobReq方法主要用于把Job类转换为能执行的ComputationJobReq对象。

核心的orchestration也是在DefaultEntranceExecutor类方法的callExecute方法中完成初始化,后续orchestration负责调度。

今天的分析到此结束,后文继续吧。