首先打开官方的API文档作为参考,官方文档链接如下
任务提交执行RestAPI文档
但是仔细看,会发现官方文档的写得还不够详细,比如api接口中涉及到 execID ,其实读者看文档并不知道自己搭建的集群中的${execID}变量值是多少,所以这个最好是参考笔者上一篇文档的调试方法,来把自己感兴趣的接口调用都先截图保存起来,然后再结合官方文档去看就能实践结合理论了。笔者上一篇文档的链接如下:
Apache开源项目Linkis源码分析02--前端调试篇
下面开始分析后端接口,第一个要分析的接口是上文的提交接口,即execute接口
接口调用的前端截图如下:
我们可以看到是调用了
http://192.168.233.131:18088/api/rest_j/v1/entrance/execute
调用时候是Post请求,发送的内容json如下:
{
"executeApplicationName": "hive",
"executionCode": "show databases",
"runType": "hql",
"params": {
"variable": {},
"configuration": {}
},
"source": {
"scriptPath": "file:///tmp/linkis/root/workDir/11111.hql"
}
}
调用该接口的返回的内容如下所示:
{
"method":"/api/entrance/execute",
"status": 0,
"message": "OK",
"data": {
"taskID": 27,
"execID": "exec_id018014linkis-cg-entrancelocalhost:9140IDE_root_hive_2"
}
}
打开linkis的源码,找到处理该请求的代码,截图如下:
贴一下所有的代码
/**
* The execute function handles the request submitted by the user to execute the task, and the
* execution ID is returned to the user. execute
* 函数处理的是用户提交执行任务的请求,返回给用户的是执行ID json Incoming
* key-value pair(传入的键值对) Repsonse
*/
@ApiOperation(value = "execute", notes = "execute the submitted task", response = Message.class)
@ApiOperationSupport(ignoreParameters = {"json"})
@Override
@RequestMapping(path = "/execute", method = RequestMethod.POST)
public Message execute(HttpServletRequest req, @RequestBody Map<String, Object> json) {
Message message = null;
logger.info("Begin to get an execID");
json.put(TaskConstant.EXECUTE_USER, ModuleUserUtils.getOperationUser(req));
json.put(TaskConstant.SUBMIT_USER, SecurityFilter.getLoginUsername(req));
HashMap<String, String> map = (HashMap<String, String>) json.get(TaskConstant.SOURCE);
if (map == null) {
map = new HashMap<>();
json.put(TaskConstant.SOURCE, map);
}
String ip = JobHistoryHelper.getRequestIpAddr(req);
map.put(TaskConstant.REQUEST_IP, ip);
Job job = entranceServer.execute(json);
JobRequest jobReq = ((EntranceJob) job).getJobRequest();
Long jobReqId = jobReq.getId();
ModuleUserUtils.getOperationUser(req, "execute task,id: " + jobReqId);
pushLog(
LogUtils.generateInfo(
"You have submitted a new job, script code (after variable substitution) is"),
job);
pushLog(
"************************************SCRIPT CODE************************************", job);
pushLog(jobReq.getExecutionCode(), job);
pushLog(
"************************************SCRIPT CODE************************************", job);
String execID =
ZuulEntranceUtils.generateExecID(
job.getId(),
Sender.getThisServiceInstance().getApplicationName(),
new String[] {Sender.getThisInstance()});
pushLog(
LogUtils.generateInfo(
"Your job is accepted, jobID is "
+ execID
+ " and taskID is "
+ jobReqId
+ " in "
+ Sender.getThisServiceInstance().toString()
+ ". Please wait it to be scheduled"),
job);
message = Message.ok();
message.setMethod("/api/entrance/execute");
message.data("execID", execID);
message.data("taskID", jobReqId);
logger.info("End to get an an execID: {}, taskID: {}", execID, jobReqId);
return message;
}
上述代码中,最核心的代码是第23行
Job job = entranceServer.execute(json);
这行代码返回了一个Job对象,其中Job对象的getId方法可以返回该job的id,比如此次我执行的id是27号,参考调用返回的json对象中的taskID。该id用于后续的跟踪,跟踪该作业的执行情况。此外后续还有全局唯一的execID也可以用于跟踪任务的进展情况。参考上述的36-40行如何生成execID,此处不展开。
这一行代码调用了entranceServer类的execute方法,entranceServer类属于scala类,我们把该类的execute方法代码贴在下面:
/**
* Execute a task and return an job(执行一个task,返回一个job)
* @param params
* @return Job
*/
def execute(params: java.util.Map[String, AnyRef]): Job = {
if (!params.containsKey(EntranceServer.DO_NOT_PRINT_PARAMS_LOG)) {
logger.debug("received a request: " + params)
} else params.remove(EntranceServer.DO_NOT_PRINT_PARAMS_LOG)
var jobRequest = getEntranceContext.getOrCreateEntranceParser().parseToTask(params)
// todo: multi entrance instances
jobRequest.setInstances(Sender.getThisInstance)
Utils.tryAndWarn(CSEntranceHelper.resetCreator(jobRequest))
// After parse the map into a jobRequest, we need to store it in the database, and the jobRequest can get a unique taskID.
// 将map parse 成 jobRequest 之后,我们需要将它存储到数据库中,task可以获得唯一的taskID
getEntranceContext
.getOrCreatePersistenceManager()
.createPersistenceEngine()
.persist(jobRequest)
if (null == jobRequest.getId || jobRequest.getId <= 0) {
throw new EntranceErrorException(
PERSIST_JOBREQUEST_ERROR.getErrorCode,
PERSIST_JOBREQUEST_ERROR.getErrorDesc
)
}
logger.info(s"received a request,convert $jobRequest")
LoggerUtils.setJobIdMDC(jobRequest.getId.toString)
val logAppender = new java.lang.StringBuilder()
Utils.tryThrow(
getEntranceContext
.getOrCreateEntranceInterceptors()
.foreach(int => jobRequest = int.apply(jobRequest, logAppender))
) { t =>
LoggerUtils.removeJobIdMDC()
val error = t match {
case error: ErrorException => error
case t1: Throwable =>
val exception = new EntranceErrorException(
FAILED_ANALYSIS_TASK.getErrorCode,
MessageFormat.format(
FAILED_ANALYSIS_TASK.getErrorDesc,
ExceptionUtils.getRootCauseMessage(t)
)
)
exception.initCause(t1)
exception
case _ =>
new EntranceErrorException(
FAILED_ANALYSIS_TASK.getErrorCode,
MessageFormat.format(
FAILED_ANALYSIS_TASK.getErrorDesc,
ExceptionUtils.getRootCauseMessage(t)
)
)
}
jobRequest match {
case t: JobRequest =>
t.setErrorCode(error.getErrCode)
t.setErrorDesc(error.getDesc)
t.setStatus(SchedulerEventState.Failed.toString)
t.setProgress(EntranceJob.JOB_COMPLETED_PROGRESS.toString)
val infoMap = new util.HashMap[String, AnyRef]
infoMap.put(TaskConstant.ENGINE_INSTANCE, "NULL")
infoMap.put(TaskConstant.TICKET_ID, "")
infoMap.put("message", "Task interception failed and cannot be retried")
JobHistoryHelper.updateJobRequestMetrics(jobRequest, null, infoMap)
case _ =>
}
getEntranceContext
.getOrCreatePersistenceManager()
.createPersistenceEngine()
.updateIfNeeded(jobRequest)
error
}
val job = getEntranceContext.getOrCreateEntranceParser().parseToJob(jobRequest)
Utils.tryThrow {
job.init()
job.setLogListener(getEntranceContext.getOrCreateLogManager())
job.setProgressListener(getEntranceContext.getOrCreatePersistenceManager())
job.setJobListener(getEntranceContext.getOrCreatePersistenceManager())
job match {
case entranceJob: EntranceJob =>
entranceJob.setEntranceListenerBus(getEntranceContext.getOrCreateEventListenerBus)
case _ =>
}
Utils.tryCatch {
if (logAppender.length() > 0) {
job.getLogListener.foreach(_.onLogUpdate(job, logAppender.toString.trim))
}
} { t =>
logger.error("Failed to write init log, reason: ", t)
}
/**
* job.afterStateChanged() method is only called in job.run(), and job.run() is called only
* after job is scheduled so it suggest that we lack a hook for job init, currently we call
* this to trigger JobListener.onJobinit()
*/
Utils.tryAndWarn(job.getJobListener.foreach(_.onJobInited(job)))
getEntranceContext.getOrCreateScheduler().submit(job)
val msg = LogUtils.generateInfo(
s"Job with jobId : ${jobRequest.getId} and execID : ${job.getId()} submitted "
)
logger.info(msg)
job match {
case entranceJob: EntranceJob =>
entranceJob.getJobRequest.setReqId(job.getId())
if (jobTimeoutManager.timeoutCheck && JobTimeoutManager.hasTimeoutLabel(entranceJob)) {
jobTimeoutManager.add(job.getId(), entranceJob)
}
entranceJob.getLogListener.foreach(_.onLogUpdate(entranceJob, msg))
case _ =>
}
LoggerUtils.removeJobIdMDC()
job
} { t =>
LoggerUtils.removeJobIdMDC()
job.onFailure("Submitting the query failed!(提交查询失败!)", t)
val _jobRequest: JobRequest =
getEntranceContext.getOrCreateEntranceParser().parseToJobRequest(job)
getEntranceContext
.getOrCreatePersistenceManager()
.createPersistenceEngine()
.updateIfNeeded(_jobRequest)
t match {
case e: LinkisException => e
case e: LinkisRuntimeException => e
case t: Throwable =>
new SubmitFailedException(
SUBMITTING_QUERY_FAILED.getErrorCode,
SUBMITTING_QUERY_FAILED.getErrorDesc + ExceptionUtils.getRootCauseMessage(t),
t
)
}
}
}
此处代码重点看103行
getEntranceContext.getOrCreateScheduler().submit(job)
这里会先获取到一个Scheduler类,Scheduler是一个接口类,实现这个接口类的有如下三种:
其中AbstractScheduler类是抽象类,被FIFOScheduler类和ParallelScheduler类所继承。
其中上面用到的submit方法调用的就是AbstractScheduler类的submit()方法。
FIFOScheduler类和ParallelScheduler类都没有重写submit方法。
所以我们就要分析一下AbstractScheduler类的submit()方法做了什么。
AbstractScheduler类的submit()方法代码如下:
override def submit(event: SchedulerEvent): Unit = {
val group = getSchedulerContext.getOrCreateGroupFactory.getOrCreateGroup(event)
val consumer =
getSchedulerContext.getOrCreateConsumerManager.getOrCreateConsumer(group.getGroupName)
val index = consumer.getConsumeQueue.offer(event)
index.map(getEventId(_, group.getGroupName)).foreach(event.setId)
if (index.isEmpty) {
throw new SchedulerErrorException(
JOB_QUEUE_IS_FULL.getErrorCode,
JOB_QUEUE_IS_FULL.getErrorDesc
)
}
}
首先根据event就是不同作业类型创建了不同的group,然后不同的group中分别创建了consumer
然后调用了consumer的offer方法
put()方法是如果容器满了的话就会把当前线程挂起
offer()方法是容器如果满的话就会返回false
每个groupName对应一个consumer,存放同租户同类型的job。
consumer.getConsumeQueue.offer(event)
这里的getConsumeQueue方法会返回一个ConsumeQueue,ConsumeQueue类是一个接口类,只有一个实现类,叫LoopArrayQueue类,下面贴一下LoopArrayQueue类的offer方法:
override def offer(event: SchedulerEvent): Option[Int] = {
var index = -1
writeLock synchronized {
if (isFull) return None
else {
index = add(event)
}
}
readLock synchronized { readLock.notify() }
Some(index)
}
接着分析AbstractScheduler类的submit()方法中的3-4行代码:
val consumer = getSchedulerContext.getOrCreateConsumerManager.getOrCreateConsumer(group.getGroupName)
此处getOrCreateConsumerManager会调用FIFOSchedulerContextImpl类的getOrCreateConsumerManager方法, FIFOSchedulerContextImpl类的getOrCreateConsumerManager方法如下:
override def getOrCreateConsumerManager: ConsumerManager = {
if (consumerManager != null) return consumerManager
lock.synchronized {
if (consumerManager == null) {
consumerManager = createConsumerManager()
consumerManager.setSchedulerContext(this)
}
}
consumerManager
}
此处调用了consumerManager.setSchedulerContext(this)方法,跟踪进去,代码如下:
override def setSchedulerContext(schedulerContext: SchedulerContext): Unit = {
super.setSchedulerContext(schedulerContext)
group = getSchedulerContext.getOrCreateGroupFactory.getOrCreateGroup(null)
executorService = group match {
case g: FIFOGroup =>
Utils.newCachedThreadPool(g.getMaxRunningJobs + 2, groupName + "-Thread-")
case _ =>
throw new SchedulerErrorException(
NEED_SUPPORTED_GROUP.getErrorCode,
MessageFormat.format(NEED_SUPPORTED_GROUP.getErrorDesc, group.getClass)
)
}
consumerQueue = new LoopArrayQueue(
getSchedulerContext.getOrCreateGroupFactory.getOrCreateGroup(null)
)
consumer = createConsumer(groupName)
}
再看此处的consumer = createConsumer(groupName),也就是说这里会调用createConsumer方法,再跟踪进去,createConsumer代码如下:
override protected def createConsumer(groupName: String): Consumer = {
val group = getSchedulerContext.getOrCreateGroupFactory.getOrCreateGroup(null)
val consumer = new FIFOUserConsumer(getSchedulerContext, getOrCreateExecutorService, group)
consumer.setGroup(group)
consumer.setConsumeQueue(consumerQueue)
if (consumerListener != null) consumerListener.onConsumerCreated(consumer)
consumer.start()
consumer
}
这里可以看到调用了consumer.start()方法。跟踪start()方法进去:
def start(): Unit = {
future = executeService.submit(this)
bdpFutureTask = new BDPFutureTask(this.future)
}
这里看到调用了executeService.submit(this)方法,把当前类对象传递进去,提交到线程池。 当运行到的时候,会调用本类(FIFOUserConsumer类)的run方法。如下:
override def run(): Unit = {
Thread.currentThread().setName(s"${toString}Thread")
logger.info(s"$toString thread started!")
while (!terminate) {
Utils.tryAndError(loop())
Utils.tryAndError(Thread.sleep(10))
}
logger.info(s"$toString thread stopped!")
}
这里就引出了最重要的Loop()方法了。loop()方法是一个一直在线的循环。代码如下:
protected def loop(): Unit = {
var isRetryJob = false
def getWaitForRetryEvent: Option[SchedulerEvent] = {
val waitForRetryJobs = runningJobs.filter(job => job != null && job.isJobCanRetry)
waitForRetryJobs.find { job =>
isRetryJob = Utils.tryCatch(job.turnToRetry()) { t =>
job.onFailure(
"Job state flipped to Scheduled failed in Retry(Retry时,job状态翻转为Scheduled失败)!",
t
)
false
}
isRetryJob
}
}
var event: Option[SchedulerEvent] = getWaitForRetryEvent
if (event.isEmpty) {
val completedNums = runningJobs.filter(job => job == null || job.isCompleted)
if (completedNums.length < 1) {
Utils.tryQuietly(Thread.sleep(1000)) // TODO 还可以优化,通过实现JobListener进行优化
return
}
while (event.isEmpty) {
val takeEvent = if (getRunningEvents.isEmpty) Option(queue.take()) else queue.take(3000)
event =
if (
takeEvent.exists(e =>
Utils.tryCatch(e.turnToScheduled()) { t =>
takeEvent.get.asInstanceOf[Job].onFailure("Job状态翻转为Scheduled失败!", t)
false
}
)
) {
takeEvent
} else getWaitForRetryEvent
}
}
event.foreach { case job: Job =>
Utils.tryCatch {
val (totalDuration, askDuration) =
(fifoGroup.getMaxAskExecutorDuration, fifoGroup.getAskExecutorInterval)
var executor: Option[Executor] = None
job.consumerFuture = bdpFutureTask
Utils.waitUntil(
() => {
executor = Utils.tryCatch(
schedulerContext.getOrCreateExecutorManager.askExecutor(job, askDuration)
) {
case warn: WarnException =>
job.getLogListener.foreach(_.onLogUpdate(job, LogUtils.generateWarn(warn.getDesc)))
None
case e: ErrorException =>
job.getLogListener.foreach(_.onLogUpdate(job, LogUtils.generateERROR(e.getMessage)))
throw e
case error: Throwable =>
job.getLogListener.foreach(
_.onLogUpdate(job, LogUtils.generateERROR(error.getMessage))
)
throw error
}
Utils.tryQuietly(askExecutorGap())
executor.isDefined
},
totalDuration
)
job.consumerFuture = null
executor.foreach { executor =>
job.setExecutor(executor)
job.future = executeService.submit(job)
job.getJobDaemon.foreach(jobDaemon => jobDaemon.future = executeService.submit(jobDaemon))
if (!isRetryJob) putToRunningJobs(job)
}
} {
case _: TimeoutException =>
logger.warn(s"Ask executor for Job $job timeout!")
job.onFailure(
"The request engine times out (请求引擎超时,可能是EngineConnManager 启动EngineConn失败导致,可以去查看看EngineConnManager的linkis.out和linkis.log日志).",
new SchedulerErrorException(
REQUEST_ENGINE_TIME_OUT.getErrorCode,
REQUEST_ENGINE_TIME_OUT.getErrorDesc
)
)
case error: Throwable =>
job.onFailure("请求引擎失败,可能是由于后台进程错误!请联系管理员", error)
if (job.isWaitForRetry) {
logger.warn(s"Ask executor for Job $job failed, wait for the next retry!", error)
if (!isRetryJob) putToRunningJobs(job)
} else logger.warn(s"Ask executor for Job $job failed!", error)
}
}
}
loop()方法中47行代码如下
schedulerContext.getOrCreateExecutorManager.askExecutor(job, askDuration)
该方法调用了getOrCreateExecutorManager方法,该方法实际会返回一个EntranceExecutorManager抽象类,然后这行代码会调用该抽象类的askExecutor()方法。askExecutor方法代码如下:
override def askExecutor(schedulerEvent: SchedulerEvent): Option[Executor] =
schedulerEvent match {
case job: Job =>
Option(createExecutor(job))
}
该方法又调用了createExecutor()方法,跟踪进去代码如下:
override protected def createExecutor(schedulerEvent: SchedulerEvent): EntranceExecutor =
schedulerEvent match {
case job: EntranceJob =>
job.getJobRequest match {
case jobReq: JobRequest =>
val entranceEntranceExecutor =
new DefaultEntranceExecutor(jobReq.getId)
// getEngineConn Executor
job.getLogListener.foreach(
_.onLogUpdate(
job,
LogUtils.generateInfo("Your job is being scheduled by orchestrator.")
)
)
jobReq.setUpdatedTime(new Date(System.currentTimeMillis()))
/**
* // val engineConnExecutor = engineConnManager.getAvailableEngineConnExecutor(mark)
* idToEngines.put(entranceEntranceExecutor.getId, entranceEntranceExecutor)
*/
// instanceToEngines.put(engineConnExecutor.getServiceInstance.getInstance, entranceEntranceExecutor) // todo
// entranceEntranceExecutor.setInterceptors(getOrCreateInterceptors()) // todo
entranceEntranceExecutor
case _ =>
throw new EntranceErrorException(
NOT_CREATE_EXECUTOR.getErrorCode,
NOT_CREATE_EXECUTOR.getErrorDesc
)
}
case _ =>
throw new EntranceErrorException(
ENTRA_NOT_CREATE_EXECUTOR.getErrorCode,
ENTRA_NOT_CREATE_EXECUTOR.getErrorDesc
)
}
这里的第6-7行代码,也就是
val entranceEntranceExecutor = new DefaultEntranceExecutor(jobReq.getId)
这行代码会创建一个DefaultEntranceExecutor类,这个类中重要的方法有callExecute方法和requestToComputationJobReq方法。 当Job被线程池调度运行起来时,Job类中的run方法会被调用。Job类中的run方法中就有一句代码:
val rs = Utils.tryCatch(executor.execute(jobToExecuteRequest))
这行代码就是调用了EntranceExecutor类的execute方法,EntranceExecutor的execute方法再调用DefaultEntranceExecutor类的callExecute方法。 requestToComputationJobReq方法主要用于把Job类转换为能执行的ComputationJobReq对象。
核心的orchestration也是在DefaultEntranceExecutor类方法的callExecute方法中完成初始化,后续orchestration负责调度。
今天的分析到此结束,后文继续吧。