使用shell脚本调度Apache Kylin构建任务

464 阅读3分钟

本文已参与「开源摘星计划」,欢迎正在阅读的你加入。活动链接:github.com/weopenproje…

背景

在使用 Kylin 作为 OLAP 查询引擎时,我们需要让 kylin 自动进行 cube 构建,当有增量数据产生时,由调度系统自动触发 kylin 的任务构建,这样使得数仓数据计算和 kylin cube 构建可以串起来。

接口说明

触发 kylin 调度任务的方式为调用 kylin 接口,然后触发任务构建。 我们可以通过 java、python 代码的方式调用 kylin 的接口,也可以直接使用 shell 调用 kylin 的接口。

这里主要需要用到3个接口。

  1. http://your-kylin-host:7070/kylin/api/user/authentication

用户认证接口,用来验证 kylin 的用户认证。

  1. http://your-kylin-host:7070/kylin/api/cubes/${cubeName}/rebuild

cube构建接口,用来触发 kylin cube 的构建任务。

  1. http://your-kylin-host:7070/kylin/api/jobs/${id}

根据jobId查询job状态接口,用来监控 cube 构建任务的执行情况。

脚本实现

下面是我编写的 kylin cube 构建脚本。

#!/bin/bash
##******************************************************************************
## **  功能描述: kylin cube 构建
## **
## **  执行参数:2-4个 1.(必填) 第一个是构建日期.支持日期或者月份或者字符串null,日期构建日类型的cube,月份构建月类型的cube,null构建没有日期分区的cube
## **                          一次只能构建日模型的一天或月模型的一个月,或者没有日期分区的cube。  例:20200101  或者  202001   或者字符串  null
## **                  2.(必填) 第二个参数是cube名称。
## **                  3.(可选) hive表名,用来统计要构建的数据量。
## **                  4.(可选) hive表的日期分区字段名称,用来统计要构建的数据量时使用,全量构建的cube或者是hive表名为空的情况下不需要此参数。默认为dt。
## *****************************************************************************

source /etc/profile
source ~/.bashrc


username="这里填kylin系统的用户名"
password="这里填kylin系统的密码"
cubeName=$2
echo "【cube name is $2】"

dateParam=$(echo $1 | tr '[A-Z]' '[a-z]')
startTime="123"
endTime="123"
if [[ ${#dateParam} -eq 6 ]]; then
  endTime=${dateParam}"01"
  startTime=$(date -d "$endTime-1 days" +%Y%m%d)
  echo "【build month cube】"
elif [[ ${#dateParam} -eq 8 ]]; then
  startTime=${dateParam}
  endTime=$(date -d "$startTime+1 days" +%Y%m%d)
  echo "【build day cube】"
elif [[ "$dateParam" == "null" ]]; then
  startTime=
  endTime=
  echo "【build whole quantity cube】"
else
  echo "【dateParam input error】"
  exit 1
fi

tableName=$3
dateField=$4
countSql=''
if [[ ${#tableName} -ge 1 ]]; then
    echo "【hive table name is ${tableName}】"
    if [[ ${#dateField} -ge 1 ]]; then
        echo "【date field is ${dateField}】"
        
    else
        echo "【date field is null,use 'dt'】"
        dateField='dt'
    fi
    if [[ "$dateParam" == "null" ]]; then
        countSql="set hive.cli.print.header=false; select count(1) from ${tableName}"
    else
        countSql="set hive.cli.print.header=false; select count(1) from ${tableName} where ${dateField} >='${startTime}' and ${dateField} < '${endTime}'"
    fi
else
    echo "【hive table name is null】"
fi
if [[ ${#countSql} -ge 5 ]]; then
    echo "================================================================="
    echo "【count sql is:  ${countSql} 】"
    rowNum=`hive -e " ${countSql} "`
    echo "================================================================="
    echo "【the amount of data to build is ${rowNum}】"
    if [[ ${rowNum} -le 0 ]]; then
        echo "================================================================="
        echo "【don't need to build】"
        exit 0
    fi
fi

now=$(date "+%Y-%m-%d %H:%M:%S")

auth $username $password
jobid=$(build $cubeName $startTime $endTime)
echo "================================================================="
echo "【$now build $startTime - $endTime jobId is $jobid】"
echo "================================================================="
if [[ ${#jobid} -lt 5 ]]; then
  echo "【jobid is not as expected,maybe the segment has been merged,the merged segment needs to be built manually】"
  exit 1
fi
FLAG=0
while (("$FLAG" != "-1")); do
  jobInfo=$(getJobInfo $jobid)
  echo "================================================================="
  now_time=$(date "+%Y-%m-%d %H:%M:%S")
  if [[ $jobInfo =~ "FINISHED" ]]; then
    echo "$now_time 【build successed】"
    exit 0
  elif [[ $jobInfo =~ "ERROR" ]]; then
    echo "$now_time 【build failed】"
    exit 1
  elif [[ $jobInfo =~ "STOPPED" ]]; then
    echo "$now_time 【build stopped,it may be manually operated】"
    exit 1
  elif [[ $jobInfo =~ "DISCARDED" ]]; then
    echo "$now_time 【build discarded,it may be manually operated】"
    exit 1
  elif [[ $jobInfo =~ "PENDING" ]]; then
    echo "$now_time 【job status is PENDING, please wait while building...】"
    sleep 60
  elif [[ $jobInfo =~ "RUNNING" ]]; then
    echo "$now_time 【job status is RUNNING, please wait while building...】"
    sleep 60
  else
    if [[ "$FLAG" -lt "10" ]]; then
      echo "$now_time 【failed to get job status, the program exits automatically after 10 failed fetches】"
      let "FLAG=FLAG+1"
      sleep 60
    else
      echo "$now_time 【a total of 10 failed to obtain the status of the job, the program quit】"
      exit 1
    fi
  fi
  echo "================================================================="
done

# 验证账号、密码
function auth() {
  username=$1
  password=$2

  base64Encryption=$(printf "%s""$username:$password" | base64)
  authentication=$(curl -X POST -H "Authorization: Basic $base64Encryption" -H "Content-Type: application/json;charset=UTF-8" http://your-kylin-host:7070/kylin/api/user/authentication)

  if [[ $authentication =~ "Unauthorized" ]]; then
    echo "Authentication failure: user name or password wrong"
    exit 1
  fi
}

# 构建cube
function build() {
  cubeName=$1
  startTime=$2
  endTime=$3

  startTimeTimestamp=$(date -d "$startTime 00:00:00" +%s)
  endTimeTimestamp=$(date -d "$endTime 00:00:00" +%s)
  GMT8=$((8 * 60 * 60 * 1000))
  kylinStartTime=$((startTimeTimestamp * 1000 + GMT8))
  kylinEndTime=$((endTimeTimestamp * 1000 + GMT8))
  buildInfo=$(curl -X PUT -H "Authorization: Basic $base64Encryption" -H "Content-Type: application/json;charset=UTF-8" -d '{"startTime":'$kylinStartTime', "endTime":'$kylinEndTime', "buildType":"BUILD"}' http://your-kylin-host:7070/kylin/api/cubes/${cubeName}/rebuild)

  uuid=$(echo $buildInfo | grep -oP '(?<={"uuid":").*(?=","last_modified")')
  echo $uuid
}

# 根据jobId查询job状态
function getJobInfo() {
  uuid=$1

  jobInfo=$(curl -X GET -H "Authorization: Basic $base64Encryption" -H "Content-Type: application/json;charset=UTF-8" http://your-kylin-host:7070/kylin/api/jobs/$uuid)
  jobStatus=$(echo $jobInfo | grep -oP '(?<="job_status":").*(?=","progress")')
  progress=$(echo $jobInfo | grep -oP '(?<="progress":).*(?=})')
  echo "jobStatus:$jobStatus progress:$progress"
}

使用方式

  1. 新建 shell 脚本文件,命名kylin_build.sh

kylin_build 只是示例名称,可以自由命名。

  1. 复制上面的脚本内容到kylin_build.sh文件,修改脚本中的用户名密码your-kylin-host,然后保存。
  2. 执行脚本,触发构建任务。

执行命令 sh kylin_build.sh 日期参数(必填) cube名称(必填) hive表名(可选) hive表日期分区字段名称(可选)

  1. 等待任务构建完成。

扩展提示

可以使用Apache DolphinScheduler调度引擎(不限于Apache DolphinScheduler)来调度哦~