【kettle】Linux设置kettle的任务启动

2,781 阅读3分钟

1.部署准备

由于kettle是基于java写的,所以需要jdk环境

vi  /etc/profile
export JAVA_HOME=/usr/java/jre1.8.0_45
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
source /etc/profile  ## 生效

2.kettle部署

国内镜像官网镜像上下载安装包,放到指定Linux指定目录

unzip pdi-ce-7.1.0.0-12.zip     ## 解压
cd data-integration             ##kettle根目录
chmod +x *.sh                   ##修改脚本权限
./kitchen.sh                    ##判断是否成功

判断是否成功时出现警告

#######################################################################
WARNING:  no libwebkitgtk-1.0 detected, some features will be unavailable
   Consider installing the package with apt-get or yum.
   e.g. 'sudo apt-get install libwebkitgtk-1.0-0'
#######################################################################
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0
Options:
 -rep            = Repository name
 -user           = Repository username
 -pass           = Repository password
 -job            = The name of the job to launch
 -dir            = The directory (dont forget the leading /)
 -file           = The filename (Job XML) to launch
 -level          = The logging level (Basic, Detailed, Debug, Rowlevel, Error, Minimal, Nothing)
 -logfile        = The logging file to write to
 -listdir        = List the directories in the repository
 -listjobs       = List the jobs in the specified directory
 -listrep        = List the available repositories
 -norep          = Do not log into the repository
 -version        = show the version, revision and build date
 -param          = Set a named parameter <NAME>=<VALUE>. For example -param:FILE=customers.csv
 -listparam      = List information concerning the defined parameters in the specified job.
 -export         = Exports all linked resources of the specified job. The argument is the name of a ZIP file.
 -custom         = Set a custom plugin specific option as a String value in the job using <NAME>=<Value>, for example: -custom:COLOR=Red
 -maxloglines    = The maximum number of log lines that are kept internally by Kettle. Set to 0 to keep all rows (default)
 -maxlogtimeout  = The maximum age (in minutes) of a log line while being kept internally by Kettle. Set to 0 to keep all rows indefinitely (default)

查看官网说明需要libwebkitgtk环境,安装即可

 sudo apt-get install libwebkitgtk-1.0.0
 ./kitchen.sh   ##判断kettle是否安装成功
OpenJDK 64-Bit Server VM warning: ignoring option MaxPermSize=256m; support was removed in 8.0
Options:
 -rep            = Repository name
 -user           = Repository username
 -pass           = Repository password
 -job            = The name of the job to launch
 -dir            = The directory (dont forget the leading /)
 -file           = The filename (Job XML) to launch
 -level          = The logging level (Basic, Detailed, Debug, Rowlevel, Error, Minimal, Nothing)
 -logfile        = The logging file to write to
 -listdir        = List the directories in the repository
 -listjobs       = List the jobs in the specified directory
 -listrep        = List the available repositories
 -norep          = Do not log into the repository
 -version        = show the version, revision and build date
 -param          = Set a named parameter <NAME>=<VALUE>. For example -param:FILE=customers.csv
 -listparam      = List information concerning the defined parameters in the specified job.
 -export         = Exports all linked resources of the specified job. The argument is the name of a ZIP file.
 -custom         = Set a custom plugin specific option as a String value in the job using <NAME>=<Value>, for example: -custom:COLOR=Red
 -maxloglines    = The maximum number of log lines that are kept internally by Kettle. Set to 0 to keep all rows (default)
 -maxlogtimeout  = The maximum age (in minutes) of a log line while being kept internally by Kettle. Set to 0 to keep all rows indefinitely (default)

3.搭建脚本目录

mkdir -p /data/kettle/kettle_file/job      ##存放作业文件
mkdir /data/kettle/kettle_file/transition  ##存放转换
mkdir /data/kettle/kettle_sh               ##存放执行脚本
mkdir /data/kettle/kettle_log              ##存放执行kettle产生的日志文件

将从windows上配置好的.ktr和.kjb程序分别对应放在transition目录和job目录下

注意: windows下的.kjb文件里的路径和Linux的不一样,要修改后再复制到Linux下运行,不然提示找不到文件,文件方式打开,修改fileName里的路径

  • 测试转换
/home/kettle/data-integration/pan.sh -file=/home/kettle/workfile/kettleFile/test.ktr
  • 测试job
/home/kettle/data-integration/kitchen.sh -file=/home/kettle/workfile/kettleFile/testJob.kjb

官方文档说明

4.设置定时任务

  • 编写任务脚本

注意:crontab只加载/ect/environment,并不加载/etc/profile和~/.bash_profile,所以需要在脚本里手动设置环境变量

#!/bin/bash
cd /home/user/hzx/kettle/data-integration
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
export CLASSPATH=.:$JAVA_HOME/lib/dt.jar:$JAVA_HOME/lib/tools.jar
export PATH=$JAVA_HOME/bin:$PATH
./kitchen.sh -file=/home/kettle/workfile/kettleFile/testJob.kjb --level=Basic >> /home/kettle/workfile/kettleLogs/testJob__$(date +%Y%m%d%H%M%S).log
  • 添加定时任务
user@user:~/hzx/kettle/workfile/kettleShs$ crontab -e
no crontab for user - using an empty one
Select an editor.  To change later, run 'select-editor'.
  1. /bin/nano        <---- easiest
  2. /usr/bin/vim.basic
  3. /usr/bin/vim.tiny
  4. /bin/ed

Choose 1-4 [1]: 2  ## 我选择2 回车编辑任务列表
# Edit this file to introduce tasks to be run by cron.
# 
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
# 
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').# 
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
# 
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
# 
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
# 
# For more information see the manual pages of crontab(5) and cron(8)
# 
# m h  dom mon dow   command
 0 */1 * * * /home/user/hzx/kettle/workfile/kettleShs/job.sh  ##每隔1小时执行,自行选择

保存,完结散花

注意:当服务器重启时,crontab里的任务是不会补偿停机过程缺少的执行次数,所以要注意任务里的时间参数与服务器时间是否有强依赖关系 参看链接

kettle转换视频官方文档

kettle社区问题文章

kettle spoon用户使用指南

kettle 各个版本文档

kettle数据集成步骤