实战干货:用DolphinScheduler搭建大数据ETL任务调度平台
一、DolphinScheduler核心架构解析
1. 核心组件
- Master Server:负责任务调度
- Worker Server:执行具体任务
- Alert Server:告警服务
- API Server:提供RESTful接口
- UI:Web操作界面
2. 部署模式对比
| 模式 | 适用场景 | 特点 |
|---|---|---|
| 单机模式 | 开发测试环境 | 所有组件部署在同一节点 |
| 伪集群模式 | 功能验证环境 | 单机模拟分布式部署 |
| 集群模式 | 生产环境 | 真正分布式部署 |
二、快速安装指南(基于1.3.9版本)
1. 环境准备
Bash
# 安装依赖
sudo yum install -y java-1.8.0-openjdk mysql-server
# 创建元数据库
mysql> CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8;
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO 'ds_user'@'%' IDENTIFIED BY 'ds_password';
2. 单机部署
Bash
wget https://mirrors.bfsu.edu.cn/apache/dolphinscheduler/1.3.9/apache-dolphinscheduler-1.3.9-bin.tar.gz
tar -zxvf apache-dolphinscheduler-1.3.9-bin.tar.gz
cd apache-dolphinscheduler-1.3.9-bin
sh ./bin/dolphinscheduler-daemon.sh start standalone-server
三、ETL任务实战开发
1. 创建Shell任务示例
Python
# 数据抽取脚本示例(extract_data.sh)
#!/bin/bash
# 定义日期参数
dt=$(date -d "-1 day" +%Y%m%d)
# HDFS数据抽取
hadoop fs -get /source/log_${dt}.csv /data/staging/
echo "Extract data for ${dt} completed"
2. Spark任务配置
Python
# pyspark_etl.py
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ETL_Processing") \
.enableHiveSupport() \
.getOrCreate()
# 读取数据
df = spark.read.csv("/data/staging/log_${dt}.csv", header=True)
# 数据转换
df_clean = df.dropDuplicates().fillna(0)
# 写入Hive
df_clean.write.mode("overwrite").saveAsTable("ods.log_table")
四、工作流设计最佳实践
1. 典型ETL工作流结构
PlainText
[Shell: 数据抽取] → [Spark: 数据转换] → [Hive: 数据加载] → [Email: 通知]
2. 依赖配置技巧
Json
// 任务依赖配置示例
{
"dependTaskList": ["task_123"],
"dependItemList": [
{
"depTasks": "task_123",
"status": "SUCCESS"
}
]
}
五、高级调度配置
1. 时间参数使用
Python
# 在Shell任务中使用系统参数
echo "处理日期:${system.biz.date}"
echo "上个月日期:${system.biz.date.last.month}"
2. 条件分支实现
Python
# 使用Switch任务实现分支逻辑
{
"type": "SWITCH",
"conditions": [
{
"condition": "${daily_flag} == true",
"nextNode": "daily_task"
},
{
"condition": "${weekly_flag} == true",
"nextNode": "weekly_task"
}
],
"defaultBranch": "default_task"
}
六、监控与告警配置
1. 邮件告警设置
Python
# 告警插件配置(alert.properties)
mail.protocol=SMTP
mail.server.host=smtp.163.com
mail.server.port=25
mail.sender=yourmail@163.com
mail.user=yourmail@163.com
mail.passwd=yourpassword
2. 自定义告警脚本
Python
# wechat_alert.py
import requests
import json
def send_wechat_alert(title, content):
webhook = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
msg = {
"msgtype": "markdown",
"markdown": {
"content": f"**{title}**\n> {content}"
}
}
requests.post(webhook, json=msg)
七、性能优化技巧
1. Worker分组配置
Python
# worker分组配置(worker.properties)
worker.groups=group1,group2
worker.group.group1.workers=worker1:1234,worker2:1234
worker.group.group2.workers=worker3:1234
2. 任务并行度控制
Python
# 全局并行度设置(master.properties)
master.exec.threads=100
master.exec.task.num=20
八、系统集成方案
1. API调用示例
Python
import requests
# 触发工作流执行
def start_workflow(project, flow, params):
url = "http://ds-server:12345/dolphinscheduler/api/projects/{project}/executors/start-process-instance"
headers = {"token": "your_token"}
data = {
"processDefinitionCode": flow,
"scheduleTime": None,
"failureStrategy": "CONTINUE",
"execType": "START_PROCESS",
"warningType": "NONE",
"startParams": params
}
response = requests.post(url, json=data, headers=headers)
return response.json()
2. 与DataX集成配置
Python
# datax作业配置示例
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "password",
"column": ["id", "name"],
"connection": [{
"table": ["table1"],
"jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/db"]
}]
}
},
"writer": {
"name": "hdfswriter",
"parameter": {
"defaultFS": "hdfs://cluster:8020",
"fileType": "text",
"path": "/data/output",
"fileName": "data_${bizdate}",
"column": [{"name": "id", "type": "long"}, {"name": "name", "type": "string"}]
}
}
}]
}
}
九、运维管理技巧
1. 日志分析脚本
Python
# analyze_ds_logs.py
import pandas as pd
def analyze_failure(log_path):
logs = pd.read_csv(log_path, sep="|", error_bad_lines=False)
failed = logs[logs['state'] == 'FAILURE']
print(f"失败任务TOP5:\n{failed['task_name'].value_counts().head()}")
2. 数据库维护命令
-- 清理历史数据
DELETE FROM t_ds_process_instance
WHERE state = 7 AND end_time < DATE_SUB(NOW(), INTERVAL 30 DAY);
-- 重建索引
ANALYZE TABLE t_ds_task_instance;
通过以上实战配置,您将能够:
- 快速搭建企业级调度平台
- 实现复杂ETL工作流编排
- 构建可靠的故障告警机制
- 轻松集成现有大数据组件
建议生产环境采用集群部署模式,并定期备份元数据库,确保调度服务的高可用性。