DolphinScheduler教程

62 阅读3分钟

实战干货:用DolphinScheduler搭建大数据ETL任务调度平台

一、DolphinScheduler核心架构解析

1. 核心组件

  • Master Server:负责任务调度
  • Worker Server:执行具体任务
  • Alert Server:告警服务
  • API Server:提供RESTful接口
  • UI:Web操作界面

2. 部署模式对比

模式适用场景特点
单机模式开发测试环境所有组件部署在同一节点
伪集群模式功能验证环境单机模拟分布式部署
集群模式生产环境真正分布式部署

二、快速安装指南(基于1.3.9版本)

1. 环境准备

Bash

# 安装依赖
sudo yum install -y java-1.8.0-openjdk mysql-server
# 创建元数据库
mysql> CREATE DATABASE dolphinscheduler DEFAULT CHARACTER SET utf8;
mysql> GRANT ALL PRIVILEGES ON dolphinscheduler.* TO 'ds_user'@'%' IDENTIFIED BY 'ds_password';

2. 单机部署

Bash

wget https://mirrors.bfsu.edu.cn/apache/dolphinscheduler/1.3.9/apache-dolphinscheduler-1.3.9-bin.tar.gz
tar -zxvf apache-dolphinscheduler-1.3.9-bin.tar.gz
cd apache-dolphinscheduler-1.3.9-bin
sh ./bin/dolphinscheduler-daemon.sh start standalone-server

三、ETL任务实战开发

1. 创建Shell任务示例

Python

# 数据抽取脚本示例(extract_data.sh)
#!/bin/bash
# 定义日期参数
dt=$(date -d "-1 day" +%Y%m%d)
# HDFS数据抽取
hadoop fs -get /source/log_${dt}.csv /data/staging/
echo "Extract data for ${dt} completed"

2. Spark任务配置

Python

# pyspark_etl.py
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("ETL_Processing") \
    .enableHiveSupport() \
    .getOrCreate()

# 读取数据
df = spark.read.csv("/data/staging/log_${dt}.csv", header=True)

# 数据转换
df_clean = df.dropDuplicates().fillna(0)

# 写入Hive
df_clean.write.mode("overwrite").saveAsTable("ods.log_table")

四、工作流设计最佳实践

1. 典型ETL工作流结构

PlainText

[Shell: 数据抽取][Spark: 数据转换][Hive: 数据加载][Email: 通知]

2. 依赖配置技巧

Json

// 任务依赖配置示例
{
  "dependTaskList": ["task_123"],
  "dependItemList": [
    {
      "depTasks": "task_123",
      "status": "SUCCESS"
    }
  ]
}

五、高级调度配置

1. 时间参数使用

Python

# 在Shell任务中使用系统参数
echo "处理日期:${system.biz.date}"
echo "上个月日期:${system.biz.date.last.month}"

2. 条件分支实现

Python

# 使用Switch任务实现分支逻辑
{
  "type": "SWITCH",
  "conditions": [
    {
      "condition": "${daily_flag} == true",
      "nextNode": "daily_task"
    },
    {
      "condition": "${weekly_flag} == true",
      "nextNode": "weekly_task"
    }
  ],
  "defaultBranch": "default_task"
}

六、监控与告警配置

1. 邮件告警设置

Python

# 告警插件配置(alert.properties)
mail.protocol=SMTP
mail.server.host=smtp.163.com
mail.server.port=25
mail.sender=yourmail@163.com
mail.user=yourmail@163.com
mail.passwd=yourpassword

2. 自定义告警脚本

Python

# wechat_alert.py
import requests
import json

def send_wechat_alert(title, content):
    webhook = "https://qyapi.weixin.qq.com/cgi-bin/webhook/send?key=xxx"
    msg = {
        "msgtype": "markdown",
        "markdown": {
            "content": f"**{title}**\n> {content}"
        }
    }
    requests.post(webhook, json=msg)

七、性能优化技巧

1. Worker分组配置

Python

# worker分组配置(worker.properties)
worker.groups=group1,group2
worker.group.group1.workers=worker1:1234,worker2:1234
worker.group.group2.workers=worker3:1234

2. 任务并行度控制

Python

# 全局并行度设置(master.properties)
master.exec.threads=100
master.exec.task.num=20

八、系统集成方案

1. API调用示例

Python

import requests

# 触发工作流执行
def start_workflow(project, flow, params):
    url = "http://ds-server:12345/dolphinscheduler/api/projects/{project}/executors/start-process-instance"
    headers = {"token": "your_token"}
    data = {
        "processDefinitionCode": flow,
        "scheduleTime": None,
        "failureStrategy": "CONTINUE",
        "execType": "START_PROCESS",
        "warningType": "NONE",
        "startParams": params
    }
    response = requests.post(url, json=data, headers=headers)
    return response.json()

2. 与DataX集成配置

Python

# datax作业配置示例
{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "password",
          "column": ["id", "name"],
          "connection": [{
            "table": ["table1"],
            "jdbcUrl": ["jdbc:mysql://127.0.0.1:3306/db"]
          }]
        }
      },
      "writer": {
        "name": "hdfswriter",
        "parameter": {
          "defaultFS": "hdfs://cluster:8020",
          "fileType": "text",
          "path": "/data/output",
          "fileName": "data_${bizdate}",
          "column": [{"name": "id", "type": "long"}, {"name": "name", "type": "string"}]
        }
      }
    }]
  }
}

九、运维管理技巧

1. 日志分析脚本

Python

# analyze_ds_logs.py
import pandas as pd

def analyze_failure(log_path):
    logs = pd.read_csv(log_path, sep="|", error_bad_lines=False)
    failed = logs[logs['state'] == 'FAILURE']
    print(f"失败任务TOP5:\n{failed['task_name'].value_counts().head()}")

2. 数据库维护命令

-- 清理历史数据
DELETE FROM t_ds_process_instance 
WHERE state = 7 AND end_time < DATE_SUB(NOW(), INTERVAL 30 DAY);

-- 重建索引
ANALYZE TABLE t_ds_task_instance;

通过以上实战配置,您将能够:

  1. 快速搭建企业级调度平台
  2. 实现复杂ETL工作流编排
  3. 构建可靠的故障告警机制
  4. 轻松集成现有大数据组件

建议生产环境采用集群部署模式,并定期备份元数据库,确保调度服务的高可用性。