SeaTunnel & SeaTunnel Web部署

5,712 阅读5分钟

参考文档

  1. Deployment of Apache SeaTunnel Web
  2. SeaTunnel deployment
  3. [Bug][Seatunnel-web] 已经配置了数据源,无法选择source name

SeaTunnel部署

我这里是在emr core上部署了3节点的seatunnel-clusterseatunnel-web部署在集群之外的单节点。

IPseatunnel-engineseatunnel-web说明
10.6.4.24master 免密登陆其他节点
10.6.4.10core
10.6.4.14core
10.6.4.15core
10.6.6.2data-collect 这里的seatunnel只是client,不启动服务

Linux环境初始化

  1. 后续会用到scp之类的操作,可根据自己情况跳过这个环节,首先在master创建hadoop用户并授权sudo 全线

    sudo su - 
    useradd -m -g hadoop hadoop
    ​
    echo "hadoop ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
    
  2. hadoop用户生成密钥对

    # 执行命令后一路回车
    su - hadoop
    ssh-keygen -t rsa
    
  3. 其他需要免密登陆的节点创建hadoop用户并且同步公钥信息

    # xxxxxxxxxxxxx 换成上一步中生成的公钥,cat ~/.ssh/id_rsa.pub 可以获取
    su - hadoop 
    mkdir ~/.ssh
    echo "xxxxxxxxxxxxx" >> ~/.ssh/authorized_keys
    chmod 600 ~/.ssh/authorized_keys
    exit
    exit
    exit
    
  4. 验证免密是否成功:

    ssh -p 1022 hadoop@10.6.4.10
    

SeaTunnel部署

  1. master节点下载seatunnel/opt/softs目录下,下载链接:download, 这里建议下载2.3.3版本的,我最开始用过高版本的部署存在一些问题,不确定是版本的问题还是我设置的一些问题。seatunnel-web有段时间没有更新了,不清楚高版本有没有兼容问题。

    sudo su - hadoop
    mkdir /opt/softs
    mkdir /opt/service
    ​
    cd /opt/softs
    export version="2.3.3"
    wget "https://archive.apache.org/dist/seatunnel/${version}/apache-seatunnel-${version}-bin.tar.gz"
    ​
    tar -zxvf apache-seatunnel-2.3.3-bin.tar.gz -C /opt/service
    
  2. 安装plugin时用maven下载jar,默认的脚本里面会下载个mvnw,可以自己安装个maven,也可以直接用他脚本的,不过建议换一下maven源,不然下载会很慢,修改maven源为阿里的

      <mirror>
       <id>alimaven</id>
         <name>aliyun maven</name>
         <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
      <mirrorOf>central</mirrorOf>
    

    编辑config/plugin_config,保留自己需要的connector即可,这是我的

    # Don't modify the delimiter " -- ", just select the plugin you need
    --connectors-v2--
    connector-cdc-mysql
    connector-clickhouse
    connector-dingtalk
    connector-doris
    connector-elasticsearch
    connector-file-s3
    connector-hive
    connector-hudi
    connector-jdbc
    connector-kafka
    connector-kudu
    connector-redis
    connector-starrocks
    --end--
    

    安装connector plugin,执行完成后会在connectors/seatunnel看到对应jar包

    mkdir ~/.m2
    cp setting.xml ~/.m2/
    ​
    sh bin/install-plugin.sh 
    

    可以手动去maven上一个个下载对应jar然后上传到对应目录(不建议),官网里面给个tips说需要在connectors目录下创建flink, flink-sql 等子目录,实测不用。

    1712129922599.jpg

  3. 配置环境变量/etc/profile 中增加一下内容:

    export SEATUNNEL_HOME=/opt/service/apache-seatunnel-2.3.3
    export PATH=$PATH:SSEATUNNEL_HOME/bin
    ​
    ​
    # source /etc/profile 生效
    
  4. 修改启动脚本$SEATUNNEL_HOME/bin/seatunnel-cluster.sh,首行增加java内存配置JAVA_OPTS="-Xms2G -Xmx2G"

  5. 修改config目录相关配置,具体配置作用可以参考官网,我这里不一一列举了,直接罗列出我的配置

    1. seatunnel.yaml ha的hdfs参考checkpoint storage

      seatunnel:
        engine:
          history-job-expire-minutes: 4320
          backup-count: 1
          queue-type: blockingqueue
          print-execution-info-interval: 60
          print-job-metrics-info-interval: 60
          slot-service:
            dynamic-slot: true
          checkpoint:
            interval: 10000
            timeout: 60000
            storage:
              type: hdfs
              max-retained: 3
              plugin-config:
                namespace: /tmp/seatunnel/checkpoint_snapshot
                storage.type: hdfs
                fs.defaultFS: hdfs://emr-cluster
                seatunnel.hadoop.dfs.nameservices: emr-cluster
                seatunnel.hadoop.dfs.ha.namenodes.emr-cluster: nn1,nn2
                seatunnel.hadoop.dfs.namenode.rpc-address.emr-cluster.nn1: 10.6.4.13:8020
                seatunnel.hadoop.dfs.namenode.rpc-address.emr-cluster.nn2: 10.6.4.24:8020
                seatunnel.hadoop.dfs.client.failover.proxy.provider.emr-cluster: org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider
      
    2. hazelcast.yaml 主要配置cluster-namemember-list

      hazelcast:
        cluster-name: seatunnel-test
        network:
          rest-api:
            enabled: true
            endpoint-groups:
              CLUSTER_WRITE:
                enabled: true
              DATA:
                enabled: true
          join:
            tcp-ip:
              enabled: true
              member-list:
                - 10.6.4.10
                - 10.6.4.14
                - 10.6.4.15
          port:
            auto-increment: false
            port: 5801
        properties:
          hazelcast.invocation.max.retry.count: 20
          hazelcast.tcp.join.port.try.count: 30
          hazelcast.logging.type: log4j2
          hazelcast.operation.generic.thread.count: 50
      
    3. hazelcast-client.yaml,注意cluster-name 要和hazelcast.yaml中的一致,不然提交不了任务

      hazelcast-client:
        cluster-name: seatunnel-test
        properties:
          hazelcast.logging.type: log4j2
        network:
          cluster-members:
            - 10.6.4.10:5801
            - 10.6.4.14:5801
            - 10.6.4.15:5801
      
    4. 创建logs目录

      mkdir -p $SEATUNNEL_HOME/logs
      
    5. 将部署包分发到其他节点并启动

      ansible emr_core -m shell -a "sudo mkdir -p /opt/service"
      ansible emr_core -m shell -a "sudo chown -R hadoop:hadoop /opt/service"
      ansible emr_core -m copy -a "src=/opt/service/apache-seatunnel-2.3.3 dest=/opt/service/"
      ​
      ansible emr_core -m shell -a "sh /opt/service/apache-seatunnel-2.3.3/bin/seatunnel-cluster.sh -d"
      
    6. 去core节点确定下任务是否正常运行,运行不报错即可

      cd /opt/service/apache-seatunnel-2.3.3
      sh bin/seatunnel.sh --config config/v2.batch.config.template
      

SeaTunnel Web部署

  1. 部署SeaTunnel Engine Client,直接将master节点中的包复制一份过来即可

    # master节点
    cd /opt/service
    scp -P 1022 -r apache-seatunnel-2.3.3 hadoop@10.6.6.2:$PWD# data-collect节点
    cd /opt/service/apache-seatunnel-2.3.3
    sh bin/seatunnel.sh --config config/v2.batch.config.template
    
  2. 下载`apache-seatunnel-web-1.0.0-bin.tar.gz/opt/softs目录并解压

    mkdir /opt/softs
    cd /opt/softs
    wget https://www.apache.org/dyn/closer.lua/seatunnel/seatunnel-web/1.0.0/apache-seatunnel-web-1.0.0-bin.tar.gz
    tar -zxvf apache-seatunnel-web-1.0.0-bin.tar.gz -C /opt/service
    
  3. 配置环境变量,/etc/profile中添加以下内容, source /etc/profile生效:

    export SEATUNNEL_HOME=/opt/service/apache-seatunnel-2.3.3
    export SEATUNNEL_WEB_HOME=/opt/service/apache-seatunnel-web-1.0.0-bin
    export ST_WEB_BASEDIR_PATH=/opt/service/apache-seatunnel-web-1.0.0-bin/ui
    ​
    export PATH=$PATH:$SEATUNNEL_HOME/bin:$SEATUNNEL_WEB_HOME/bin
    
  4. 初始化数据库:

    1. 修改apache-seatunnel-web-1.0.0-bin/script/seatunnel_server_env.sh将其修改成正确的连接,需要注意的是默认用的是seatunnel的库

      export HOSTNAME="localhost"
      export PORT="3306"
      export USERNAME="root"
      export PASSWORD="123456"
      
    2. 执行sh apache-seatunnel-web-1.0.0-bin/script/init_sql.sh初始化数据,执行没有异常即成功

  5. 下载DataSorce Plugin

    cd /opt/service/apache-seatunnel-web-1.0.0-bin/bin
    wget https://seatunnel.apache.org/assets/files/download_datasource-4b79e6fafe80459590a6a0fc2865e5ac.sh
    mv download_datasource-4b79e6fafe80459590a6a0fc2865e5ac.sh download_datasource.sh
    ​
    # 建议在执行之前修改这个脚本删除里面的datasource-hive配置,这个会存在跟seatunnel-web中自带的jar的jetty-server依赖版本不一致问题,导致启动很失败
    sh download_datasource.sh
    
  6. 依赖和配置补齐

    1. 需要手动将mysql-jdbc驱动下载到/opt/service/apache-seatunnel-web-1.0.0-bin/libs
    2. /opt/service/apache-seatunnel-web-1.0.0-bin/libsdatasource-*相关jar在client节点需要复制到/opt/service/apache-seatunnel-2.3.3/lib/
    3. /opt/service/apache-seatunnel-2.3.3/config/hazelcast-client.yaml/opt/service/apache-seatunnel-2.3.3/connectors/plugin-mapping.properties需要复制到/opt/service/apache-seatunnel-web-1.0.0-bin/conf
  7. 修改seatunnel-web配置/opt/service/apache-seatunnel-web-1.0.0-bin/conf/application.yml,mysql链接和初始化是保持一致即可

    server:
      port: 8801
    
    spring:
      application:
        name: seatunnel
      jackson:
        date-format: yyyy-MM-dd HH:mm:ss
      datasource:
        driver-class-name: com.mysql.jdbc.Driver
        url: jdbc:mysql://xxxx:3306/seatunnel?useSSL=false&useUnicode=true&characterEncoding=utf-8&allowMultiQueries=true&allowPublicKeyRetrieval=true
        username: xxx
        password: xxx
      mvc:
        pathmatch:
          matching-strategy: ant_path_matcher
    
    jwt:
      expireTime: 86400
      secretKey: https://github.com/apache/seatunnel
      algorithm: HS256
    
  8. 启动SeaTunnel Web服务

    1. 启动服务,一定要在这个目录apache-seatunnel-web-1.0.0-bin中执行启动命令,不然可能找不到前端资源导致访问报错

      cd /opt/service/apache-seatunnel-web-1.0.0-bin
      sh bin/seatunnel-backend-daemon.sh start
      
    2. 访问http://127.0.0.1:8801/ui,默认用户名密码为admin:admin

问题总结

jetty-server 类版本冲突

SeaTunnel Web启动时报错:

An attempt was made to call a method that does not exist. The attempt was made from the following location:
​
    org.springframework.boot.web.embedded.jetty.JettyServletWebServerFactory.configureSession(JettyServletWebServerFactory.java:267)
​
The following method did not exist:
​
    org.eclipse.jetty.server.session.SessionHandler.setMaxInactiveInterval(I)V
​
The calling method's class, org.springframework.boot.web.embedded.jetty.JettyServletWebServerFactory, was loaded from the following location:
​
    jar:file:/datadir/seatunnel-web/libs/spring-boot-2.6.8.jar!/org/springframework/boot/web/embedded/jetty/JettyServletWebServerFactory.class
​
The called method's class, org.eclipse.jetty.server.session.SessionHandler, is available from the following locations:
​
    jar:file:/opt/service/apache-seatunnel-web-1.0.0-bin/libs/datasource-hive-1.0.0.jar!/org/eclipse/jetty/server/session/SessionHandler.class
    jar:file:/opt/service/apache-seatunnel-web-1.0.0-bin/libs/jetty-server-9.4.53.v20231009.jar!/org/eclipse/jetty/server/session/SessionHandler.class

这是因为/opt/service/apache-seatunnel-web-1.0.0-bin/libs/datasource-hive-1.0.0.jar和自带jar中jetty-server-9.4.53.v20231009.jar冲突,删除datasource-hive-1.0.0.jar可以解决,不过就用不了hive,需要hive的要另寻解决方案。

无法创建数据源

无法创建数据源的要确认下$SEATUNNEL_WEB_HOME/libs目录下是否有成功下载 datasource-*相关jar包

可以创建数据源但是无法选择source

我这边已经配置成功就没有那些不成功的截图了,可以参考一下几点逐步确认:

  1. 确认下$SEATUNNEL_HOME/lib目录下有没有datasource-* 相关jar包,这个我不确定是只需要在client节点的有就好了,还是seatunnel-engine节点也需要,我这里是所有节点都弄了

  2. 确认下$SEATUNNEL_HOME/connecotors/seatunnel目录下有没有相关connector,手动下载的注意要放到connectors/seatunnel目录下

  3. 确定下有没有正确配置一下环境变量并且生效

    export SEATUNNEL_HOME=/opt/service/apache-seatunnel-2.3.3
    export SEATUNNEL_WEB_HOME=/opt/service/apache-seatunnel-web-1.0.0-bin
    export ST_WEB_BASEDIR_PATH=/opt/service/apache-seatunnel-web-1.0.0-bin/ui
    ​
    export PATH=$PATH:$SEATUNNEL_HOME/bin:$SEATUNNEL_WEB_HOME/bin
    

    如果担心不生效可以直接将这个配置加在$SEATUNNEL_WEB_HOME/bin/seatunnel-backend-daemon.sh行首,生效后日志会有一下信息:

    [AbstractPluginDiscovery.<init>():113] - Load SeaTunnelSink Plugin from /opt/service/apache-seatunnel-2.3.3/connectors/seatunnel
     
     
    [AbstractDataSourceClient.getCustomClassloader():225] - ST_WEB_BASEDIR_PATH is : /opt/service/apache-seatunnel-web-1.0.0-bin/ui
    
  4. 确定下$SEATUNNEL_HOME/lib下有没有 mysql-jdbc相关去驱动包,这个我不确定是否有影响,看到网上有说,我也有加这个,最终成功不知道与这个是否有关没有验证,如果前面几步确认了还不行可以试一下这个。

整库同步不可用

可以正常创建数据源,单表同步任务也可以选择source时,多表同步、整库同步不行,这个是因为没有配置cdc相关数据源,多表同步依赖这个配置。

最后送上我配置成功的截图:

1712133842316.jpg

1712134026894.jpg

1712134083029.jpg

其他

SeaTunnel Web这个项目从2023.10 发布1.0.0版本之后基本就没怎么看到有代码更新了,官方文档也比较糟糕,很多内容都是空的,github上连issues都没开,后续情况还是比较担忧的,作为apache的顶级项目有点意外。看用户群回复感觉短期也不会有新版本,生产环境使用还是慎重考虑,也不想总是换采集组件。

1712134470848.jpg

后续

使用seatunnel-web一段时间后发现真难用,token一过期退出登录都不行,只能清浏览器缓存,然后hadoop相关的东西如果seatunnel-web部署在非emr节点,各种缺包。

发现DolphinScheduler上有集成seatunnel的使用,安装好seatunnel后在DolphinScheduler上配置下环境变量就好了。决定放弃seatunnel-web!!!