Docker快速部署大数据组件指南

305 阅读3分钟

Docker快速部署大数据组件指南

Docker大数据

一、单容器部署Hadoop/Spark

1.1 定制化Dockerfile

FROM ubuntu:22.04

# 基础环境配置
RUN apt-get update && apt-get install -y \
    openjdk-11-jdk \
    ssh \
    wget \
    python3-pip \
    && rm -rf /var/lib/apt/lists/*

# 安装Hadoop 3.3.6
RUN wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && \
    tar -xzf hadoop-3.3.6.tar.gz -C /opt && \
    ln -s /opt/hadoop-3.3.6 /opt/hadoop

# 安装Spark 3.4.2
RUN wget https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz && \
    tar -xzf spark-3.4.2-bin-hadoop3.tgz -C /opt && \
    ln -s /opt/spark-3.4.2-bin-hadoop3 /opt/spark

# 配置环境变量
ENV HADOOP_HOME=/opt/hadoop \
    SPARK_HOME=/opt/spark \
    PATH=$PATH:$HADOOP_HOME/bin:$SPARK_HOME/bin

# 初始化SSH
RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
    chmod 0600 ~/.ssh/authorized_keys

# 暴露常用端口
EXPOSE 8020 9000 9870 8088 4040

# 启动脚本
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]

1.2 启动脚本示例(entrypoint.sh)

#!/bin/bash

# 格式化HDFS(首次启动)
if [ ! -f /hadoop/name/current/VERSION ]; then
    hdfs namenode -format -force
fi

# 启动SSH服务
service ssh start

# 启动HDFS
start-dfs.sh

# 启动YARN
start-yarn.sh

# 启动Spark Standalone
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://localhost:7077

# 保持容器运行
tail -f /dev/null

二、Docker Compose编排集群

2.1 多节点集群架构

graph TD
    subgraph Hadoop Cluster
    NN[NameNode] --> DN1[DataNode1]
    NN --> DN2[DataNode2]
    RM[ResourceManager] --> NM1[NodeManager1]
    RM --> NM2[NodeManager2]
    end
    
    subgraph Spark Cluster
    SM[Spark Master] --> SW1[Spark Worker1]
    SM --> SW2[Spark Worker2]
    end
    
    ZK[ZooKeeper] -->|协调| NN
    ZK -->|协调| RM

2.2 docker-compose.yml

version: '3.7'

services:
  namenode:
    image: bigdata-allinone
    hostname: namenode
    ports:
      - "9870:9870"
      - "9000:9000"
    volumes:
      - hadoop_nn:/opt/hadoop/name
    environment:
      - CLUSTER_ROLE=namenode
    networks:
      - bigdata-net

  datanode1:
    image: bigdata-allinone
    hostname: datanode1
    environment:
      - CLUSTER_ROLE=datanode
      - NN_HOST=namenode
    depends_on:
      - namenode
    networks:
      - bigdata-net

  spark-master:
    image: bigdata-allinone
    hostname: spark-master
    ports:
      - "8080:8080"
    environment:
      - SPARK_MODE=master
    networks:
      - bigdata-net

  spark-worker1:
    image: bigdata-allinone
    environment:
      - SPARK_MODE=worker
      - SPARK_MASTER_URL=spark://spark-master:7077
    depends_on:
      - spark-master
    networks:
      - bigdata-net

volumes:
  hadoop_nn:
  hadoop_dn:

networks:
  bigdata-net:
    driver: bridge
    ipam:
      config:
        - subnet: 172.20.0.0/24

2.3 动态启动脚本

import docker
import time

def start_cluster():
    client = docker.from_env()
    
    # 启动Hadoop集群
    namenode = client.containers.run(
        "bigdata-allinone",
        detach=True,
        name="namenode",
        ports={'9870/tcp': 9870},
        environment={'CLUSTER_ROLE': 'namenode'}
    )
    
    time.sleep(30)  # 等待NameNode初始化
    
    for i in range(2):
        client.containers.run(
            "bigdata-allinone",
            detach=True,
            name=f"datanode{i+1}",
            environment={
                'CLUSTER_ROLE': 'datanode',
                'NN_HOST': 'namenode'
            }
        )
    
    # 启动Spark集群
    spark_master = client.containers.run(
        "bigdata-allinone",
        detach=True,
        name="spark-master",
        ports={'8080/tcp': 8080},
        environment={'SPARK_MODE': 'master'}
    )
    
    for i in range(2):
        client.containers.run(
            "bigdata-allinone",
            detach=True,
            name=f"spark-worker{i+1}",
            environment={
                'SPARK_MODE': 'worker',
                'SPARK_MASTER_URL': 'spark://spark-master:7077'
            }
        )

if __name__ == "__main__":
    start_cluster()

三、验证与使用

3.1 服务验证命令

# 检查HDFS状态
docker exec namenode hdfs dfsadmin -report

# 提交Spark测试任务
docker exec spark-master spark-submit \
    --master spark://spark-master:7077 \
    /opt/spark/examples/src/main/python/pi.py 100

3.2 Web UI访问

服务访问地址
HDFS NameNodehttp://localhost:9870
YARNhttp://localhost:8088
Spark Masterhttp://localhost:8080

四、避坑指南

4.1 常见问题解决方案

问题现象解决方案
DataNode无法注册检查NN容器IP是否变化
Spark Worker离线确认7077端口映射正确
容器间网络不通使用自定义Docker网络
存储卷权限问题添加user: "root"到服务配置

4.2 性能调优参数

# 在docker-compose.yml中添加资源限制
deploy:
  resources:
    limits:
      cpus: '2'
      memory: 4G
    reservations:
      memory: 2G

五、扩展部署模式

5.1 Kubernetes部署示例

apiVersion: apps/v1
kind: Deployment
metadata:
  name: hadoop-namenode
spec:
  replicas: 1
  template:
    spec:
      containers:
      - name: namenode
        image: bigdata-allinone
        env:
        - name: CLUSTER_ROLE
          value: "namenode"
        ports:
        - containerPort: 9870
---
apiVersion: v1
kind: Service
metadata:
  name: hadoop-namenode
spec:
  type: NodePort
  ports:
  - port: 9870
    nodePort: 30007
  selector:
    app: hadoop-namenode

生产建议

  1. 使用持久化存储卷保证数据安全
  2. 配置容器资源限制防止OOM
  3. 启用健康检查机制
  4. 使用私有镜像仓库管理定制镜像

扩展实践:集成Prometheus监控容器化集群,完整代码参考GitHub仓库

附录:常用Docker命令速查

功能命令
构建镜像docker build -t bigdata-allinone .
查看容器日志docker logs -f namenode
进入容器docker exec -it namenode bash
清理无用资源docker system prune -af