Docker快速部署大数据组件指南

一、单容器部署Hadoop/Spark
1.1 定制化Dockerfile
FROM ubuntu:22.04
# 基础环境配置
RUN apt-get update && apt-get install -y \
openjdk-11-jdk \
ssh \
wget \
python3-pip \
&& rm -rf /var/lib/apt/lists/*
# 安装Hadoop 3.3.6
RUN wget https://archive.apache.org/dist/hadoop/common/hadoop-3.3.6/hadoop-3.3.6.tar.gz && \
tar -xzf hadoop-3.3.6.tar.gz -C /opt && \
ln -s /opt/hadoop-3.3.6 /opt/hadoop
# 安装Spark 3.4.2
RUN wget https://archive.apache.org/dist/spark/spark-3.4.2/spark-3.4.2-bin-hadoop3.tgz && \
tar -xzf spark-3.4.2-bin-hadoop3.tgz -C /opt && \
ln -s /opt/spark-3.4.2-bin-hadoop3 /opt/spark
# 配置环境变量
ENV HADOOP_HOME=/opt/hadoop \
SPARK_HOME=/opt/spark \
PATH=$PATH:$HADOOP_HOME/bin:$SPARK_HOME/bin
# 初始化SSH
RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
chmod 0600 ~/.ssh/authorized_keys
# 暴露常用端口
EXPOSE 8020 9000 9870 8088 4040
# 启动脚本
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
ENTRYPOINT ["/entrypoint.sh"]
1.2 启动脚本示例(entrypoint.sh)
#!/bin/bash
# 格式化HDFS(首次启动)
if [ ! -f /hadoop/name/current/VERSION ]; then
hdfs namenode -format -force
fi
# 启动SSH服务
service ssh start
# 启动HDFS
start-dfs.sh
# 启动YARN
start-yarn.sh
# 启动Spark Standalone
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh spark://localhost:7077
# 保持容器运行
tail -f /dev/null
二、Docker Compose编排集群
2.1 多节点集群架构
graph TD
subgraph Hadoop Cluster
NN[NameNode] --> DN1[DataNode1]
NN --> DN2[DataNode2]
RM[ResourceManager] --> NM1[NodeManager1]
RM --> NM2[NodeManager2]
end
subgraph Spark Cluster
SM[Spark Master] --> SW1[Spark Worker1]
SM --> SW2[Spark Worker2]
end
ZK[ZooKeeper] -->|协调| NN
ZK -->|协调| RM
2.2 docker-compose.yml
version: '3.7'
services:
namenode:
image: bigdata-allinone
hostname: namenode
ports:
- "9870:9870"
- "9000:9000"
volumes:
- hadoop_nn:/opt/hadoop/name
environment:
- CLUSTER_ROLE=namenode
networks:
- bigdata-net
datanode1:
image: bigdata-allinone
hostname: datanode1
environment:
- CLUSTER_ROLE=datanode
- NN_HOST=namenode
depends_on:
- namenode
networks:
- bigdata-net
spark-master:
image: bigdata-allinone
hostname: spark-master
ports:
- "8080:8080"
environment:
- SPARK_MODE=master
networks:
- bigdata-net
spark-worker1:
image: bigdata-allinone
environment:
- SPARK_MODE=worker
- SPARK_MASTER_URL=spark://spark-master:7077
depends_on:
- spark-master
networks:
- bigdata-net
volumes:
hadoop_nn:
hadoop_dn:
networks:
bigdata-net:
driver: bridge
ipam:
config:
- subnet: 172.20.0.0/24
2.3 动态启动脚本
import docker
import time
def start_cluster():
client = docker.from_env()
# 启动Hadoop集群
namenode = client.containers.run(
"bigdata-allinone",
detach=True,
name="namenode",
ports={'9870/tcp': 9870},
environment={'CLUSTER_ROLE': 'namenode'}
)
time.sleep(30) # 等待NameNode初始化
for i in range(2):
client.containers.run(
"bigdata-allinone",
detach=True,
name=f"datanode{i+1}",
environment={
'CLUSTER_ROLE': 'datanode',
'NN_HOST': 'namenode'
}
)
# 启动Spark集群
spark_master = client.containers.run(
"bigdata-allinone",
detach=True,
name="spark-master",
ports={'8080/tcp': 8080},
environment={'SPARK_MODE': 'master'}
)
for i in range(2):
client.containers.run(
"bigdata-allinone",
detach=True,
name=f"spark-worker{i+1}",
environment={
'SPARK_MODE': 'worker',
'SPARK_MASTER_URL': 'spark://spark-master:7077'
}
)
if __name__ == "__main__":
start_cluster()
三、验证与使用
3.1 服务验证命令
# 检查HDFS状态
docker exec namenode hdfs dfsadmin -report
# 提交Spark测试任务
docker exec spark-master spark-submit \
--master spark://spark-master:7077 \
/opt/spark/examples/src/main/python/pi.py 100
3.2 Web UI访问
| 服务 | 访问地址 |
|---|---|
| HDFS NameNode | http://localhost:9870 |
| YARN | http://localhost:8088 |
| Spark Master | http://localhost:8080 |
四、避坑指南
4.1 常见问题解决方案
| 问题现象 | 解决方案 |
|---|---|
| DataNode无法注册 | 检查NN容器IP是否变化 |
| Spark Worker离线 | 确认7077端口映射正确 |
| 容器间网络不通 | 使用自定义Docker网络 |
| 存储卷权限问题 | 添加user: "root"到服务配置 |
4.2 性能调优参数
# 在docker-compose.yml中添加资源限制
deploy:
resources:
limits:
cpus: '2'
memory: 4G
reservations:
memory: 2G
五、扩展部署模式
5.1 Kubernetes部署示例
apiVersion: apps/v1
kind: Deployment
metadata:
name: hadoop-namenode
spec:
replicas: 1
template:
spec:
containers:
- name: namenode
image: bigdata-allinone
env:
- name: CLUSTER_ROLE
value: "namenode"
ports:
- containerPort: 9870
---
apiVersion: v1
kind: Service
metadata:
name: hadoop-namenode
spec:
type: NodePort
ports:
- port: 9870
nodePort: 30007
selector:
app: hadoop-namenode
生产建议:
- 使用持久化存储卷保证数据安全
- 配置容器资源限制防止OOM
- 启用健康检查机制
- 使用私有镜像仓库管理定制镜像
扩展实践:集成Prometheus监控容器化集群,完整代码参考GitHub仓库
附录:常用Docker命令速查
| 功能 | 命令 |
|---|---|
| 构建镜像 | docker build -t bigdata-allinone . |
| 查看容器日志 | docker logs -f namenode |
| 进入容器 | docker exec -it namenode bash |
| 清理无用资源 | docker system prune -af |