[大数据]python+hadoop(实战)Python 与 Hadoop 结合使用主要通过 Hadoop Stream

Python 与 Hadoop 结合使用主要通过 Hadoop Streaming API 或 PySpark 实现。

以下是一个详细示例，展示如何用 Python 编写 Hadoop MapReduce 任务（基于 Hadoop Streaming），并解释适用场景。

一、适用场景

大数据处理：当数据量超过单机处理能力时（如日志分析、ETL 清洗）。
利用现有 Hadoop 集群：企业已有 Hadoop 基础设施，但团队更熟悉 Python。
快速原型开发：Python 语法简洁，适合快速验证算法逻辑。
结合 Python 生态：需调用 Python 的机器学习库（如 Scikit-learn）处理 Hadoop 数据。

二、详细示例：Python + Hadoop Streaming 实现 WordCount

1. 环境准备

Hadoop 集群（或单机伪分布式环境）
Python 环境（所有节点需安装 Python）
测试数据：文本文件（如 input.txt）

2. 数据准备

创建本地测试文件：

echo "hello world hello python" > input.txt     //清空文件后写入
echo "hadoop python spark python" >> input.txt  //追加写入

3. 上传数据到 HDFS

hadoop fs -mkdir /input           //在根目录下创建一个文件夹 input
hadoop fs -put input.txt /input   //把input.txt文件, 放在input这个文件夹中

4. 编写 Python MapReduce 脚本

Mapper 脚本 (mapper.py)
逐行读取输入，拆分单词并输出 <word, 1>：

#!/usr/bin/env python3
import sys

for line in sys.stdin:
    line = line.strip()  # 去掉首位的空格
    words = line.split() # 按照空格进行拆分, 最后得到的对象是list对象
    for word in words:
        print(f"{word}\t1")

Reducer 脚本 (reducer.py)
聚合相同单词的计数：

#!/usr/bin/env python3
import sys

current_word = None
current_count = 0

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue

    if current_word == word:
        current_count += count
    else:
        if current_word:
            print(f"{current_word}\t{current_count}")
        current_word = word
        current_count = count

if current_word:
    print(f"{current_word}\t{current_count}")

5. 赋予脚本执行权限

chmod +x mapper.py reducer.py

6. 提交 Hadoop Streaming 作业

hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/hadoop-streaming-*.jar \
-input /input \
-output /output \
-mapper "mapper.py" \
-reducer "reducer.py" \
-file mapper.py \
-file reducer.py

7. 查看结果

hadoop fs -cat /output/part-00000

输出示例：

hadoop  1
hello   2
python  3
spark   1
world   1

三、结合 Python 的其他方式

PySpark
使用 Spark on Hadoop（YARN）处理数据，适合复杂流水线和机器学习：

from pyspark.sql import SparkSession
spark = SparkSession.builder.master("yarn").appName("WordCount").getOrCreate()
text_file = spark.read.text("hdfs:///input.txt")
counts = text_file.rdd.flatMap(lambda line: line.value.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs:///output")

HDFS 操作
使用 hdfs 库直接读写 HDFS：

from hdfs import InsecureClient
client = InsecureClient('http://namenode:50070')
with client.read('/input.txt') as reader:
    content = reader.read()

四、关键注意事项

性能优化：Hadoop Streaming 效率低于 Java，可通过 Combiner 或 reducer.py 预聚合优化。
依赖管理：复杂任务需通过 -file 参数上传 Python 依赖包。
错误处理：在 Python 脚本中添加 try-except 捕获异常，避免任务失败。

五、总结

当需要快速处理 Hadoop 集群中的大规模数据，同时利用 Python 的灵活性和丰富库时，Python + Hadoop 是理想选择。对于更复杂的场景（如迭代计算），建议转向 PySpark。