Hadoop入门mac系统下搭配运行环境（四）MapReduce初识1、MapReduce核心思想 1、分布式的运算程序

1、MapReduce核心思想

1、分布式的运算程序往往需要分成至少2个阶段
2、第一个阶段的MapTask并发实例，完全并行运行，互不相干
3、第二个阶段的ReduceTask并发实例互不相干，但是他们的数据依赖于上一个阶段的所有MapTask并发实例的输出。
4、MapReduce编程模型只能包含一个Map阶段和一个Reduce阶段，如果用户的业务逻辑非常复杂，那就只能多个MapReduce程序，串行运行。

2、MapReduce进程

一个完整的MapReduce程序在分布式运行时有三类实例进程：

1、(1）MrAppMaster：负责整个程序的过程调度及状态协调。
2、(2) MapTask：负责Map阶段的整个数据处理流程。
3、(3) ReduceTask：负责Reduce阶段的整个数据处理流程。

3、常用数据序列化类型

Java类型	Hadoop Writable类型
Boolean	BooleanWritable
Byte	ByteWritable
Integer	IntWritable
Float	FloatWritable
Long	LongWritable
Double	DoubleWritable
String	Text
Map	MapWritable
Array	ArrayWritable

4、案例（一）

4.1、需求

1、在给定的文本文件中统计输出每一个单词出现的总次数输入数据：hello.txt

zhangfei
zhagnfei
zhangfei
liubei
liubei
liubei
zhugeliang
zhugeliang

期望输出

zhangfei 3
liubei 3
zhugeliang 2

4.2、编写Mapper类

package com.jubull.mapreduce;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
 

public class WordcountMapper extends Mapper<LongWritable, Text, Text, IntWritable>{
    Text k = new Text();
    IntWritable v = new IntWritable(1);
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // 1 获取一行
        String line = value.toString();
        // 2 切割
        String[] words = line.split(" ");
        // 3 输出
        for (String word : words) {
            k.set(word);
            context.write(k, v);
        }
   }
}

4.3、编写Reducer类

package com.jubull.mapreduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
 
public class WordcountReducer extends Reducer<Text, IntWritable, Text, IntWritable>{
    int sum;
    IntWritable v = new IntWritable();
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
        // 1 累加求和
        sum = 0;
        for (IntWritable count : values) {
            sum += count.get();
        }
        // 2 输出
        v.set(sum);
        context.write(key,v);
    }
}

####、 4.4 编写Driver驱动类

package com.jubull.mapreduce.wordcount;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordcountDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        // 1 获取配置信息以及封装任务
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);
        // 2 设置jar加载路径
        job.setJarByClass(WordcountDriver.class);
        // 3 设置map和reduce类
        job.setMapperClass(WordcountMapper.class);
        job.setReducerClass(WordcountReducer.class);
        // 4 设置map输出
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);
        // 5 设置最终输出kv类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);
        // 6 设置输入和输出路径
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        // 7 提交
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }
}

4.4、本地测试

1、首先搭载好环境
2、到hadoop官网下载hadoop的liux的tar.gz包 https://hadoop.apache.org/release/3.3.1.html

3、下载到下来cp到opt/module目录中tar -zxvf hadoop3.3.1.tar.gz -C ./
4、配置Mac的系统环境变量 .bash_profile
5、配置好了之后source .bash_profile
6、在Maven中的pom.xml文件配置

<dependencies>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.12</version>
    </dependency>
    <dependency>
        <groupId>org.apache.logging.log4j</groupId>
        <artifactId>log4j-slf4j-impl</artifactId>
        <version>2.12.0</version>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-client</artifactId>
        <version>3.1.3</version>
    </dependency>
</dependencies>

7、本地测试在Idea中运行就可以了

4.5、集群上测试

用manven用maven打jar包，需要添加的打包插件依赖

1、配置好pom.xml的依赖

<build>
    <plugins>
        <plugin>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>2.3.2</version>
            <configuration>
            <source>1.8</source>
            <target>1.8</target>
            </configuration>
            </plugin>
            <plugin>
            <artifactId>maven-assembly-plugin </artifactId>
            <configuration>
            <descriptorRefs>
            <descriptorRef>jar-with-dependencies</descriptorRef>
            </descriptorRefs>
            <archive>
                <manifest>
                    <mainClass>com.jubull.mr.WordcountDriver</mainClass>
                </manifest>
            </archive>
        </configuration>
        <executions>
            <execution>
                <id>make-assembly</id>
                <phase>package</phase>
                <goals>
                <goal>single</goal>
                </goals>
            </execution>
        </executions>
        </plugin>
    </plugins>
</build>

2、启动Hadoop集群 3、执行WordCount程序 hadoop jar wc.jar com.jubull.wordcount.WordcountDriver /user/atguigu/input /user/atguigu/output