参考：《尚硅谷大数据技术之Hadoop（MapReduce）》

1)MapReduce概述

1.1) 原理

如上，分布式的运算程序往往需要分成至少2个阶段：

第一个阶段的MapTask并发实例，完全并行运行，互不相干。
第二个阶段的ReduceTask并发实例互不相干，但是他们的数据依赖于上一个阶段的所有MapTask并发实例的输出。
MapReduce编程模型只能包含一个Map阶段和一个Reduce阶段，如果用户的业务逻辑非常复杂，那就只能多个MapReduce程序，串行运行。

1.2) MapReduce进程

一个完整的MapReduce程序在分布式运行时有三类实例进程：
（1）MrAppMaster：负责整个程序的 过程调度及状态协调。
（2）MapTask：负责 Map阶段的 整个数据处理流程。
（3）ReduceTask：负责 Reduce阶段的 整个数据处理流程。

1.3) MR的HelloWorld——WordCount Demo

常用数据序列化类型

关键逻辑

注意事项

Mapper类的对象一行一行地读取原始数据的中内容
每读一行就调用一次map方法切割第一行的单词生成键值对
K是单词 V是单词的数量
然后把所有键值对写入到临时文件最后对临时文件排序类似sql的groupby
随后shuffle 随后Reducer读数据每次读一组

注意不能这样写（否则sum统计的是整篇txt）： 而应该这样写：

WordCount代码

MyMapper:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.Arrays;

public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

    IntWritable iw = new IntWritable(1);

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        Text text = new Text();
        String line = value.toString();
        String[] arr = line.split(" ");
        for (String str : arr) {
            text.set(str);
            context.write(text, iw);
        }
    }
}

MyReducer:

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    int sum;

    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        sum = 0;
        for (IntWritable iw : values) {
            sum += iw.get();
        }
        IntWritable i = new IntWritable(sum);
        context.write(key, i);
    }
}

MyDriver:

public class MyDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(MyDriver.class);

        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        FileInputFormat.setInputPaths(job, new Path("D:\\demo\\example\\news.txt"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\demo\\example\\example_output"));

        job.waitForCompletion(true);
    }
}

输入文件news.txt的内容:

One artificial intelligence system for peritoneal carcinomatosis has been developed and published in top surgical journal from colorectal surgery of the Sixth Affiliated Hospital of Sun Yat-sen University
Last updated :2020-08-11



Source: The Sixth Affiliated Hospital
Written by: The Sixth Affiliated Hospital
Edited by: Tan Rongyu, Wang Dongmei

Peritoneal carcinomatosis (PC) is considered to be the terminal stage of colorectal cancer (CRC) and obtains poor prognosis. Currently, the imaging tools for detecting PC is limited by low sensitivity, especially for these small PC nodes < 5mm in size. Recently, the researchers in colorectal surgery in the Sixth Affiliated Hospital of Sun Yat-sen University (SYSU) of China has developed the first artificial intelligence (AI) system for diagnosis of PC in cooperation with Tencent AI lab in Shenzhen. The original article was published in the top surgical journal Annals of Surgery (Impact factor: 10.13 points) with the title “Development and Validation of an Image-based Deep Learning Algorithm for Detection of Synchronous Peritoneal Carcinomatosis in Colorectal Cancer”. Dr. Zixu Yuan from the Sixth Affiliated Hospital of SYSU is the first author, and Professor Hui Wang is the corresponding author. Vice chief doctor Jian Cai, Dr. Wuteng Cao in radiology, and Dr. Yebiao Zhao have made critical attributions to this article.

input目录下:
输出文件part-r-00000 的内容(节选自 www.sysu.edu.cn/en/news/new…):

	5
(AI)	1
(CRC)	1
(Impact	1
(PC)	1
(SYSU)	1
10.13	1
5mm	1
:2020-08-11	1
<	1
AI	1
Affiliated	5
Algorithm	1
Annals	1
Cai,	1
Cancer”.	1
Cao	1
Carcinomatosis	1
China	1
Colorectal	1
Currently,	1
Deep	1
Detection	1
Dongmei	1
Dr.	3
Edited	1
Hospital	5
Hui	1
Image-based	1
Jian	1
Last	1
Learning	1
One	1
PC	3
Peritoneal	2
Professor	1
Recently,	1
Rongyu,	1
SYSU	1
Shenzhen.	1
Sixth	5
Source:	1
Sun	2
Surgery	1
Synchronous	1
Tan	1
Tencent	1
The	3
University	2
Validation	1
Vice	1
Wang	2
Written	1
Wuteng	1
Yat-sen	2
Yebiao	1
Yuan	1
Zhao	1
Zixu	1
an	1
and	5
article	1
article.	1
artificial	2
attributions	1
author,	1
author.	1
be	1
been	1
by	1
by:	2
cancer	1
carcinomatosis	2
chief	1
colorectal	3
considered	1
cooperation	1
corresponding	1
critical	1
detecting	1
developed	2
diagnosis	1
doctor	1
especially	1
factor:	1
first	2
for	5
from	2
has	2
have	1
imaging	1
in	9
intelligence	2
is	4
journal	2
lab	1
limited	1
low	1
made	1
nodes	1
obtains	1
of	10
original	1
peritoneal	1
points)	1
poor	1
prognosis.	1
published	2
radiology,	1
researchers	1
sensitivity,	1
size.	1
small	1
stage	1
surgery	2
surgical	2
system	2
terminal	1
the	11
these	1
this	1
title	1
to	2
tools	1
top	2
updated	1
was	1
with	2
“Development	1

example_output目录下:

2)Hadoop序列化

2.1)序列化要点

在企业开发中往往常用的基本序列化类型不能满足所有需求，比如在Hadoop框架内部传递一个bean对象，那么该对象就需要实现序列化接口。
具体实现bean对象序列化步骤如下7步:

1.必须实现Writable接口
2.反序列化时，需要反射调用空参构造函数，所以必须有空参构造函数

public FlowBean() {
	super();
}

3.重写序列化方法

@Override
public void write(DataOutput out) throws IOException {
	out.writeLong(upFlow);
	out.writeLong(downFlow);
	out.writeLong(sumFlow);
}

4.重写反序列化方法

@Override
public void readFields(DataInput in) throws IOException {
	upFlow = in.readLong();
	downFlow = in.readLong();
	sumFlow = in.readLong();
}

5.注意反序列化的顺序和序列化的顺序完全一致
6.要想把结果显示在文件中，需要重写toString()，可用”\t”分开，方便后续用。
7.如果需要将自定义的bean放在key中传输，则还需要实现Comparable接口，因为MapReduce框中的Shuffle过程要求对key必须能排序。详见后文的排序案例。

@Override
public int compareTo(FlowBean o) {
	// 倒序排列，从大到小
	return this.sumFlow > o.getSumFlow() ? -1 : 1;
}

2.2)序列化案例实操

需求统计每一个手机号耗费的总上行流量、下行流量、总流量

1.输入数据

2.输入数据格式：

id	手机号码	网络ip	上行流量	下行流量	网络状态码
7	13560436666	120.196.100.99	1116	954	200

3.期望输出数据格式:

手机号码	上行流量	下行流量	总流量
13560436666	1116	954	2070
需求分析

2.3)编写MapReduce程序

FlowBean:

public class FlowBean implements Writable {
    private long upFlow;
    private long downFlow;
    private long sumFlow;

    public FlowBean() {
    }

    public FlowBean(long upFlow, long downFlow, long sumFlow) {
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = sumFlow;
    }

    public FlowBean(long upFlow, long downFlow) {
        this.upFlow = upFlow;
        this.downFlow = downFlow;
        this.sumFlow = upFlow + downFlow;
    }

    @Override
    public void write(DataOutput out) throws IOException {
        out.writeLong(upFlow);
        out.writeLong(downFlow);
        out.writeLong(sumFlow);
    }

    @Override
    public void readFields(DataInput in) throws IOException {
        upFlow = in.readLong();
        downFlow = in.readLong();
        sumFlow = in.readLong();
    }

    public long getUpFlow() {
        return upFlow;
    }

    public void setUpFlow(long upFlow) {
        this.upFlow = upFlow;
    }

    public long getDownFlow() {
        return downFlow;
    }

    public void setDownFlow(long downFlow) {
        this.downFlow = downFlow;
    }

    public long getSumFlow() {
        return sumFlow;
    }

    public void setSumFlow(long sumFlow) {
        this.sumFlow = sumFlow;
    }

    @Override
    public String toString() {
        return upFlow + " " + downFlow + " " + sumFlow;
    }
}

FlowCountMapper:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

import java.io.IOException;
import java.util.Map;

public class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
    Text text = new Text();

    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        String line = value.toString();
        String[] arr = line.split("\t");

        text.set(arr[1]);
        FlowBean fb = new FlowBean(Long.parseLong(arr[arr.length - 3]), Long.parseLong(arr[arr.length - 2]));
        // 写入键值对 (手机号码 ：流量信息)
        context.write(text, fb);
    }
}

FlowCountReducer:

import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;

import java.io.IOException;

public class FlowCountReducer extends Reducer<Text, FlowBean, Text, FlowBean> {

    IntWritable iw = new IntWritable();

    @Override
    protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
        long totalUp = 0;
        long totalDown = 0;
        
        for (FlowBean fb : values) {
            totalUp += fb.getUpFlow();
            totalDown += fb.getDownFlow();
        }
        FlowBean fb = new FlowBean(totalUp, totalDown);

        context.write(key, fb);
    }
}

FlowsumDriver:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

public class FlowsumDriver {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration conf = new Configuration();
        Job job = Job.getInstance(conf);

        job.setJarByClass(FlowsumDriver.class);

        job.setMapperClass(FlowCountMapper.class);
        job.setReducerClass(FlowCountReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(FlowBean.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(FlowBean.class);

        FileInputFormat.setInputPaths(job, new Path("D:\\demo\\input\\phone_data.txt"));
        FileOutputFormat.setOutputPath(job, new Path("D:\\demo\\input\\see"));

        job.waitForCompletion(true);
    }
}

phone_data.txt的内容：

1	13736230513	192.196.100.1	www.atguigu.com	2481	24681	200
2	13846544121	192.196.100.2			264	0	200
3	13956435636	192.196.100.3			132	1512	200
4	13966251146	192.168.100.1			240	0	404
5 	18271575951	192.168.100.2	www.atguigu.com	1527	2106	200
6 	84188413	192.168.100.3	www.atguigu.com	4116	1432	200
7 	13590439668	192.168.100.4			1116	954	200
8 	15910133277	192.168.100.5	www.hao123.com	3156	2936	200
9 	13729199489	192.168.100.6			240	0	200
10 	13630577991	192.168.100.7	www.shouhu.com	6960	690	200
11 	15043685818	192.168.100.8	www.baidu.com	3659	3538	200
12 	15959002129	192.168.100.9	www.atguigu.com	1938	180	500
13 	13560439638	192.168.100.10			918	4938	200
14 	13470253144	192.168.100.11			180	180	200
15 	13682846555	192.168.100.12	www.qq.com	1938	2910	200
16 	13992314666	192.168.100.13	www.gaga.com	3008	3720	200
17 	13509468723	192.168.100.14	www.qinghua.com	7335	110349	404
18 	18390173782	192.168.100.15	www.sogou.com	9531	2412	200
19 	13975057813	192.168.100.16	www.baidu.com	11058	48243	200
20 	13768778790	192.168.100.17			120	120	200
21 	13568436656	192.168.100.18	www.alibaba.com	2481	24681	200
22 	13568436656	192.168.100.19			1116	954	200

输出文件part-r-00000的内容：:

13470253144	180 180 360
13509468723	7335 110349 117684
13560439638	918 4938 5856
13568436656	3597 25635 29232
13590439668	1116 954 2070
13630577991	6960 690 7650
13682846555	1938 2910 4848
13729199489	240 0 240
13736230513	2481 24681 27162
13768778790	120 120 240
13846544121	264 0 264
13956435636	132 1512 1644
13966251146	240 0 240
13975057813	11058 48243 59301
13992314666	3008 3720 6728
15043685818	3659 3538 7197
15910133277	3156 2936 6092
15959002129	1938 180 2118
18271575951	1527 2106 3633
18390173782	9531 2412 11943
84188413	4116 1432 5548

输出目录中的内容：

3)MapReduce框架原理

3.1)InputFormat数据输入

3.1.1) 切片与MapTask并行度决定机制

1)问题引出
MapTask的并行度决定Map阶段的任务处理并发度，进而影响到整个Job的处理速度。思考：
1G的数据，启动8个MapTask，可以提高集群的并发处理能力。那么1K的数据，也启动8个MapTask，会提高集群性能吗？
MapTask并行任务是否越多越好呢？
哪些因素影响了MapTask并行度？
2)MapTask并行度决定机制
数据块：Block是HDFS物理上把数据分成一块一块。
数据切片：数据切片只是在逻辑上对输入进行分片，并不会在磁盘上将其切分成片进行存储。
==========================
如下图，数据块大小为128M，而切片大小为100M,可见造成了28M和56M的碎片，消耗了额外的网络带宽，差评！如下图，数据块大小、切片大小，均为128M，这才是正确的高效方式：总之： 1）一个Job的Map阶段并行度由客户端在提交Job时的切片数决定 2）每一个Split切片分配一个MapTask并行实例处理 3）默认情况下，切片大小=BlockSize 4）切片时不考虑数据集整体，而是逐个针对每一个文件单独切片

MR框架原理

不同的阶段：Mapper ==> Shuffle ==> Reducer
从流的角度：数据的输入 ==> InputFormat ==> Mapper ==> Shuffle ==> Reducer ==> OutputFormat ==> 输出数据
代码的角度： MapTask --> ReduceTask map --> sort --> copy --> sort --> reduce 其中 map --> sort 如下：其中 sort --> reduce 如下真正提交任务： copy配置信息客户端生成切片信息：真正提交任务：