参考: 《尚硅谷大数据技术之Hadoop(MapReduce)》
1)MapReduce概述
1.1) 原理
如上,分布式的运算程序往往需要分成至少2个阶段:
- 第一个阶段的MapTask并发实例,完全并行运行,互不相干。
- 第二个阶段的ReduceTask并发实例互不相干,但是他们的数据依赖于上一个阶段的所有MapTask并发实例的输出。
- MapReduce编程模型只能包含 一个Map阶段 和 一个Reduce阶段,如果用户的业务逻辑非常复杂,那就只能 多个MapReduce程序,串行运行。
1.2) MapReduce进程
一个完整的MapReduce程序在分布式运行时有三类实例进程:
(1)MrAppMaster:负责整个程序的 过程调度 及 状态协调。
(2)MapTask:负责 Map阶段的 整个数据处理流程。
(3)ReduceTask:负责 Reduce阶段的 整个数据处理流程。
1.3) MR的HelloWorld——WordCount Demo
常用数据序列化类型
关键逻辑
注意事项
Mapper类的对象 一行一行地读取 原始数据的中内容
每读一行 就调用一次map方法 切割第一行的单词 生成键值对
K是单词 V是单词的数量
然后把所有键值对写入到临时文件
最后对临时文件排序 类似sql的groupby
随后shuffle
随后Reducer读数据 每次读一组
注意不能这样写(否则sum统计的是整篇txt):
而应该这样写:
WordCount代码
MyMapper:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import java.util.Arrays;
public class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> {
IntWritable iw = new IntWritable(1);
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
Text text = new Text();
String line = value.toString();
String[] arr = line.split(" ");
for (String str : arr) {
text.set(str);
context.write(text, iw);
}
}
}
MyReducer:
public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
int sum;
@Override
protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
sum = 0;
for (IntWritable iw : values) {
sum += iw.get();
}
IntWritable i = new IntWritable(sum);
context.write(key, i);
}
}
MyDriver:
public class MyDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(MyDriver.class);
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.setInputPaths(job, new Path("D:\\demo\\example\\news.txt"));
FileOutputFormat.setOutputPath(job, new Path("D:\\demo\\example\\example_output"));
job.waitForCompletion(true);
}
}
输入文件news.txt的内容:
One artificial intelligence system for peritoneal carcinomatosis has been developed and published in top surgical journal from colorectal surgery of the Sixth Affiliated Hospital of Sun Yat-sen University
Last updated :2020-08-11
Source: The Sixth Affiliated Hospital
Written by: The Sixth Affiliated Hospital
Edited by: Tan Rongyu, Wang Dongmei
Peritoneal carcinomatosis (PC) is considered to be the terminal stage of colorectal cancer (CRC) and obtains poor prognosis. Currently, the imaging tools for detecting PC is limited by low sensitivity, especially for these small PC nodes < 5mm in size. Recently, the researchers in colorectal surgery in the Sixth Affiliated Hospital of Sun Yat-sen University (SYSU) of China has developed the first artificial intelligence (AI) system for diagnosis of PC in cooperation with Tencent AI lab in Shenzhen. The original article was published in the top surgical journal Annals of Surgery (Impact factor: 10.13 points) with the title “Development and Validation of an Image-based Deep Learning Algorithm for Detection of Synchronous Peritoneal Carcinomatosis in Colorectal Cancer”. Dr. Zixu Yuan from the Sixth Affiliated Hospital of SYSU is the first author, and Professor Hui Wang is the corresponding author. Vice chief doctor Jian Cai, Dr. Wuteng Cao in radiology, and Dr. Yebiao Zhao have made critical attributions to this article.
input目录下:
输出文件part-r-00000 的内容(节选自 www.sysu.edu.cn/en/news/new…):
5
(AI) 1
(CRC) 1
(Impact 1
(PC) 1
(SYSU) 1
10.13 1
5mm 1
:2020-08-11 1
< 1
AI 1
Affiliated 5
Algorithm 1
Annals 1
Cai, 1
Cancer”. 1
Cao 1
Carcinomatosis 1
China 1
Colorectal 1
Currently, 1
Deep 1
Detection 1
Dongmei 1
Dr. 3
Edited 1
Hospital 5
Hui 1
Image-based 1
Jian 1
Last 1
Learning 1
One 1
PC 3
Peritoneal 2
Professor 1
Recently, 1
Rongyu, 1
SYSU 1
Shenzhen. 1
Sixth 5
Source: 1
Sun 2
Surgery 1
Synchronous 1
Tan 1
Tencent 1
The 3
University 2
Validation 1
Vice 1
Wang 2
Written 1
Wuteng 1
Yat-sen 2
Yebiao 1
Yuan 1
Zhao 1
Zixu 1
an 1
and 5
article 1
article. 1
artificial 2
attributions 1
author, 1
author. 1
be 1
been 1
by 1
by: 2
cancer 1
carcinomatosis 2
chief 1
colorectal 3
considered 1
cooperation 1
corresponding 1
critical 1
detecting 1
developed 2
diagnosis 1
doctor 1
especially 1
factor: 1
first 2
for 5
from 2
has 2
have 1
imaging 1
in 9
intelligence 2
is 4
journal 2
lab 1
limited 1
low 1
made 1
nodes 1
obtains 1
of 10
original 1
peritoneal 1
points) 1
poor 1
prognosis. 1
published 2
radiology, 1
researchers 1
sensitivity, 1
size. 1
small 1
stage 1
surgery 2
surgical 2
system 2
terminal 1
the 11
these 1
this 1
title 1
to 2
tools 1
top 2
updated 1
was 1
with 2
“Development 1
example_output目录下:
2)Hadoop序列化
2.1)序列化要点
在企业开发中往往常用的基本序列化类型不能满足所有需求,比如在Hadoop框架内部传递一个bean对象,那么该对象就需要实现序列化接口。
具体实现bean对象序列化步骤如下7步:
- 1.必须实现Writable接口
- 2.反序列化时,需要反射调用空参构造函数,所以必须有空参构造函数
public FlowBean() {
super();
}
- 3.重写序列化方法
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
- 4.重写反序列化方法
@Override
public void readFields(DataInput in) throws IOException {
upFlow = in.readLong();
downFlow = in.readLong();
sumFlow = in.readLong();
}
- 5.注意反序列化的顺序和序列化的顺序完全一致
- 6.要想把结果显示在文件中,需要重写toString(),可用”\t”分开,方便后续用。
- 7.如果需要将自定义的bean放在key中传输,则还需要实现Comparable接口,因为MapReduce框中的Shuffle过程要求对key必须能排序。详见后文的排序案例。
@Override
public int compareTo(FlowBean o) {
// 倒序排列,从大到小
return this.sumFlow > o.getSumFlow() ? -1 : 1;
}
2.2)序列化案例实操
需求 统计每一个手机号耗费的总上行流量、下行流量、总流量
- 1.输入数据
- 2.输入数据格式:
| id | 手机号码 | 网络ip | 上行流量 | 下行流量 | 网络状态码 |
|---|---|---|---|---|---|
| 7 | 13560436666 | 120.196.100.99 | 1116 | 954 | 200 |
-
3.期望输出数据格式:
| 手机号码 | 上行流量 | 下行流量 | 总流量 |
|---|---|---|---|
| 13560436666 | 1116 | 954 | 2070 |
| 需求分析 | |||
2.3)编写MapReduce程序
FlowBean:
public class FlowBean implements Writable {
private long upFlow;
private long downFlow;
private long sumFlow;
public FlowBean() {
}
public FlowBean(long upFlow, long downFlow, long sumFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = sumFlow;
}
public FlowBean(long upFlow, long downFlow) {
this.upFlow = upFlow;
this.downFlow = downFlow;
this.sumFlow = upFlow + downFlow;
}
@Override
public void write(DataOutput out) throws IOException {
out.writeLong(upFlow);
out.writeLong(downFlow);
out.writeLong(sumFlow);
}
@Override
public void readFields(DataInput in) throws IOException {
upFlow = in.readLong();
downFlow = in.readLong();
sumFlow = in.readLong();
}
public long getUpFlow() {
return upFlow;
}
public void setUpFlow(long upFlow) {
this.upFlow = upFlow;
}
public long getDownFlow() {
return downFlow;
}
public void setDownFlow(long downFlow) {
this.downFlow = downFlow;
}
public long getSumFlow() {
return sumFlow;
}
public void setSumFlow(long sumFlow) {
this.sumFlow = sumFlow;
}
@Override
public String toString() {
return upFlow + " " + downFlow + " " + sumFlow;
}
}
FlowCountMapper:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
import java.io.IOException;
import java.util.Map;
public class FlowCountMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
Text text = new Text();
@Override
protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
String[] arr = line.split("\t");
text.set(arr[1]);
FlowBean fb = new FlowBean(Long.parseLong(arr[arr.length - 3]), Long.parseLong(arr[arr.length - 2]));
// 写入键值对 (手机号码 :流量信息)
context.write(text, fb);
}
}
FlowCountReducer:
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Reducer;
import java.io.IOException;
public class FlowCountReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
IntWritable iw = new IntWritable();
@Override
protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
long totalUp = 0;
long totalDown = 0;
for (FlowBean fb : values) {
totalUp += fb.getUpFlow();
totalDown += fb.getDownFlow();
}
FlowBean fb = new FlowBean(totalUp, totalDown);
context.write(key, fb);
}
}
FlowsumDriver:
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import java.io.IOException;
public class FlowsumDriver {
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf);
job.setJarByClass(FlowsumDriver.class);
job.setMapperClass(FlowCountMapper.class);
job.setReducerClass(FlowCountReducer.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(FlowBean.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(FlowBean.class);
FileInputFormat.setInputPaths(job, new Path("D:\\demo\\input\\phone_data.txt"));
FileOutputFormat.setOutputPath(job, new Path("D:\\demo\\input\\see"));
job.waitForCompletion(true);
}
}
phone_data.txt的内容:
1 13736230513 192.196.100.1 www.atguigu.com 2481 24681 200
2 13846544121 192.196.100.2 264 0 200
3 13956435636 192.196.100.3 132 1512 200
4 13966251146 192.168.100.1 240 0 404
5 18271575951 192.168.100.2 www.atguigu.com 1527 2106 200
6 84188413 192.168.100.3 www.atguigu.com 4116 1432 200
7 13590439668 192.168.100.4 1116 954 200
8 15910133277 192.168.100.5 www.hao123.com 3156 2936 200
9 13729199489 192.168.100.6 240 0 200
10 13630577991 192.168.100.7 www.shouhu.com 6960 690 200
11 15043685818 192.168.100.8 www.baidu.com 3659 3538 200
12 15959002129 192.168.100.9 www.atguigu.com 1938 180 500
13 13560439638 192.168.100.10 918 4938 200
14 13470253144 192.168.100.11 180 180 200
15 13682846555 192.168.100.12 www.qq.com 1938 2910 200
16 13992314666 192.168.100.13 www.gaga.com 3008 3720 200
17 13509468723 192.168.100.14 www.qinghua.com 7335 110349 404
18 18390173782 192.168.100.15 www.sogou.com 9531 2412 200
19 13975057813 192.168.100.16 www.baidu.com 11058 48243 200
20 13768778790 192.168.100.17 120 120 200
21 13568436656 192.168.100.18 www.alibaba.com 2481 24681 200
22 13568436656 192.168.100.19 1116 954 200
输出文件part-r-00000的内容::
13470253144 180 180 360
13509468723 7335 110349 117684
13560439638 918 4938 5856
13568436656 3597 25635 29232
13590439668 1116 954 2070
13630577991 6960 690 7650
13682846555 1938 2910 4848
13729199489 240 0 240
13736230513 2481 24681 27162
13768778790 120 120 240
13846544121 264 0 264
13956435636 132 1512 1644
13966251146 240 0 240
13975057813 11058 48243 59301
13992314666 3008 3720 6728
15043685818 3659 3538 7197
15910133277 3156 2936 6092
15959002129 1938 180 2118
18271575951 1527 2106 3633
18390173782 9531 2412 11943
84188413 4116 1432 5548
输出目录中的内容:
3)MapReduce框架原理
3.1)InputFormat数据输入
3.1.1) 切片与MapTask并行度决定机制
- 1)问题引出
MapTask的并行度决定Map阶段的任务处理并发度,进而影响到整个Job的处理速度。 思考:
1G的数据,启动8个MapTask,可以提高集群的并发处理能力。那么1K的数据,也启动8个MapTask,会提高集群性能吗?
MapTask并行任务是否越多越好呢?
哪些因素影响了MapTask并行度? - 2)MapTask并行度决定机制
数据块:Block是HDFS物理上把数据分成一块一块。
数据切片:数据切片只是在逻辑上对输入进行分片,并不会在磁盘上将其切分成片进行存储。
==========================
如下图,数据块大小为128M,而切片大小为100M,可见造成了28M和56M的碎片,消耗了额外的网络带宽,差评!如下图,数据块大小、切片大小,均为128M,这才是正确的高效方式:
总之: 1)一个Job的Map阶段并行度由客户端在提交Job时的切片数决定 2)每一个Split切片分配一个MapTask并行实例处理 3)默认情况下,切片大小=BlockSize 4)切片时不考虑数据集整体,而是逐个针对每一个文件单独切片
MR框架原理
- 不同的阶段:Mapper ==> Shuffle ==> Reducer
- 从流的角度:数据的输入 ==> InputFormat ==> Mapper ==> Shuffle ==> Reducer ==> OutputFormat ==> 输出数据
- 代码的角度:
MapTask --> ReduceTask
map --> sort --> copy --> sort --> reduce
其中
map --> sort 如下:
其中 sort --> reduce 如下
真正提交任务:
copy配置信息 客户端生成切片信息:
真正提交任务:
切片大小 128M
则一个MapTask每次读128M
InputFormat:1.读数据 2.生成切片信息