hdfs是为了更好分布式计算
计算向数据移动

MR工作流程

在世界上计算有这么一种，以一条记录为单位。就是Map
在世界上计算有这么一种，以一组记录为单位。就是Reduce
最小的计算单位是只有Map，没有Reduce。
复杂的就是串起来
数据以一条记录为单位经过Map方法映射成KV，相同的key为一组，这一组数据条用一次reduce方法，在reduce中迭代计算这一组数据
经验：数据集一般是用迭代计算

Map

按一条一条的记录record(InputFormat)来处理，每次处理的时候不会跟其他条有关系，不关心其他条。
1进N出。做映射变化过滤
hdfs block是物理存在的，计算中逻辑概念 split，默认等于block。为了解耦！计算分为IO密集型和计算密集型，所以设计合适的split更灵活。split控制并行度
Map输入来自hdfs，并行度跟split有关，一个split一个map任务。split中能拿到块的副本信息host，这是计算向数据移动的原因
映射成KV。相同的K为一组。组是不可切分的。拿着K进行一次计算，计算出分区，拿着K算出P，所以得到的结果应该是KVP，算出每条记录最后要去到那个分区。
分组是跟需求有关。
一个map计算结果是在本机存着的。最后是一个文件。
在一个map中每次处理一条数据如果直接往本地磁盘写的话非常不合理，每次IO其实是一个用户态到内核态的转变，调用操作系统底层。所以应该是带缓冲区的bufferedIO默认buffer是100MB。//todo :这个buffer单独开专栏讲，在buffer中进行很多事情！先按照分区来排序，每次溢写的都在本地磁盘产生一个小文件，最后得到一堆小文件，内部有序(分区有序)外部无序。最终产生归并排序
排序两次，分区有序且分区内key有序！这样reduce复杂度下降

Reduce

一个组不能取到多个分区，一个组是相同特征(KEY)的。

以一组单位进行计算。相同的特征为一组，K,V 相同K 为一组。KV是由上游Map映射产生的。
reduce并行度，不是跟key的个数对应。是由人来控制的。**可能有10亿种key，每组只有两条数据。**不可能有10亿台机器。
一个reduce不一定只处理一组。默认框架中reduce是1个。
一个reduceTask 处理一个分区！
一个reduce拿到的是一个分区的迭代器，如果一个分区的数据很大(上T)传到了reduce方法中，内存早就溢出了。所以世界上有了迭代器这个设计模式。一次全量IO 就能计算完

资源管理和任务调度

计算向数据移动是如何实现的呢。程序是如何跑到DN节点上的呢。

人是懒惰的，促进科技的进步

1.x

角色

JobTracker(主) 资源管理任务调度。比较忙，而且有单点故障。已经被淘汰了
- 从hdfs中取出split清单。
- 根据tasktracker汇报的资源，最终 确定每一个map任务去哪个节点
- 之后tasktracker和自己做心跳的时候取回分配给自己的信息。
TaskTracker(从) 任务管理和资源汇报
- 从JT中取回任务
- 下载jar包(这就是计算向数据移动)
- 开始启动任务
客户端。(jar包)
- 根据每次计算的数据咨询NN元数据(block信息)算出split 得到切片清单从而得到map的数量。从切片能得出block的信息。从而map任务移动到哪些节点上。
- 生成计算程序未来计算的相关配置文件
- 客户端会将 jar包切片清单和计算配置文件上传HDFS (比较可靠，副本数默认是10。支持更快的读)
- 客户端调用JobTracker 通知要启动计算程序了。并告知自己上传到HDFS上的东西。

未来新的计算框架不能复用JobTracker。重复造轮子。

2.x yarn

app mstr 约等于 JobTracker阉割了资源管理。按需启动。
app mstr 拿着任务清单咨询RM。
container是资源的描述。
container启动后注册到app mstr
MR on yarn
客户端(切片清单、任务的配置、jarball 都上传到HDFS)，访问RM申请AppMaster
RM选择一台不忙的节点，通知NodeManager启动container，在里面反射一个MRAppMaster
appMaster 从hdfs下载切片清单，向MR申请资源
RM根据自己掌握的资源，通知nodemanager 启动container
container启动后会注册到appMaster
AppMaster会发消息给container，让他们启动任务
container会反射相应的Task类为对象，调用方法执行。
计算框架都有任务失败重试的机制。结论：每个计算程序有自己的AppMaster 轻量。

yarn只是资源管理，只要计算程序实现了yarn-AppMaster接口，大家都能使用一个统一视图的资源层。

yarn HA搭建

data node 和node manager 是一一对应的。
shuffle reduce从map拉数据的过程。

只看HA的搭建文档
默认端口是8088，上面的CPU和内存是虚拟的。上页面主要看集群节点的状态。

//mapreduce on yarn
<configuration>
    <property>
        <name>mapreduce.framework.name</name>
        <value>yarn</value>
    </property>
</configuration>

<configuration>
    <property>
        <name>yarn.nodemanager.aux-services</name>
        <value>mapreduce_shuffle</value>
    </property>
    <!--开启HA-->
    <property>
        <name>yarn.resourcemanager.ha.enabled</name>
        <value>true</value>
    </property>
    <!--使用ZK-->
     <property>
       <name>yarn.resourcemanager.zk-address</name>
       <value>node2:2181,node3:2181,node4:2181</value>
     </property>
     <property>
       <name>yarn.resourcemanager.cluster-id</name>
       <value>yarn集群标识</value>
     </property>
     <property>
       <name>yarn.resourcemanager.ha.rm-ids</name>
       <value>rm1,rm2</value>
     </property>
     <!--RM 的hostname-->
     <property>
       <name>yarn.resourcemanager.hostname.rm1</name>
       <value>node3</value>
     </property>
     <property>
       <name>yarn.resourcemanager.hostname.rm2</name>
       <value>node4</value>
     </property>
</configuration>

start-yarn.sh 根据slaves启动NodeManager。并没有启动RM 手动去启动 yarn-daemon.sh start resourcemanager

使用官方实例跑一个Wordcount，跑出来的结果 _SUCCESS是一个标识文件，说明成功了。 part-r-00000这个是跑出来的结果。中间的r代表是reduce的结果。如果只有map的话中间就是m了。
hdfs在把文件分成block的时候把单词切割了，但是计算结果是完好的。(思考)

hadoop jar  hadoop-mapreduce-examples-2.6.5.jar   wordcount   /data/wc/input   /data/wc/output

MR的开发

<!--这一个包就行-->
<!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-client -->
<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-client</artifactId>
    <version>2.6.5</version>
</dependency>

Path类的对象能够拿到文件系统。path.getFileSystem(conf)这个写法在源码中大量存在
分布式计算，避免不了序列化和反序列化。Hadoop提供了一套序列化反序列化的类型。

自己开发类型，必须实现序列化和反序列化接口还有比较器接口排序其实就是比较大小。世界上有两种顺序字典序和数值序

mapper开发就是重写map方法

/**
 * @author huoyun
 * @date 2019/7/17-19:48
 */
public class MyMapper extends Mapper<Object, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    /**
     * 每一条记录进入map方法
     *
     * @param key     每条记录第一个字节面对源文件的偏移量
     * @param value   每条记录的文本内容
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
        // 工具类切割 一句话  返回一个迭代器
        StringTokenizer itr = new StringTokenizer(value.toString());
        while (itr.hasMoreTokens()) {
            // 迭代出每一个词
            word.set(itr.nextToken());
            // 引用传递！ 后面序列化了，打断引用这个问题了
            context.write(word, one);
        }
    }
}
==================================================================================================================
/**
 * @author huoyun
 * @date 2019/7/17-22:37
 */
public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
    private IntWritable result = new IntWritable();

    /**
     * 相同的key为一组，这一组数据 调用一次reduce！
     *
     * @param key     一个单词
     * @param values  一堆1
     * @param context
     * @throws IOException
     * @throws InterruptedException
     */
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable val : values) {
            sum += val.get();
        }
        result.set(sum);
        context.write(key, result);
    }
}

// 启动hadoop jar demo-client-1.0-SNAPSHOT.jar com.huo.mapreduce.wc.MyWordCount

mr的提交方式

说说job.setJar() 和 job.setJarByClass()。如果在代码中直接运行main方法，也就是说裸执行是跟jar包没什么关系的。那么setJarByClass是不行的它找不到jar包。

//如果计算跑在集群上，一定需要把jar包上传hdfs。
/**
* Set the job's jar file by finding an example class location.
* 
* @param cls the example class.
*/
public void setJarByClass(Class cls) {
    // 用classloader根据类来找到jar包的绝对路径。
    String jar = ClassUtil.findContainingJar(cls);
    if (jar != null) {
      setJar(jar);
    }   
}

开发 maven打包上传jar包。命令行执行Hadoop jar xxx.jar args
在代码里提交任务，也是在集群中运行 MapReduce.framework.name = yarn 决定了MR on yarn，即在集群上跑。
- job通过conf 知道RM在哪
- 直接运行main方法会报错，在8088页面找到错误Application application_1563358934125_0004 failed 2 times due to AM Container for appattempt_1563358934125_0004_000002 exited with exitCode: 1 APPmaster都没起来
- 代码如果在Windows上跑，需要开启conf.set("mapreduce.app-submission.cross-platform", "true");原因是Windows和Linux是异构平台。
- 开启上面的配置之后再运行还是报错。Error: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.huo.mapreduce.wc.MyMapper not found找不到自己写的mymapper
- 客户端还需要job.setJar("D:\\devlop\\代码\\马士兵VIP\\大数据\\demo-client\\target\\demo-client-1.0-SNAPSHOT.jar");。
local单机自测调试
- conf.set("mapreduce.framework.name","local"); MR on local
- 不需要上传jar包了。
- 还是要开启异构平台。
- ！ Hadoop最好的平台是Linux。Windows需要
  - 在Windows中整一个HADOOP_HOME。
  - 还要将hadoop.dll 扔到c:\windows\system32

程序接受参数，有两种。一中是-D xxx(全局) 还有一种就是咱普通的参数例如 input、output。这些都回传入main方法的args中。得手动把他们解析然后把带D的参数put进conf。但是有GenericOptionParser来帮程序员处理。这种开发思想值得学习。

源码分析

更好的了解技术细节，分布式计算追求的是计算向数据移动

client

没有计算产生

客户端启动的入口job.waitForCompletion(true);，submit是异步的，这传入true 能够监控提交信息。
submit方法中得到提交者final JobSubmitter submitter = getJobSubmitter(cluster.getFileSystem(),cluster.getClient());
submitJobInternal()提交者的这个方法要做五件事儿

Internal method for submitting jobs to the system.
The job submission process involves:
    1.Checking the input and output specifications of the job.
    2.Computing the InputSplits for the job.
    3.Setup the requisite accounting information for the DistributedCache of the job, if necessary.
    4.Copying the job's jar and configuration to the map-reduce system directory on the distributed file-system.
    5.Submitting the job to the JobTracker and optionally monitoring it's status.

submitJobInternal方法中writeSplits得到切片的数量即map的数量

    // 框架默认的输入格式化类 是 TextInputFormat
    InputFormat<?, ?> input = ReflectionUtils.newInstance(job.getInputFormatClass(), conf);
    List<InputSplit> splits = input.getSplits(job);

getSplits是由FileInputFormat实现的。客户端最重要的方法

// 计算切片的大小
long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));//1
long maxSize = getMaxSplitSize(job);//Long.MAX_VALUE
// 拿到一个文件的所有块信息
blkLocations = fs.getFileBlockLocations(file, 0, length);
long blockSize = file.getBlockSize();
// 计算切片的大小 这默认算出来一个切片=块的大小
long splitSize = computeSplitSize(blockSize, minSize, maxSize);
// 切片的起始位置被一个block包含住，host信息拿到的是所属块的信息
long bytesRemaining = length;
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
    int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
    // makeSplit 重要!!!!!!!
    splits.add(makeSplit(path, length-bytesRemaining, splitSize,
        blkLocations[blkIndex].getHosts(),
        blkLocations[blkIndex].getCachedHosts()));
    bytesRemaining -= splitSize;
}

一个切片重要的四个属性：file(所属的文件),offset,length,hosts(计算向数据移动)

MapTask

去看MaskTask这个类的run方法
如果有reduce 可以设置map和sort的各自权重。
使用新API runNewMapper 进入这个方法找try，这里面通过客户端setMapper的class反射出来Mapper对象。

input

Map的输入阶段，用普通话来说就是，得到LineRecordReader
1、对输入有一次初始化的过程(如果不是第一个切片offset下移一行！)。
2、逐条记录处理，判断有没有下一条记录，如果有返回KV。

TextInputFormat-->>createRecordReader() 记录读取器。最终返回了一个LineRecordReader行记录读取器。这是真正干活的人
mapper.run(mapperContext);这传入的mapperContext。这个context里面有行读取器
切片的offset如果不是0也就是不是第一个切片，那么会空出第一行。因为有切割的记录。LineRecordReader 的初始化方法会把split的第一行剔除掉。

// LineRecordReader 的初始化方法。
    // If this is not the first split, we always throw away first record
    // because we always (except the last split) read one extra line in
    // next() method.
    if (start != 0) {
      start += in.readLine(new Text(), 0, maxBytesToConsume(start));
    }

LineRecordReader->nextKeyValue()。行记录读取器判断有没有下一行，同时还对KV赋值。

// 这个context 里干活的实际上就是LineRecordReader
  public void run(Context context) throws IOException, InterruptedException {
    setup(context);
    try {
      while (context.nextKeyValue()) {
        map(context.getCurrentKey(), context.getCurrentValue(), context);
      }
    } finally {
      cleanup(context);
    }
  }

output

map的输出是由NewOutputCollector来进行的
输出为KVP，输出到本地缓冲区collector，字节数组
本地缓冲区内的排序默认是快速排序。大小默认是100MB，写到80%的时候进行spill。

分区器

当reduce的数量大于1即分区数大于1时。默认的分区器是HashPartitioner

  public Class<? extends Partitioner<?,?>> getPartitionerClass() 
     throws ClassNotFoundException {
    return (Class<? extends Partitioner<?,?>>) 
      conf.getClass(PARTITIONER_CLASS_ATTR, HashPartitioner.class);
  }

// HashPartitioner里的分区方法。
  public int getPartition(K key, V value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

当分区数等于1时即reduceNum是1时，直接给出分区号是0！

  // 这的partition是肯定是1。不知道为什么要这么写 还进行一次计算 而不是直接返回0
  @Override
  public int getPartition(K key, V value, int numPartitions) {
    return partitions - 1;
  }

在Mapper类中的map方法中调用了context.write()实际上调用的是NewOutputCollector里的write

protected void map(KEYIN key, VALUEIN value, Context context) throws IOException, InterruptedException {
  context.write((KEYOUT) key, (VALUEOUT) value);
}

//上面的方法最后调用的是
@Override
public void write(K key, V value) throws IOException, InterruptedException {
  // 最后就是KVP了
  collector.collect(key, value, partitioner.getPartition(key, value, partitions));
}

上面代码中的collector就向本地磁盘spill的缓冲区。

// 默认缓冲区 写到80%的时候进行spill 并且缓冲区一共100MB
final float spillper = job.getFloat(JobContext.MAP_SORT_SPILL_PERCENT, (float)0.8);
final int sortmb = job.getInt(JobContext.IO_SORT_MB, 100);

// 排序器
// p排序 k排序 默认使用的是快速排序
sorter = ReflectionUtils.newInstance(job.getClass("map.sort.class", QuickSort.class, IndexedSorter.class), job);

comparator = job.getOutputKeyComparator();比较器。如果有指定比较器就用指定的。否则使用key的比较器。

为什么不直接使用key的比较器。因为可能key是用的Text类型但是就是想按照数值序来排序。这样比较灵活，只需要写一个比较器就行了。也是解耦

combiner 在map完事儿后进行一个小reduce。框架默认是没有combiner。
spillThread.setDaemon(true);spillThread.setName("SpillThread"); 溢写进程。专门负责溢写。

map输出的缓冲区 MapOutputBuffer (collector)

这张图就够了

buffer是环形缓冲区，本质还是字节数组。
赤道两端放KV 和索引。
索引是固定16byte的宽度的。
- 分区号 1个int
- k 的开始位置 1个int
- v 的开始位置 1个int
- v 的长度 1个字节
如果缓冲区到了80% 启动spill线程。spill的同时进行排序。同时map的输出线程继续把剩余的20%中间拉一刀，作为赤道接着写。排序是排索引。最后spill的时候按照索引就是排序好的了。
调优点：combiner(map里的一个reduce，按key做计算)。在排序之后 spill之前。spill的IO变小！combiner一定是要幂等性的操作。
缓冲区的一次溢写就会产生一个小文件，最后会把这些小文件合成一个大文件。minSpillCombine=3，如果溢写的次数大于3，会触发combiner。提高效率。
磁盘的顺序读写就很快。随机读写非常慢。

ReduceTask

一个ReduceTask拿到一个分区的数据的一个迭代器(真迭代器)
reduce方法被调用的时候，并没有把一组数据真的加载到内存。而是拿到这一组数据的迭代器(假迭代器)。
    假迭代器中hasNext会判断 nextKeyIsSame
    next() 会调用nextKeyValue()方法。从真迭代器获取数据并且更新nextKeyIsSame的值

因为数据都排过序，真假迭代器协作 一次IO就全完事儿。

input

rIter = shuffleConsumerPlugin.run();拿到一个分区的迭代器，真迭代器！

  /** Start processing next unique key. */
  public boolean nextKey() throws IOException,InterruptedException {
    while (hasMore && nextKeyIsSame) {
      nextKeyValue();
    }
    if (hasMore) {
      if (inputKeyCounter != null) {
        inputKeyCounter.increment(1);
      }
      return nextKeyValue();
    } else {
      return false;
    }
  }

MapReduce

MR工作流程

Map

Reduce

资源管理和任务调度

1.x

角色

2.x yarn

yarn HA搭建

MR的开发

mr的提交方式

源码分析

client

MapTask

input

output

map输出的缓冲区 MapOutputBuffer (collector)

ReduceTask

input

reduce

output