MapReduce核心原理（中）携手创作，共同成长！这是我参与「掘金日新计划 · 8 月更文挑战」的第29天，点击查看活

携手创作，共同成长！这是我参与「掘金日新计划 · 8 月更文挑战」的第29天，点击查看活动详情

MapReduce 中的排序

MapTask 和 ReduceTask 都会对数据按key进行排序。该操作是 Hadoop 的默认行为，任何应用程序不管需不需要都会被排序。默认排序是字典顺序排序，排序方法是快速排序

下面介绍排序过程：

MapTask

它会将处理的结果暂时放到环形缓冲区中，当环形缓冲区使用率达到一定阈值后，再对缓冲区中的数据进行一次快速排序，并将这些有序数据溢写到磁盘
溢写完毕后，他会对磁盘所有文件进行归并排序

ReduceTask

当所有数据拷贝完后，会统一对内存和磁盘的所有数据进行一次归并排序。

排序方式

部分排序

MapReduce 根据输入记录的键值对数据集排序，保证输出的每个文件内部有序

全排序

最终输出结果只有一个文件，且文件内部有序。实现方式是只设置一个 ReduceTask，但是该方法在处理大型文件时效率极低。因为这样只有一台机器处理所有的文件，完全丧失了 MapReduce 所提供的并行架构

辅助排序（分组排序）

在 Reduce 端对 key 进行分组。应用于：在接受的 key 为 bean 对象时，想让一个或几个字段相同的 key 进入到同一个 reduce 方法时，可以采用分组排序。

二次排序

在自定义排序过程中，如果 compareTo 中的判断条件为两个即为二次排序。

排序接口 WritebleComparable

我们知道 MapReduce 过程是会对 key 进行排序的。那么如果我们将 Bean 对象作为 key 时，就需要实现 WritableComparable 接口并重写 compareTo 方法指定排序规则。

@Setter
@Getter
public class CustomSort implements WritableComparable<CustomSort> {

    private Long orderId;

    private String orderCode;



    @Override
    public int compareTo(CustomSort o) {
        return orderId.compareTo(o.orderId);
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeLong(orderId);
        dataOutput.writeUTF(orderCode);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.orderId = dataInput.readLong();
        this.orderCode = dataInput.readUTF();
    }
}

分组排序 GroupingComparator

GroupingComparator 是 Mapreduce 中 reduce 端的一个功能组件，主要的作用是决定哪些数据为一组，调用一次 reduce 逻辑。默认是每个不同的 key，作为不同的组。我们可以自定义 GroupingComparator 实现不同的 key 作为一个组，调用一次 reduce 逻辑。

案例实战：求出每一个订单中成交金额最大的一笔交易。下面的数据只给出了订单行的 id 和金额。订单行 id 中_前相等的算同一个订单

订单行 id	商品金额
order1_1	345
order1_2	4325
order1_3	44
order2_1	33
order2_2	11
order2_3	55

实现思路

Mapper:

读取一行文本数据，切分每个字段
把订单行 id 和金额封装为一个 bean 对象，作为 key，排序规则是订单行 id“_”前面的订单 id 来排序，如果订单 id 相等再按金额降序排
map 输出内容，key：bean 对象，value：NullWritable.get()

Shuffle:

自定义分区器，保证相同的订单 id 的数据去同一个分区

Reduce：

自定义 GroupingComparator，分组规则指定只要订单 id 相等则属于同一组
每个 reduce 方法写出同一组 key 的第一条数据就是最大金额的数据。

参考代码：

public class OrderBean implements WritableComparable<OrderBean> {

    private String orderLineId;

    private Double price;

    @Override
    public int compareTo(OrderBean o) {
        String orderId = o.getOrderLineId().split("_")[0];
        String orderId2 = orderLineId.split("_")[0];
        int compare = orderId.compareTo(orderId2);
        if(compare==0){
            return o.price.compareTo(price);
        }
        return compare;
    }

    @Override
    public void write(DataOutput dataOutput) throws IOException {
        dataOutput.writeUTF(orderLineId);
        dataOutput.writeDouble(price);
    }

    @Override
    public void readFields(DataInput dataInput) throws IOException {
        this.orderLineId = dataInput.readUTF();
        this.price = dataInput.readDouble();
    }
}

public class OrderMapper extends Mapper<LongWritable, Text,OrderBean, NullWritable> {

    @Override
    protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, OrderBean, NullWritable>.Context context) throws IOException, InterruptedException {
        String[] split = value.toString().split(" ");
        OrderBean orderBean=new OrderBean();
        orderBean.setOrderLineId(split[0]);
        orderBean.setPrice(Double.parseDouble(split[1]));

        context.write(orderBean,NullWritable.get());
    }
}

自定义分区：

public class OrderPartitioner extends Partitioner<OrderBean, NullWritable> {
    @Override
    public int getPartition(OrderBean orderBean, NullWritable nullWritable, int i) {
        //相同订单id的发到同一个reduce中去
        String orderId = orderBean.getOrderLineId().split("_")[0];
        return (orderId.hashCode() & Integer.MAX_VALUE) % i;
    }
}

组排序：

public class OrderGroupingComparator extends WritableComparator {

    public OrderGroupingComparator() {
        super(OrderBean.class,true);
    }

    @Override
    public int compare(WritableComparable a, WritableComparable b) {
        String aOrderId = ((OrderBean) a).getOrderLineId().split("_")[0];
        String bOrderId = ((OrderBean) b).getOrderLineId().split("_")[0];
        return aOrderId.compareTo(bOrderId);
    }
}

public class OrderReducer extends Reducer<OrderBean, NullWritable,OrderBean,NullWritable> {

    @Override
    protected void reduce(OrderBean key, Iterable<NullWritable> values, Reducer<OrderBean, NullWritable, OrderBean, NullWritable>.Context context) throws IOException, InterruptedException {
        context.write(key,NullWritable.get());
    }
}

driver 类：

public class OrderDriver {

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
//        System.setProperty("java.library.path","d://");
        Configuration conf = new Configuration();
        Job job=Job.getInstance(conf,"OrderDriver");

        //指定本程序的jar包所在的路径
        job.setJarByClass(OrderDriver.class);

        //指定本业务job要使用的mapper/Reducer业务类
        job.setMapperClass(OrderMapper.class);
        job.setReducerClass(OrderReducer.class);

        //指定mapper输出数据的kv类型
        job.setMapOutputKeyClass(OrderBean.class);
        job.setMapOutputValueClass(NullWritable.class);

        //指定reduce输出数据的kv类型
        job.setOutputKeyClass(OrderBean.class);
        job.setOutputValueClass(NullWritable.class);

        //指定job的输入文件目录和输出目录
        FileInputFormat.setInputPaths(job,new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));

        job.setPartitionerClass(OrderPartitioner.class);
        job.setGroupingComparatorClass(OrderGroupingComparator.class);

//        job.setNumReduceTasks(2);
        boolean result = job.waitForCompletion(true);
        System.exit( result ? 0: 1);

    }
}

Shuffle 阶段数据的压缩机制

Hadoop 中支持的压缩算法

据压缩有两大好处，节约磁盘空间，加速数据在网络和磁盘上的传输！！

我们可以使用 bin/hadoop checknative 来查看我们编译之后的 hadoop 支持的各种压缩，如果出现 openssl 为 false，那么就在线安装一下依赖包！！

压缩格式	hadoop 自带	算法	文件扩展名	是否可切分	换压缩格式后，原来的程序是否需要修改
DEFLATE	是	DEFLATE	.deflate	否	不需要
Gzip	是	DEFLATE	.gz	否	不需要
bzip2	是	bzip2	.bz2	是	不需要
LZO	否	LZO	.lzo	是	需要建索引，还需要指定输入格式
Snappy	否	Snappy	.snappy	否	不需要

压缩效率对比：

压缩位置

Map 输入端压缩

此处使用压缩文件作为 Map 的输入数据，无需显示指定编解码方式，Hadoop 会自动检查文件扩展名，如果压缩方式能够匹配，Hadoop 就会选择合适的编解码方式进行压缩和解压。

Map 端输出压缩

Shuffle 是 MR 过程中资源消耗最多的阶段，如果有数据量过大造成网络传输速度缓慢，可以考虑使用压缩

Reduce 端输出压缩

输出的结果数据使用压缩能够减少存储的数据量，降低所需磁盘的空间，并且作为第二个 MR 的输入时可以复用压缩

压缩配置方式

在驱动代码中通过 Configuration 设置。

设置map阶段压缩
Configuration configuration = new Configuration();
configuration.set("mapreduce.map.output.compress","true");
configuration.set("mapreduce.map.output.compress.codec","org.apache.hadoop.i
o.compress.SnappyCodec");
设置reduce阶段的压缩
configuration.set("mapreduce.output.fileoutputformat.compress","true");
configuration.set("mapreduce.output.fileoutputformat.compress.type","RECORD"
);
configuration.set("mapreduce.output.fileoutputformat.compress.codec","org.ap
ache.hadoop.io.compress.SnappyCodec");

配置 mapred-site.xml，这种方式是全局的，对所有 mr 任务生效

<property>   
<name>mapreduce.output.fileoutputformat.compress</name>
   <value>true</value>
</property>
<property>    
<name>mapreduce.output.fileoutputformat.compress.type</name>
   <value>RECORD</value>
</property>
<property>    
<name>mapreduce.output.fileoutputformat.compress.codec</name>
   <value>org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

压缩实战

在驱动代码中添加即可

configuration.set("mapreduce.output.fileoutputformat.compress","true");
configuration.set("mapreduce.output.fileoutputformat.compress.type","RECORD");
configuration.set("mapreduce.output.fileoutputformat.compress.codec","org.apache
.hadoop.io.compress.SnappyCodec");