大数据开发Hive数据压缩格式（第二十一篇）一、常见的数据压缩格式前面的hive默认使用的TextFile格式的数据，

一、常见的数据压缩格式

前面的hive默认使用的TextFile格式的数据，这种格式的数据，在存储层面占用的空间比较大，影响存储能力，也影响计算效率，所以为了提高Hive中数据的存储能力，及计算性能。所以我们需要针对Hive的数据存储格式进行扩展。但是先要理解MapReduce的数据压缩格式，因为存储格式想要发挥最大性能，还需要配合数据压缩格式一起使用

1.1、常见的数据压缩

压缩格式	文件扩展名	是否可切分	压缩比	压缩速度	解压速度
deflate	.deflate	否	中	中	中
Gzip	.gz	否	中	中	中
bzip2	.bz2	是	高	低	低
Lz4	.lz4	否	低	高	高
Lzo（hadoop3.x需要安装）	.lzo	是（创建索引）	低	高	高
Snappy	.snappy	否	低	高	高

CompressionCodec

InputFormat#org.apache.hadoop.mapreduce

RecordReader#org.apache.hadoop.mapreduce

选中RecordReader类command+alt+b，查看实现类，查看其中的LineRecordReader

initialize方法中的

if (codec instanceof SplittableCompressionCodec) {
        final SplitCompressionInputStream cIn =
          ((SplittableCompressionCodec)codec).createInputStream(
            fileIn, decompressor, start, end,
            SplittableCompressionCodec.READ_MODE.BYBLOCK);
        in = new CompressedSplitLineReader(cIn, job,
            this.recordDelimiterBytes);
        start = cIn.getAdjustedStart();
        end = cIn.getAdjustedEnd();
        filePosition = cIn;
}

查询当前hadoop的

是否可拆分

表示压缩后的数据文件在被MapReduce读取的时候是否会产生多个InputSplit，如果这个压缩格式产生的文件不可切分，那就意味着无论这个压缩文件有多大。在MapReuduce中都只会产生一个Map任务，如果压缩后的文件不大，也就是100M左右，这样对性能没多大影响。但是如果压缩后的文件比较大，达到1G，由于不可切分，这样就只能使用一个Map任务去计算了，性能就比较差了。这时候就没有办法达到并行计算的效果了，所以说，是否可切分，这个特性是非常重要的。特别是我们无法控制单个压缩文件大小。

压缩比

表示压缩格式的压缩效果，压缩比越高，说明压缩效果越好。对应产生的压缩文件就越小

压缩速度

将原始文件压缩为指定压缩格式消耗的时间，压缩消耗时间体现在任务消耗的时间

解压速度

将指定压缩格式解压为原始文件消耗的时间，因为mapreduce在使用压缩文件的时候需要先解压

1.2、数据压缩格式选择建议

使用包含压缩并且支持切分的文件格式，比如Sequence File，RCFile、ORC等
使用支持切分的压缩格式，比如：Bzip2和Lzo
提前把一个大文件拆分成多个块，每个块单独压缩，这样就不需要考虑是否可切分的问题了。

1.3、压缩的位置

可以在两个地方设置数据压缩格式，一个是针对Map阶段的输出数据进行压缩、一个是针对Reduce阶段的输出数据进行压缩

Map阶段

建议选择压缩和解压速度的压缩格式，Map阶段的数据落盘后通过Shuffle。也就是通过网络阶段传输到Reduce端，压缩Map的输出是可以提高网络传输效率的

但是压缩Map的输出会增加CPU的消耗，Map阶段在处理数据的时候自己就会消耗过多的CPU。

所以此时应该重点考虑使用压缩和解压速度比较的：Lzo、Snappy
Reduce阶段

针对Reduce阶段的输出数据需要分为两种场景：
- 如果结果数据是需要永久保存，此时需要重点考虑压缩效果比较好的：Bzip2和Gzip
- 如果结果数据还需要让另一个MapReduce任务继续计算，则需要重点考虑压缩后的数据文件是否支持切分。比如：bizp2、Lzo

二、数据压缩演示

2.1、生成2G的测试代码

package com.strivelearn.hadoop.hdfs.compress;

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;

/**
 * @author xys
 * @version GenerateData.java, 2022年10月19日
 */
public class GenerateData {
    public static void main(String[] args) throws IOException {
        String fileName = "/Users/strivelearn/Desktop/words.data";
        System.out.println("start:开始生成2G文件->" + fileName);
        BufferedWriter bfw = new BufferedWriter(new FileWriter(fileName));
        int num = 0;
        while (num < 75000000) {
            //生成的内容：
            //hello_1 you_1
            //hello_2 you_2
            //....
            bfw.write("hello_" + num + " you_" + num);
            bfw.newLine();
            num++;
            if (num % 10000 == 0) {
                bfw.flush();
            }
        }
        System.out.println("end:2G文件已生成");
    }
}

2.2、MapReduce的代码

package com.strivelearn.hadoop.hdfs.compress;

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

/**
 * 需求：读取hdfs上的hello.txt文件，计算文件中每个单词出现的总次数
 * hello.txt的内容如下：
 * hello world
 * say hello
 *
 * 最终展示结果如下：
 * hello 2
 * world 1
 * say 1
 *
 * @author xys
 * @version WordCount.java
 */
public class MrDataCompress {

    public static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
        @Override
        protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, LongWritable>.Context context) throws IOException, InterruptedException {
            String[] words = value.toString().split(" ");
            //迭代切割出来的单词数据
            for (String word : words) {
                //把迭代出来的单词封装成<k2,v2>的形式
                Text k2 = new Text(word);
                LongWritable v2 = new LongWritable(1);
                //把<k2,v2>写出去
                context.write(k2, v2);
            }
        }
    }

    public static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Reducer<Text, LongWritable, Text, LongWritable>.Context context) throws IOException, InterruptedException {
            //创建一个sum变量，保存v2的和
            long sum = 0;
            //对v2的数据进行累加求和
            for (LongWritable v2 : values) {
                sum += v2.get();
            }

            //组装k3 v3
            Text k3 = key;
            LongWritable v3 = new LongWritable(sum);
            //把结果写出去
            context.write(k3, v3);
        }
    }

    public static void main(String[] args) throws IOException, InterruptedException, ClassNotFoundException {
        //args的参数1为输入路径，参数2为输出路径
        if (args.length != 2) {
            //如果传递的参数不够，程序直接退出
            System.exit(100);
        }

        //指定Job需要配置的参数
        Configuration configuration = new Configuration();
        //解析命令行中通过-D传递过来的参数，添加到conf
        String[] remainingArgs = new GenericOptionsParser(configuration, args).getRemainingArgs();
        //创建一个Job
        Job job = Job.getInstance(configuration);

        //注意。这行必须设置，否则在集群中执行的时候是找不到WordCountMain这个类的
        job.setJarByClass(MrDataCompress.class);

        //指定输入路径（可以是文件，也可以是目录）
        FileInputFormat.setInputPaths(job, new Path(remainingArgs[0]));
        //指定输出路径（只能指定一个不存在的目录）
        FileOutputFormat.setOutputPath(job, new Path(remainingArgs[1]));

        //指定map相关的代码
        job.setMapperClass(MyMapper.class);
        //指定k2的类型
        job.setMapOutputKeyClass(Text.class);
        //指定v2的类型
        job.setMapOutputValueClass(LongWritable.class);

        //禁用reduce代码
        //job.setNumReduceTasks(0);

        //指定reduce相关的代码
        job.setReducerClass(MyReducer.class);
        //指定k3的类型
        job.setOutputKeyClass(Text.class);
        //指定v3的类型
        job.setOutputValueClass(LongWritable.class);

        //提交Job
        job.waitForCompletion(true);
    }
}

打包好的jar包上传到服务器，测试数据上传到服务器

测试文件上传到hdfs

2.2、原始未压缩执行命令

在hadoop的目录下执行

开始执行mapreduce

最终结果

最终结果是未压缩的，是TextFile格式

查看数据

2.3、deflate压缩执行命令

2.4、使用Gzip压缩

2.5、使用Bzip2压缩

2.6、使用Lz4压缩

2.7、使用Snappy压缩

测试是否可以被切分

查看number of splits:的数量。如果为1，说明没有被切分，如果大于2，说明可以被切分