数据倾斜问题浅析这是我参与「第三届青训营 -后端场」笔记创作活动的第7篇笔记字节搜索中处理大量数据肯定会用到大数据框架

这是我参与「第三届青训营 -后端场」笔记创作活动的第7篇笔记

字节搜索中处理大量数据肯定会用到大数据框架，这里我们浅谈下大数据场景下的数据倾斜问题

数据倾斜

1.数据倾斜怎么造成的

mapreduce计算是将map相同的key丢到reduce，在reduce中进行聚合操作,在map和reduce中间有个shuffle操作，shuffle会将map阶段相同的key划分到reduce阶段中的一个reduce中去，数据倾斜就是数据的key 的分化严重不均，造成一部分数据很多，一部分数据很少的局面。

2.数据倾斜产生的问题

有一个或多个reduce卡住
各种container报错OOM
读写的数据量极大，至少远远超过其它正常的reduce
伴随着数据倾斜，会出现任务被kill等各种诡异的表现。

3.原因和解决方法

原因：

单个值有大量记录(1.内存的限制存在，2.可能会对集群其他任务的运行产生不稳定的影响)
唯一值较多(单个唯一值的记录数占用内存不会超过分配给reduce的内存)

解决办法:

增加reduce个数
使用自定义partitioner
增加reduce 的jvm内存（效果不好）
map 阶段将造成倾斜的key 先分成多组加随机数并且在reduce阶段去除随机数
从业务和数据上解决数据倾斜

我们通过设计的角度尝试解决它
- 数据预处理，过滤掉异常值
- 将数据打散让它的并行度变大，再汇集

平台的优化方法

join 操作中，使用 map join 在 map 端就先进行 join ，免得到reduce 时卡住
能先进行 group 操作的时候先进行 group 操作，把 key 先进行一次 reduce,之后再进行 count 或者 distinct count 操作
设置map端输出、中间结果压缩

MRchain解决数据倾斜

核心思想: 第一个mapredue把具有数据倾斜特性的数据加盐(随机数)，进行聚合；第二个mapreduce把第一个mapreduce的加盐结果进行去盐，再聚合，问题是两个MR IO高。

public class ChainFirstDriver {
    public static void main(String[] args) throws Exception{
        String input = "data/chain/data.txt";
        String output = "data/chain/first";

        // 1:获取job对象
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);

        //删除已存在目录
        FileUtils.deleteOutput(configuration,output);

        // 2：本job对应执行的主类
        job.setJarByClass(ChainFirstDriver.class);

        // 3）设置Mapper和Reducer
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        job.setNumReduceTasks(10);

        // 4）设置Mapper阶段输出数据的类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5）设置Reducer阶段输出数据的类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6）设置输入和输出路径
        FileInputFormat.setInputPaths(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));

        // 7）提交作业
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }

    public static class MyMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
        Random random = new Random();

        protected void map(LongWritable key,Text values,Context context) throws IOException, InterruptedException {
            String[] splits = values.toString().split("\t");
            int r = random.nextInt(10) + 1;
            context.write(new Text(splits[0].trim() + "_" + r), new IntWritable(Integer.parseInt(splits[1])));
        }
    }

    public static class MyReducer extends Reducer<Text, IntWritable,Text,IntWritable>{

        protected void reduce(Text key,Iterable<IntWritable>values,Context context) throws IOException, InterruptedException {
            int sum = 0;
            for(IntWritable value:values){
                sum+=value.get();
            }
            context.write(key,new IntWritable(sum));
        }
    }
}

第二个mr

public class ChainSecondDriver {
    public static void main(String[] args) throws  Exception{
        String input = "data/chain/first";
        String output = "data/chain/second";

        // 1:获取job对象
        Configuration configuration = new Configuration();
        Job job = Job.getInstance(configuration);

        //删除已存在目录
        FileUtils.deleteOutput(configuration,output);

        // 2：本job对应执行的主类
        job.setJarByClass(ChainFirstDriver.class);

        // 3）设置Mapper和Reducer
        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        // 4）设置Mapper阶段输出数据的类型
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

        // 5）设置Reducer阶段输出数据的类型
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // 6）设置输入和输出路径
        FileInputFormat.setInputPaths(job, new Path(input));
        FileOutputFormat.setOutputPath(job, new Path(output));

        // 7）提交作业
        boolean result = job.waitForCompletion(true);
        System.exit(result ? 0 : 1);
    }

    public static class MyMapper extends Mapper<LongWritable,Text,Text,IntWritable>{

        protected void map(LongWritable key,Text values,Context context) throws IOException, InterruptedException {
            String[] splits = values.toString().split("\t");
            int index = splits[0].lastIndexOf("_");
            String result = splits[0].substring(0, index);
            context.write(new Text(result),new IntWritable(Integer.parseInt(splits[1])));
        }
    }

    public static class MyReducer extends Reducer<Text,IntWritable,Text,IntWritable>{

        protected void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException {
            int sum=0;
            for(IntWritable value:values){
                sum+=value.get();
            }
            context.write(key,new IntWritable(sum));
        }
    }
}

FileUtils补充

public class FileUtils {
    public static void deleteOutput(Configuration configuration, String output) throws Exception{
        FileSystem fileSystem = FileSystem.get(configuration);
        Path path = new Path(output);
        if(fileSystem.exists(path)){
            fileSystem.delete(path, true);
        }
    }
}