小知识：MapReduce 序列化机制数据的序列化本身只需要实现Writable这个接口即可，如果需要作为key则实现W

小知识，大挑战！本文正在参与「程序员必备小知识」创作活动

上次说到MapReduce可以自行定义写入磁盘的类型。

我们这次就同样以先前用spark做过的自身关联查询为例子来使用MapReduce的序列化和反序列化机制。

MapReduce的序列化实际上就是继承Writable接口。例如我们要这次要自身做识别，需要标注自己是来自哪个表，以及用于生成笛卡尔积的字段。也就是总共两个值，一个整数是表示来自哪个表的元数据，另外一个String用于连接操作。

像下面这样

static class Relation implements Writable ,Cloneable{
        int type;
        String value;

        public Relation(int from, String value) {
            this.type = from;
            this.value = value;
        }
        public Relation() {
        }
        @Override
        public void write(DataOutput dataOutput) throws IOException {
            dataOutput.writeUTF(value);
            dataOutput.writeInt(type);
        }

        @Override
        public void readFields(DataInput dataInput) throws IOException {
            value = dataInput.readUTF();
            type = dataInput.readInt();
        }
        @Override
        public Relation clone() {
            try {
                return (Relation) super.clone();
            } catch (CloneNotSupportedException e) {
                throw new AssertionError();
            }
        }
    }

writeUTF可以自动地进行UTF8编解码，不需要我们手动转换到字节数组。

值得注意的是，我们自己定义的Writerable类型必须要有无参构造方法，不然后续mapreduce使用反射生成我们需要的类的时候，会因为没有这个构造方法抛出NoSuchMethodException。

还有这个clone方法在后面大有用处，这里暂且不表。

有了这个类我们就可以开始写MapReduce作业地代码了。

对于Map而言，显然它只需要把关系正反两次构造关系记录写入context。

  static class MyMap extends Mapper<LongWritable, Text, Text, Relation> {
        @Override
        public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String[] s = value.toString().split(" ");
            String child = s[0];
            String parent = s[1];
            context.write(new Text(parent), new Relation(0, child));
            context.write(new Text(child), new Relation(1, parent));
        }
    }

这里出于可读性考虑多声明了几个变量名称，0表示左边的表，1表示右边的表，在key相同时连接。

对于Reduce而言，需要做的则是将来自两个不同的列表的值做一个笛卡尔积，像下面这样

static class MyReduce extends Reducer<Text, Relation, Text, Text> {
        @Override
        public void reduce(Text key, Iterable<Relation> relations, Context context) throws IOException, InterruptedException {
            ArrayList<Relation> grandChildren = new ArrayList<>();
            ArrayList<Relation> grandParents = new ArrayList<>();
            relations.iterator().forEachRemaining(
                    it -> {
                        if (it.type == 0) {
                            grandChildren.add(it.clone());
                        } else {
                            grandParents.add(it.clone());
                        }
                    }
            );
            for (Relation grandChild : grandChildren) {
                for (Relation grandParent : grandParents) {
                    context.write(new Text(grandChild.value), new Text(grandParent.value));
                }
            }
        }
    }

Reduce中的所谓的Iterable实际上是有大坑的，它每次返回的会是同一个对象，或者这么说它只有一个对象，每次都拿这个对象对它的字段赋值，然后交给外部，如果我们不复制这个对象，那么最后一次reduce中的值都是完全一样的，最终得到的笛卡尔积会是错误的。

正是基于出于这个原因在最开始的对象定义上实现了cloneable方法，以期简单地复制对象。

最后我们尝试运行一个job

        Job job = Job.getInstance();
        job.setJobName("笛卡尔积");
        job.setJarByClass(MineApp.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(Relation.class);
        job.setMapperClass(MyMap.class);
        job.setReducerClass(MyReduce.class);
        FileInputFormat.addInputPath(job, ....);
        FileOutputFormat.setOutputPath(job, ...);
        job.waitForCompletion(true);

可以得到我们想要的结果

Mark    Alice
Mark    Jesse
Philip  Alice
Philip  Jesse
Jone    Jesse
Jone    Alice
Steven  Jesse
Steven  Alice
Jone    Mary
Jone    Frank
Steven  Mary
Steven  Frank

除了最大的坑Iterable之外，序列化方式还是比较简单的，如果需要自定义一个用于key的类则应该继承WritableComparable 类。

从我个人的角度出发我不是很喜欢它的这种序列化机制，首先key和value天然的分离了，这样二者的Writable 类自然也是分开的，而分开之后，自定义序列化类也不见得会很好用，更何况还有reduce中的Iterable的大坑。

我个人的倾向是有必要的话，在reduce使用Text自行生成相应的对象。

不同意见，欢迎讨论👏。