Hadoop序列化

95 阅读4分钟

1. 序列化概述

  1. 什么是序列化

    序列化就是把内存中的对象,转换成字节序列(或其他数据传输协议)以便于存储到磁盘(持久化)和网络传输。 

反序列化就是将收到字节序列(或其他数据传输协议)或者是磁盘的持久化数据,转换成内存中的对象。

  1. 为什么要序列化

    一般来说,“活的”对象只生存在内存里,关机断电就没有了。而且“活的”对象只能由本地的进程使用,不能被发送到网络上的另外一台计算机。 然而序列化可以存储“活的”对象,可以将“活的”对象发送到远程计算机。

  2. 为什么不用Java的序列化

    Java的序列化是一个重量级序列化框架(Serializable),一个对象被序列化后,会附带很多额外的信息(各种校验信息,Header,继承体系等),不便于在网络中高效传输。所以,Hadoop自己开发了一套序列化机制(Writable)。

  3. Hadoop序列化特点:

  • 紧凑:高效使用存储空间。

  • 快速:读写数据的额外开销小。

  • 互操作:支持多语言的交互

2. 自定义bean对象实现序列化接口(Writable)

在企业开发中往往常用的基本序列化类型不能满足所有需求,比如在Hadoop框架内部传递一个bean对象,那么该对象就需要实现序列化接口。

具体实现bean对象序列化步骤如下7步

  1. 必须实现Writable接口
  2. 反序列化时,需要反射调用空参构造函数,所以必须有空参构造
public FlowBean() {

    super();
    
}
  1. 重写序列化方法
@Override

public void write(DataOutput out) throws IOException {

    out.writeLong(upFlow);

    out.writeLong(downFlow);

    out.writeLong(sumFlow);

}
  1. 重写反序列化方法
@Override

public void readFields(DataInput in) throws IOException {

    upFlow = in.readLong();

    downFlow = in.readLong();

    sumFlow = in.readLong();

}
  1. 注意反序列化的顺序和序列化的顺序完全一致
  2. 要想把结果显示在文件中,需要重写toString(),可用"\t"分开,方便后续用。
  3. 如果需要将自定义的bean放在key中传输,则还需要实现Comparable接口,因为MapReduce框中的Shuffle过程要求对key必须能排序。详见后面排序案例。
@Override
public int compareTo(FlowBean o) {

    // 倒序排列,从大到小

    return this.sumFlow > o.getSumFlow() ? -1 : 1;

}

3. 序列化案例实操  

  1. 需求

    统计每一个手机号耗费的总上行流量、总下行流量、总流量

    1. 输入数据

    2. 输入数据格式:

      713560436666120.196.100.991116954200
      id手机号码网络ip上行流量下行流量网络状态码
    3. 期望输出数据格式

      13560436666     1116     954     2070
      手机号码  上行流量  下行流量   总流量
  2. 编写MapReduce程序

    1. 编写流量统计的Bean对象
    package com.learning.mapreduce.writable; 
    
    import org.apache.hadoop.io.Writable;
    import java.io.DataInput;
    import java.io.DataOutput;
    import java.io.IOException; 
    
    //1 继承Writable接口
    public class FlowBean implements Writableprivate long upFlow; //上行流量
        private long downFlow; //下行流量
        private long sumFlow; //总流量
    
        //2 提供无参构造
        public FlowBean() {
    
        }
    
        //3 提供三个参数的getter和setter方法
        public long getUpFlow() {
    
            return upFlow;
    
        }
    
        public void setUpFlow(long upFlow) {
    
            this.upFlow = upFlow;
    
        } 
    
        public long getDownFlow() {
    
            return downFlow;
    
        } 
    
        public void setDownFlow(long downFlow) {
    
            this.downFlow = downFlow;
    
        }
    
        public long getSumFlow() {
    
            return sumFlow;
    
        } 
    
        public void setSumFlow(long sumFlow) {
    
            this.sumFlow = sumFlow;
    
        } 
    
        public void setSumFlow() {
    
            this.sumFlow = this.upFlow + this.downFlow;
    
        } 
    
        //4 实现序列化和反序列化方法,注意顺序一定要保持一致
        @Override
        public void write(DataOutput dataOutput) throws IOException {
    
            dataOutput.writeLong(upFlow);
            dataOutput.writeLong(downFlow);
            dataOutput.writeLong(sumFlow);
    
        } 
    
        @Override
        public void readFields(DataInput dataInput) throws IOException {
    
            this.upFlow = dataInput.readLong();
            this.downFlow = dataInput.readLong();
            this.sumFlow = dataInput.readLong();
    
        } 
    
        //5 重写ToString
        @Override
        public String toString() {
    
            return upFlow + "\t" + downFlow + "\t" + sumFlow;
    
        }
    
    }
    
    1. 编写Mapper类

      package com.learning.mapreduce.writable;
      
      import org.apache.hadoop.io.LongWritable;
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Mapper;
      import java.io.IOException;
      
      public class FlowMapper extends Mapper<LongWritable, Text, Text, FlowBean> {
      
          private Text outK = new Text();
          private FlowBean outV = new FlowBean();
      
          @Override
          protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      
              //1 获取一行数据,转成字符串
              String line = value.toString();
      
              //2 切割数据
              String[] split = line.split("\t");
      
              //3 抓取我们需要的数据:手机号,上行流量,下行流量
              String phone = split[1];
              String up = split[split.length - 3];
              String down = split[split.length - 2];
      
              //4 封装outK outV
              outK.set(phone);
              outV.setUpFlow(Long.parseLong(up));
              outV.setDownFlow(Long.parseLong(down));
              outV.setSumFlow();
      
              //5 写出outK outV
              context.write(outK, outV);
      
          }
      
      }
      
    2. 编写Reducer类

      package com.learning.mapreduce.writable;
      
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Reducer;
      import java.io.IOException; 
      
      public class FlowReducer extends Reducer<Text, FlowBean, Text, FlowBean> {
      
          private FlowBean outV = new FlowBean();
      
          @Override
          protected void reduce(Text key, Iterable<FlowBean> values, Context context) throws IOException, InterruptedException {
      
              long totalUp = 0;
              long totalDown = 0;
      
              //1 遍历values,将其中的上行流量,下行流量分别累加
              for (FlowBean flowBean : values) {
      
                  totalUp += flowBean.getUpFlow();
                  totalDown += flowBean.getDownFlow();
      
              }
      
              //2 封装outKV
              outV.setUpFlow(totalUp);
              outV.setDownFlow(totalDown);
              outV.setSumFlow();
      
              //3 写出outK outV
              context.write(key,outV);
      
          }
      
      }
      
    3. 编写Driver驱动类

      package com.learning.mapreduce.writable;
      
      import org.apache.hadoop.conf.Configuration;
      import org.apache.hadoop.fs.Path;
      import org.apache.hadoop.io.Text;
      import org.apache.hadoop.mapreduce.Job;
      import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
      import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
      
      import java.io.IOException;
      
      public class FlowDriver {
      
          public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
      
              //1 获取job对象
              Configuration conf = new Configuration();
              Job job = Job.getInstance(conf);
      
              //2 关联本Driver类
              job.setJarByClass(FlowDriver.class);
      
              //3 关联Mapper和Reducer
              job.setMapperClass(FlowMapper.class);
              job.setReducerClass(FlowReducer.class);
      
              //4 设置Map端输出KV类型
              job.setMapOutputKeyClass(Text.class);
              job.setMapOutputValueClass(FlowBean.class);
      
              //5 设置程序最终输出的KV类型
              job.setOutputKeyClass(Text.class);
              job.setOutputValueClass(FlowBean.class);     
      
              //6 设置程序的输入输出路径
              FileInputFormat.setInputPaths(job, new Path("D:\\inputflow"));
              FileOutputFormat.setOutputPath(job, new Path("D:\\flowoutput"));   
      
              //7 提交Job
              boolean b = job.waitForCompletion(true);
      
              System.exit(b ? 0 : 1);
      
          }
      
      }