Hadoop学习笔记 - 05MapReduce客户端源码解析通过学习MapReduce的客户端源码，进一步了解spli

写在前面： 通过学习MapReduce的客户端源码，进一步了解split切片与block的关系，以及分治与计算向数据移动的思想。

首先看MapReduce程序入口的waitForCompletion()方法：

  public boolean waitForCompletion(boolean verbose
                                   ) throws IOException, InterruptedException,
                                            ClassNotFoundException {
    if (state == JobState.DEFINE) {
      // 异步提交任务
      submit();
    }
    if (verbose) {
      // 检测并打印job的运行日志，比如运行了百分之多少mapTask的进度等这些信息。
      monitorAndPrintJob();
    } else {
      // get the completion poll interval from the client. 
      // 从客户端获取完成轮询间隔。
      int completionPollIntervalMillis = 
        Job.getCompletionPollInterval(cluster.getConf());
      while (!isComplete()) {
        try {
          Thread.sleep(completionPollIntervalMillis);
        } catch (InterruptedException ie) {
        }
      }
    }
    return isSuccessful();
  }

任务是通过submit()方法来异步提交的。接下来看该方法实现：

  public void submit() 
         throws IOException, InterruptedException, ClassNotFoundException {
    ensureState(JobState.DEFINE);
    setUseNewAPI(); // 新旧api的变化
    connect(); // 客户端需要连接集群主服务连接
    final JobSubmitter submitter = 
        getJobSubmitter(cluster.getFileSystem(), cluster.getClient());
    status = ugi.doAs(new PrivilegedExceptionAction<JobStatus>() {
      public JobStatus run() throws IOException, InterruptedException,
      ClassNotFoundException {
        // 描述了job是如何提交，以及split清单是如何确认出来的
        return submitter.submitJobInternal(Job.this, cluster);
      }
    });
    state = JobState.RUNNING;
    LOG.info("The url to track the job: " + getTrackingURL());
   }

本文中展示的源码是trunk分支的，所以使用的是新版本的API。接下来主要看submitJobInternal()方法：

  /**
   * Internal method for submitting jobs to the system.
   * 
   * <p>The job submission process involves:
   * <ol>
   *   <li>
   *   Checking the input and output specifications of the job. 
   *   </li>
   *   <li>
   *   Computing the {@link InputSplit}s for the job.
   *   </li>
   *   <li>
   *   Setup the requisite accounting information for the 
   *   {@link DistributedCache} of the job, if necessary.为
   *   </li>
   *   <li>
   *   Copying the job's jar and configuration to the map-reduce system
   *   directory on the distributed file-system. 
   *   </li>
   *   <li>
   *   Submitting the job to the <code>JobTracker</code> and optionally
   *   monitoring it's status.
   *   </li>
   * </ol></p>
   * @param job the configuration to submit
   * @param cluster the handle to the Cluster
   * @throws ClassNotFoundException
   * @throws InterruptedException
   * @throws IOException
   */
  JobStatus submitJobInternal(Job job, Cluster cluster) 
  throws ClassNotFoundException, InterruptedException, IOException {
    ...
    Path submitJobDir = new Path(jobStagingArea, jobId.toString()); // 存放jar包等
    JobStatus status = null;
    try {
      ...
      // Create the splits for the job
      LOG.debug("Creating splits at " + jtFs.makeQualified(submitJobDir));
      // 获取切片清单的方法
      int maps = writeSplits(job, submitJobDir);
      conf.setInt(MRJobConfig.NUM_MAPS, maps);
      LOG.info("number of splits:" + maps);
    ...
    } finally {
      ...
    }
  }

submitJobInternal方法一共做了五件事：

确定输入输出的详细信息，检查输入输出路径
计算split清单
为job的DistributedCache设置必要的信息
复制job运行的jar包和配置文件信息到HDFS中的MapReduce的系统路径
提交job到JobTracker中，并且开始监控这个job的运行状态。

我们重点关注计算split清单这一步，也就是writeSplits方法：

  private int writeSplits(org.apache.hadoop.mapreduce.JobContext job,
      Path jobSubmitDir) throws IOException,
      InterruptedException, ClassNotFoundException {
    JobConf jConf = (JobConf)job.getConfiguration();
    int maps;
    // 使用新旧Mapper类，Hadoop2.x使用的是新Mapper
    if (jConf.getUseNewMapper()) {
      maps = writeNewSplits(job, jobSubmitDir);
    } else {
      maps = writeOldSplits(jConf, jobSubmitDir);
    }
    return maps;
  }

本文使用的trunk分支使用的是新Mapper类，所以接着看writeNewSplits方法：

  @SuppressWarnings("unchecked")
  private <T extends InputSplit>
  int writeNewSplits(JobContext job, Path jobSubmitDir) throws IOException,
      InterruptedException, ClassNotFoundException {
    Configuration conf = job.getConfiguration();
    InputFormat<?, ?> input =
      ReflectionUtils.newInstance(job.getInputFormatClass(), conf); // 默认TextInputFormat

    List<InputSplit> splits = input.getSplits(job); // 得到切片
    T[] array = (T[]) splits.toArray(new InputSplit[splits.size()]);

    // sort the splits into order based on size, so that the biggest
    // go first
    Arrays.sort(array, new SplitComparator());
    JobSplitWriter.createSplitFiles(jobSubmitDir, conf, 
        jobSubmitDir.getFileSystem(conf), array);
    return array.length;
  }

由于此处的JobContext job实现类为JobContextImpl，所以在该类中查看getInputFormatClass方法。可以看到InputFormat类可以由mapreduce.job.inputformat.class参数指定，并且默认为TextInputFormat。

接着看获取清单列表的getSplits方法，TextInputFormat中没有该方法的实现，在父类FileInputFormat找到了具体实现：

  /**
   * Generate the list of files and make them into FileSplits.
   * @param job the job context
   * @throws IOException
   */
  public List<InputSplit> getSplits(JobContext job) throws IOException {
    StopWatch sw = new StopWatch().start();
    // 获取split最小大小
    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    // 获取split最大大小
    long maxSize = getMaxSplitSize(job);

    // generate splits
    List<InputSplit> splits = new ArrayList<InputSplit>();
    // FileStatus包含输入文件的元数据，包括文件路径信息、length、block大小等
    List<FileStatus> files = listStatus(job);

    boolean ignoreDirs = !getInputDirRecursive(job)
      && job.getConfiguration().getBoolean(INPUT_DIR_NONRECURSIVE_IGNORE_SUBDIRS, false);
    for (FileStatus file: files) {
      if (ignoreDirs && file.isDirectory()) {
        continue;
      }
      Path path = file.getPath(); // 路径
      long length = file.getLen(); // 文件大小
      if (length != 0) { // 文件有数据
        // 返回每一个文件对应的文件块路径以及文件块所在的机器（按照离客户端的远近排序）
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length); // 得到从0开始整个文件块的信息
        }
        // 判断文件是否可以被split切割
        if (isSplitable(job, path)) {
          long blockSize = file.getBlockSize();
          // 计算split真实大小的方法
          long splitSize = computeSplitSize(blockSize, minSize, maxSize);

          // 没有被split切割的文件字节数，一开始等于整个文件
          long bytesRemaining = length;
          // 当剩余的未分配split的文件内容，除上split大小大于设定的SPLIT_SLOP
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            // 根据当前文件的所有文件块路径，以及文件块的起始offset=length-bytesRemaining，获取这个文件块的索引，也就是确定当前split分配给该文件的哪个block块。
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            // 调用makeSplit，计算该split的清单。
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            // 将分配完split这部分的文件字节数减去。剩下的进入下一轮循环
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            // 获取块的索引
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            splits.add(makeSplit(path, length-bytesRemaining, bytesRemaining,
                       blkLocations[blkIndex].getHosts(),
                       blkLocations[blkIndex].getCachedHosts()));
          }
        } else { // not splitable
          if (LOG.isDebugEnabled()) {
            // Log only if the file is big enough to be splitted
            if (length > Math.min(file.getBlockSize(), minSize)) {
              LOG.debug("File is not splittable so no parallelization "
                  + "is possible: " + file.getPath());
            }
          }
          splits.add(makeSplit(path, 0, length, blkLocations[0].getHosts(),
                      blkLocations[0].getCachedHosts()));
        }
      } else {
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    // Save the number of input files for metrics/loadgen
    job.getConfiguration().setLong(NUM_INPUT_FILES, files.size());
    sw.stop();
    if (LOG.isDebugEnabled()) {
      LOG.debug("Total # of splits generated by getSplits: " + splits.size()
          + ", TimeTaken: " + sw.now(TimeUnit.MILLISECONDS));
    }
    return splits;
  }

首先确定split大小范围，最小值为1和配置最小值中的最大值，最大值直接由配置获取：

    long minSize = Math.max(getFormatMinSplitSize(), getMinSplitSize(job));
    long maxSize = getMaxSplitSize(job);

接着通过listStatus获取文件详细信息列表，其中FileStatus包含了文件的软数据，如：文件路径信息、文件大小以及block大小等。

    List<FileStatus> files = listStatus(job);

接着遍历这些文件，针对每个文件进行split分配。

首先获取不为空的文件的block路径以及所在机器：

        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          FileSystem fs = path.getFileSystem(job.getConfiguration());
          blkLocations = fs.getFileBlockLocations(file, 0, length); // 得到从0开始整个文件块的信息
        }

然后计算split的真实大小，也就是computeSplitSize方法：

  protected long computeSplitSize(long blockSize, long minSize,
                                  long maxSize) {
    // 默认split大小是block大小
    return Math.max(minSize, Math.min(maxSize, blockSize));
  }

可以看到，默认一个split的大小就是一个block的大小。如果要将split调到比block大，可以调大minSize。弱要比block小，可以调小maxSize。

接着就是计算split清单的具体方法：

          // 没有被split切割的文件字节数，一开始等于整个文件
          long bytesRemaining = length;
          // 当剩余的未分配split的文件内容，除上split大小大于设定的SPLIT_SLOP
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            // 根据当前文件的所有文件块路径，以及文件块的起始offset=length-bytesRemaining，获取这个文件块的索引，也就是确定当前split分配给该文件的哪个block块。
            int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
            // 调用makeSplit，计算该split的清单。
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                        blkLocations[blkIndex].getHosts(),
                        blkLocations[blkIndex].getCachedHosts()));
            // 将分配完split这部分的文件字节数减去。剩下的进入下一轮循环
            bytesRemaining -= splitSize;
          }

一开始bytesRemaining是整个文件大小，然后当bytesRemaining和切片大小的比例大于1.1，则继续计算清单，每次计算完都会将bytesRemaining减去切片大小，继续从剩余的文件中计算下一个切片的清单。

getBlockIndex方法获取的是blkLocations[]的索引，block是通过length-bytesRemaining获取的。例如第一次计算时，length-bytesRemaining为0，相当于从offset0开始切割block块。第二次相当于从第二个split的第一行文件所在的offset开始切割。具体的getBlockIndex方法如下：

  protected int getBlockIndex(BlockLocation[] blkLocations,
                              long offset) {
    // 确定当前split分配给该文件的哪个block块
    // 切片的起始位置被一个块包含
    for (int i = 0 ; i < blkLocations.length; i++) {
      // is the offset inside this block?
      if ((blkLocations[i].getOffset() <= offset) &&
          (offset < blkLocations[i].getOffset() + blkLocations[i].getLength())){
        return i;
      }
    }
    BlockLocation last = blkLocations[blkLocations.length -1];
    long fileLength = last.getOffset() + last.getLength() -1;
    throw new IllegalArgumentException("Offset " + offset +
                                       " is outside of file (0.." +
                                       fileLength + ")");
  }

该方法判断了每个文件block块的offset信息是否小于传入的需要分配split的文件offset，并且需要分配split的文件offset需要同时小于这个文件block块的整个大小。

最后通过计算该split清单，也就是makeSplit方法:

  protected FileSplit makeSplit(Path file, long start, long length,
                                String[] hosts, String[] inMemoryHosts) {
    return new FileSplit(file, start, length, hosts, inMemoryHosts);
  }

可以看到一个split包含了如下几个重要属性：

file归属哪个文件
offset偏移量
length块大小
hosts文件块所在的主机列表

其中hosts属性，决定MapTask需要移动到哪台服务器上去执行代码，支撑的计算向数据移动。