spark_创建RDD以及分区,切片规则

1,562 阅读4分钟

1. 创建RDD

1. 从集合中创建RDD

Spark主要提供了两种函数:parallelize和makeRDD

//1.使用parallelize()创建rdd
val rdd: RDD[Int] = sc.parallelize(Array(1, 2, 3, 4, 5, 6, 7, 8))
println("分区数:" + rdd.partitions.size)
//2.使用makeRDD()创建rdd
val rdd1: RDD[Int] = sc.makeRDD(Array(1, 2, 3, 4, 5, 6, 7, 8))

2. 从外部系统数据集创建

本地文件系统,还有所有Hadoop支持的数据集,比如HDFS,HBase等。

//读取文件。如果是集群路径:hdfs://hadoop102:9000/input
val lineWordRdd: RDD[String] = sc.textFile("input")

3. 从其它RDD创建

主要是通过一个RDD运算完后,再产生新的RDD。

2. 分区规则

1. 默认分区源码(RDD从集合中创建)

结论:

  • 如果设置了分区数,则按照设置的个数分区(numSlices,查看源码得知这个是可以指定的切片数,也就是分区数,没有指定看源码得知就是CPU核数
  • 如果没有指定分区数,则按照CPU的核数设置分区

2. 从外部文件创建RDD的分区规则

结论:math.min(取决于分配给应用的CPU的核数,2)

源码如下:

def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}
TextInputFormatLineRecordReader  行记录读取器
minPartitions:最小分区数,未指定就按默认
=====>  
def defaultMinPartitions: Int = math.min(defaultParallelism, 2)
=====>
override def defaultParallelism(): Int =
  scheduler.conf.getInt("spark.default.parallelism", totalCores)

切片规划: 注意: getSplits文件返回的是切片规划,真正读取是在compute方法中创建LineRecordReader读取的,有两个关键变量

start=split.getStart()	 
end = start + split.getLength

3. 从其它RDD创建

一般跟随上一个RDD的分区数

3. 切片规则的详细源码(后续补充)

1. 读取文件创建RDD,分区切片相关分析

  1. textFile方法:
  • minPartitions:最小分区数,没有指定具体值,则Math.min(totalCores,2)
//1. textFile,读取文本方法
def textFile(
    path: String,
    minPartitions: Int = defaultMinPartitions): RDD[String] = withScope {
  assertNotStopped()
  hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
    minPartitions).map(pair => pair._2.toString).setName(path)
}

//2. hadoopFile
def hadoopFile[K, V](
    path: String,
    inputFormatClass: Class[_ <: InputFormat[K, V]],
    keyClass: Class[K],
    valueClass: Class[V],
    minPartitions: Int = defaultMinPartitions): RDD[(K, V)] = withScope {
  assertNotStopped()

  // This is a hack to enforce loading hdfs-site.xml.
  // See SPARK-11227 for details.
  FileSystem.getLocal(hadoopConfiguration)

  // A Hadoop configuration can be about 10 KB, which is pretty big, so broadcast it.
  val confBroadcast = broadcast(new SerializableConfiguration(hadoopConfiguration))
  val setInputPathsFunc = (jobConf: JobConf) => FileInputFormat.setInputPaths(jobConf, path)
  new HadoopRDD(
    this,
    confBroadcast,
    Some(setInputPathsFunc),
    inputFormatClass,
    keyClass,
    valueClass,
    minPartitions).setName(path)
}

//3. HadoopRDD算子处理文本
 new HadoopRDD(
    this,
    confBroadcast,
    Some(setInputPathsFunc),
    inputFormatClass,
    keyClass,
    valueClass,
    minPartitions)
  1. TextInputFormat :文本文件,切片逻辑是在父类FileInputFormat
/** 
 * An {@link InputFormat} for plain text files.  Files are broken into lines.
 * Either linefeed or carriage-return are used to signal end of line.  Keys are
 * the position in the file, and values are the line of text.. 
 */
@InterfaceAudience.Public
@InterfaceStability.Stable
public class TextInputFormat extends FileInputFormat<LongWritable, Text>
  1. 获取分区信息,切片信息也在分区对象里面HadoopPartition
//1. 获取分区数
override def getPartitions: Array[Partition] = {
  val jobConf = getJobConf()
  // add the credentials here as this can be called before SparkContext initialized
  SparkHadoopUtil.get.addCredentials(jobConf)
  val inputFormat = getInputFormat(jobConf)
  //获取切片规划
  val inputSplits = inputFormat.getSplits(jobConf, minPartitions)
  val array = new Array[Partition](inputSplits.size)
  //分区数=切片数
  for (i <- 0 until inputSplits.size) {
    array(i) = new HadoopPartition(id, i, inputSplits(i))
  }
  array
}

//切片对象
public FileSplit(Path file, long start, long length, String[] hosts) {
    this.file = file;	//指定的文件路径
    this.start = start;  //切片读取文件的开始位置
    this.length = length; //读取长度
    this.hosts = hosts; //文件所在服务器
  }
 
  1. 具体切片规划
  • 计算切片大小关键方法:

    Math.max(minSize, Math.min(goalSize, blockSize));

    1. 当配置的最小切片大小大于文件块大小,那么切片大小就是配置的minSize
    2. 否则 切片大小就是目标文件大小和块文件大小,谁小就是谁,只要大于1

minSize:最小切片大小,可以在配置文件中指定,未指定,系统默认设置的是1

goalSize: 目标切片大小=文件总大小/最小分区数

blockSize: hdfs文件块大小大小

  • 文件规划关键代码:

最后一个切片规划,有可能会大于切片大小,但是不会超过切片大小的1.1倍

//SPLIT_SLOP 1.1
while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
  String[] splitHosts = getSplitHosts(blkLocations,
      length-bytesRemaining, splitSize, clusterMap);
  splits.add(makeSplit(path, length-bytesRemaining, splitSize,
      splitHosts));
  bytesRemaining -= splitSize;
}

if (bytesRemaining != 0) {
  String[] splitHosts = getSplitHosts(blkLocations, length
      - bytesRemaining, bytesRemaining, clusterMap);
  splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
      splitHosts));
}

完整代码:

//4. Hadoop getSplits  计算切片,做切片规划
  /** 
   * Splits files returned by {@link #listStatus(JobConf)} when
   * they're too big.*/ 
  public InputSplit[] getSplits(JobConf job, int numSplits)
    throws IOException {
    FileStatus[] files = listStatus(job);
    
    // Save the number of input files for metrics/loadgen
    job.setLong(NUM_INPUT_FILES, files.length);
    long totalSize = 0;                           // compute total size
    for (FileStatus file: files) {                // check we have valid files
      if (file.isDirectory()) {
        throw new IOException("Not a file: "+ file.getPath());
      }
      totalSize += file.getLen();
    }
	//目标切片大小 = 文件总大小/最小分区数
    long goalSize = totalSize / (numSplits == 0 ? 1 : numSplits);
    //最小切片大小,默认是1
    long minSize = Math.max(job.getLong(org.apache.hadoop.mapreduce.lib.input.
      FileInputFormat.SPLIT_MINSIZE, 1), minSplitSize);

    // generate splits
    ArrayList<FileSplit> splits = new ArrayList<FileSplit>(numSplits);
    NetworkTopology clusterMap = new NetworkTopology();
    for (FileStatus file: files) {
      Path path = file.getPath();
      long length = file.getLen();
      if (length != 0) {
        FileSystem fs = path.getFileSystem(job);
        BlockLocation[] blkLocations;
        if (file instanceof LocatedFileStatus) {
          blkLocations = ((LocatedFileStatus) file).getBlockLocations();
        } else {
          blkLocations = fs.getFileBlockLocations(file, 0, length);
        }
        if (isSplitable(fs, path)) {
        //文件块大小	
          long blockSize = file.getBlockSize();
       //computeSplitSize: 计算切片大小的具体逻辑
       //Math.max(minSize, Math.min(goalSize, blockSize));
          long splitSize = computeSplitSize(goalSize, minSize, blockSize);

          long bytesRemaining = length;
          //生成切片规划文件数组
          while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
            String[] splitHosts = getSplitHosts(blkLocations,
                length-bytesRemaining, splitSize, clusterMap);
            splits.add(makeSplit(path, length-bytesRemaining, splitSize,
                splitHosts));
            bytesRemaining -= splitSize;
          }

          if (bytesRemaining != 0) {
            String[] splitHosts = getSplitHosts(blkLocations, length
                - bytesRemaining, bytesRemaining, clusterMap);
            splits.add(makeSplit(path, length - bytesRemaining, bytesRemaining,
                splitHosts));
          }
        } else {
          String[] splitHosts = getSplitHosts(blkLocations,0,length,clusterMap);
          splits.add(makeSplit(path, 0, length, splitHosts));
        }
      } else { 
        //Create empty hosts array for zero length files
        splits.add(makeSplit(path, 0, length, new String[0]));
      }
    }
    LOG.debug("Total # of splits: " + splits.size());
    return splits.toArray(new FileSplit[splits.size()]);
  }
  
  // 5. compute 读取记录
  override def compute(theSplit: Partition, context: TaskContext): InterruptibleIterator[(K, V)] = {
    val iter = new NextIterator[(K, V)] {

      private val split = theSplit.asInstanceOf[HadoopPartition]
      logInfo("Input split: " + split.inputSplit)
      private val jobConf = getJobConf()

      private val inputMetrics = context.taskMetrics().inputMetrics
      private val existingBytesRead = inputMetrics.bytesRead

      // Sets the thread local variable for the file's name
      split.inputSplit.value match {
        case fs: FileSplit => InputFileNameHolder.setInputFileName(fs.getPath.toString)
        case _ => InputFileNameHolder.unsetInputFileName()
      }

      // Find a function that will return the FileSystem bytes read by this thread. Do this before
      // creating RecordReader, because RecordReader's constructor might read some bytes
      private val getBytesReadCallback: Option[() => Long] = split.inputSplit.value match {
        case _: FileSplit | _: CombineFileSplit =>
          SparkHadoopUtil.get.getFSBytesReadOnThreadCallback()
        case _ => None
      }

      // For Hadoop 2.5+, we get our input bytes from thread-local Hadoop FileSystem statistics.
      // If we do a coalesce, however, we are likely to compute multiple partitions in the same
      // task and in the same thread, in which case we need to avoid override values written by
      // previous partitions (SPARK-13071).
      private def updateBytesRead(): Unit = {
        getBytesReadCallback.foreach { getBytesRead =>
          inputMetrics.setBytesRead(existingBytesRead + getBytesRead())
        }
      }

      private var reader: RecordReader[K, V] = null
      private val inputFormat = getInputFormat(jobConf)
      HadoopRDD.addLocalConfiguration(
        new SimpleDateFormat("yyyyMMddHHmmss", Locale.US).format(createTime),
        context.stageId, theSplit.index, context.attemptNumber, jobConf)

      reader =
        try {
          inputFormat.getRecordReader(split.inputSplit.value, jobConf, Reporter.NULL)
        } catch {
          case e: IOException if ignoreCorruptFiles =>
            logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
            finished = true
            null
        }
      // Register an on-task-completion callback to close the input stream.
      context.addTaskCompletionListener{ context => closeIfNeeded() }
      private val key: K = if (reader == null) null.asInstanceOf[K] else reader.createKey()
      private val value: V = if (reader == null) null.asInstanceOf[V] else reader.createValue()

      override def getNext(): (K, V) = {
        try {
          finished = !reader.next(key, value)
        } catch {
          case e: IOException if ignoreCorruptFiles =>
            logWarning(s"Skipped the rest content in the corrupted file: ${split.inputSplit}", e)
            finished = true
        }
        if (!finished) {
          inputMetrics.incRecordsRead(1)
        }
        if (inputMetrics.recordsRead % SparkHadoopUtil.UPDATE_INPUT_METRICS_INTERVAL_RECORDS == 0) {
          updateBytesRead()
        }
        (key, value)
      }

      override def close() {
        if (reader != null) {
          InputFileNameHolder.unsetInputFileName()
          // Close the reader and release it. Note: it's very important that we don't close the
          // reader more than once, since that exposes us to MAPREDUCE-5918 when running against
          // Hadoop 1.x and older Hadoop 2.x releases. That bug can lead to non-deterministic
          // corruption issues when reading compressed input.
          try {
            reader.close()
          } catch {
            case e: Exception =>
              if (!ShutdownHookManager.inShutdown()) {
                logWarning("Exception in RecordReader.close()", e)
              }
          } finally {
            reader = null
          }
          if (getBytesReadCallback.isDefined) {
            updateBytesRead()
          } else if (split.inputSplit.value.isInstanceOf[FileSplit] ||
                     split.inputSplit.value.isInstanceOf[CombineFileSplit]) {
            // If we can't get the bytes read from the FS stats, fall back to the split size,
            // which may be inaccurate.
            try {
              inputMetrics.incBytesRead(split.inputSplit.value.getLength)
            } catch {
              case e: java.io.IOException =>
                logWarning("Unable to get input size to set InputMetrics for task", e)
            }
          }
        }
      }
    }
    new InterruptibleIterator[(K, V)](context, iter)
  }