Spark Streaming微批到底是什么

1,009 阅读2分钟

此文章主要作为学习源码的过程,有想交流的朋友欢迎联系。

在receiver模式中,已经知道了receiver如何在executor上启动的,接下来探究一下receiver接受到数据后怎么进行的后续工作。也就是说接收到的数据,怎么变成微批的。

  • 这是SocketInputDStream接受数据的代码,其中有一个store
  /** Create a socket connection and receive data until receiver is stopped */
  def receive() {
    try {
      val iterator = bytesToObjects(socket.getInputStream())
      while(!isStopped && iterator.hasNext) {
        store(iterator.next()) // 看这里
      }
      if (!isStopped()) {
        restart("Socket data stream had no more data")
      } else {
        logInfo("Stopped receiving")
      }
    } catch {
      case NonFatal(e) =>
        logWarning("Error receiving data", e)
        restart("Error receiving data", e)
    } finally {
      onStop()
    }
  }

接着往里看,一条条的数据会被聚合进一个data block,做完这步之后才放入spark的内存(blockManager)

  /**
   * Store a single item of received data to Spark's memory.
   * These single items will be aggregated together into data blocks before
   * being pushed into Spark's memory.
   */
  def store(dataItem: T) {
    supervisor.pushSingle(dataItem)
  }

废话不多说,接着往里看ReceiverSupervisorImpl#pushSingle,代码逻辑和刚才注释中说的一样。

  /** Push a single record of received data into block generator. */
  def pushSingle(data: Any) {
    defaultBlockGenerator.addData(data)
  }

BlockGenerator#addData,这里把数据加进了一个buffer。

  /**
   * Push a single data item into the buffer.
   */
  def addData(data: Any): Unit = {
    if (state == Active) {
      waitToPush()
      synchronized {
        if (state == Active) {
          currentBuffer += data
        } else {
          throw new SparkException(
            "Cannot add data as BlockGenerator has not been started or has been stopped")
        }
      }
    } else {
      throw new SparkException(
        "Cannot add data as BlockGenerator has not been started or has been stopped")
    }
  }

这里看一下BlockGenerator的类上注释,有个大体上的认识。

/**
 * Generates batches of objects received by a
 * [[org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately
 * named blocks at regular intervals. This class starts two threads,
 * one to periodically start a new batch and prepare the previous batch of as a block,
 * the other to push the blocks into the block manager.
 *
 * Note: Do not create BlockGenerator instances directly inside receivers. Use
 * `ReceiverSupervisor.createBlockGenerator` to create a BlockGenerator and use it.
 */
两个线程,一个一直接受数据,定时把一波数据封装成block。
另一个线程把封装好的线程存入blockManager。

可以看到,默认是200ms把数据封装成block。这个参数可以调。

private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")

把block放入blockmanager (BlockGenerator#pushBlock)

private def pushBlock(block: Block) {
    // 这里打断点往里跟。
    listener.onPushBlock(block.id, block.buffer)
    logInfo("Pushed block " + block.id)
}
def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
    pushArrayBuffer(arrayBuffer, None, Some(blockId))
}
...
// 这里看到report了,因为receiver需要和driver(ReceiverTracker)汇报块信息。
pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)

把block放入BlockManager后,对block进一步封装,里面有所对应的批次信息。发送给trackerEndpoint

  /** Store block and report it to driver */
  def pushAndReportBlock(
      receivedBlock: ReceivedBlock,
      metadataOption: Option[Any],
      blockIdOption: Option[StreamBlockId]
    ) {
    val blockId = blockIdOption.getOrElse(nextBlockId)
    val time = System.currentTimeMillis
    val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
    logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
    val numRecords = blockStoreResult.numRecords
    val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
    trackerEndpoint.askSync[Boolean](AddBlock(blockInfo))
    logDebug(s"Reported block $blockId")
  }

看看这个图