此文章主要作为学习源码的过程,有想交流的朋友欢迎联系。
在receiver模式中,已经知道了receiver如何在executor上启动的,接下来探究一下receiver接受到数据后怎么进行的后续工作。也就是说接收到的数据,怎么变成微批的。
- 这是SocketInputDStream接受数据的代码,其中有一个store
/** Create a socket connection and receive data until receiver is stopped */
def receive() {
try {
val iterator = bytesToObjects(socket.getInputStream())
while(!isStopped && iterator.hasNext) {
store(iterator.next()) // 看这里
}
if (!isStopped()) {
restart("Socket data stream had no more data")
} else {
logInfo("Stopped receiving")
}
} catch {
case NonFatal(e) =>
logWarning("Error receiving data", e)
restart("Error receiving data", e)
} finally {
onStop()
}
}
接着往里看,一条条的数据会被聚合进一个data block,做完这步之后才放入spark的内存(blockManager)
/**
* Store a single item of received data to Spark's memory.
* These single items will be aggregated together into data blocks before
* being pushed into Spark's memory.
*/
def store(dataItem: T) {
supervisor.pushSingle(dataItem)
}
废话不多说,接着往里看ReceiverSupervisorImpl#pushSingle,代码逻辑和刚才注释中说的一样。
/** Push a single record of received data into block generator. */
def pushSingle(data: Any) {
defaultBlockGenerator.addData(data)
}
BlockGenerator#addData,这里把数据加进了一个buffer。
/**
* Push a single data item into the buffer.
*/
def addData(data: Any): Unit = {
if (state == Active) {
waitToPush()
synchronized {
if (state == Active) {
currentBuffer += data
} else {
throw new SparkException(
"Cannot add data as BlockGenerator has not been started or has been stopped")
}
}
} else {
throw new SparkException(
"Cannot add data as BlockGenerator has not been started or has been stopped")
}
}
这里看一下BlockGenerator的类上注释,有个大体上的认识。
/**
* Generates batches of objects received by a
* [[org.apache.spark.streaming.receiver.Receiver]] and puts them into appropriately
* named blocks at regular intervals. This class starts two threads,
* one to periodically start a new batch and prepare the previous batch of as a block,
* the other to push the blocks into the block manager.
*
* Note: Do not create BlockGenerator instances directly inside receivers. Use
* `ReceiverSupervisor.createBlockGenerator` to create a BlockGenerator and use it.
*/
两个线程,一个一直接受数据,定时把一波数据封装成block。
另一个线程把封装好的线程存入blockManager。
可以看到,默认是200ms把数据封装成block。这个参数可以调。
private val blockIntervalMs = conf.getTimeAsMs("spark.streaming.blockInterval", "200ms")
把block放入blockmanager (BlockGenerator#pushBlock)
private def pushBlock(block: Block) {
// 这里打断点往里跟。
listener.onPushBlock(block.id, block.buffer)
logInfo("Pushed block " + block.id)
}
def onPushBlock(blockId: StreamBlockId, arrayBuffer: ArrayBuffer[_]) {
pushArrayBuffer(arrayBuffer, None, Some(blockId))
}
...
// 这里看到report了,因为receiver需要和driver(ReceiverTracker)汇报块信息。
pushAndReportBlock(ArrayBufferBlock(arrayBuffer), metadataOption, blockIdOption)
把block放入BlockManager后,对block进一步封装,里面有所对应的批次信息。发送给trackerEndpoint
/** Store block and report it to driver */
def pushAndReportBlock(
receivedBlock: ReceivedBlock,
metadataOption: Option[Any],
blockIdOption: Option[StreamBlockId]
) {
val blockId = blockIdOption.getOrElse(nextBlockId)
val time = System.currentTimeMillis
val blockStoreResult = receivedBlockHandler.storeBlock(blockId, receivedBlock)
logDebug(s"Pushed block $blockId in ${(System.currentTimeMillis - time)} ms")
val numRecords = blockStoreResult.numRecords
val blockInfo = ReceivedBlockInfo(streamId, numRecords, metadataOption, blockStoreResult)
trackerEndpoint.askSync[Boolean](AddBlock(blockInfo))
logDebug(s"Reported block $blockId")
}
看看这个图
