以val socket8888 = ssc.socketTextStream("localhost", 8888)为例。
- ssc.socketTextStream里面会
new SocketInputDStream(...),即会构造出一个接受socket数据流,InputDStream是其父类(爷爷类)。InputDStream是顶级抽象类(一般这种类特别需要仔细看下源码注释),所有receiver模式的流都是继承他,并且在他初始化的代码中有这么一段ssc.graph.addInputStream(this)。即创建每个receiver流的时候,会向spark streaming graph中进行注册 - InputDStream部分源码注释,写的非常清晰,这个输入流在driver端会产生rdd。后半段也说了,输入流需要在worker上跑一个receiver。
Input streams that can generate RDDs from new data by running a
service/thread only on the driver node
(that is, without running a receiver on worker nodes),
can be implemented by directly inheriting this InputDStream.
For example, FileInputDStream, a subclass of InputDStream,
monitors a HDFS directory from the driver for new files and generates RDDs with the new files.
****
For implementing input streams that requires running a receiver on the worker nodes,
use org.apache.spark.streaming.dstream.ReceiverInputDStream as the parent class.
receiver怎么分配到executor上的
- driver端有
receiverTracker,源码注释中说他是来管理receiver的执行。 - 源码分析路径
StreamingContext#start
JobScheduler#start JobScheduler重要组件
receiverTracker#start
receiverTracker#launchReceivers
发送了一个请求endpoint.send(StartAllReceivers(receivers)),这里面的receivers就是前面说到注册到graph来的。
省略部分rpc过程
ReceiverTrackerEndpoint#receive#StartAllReceivers 接收到请求进行启动所有的receivers
val scheduledLocations = schedulingPolicy.scheduleReceivers(receivers, getExecutors)这涉及到了receiver在哪个executor启动的调度策略。ReceiverSchedulingPolicy源码注释简要说明:
ReceiverTracker will call scheduleReceivers at this phase。
说代码怎么走的。
It will try to schedule receivers such that they are evenly distributed.
尝试均匀分布
Then when a receiver is starting, it will send a register request and ReceiverTracker.registerReceiver will be called.
receiver在executor上启动后会向driver端的ReceiverTracker注册。
- ReceiverSchedulingPolicy#scheduleReceivers源码注释
Try our best to schedule receivers with evenly distributed.
尽可能的做到所有的receiver是均匀的分布的。
However, if the preferredLocations of receivers are not even, we may not be able to schedule them evenly because we have to respect them.
如果receiver设置了preferredLocations,这时候我们做不到均匀分布
Here is the approach to schedule executors: 具体流程
1. First, schedule all the receivers with preferred locations (hosts), evenly among the executors running on those host.
先按着优先位置进行分配。
2. Then, schedule all other receivers evenly among all the executors such that overall distribution over all the receivers is even.
剩下没有优先位置的就均匀分配。
This method is called when we start to launch receivers at the first time.
这个方法第一次启动receivers的时候用。
receiver分配好了如何启动
- ReceiverTracker#startReceiver启动receiver,这时候代码运行环境还是driver。
// 这个函数是用来在executor上启动receiver
val startReceiverFunc: Iterator[Receiver[_]] => Unit =
(iterator: Iterator[Receiver[_]]) => {
if (!iterator.hasNext) {
throw new SparkException(
"Could not start receiver as object not found.")
}
if (TaskContext.get().attemptNumber() == 0) {
// 这里就迭代了一下,没有用循环,应该就拿出里面的一个receiver。
val receiver = iterator.next()
assert(iterator.hasNext == false)
// 另一个组件出现了supervisor,receiver被组合进去了。
val supervisor = new ReceiverSupervisorImpl(
receiver, SparkEnv.get, serializableHadoopConf.value, checkpointDirOption)
// 逻辑在这里面 该代码会跳到 下面我贴出来的代码。
supervisor.start()
supervisor.awaitTermination()
} else {
// It's restarted by TaskScheduler, but we want to reschedule it again. So exit it.
}
}
/** Start receiver */
def startReceiver(): Unit = synchronized {
try {
if (onReceiverStart()) {
logInfo(s"Starting receiver $streamId")
receiverState = Started
// 这个onStart对应就是具体的接受逻辑了。
// 比如SocketReceiver里面就开始BIO接受网络数据了。
receiver.onStart()
logInfo(s"Called receiver $streamId onStart")
} else {
// The driver refused us
stop("Registered unsuccessfully because Driver refused to start receiver " + streamId, None)
}
} catch {
case NonFatal(t) =>
stop("Error starting receiver " + streamId, Some(t))
}
}
这个函数是如何在executor上运行起来的呢?
- 创建了一个rdd,rdd有一个5大特性之一优先位置。就是之前提到的receiver的优先位置。优先位置的思想就是计算向数据移动,
// Create the RDD using the scheduledLocations to run the receiver in a Spark job
val receiverRDD: RDD[Receiver[_]] =
// 没有优先位置的情况,有优先位置的情况。
if (scheduledLocations.isEmpty) {
ssc.sc.makeRDD(Seq(receiver), 1)
} else {
val preferredLocations = scheduledLocations.map(_.toString).distinct
ssc.sc.makeRDD(Seq(receiver -> preferredLocations))
}
receiverRDD.setName(s"Receiver $receiverId")
ssc.sparkContext.setJobDescription(s"Streaming job running receiver $receiverId")
ssc.sparkContext.setCallSite(Option(ssc.getStartSite()).getOrElse(Utils.getCallSite()))
// submitJob。spark core!
val future = ssc.sparkContext.submitJob[Receiver[_], Unit, Unit](
receiverRDD, startReceiverFunc, Seq(0), (_, _) => Unit, ())
刚开始我是想不通这的逻辑的,弄了一个rdd然后加一个函数提交任务就能启动?其实还是spark core#submitJob没真正理解。
对比一下一个计算程序
sc.makeRDD(1 to 10).foreach(println)
这个代码是我们常写的,能够理解foreach是作用在rdd每个分区中的每条数据上。我们来看一下submitJob的源码注释
* @param rdd target RDD to run tasks on
* @param processPartition a function to run on each partition of the RDD
再来看看
- receiverRDD是
sc.makeRDD(Seq(receiver), 1)内容就是receiver,并且分区数是1 - startReceiverFunc,这个函数作用在rdd里的数据,那么在这里就是receiver!
所以其实就是startReceiverFunc处理receiver。
可能遇到的问题
receiver接受数据后会经过一些步骤产生rdd,其数据在本地的blockmanager中。 后续计算的时候,由于计算会向数据移动,如果receiver分布不均匀会导致task分布不均衡。