RDD总结

Spark 主要的两种抽象

resilient distributed dataset (RDD) 作用：夸集群节点并行完成计算
shared variables 共享变量
- broadcast variables 广播变量
- accumulators 累加器作用：跨节点、任务、驱动程序完成计算

RDD生成方式

调用SparkContext'S parallelize来处理已经存在的collection(集合)生成RDD
读取外部数据生成RDD

注意：集群中读取本地文件需要每个文件都在相同路径下 spark可以读取文件的方式和种类,
textFile("/my/directory") 
textFile("/my/directory/*.txt")
textFile("/my/directory/*.gz")

SparkContext.wholeTextFiles 可以读取多个文件
能读取序列化文件

RDD操作过程

transform
- 惰性机制
action
persist or cache

传递方法给spark

利用object的全局静态性质传递

object MyFunc {
  def func1(s: String): String = { ... }
}

myRdd.map(MyFunc.func1)

通过闭包性质传递类的实例(方法内动用方法外的变量或对象)

class MyClass {
  def func1(s: String): String = { ... }
  def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(func1) }
}
// 
Here, if we create a new `MyClass` instance and call `doStuff` on it, the `map` inside there references the `func1` method *of that `MyClass` instance*, so the whole object needs to be sent to the cluster. It is similar to writing `rdd.map(x => this.func1(x))`.

等价于下面

class MyClass {
  val field = "Hello"
  def doStuff(rdd: RDD[String]): RDD[String] = { rdd.map(x => field + x) }
}
//is equivalent to writing `rdd.map(x => this.field + x)`, which references all of `this`. To avoid this issue, the simplest way is to copy `field` into a local variable instead of accessing it externally:

这两种方式实际上都是引用了整个类对象实例，避免这种方式使用局部变量。

def doStuff(rdd: RDD[String]): RDD[String] = {
  val field_ = this.field
  rdd.map(x => field_ + x)
}

闭包

闭包的存在，引用了作用域，让计算能够跨节点在集群中计算，但其他节点收到的变量是来自闭包的，而不是在driver node 中定义的变量，这导致在每个节点中最后的结果在闭包函数结束后就为归0，无法返回到driver node中定义的值。为了避免这种情况使用贡献变量——Accumulator

Accumulators in Spark are used specifically to provide a mechanism for safely updating a variable when execution is split up across worker nodes in a cluster.

👀同理除了累加，rdd.foreach（println)在集群上运行使也不会打印所有信息在driver node上，要通过rdd.collect()才能将RDD分布在所有节点上的信息打印出来。

Spark 学习笔记