Flink 分流之 Filter/Split/SideOutPut 比较(二)

195 阅读2分钟

本文已参与「新人创作礼」活动,一起开启掘金创作之路。

Split 的缺点:

但是要注意,使用 split 算子切分过的流,是不能进行二次切分的,假如把上述切分出来的 littleStream 和 bigStream 流再次调用 split 切分,控制台会抛出以下异常。

Exception in thread "main" java.lang.IllegalStateException: Consecutive multiple splits are not supported. Splits are deprecated. Please use side-outputs.

这是什么原因呢?我们在源码中可以看到注释,该方式已经废弃并且建议使用最新的 SideOutPut 进行分流操作。

3. SideOutPut 分流

SideOutPut 是 Flink 框架为我们提供的最新的也是最为推荐的分流方法,在使用 SideOutPut 时,需要按照以下步骤进行: • 定义 OutputTag • 调用特定函数进行数据拆分 • ProcessFunction • KeyedProcessFunction • CoProcessFunction • KeyedCoProcessFunction • ProcessWindowFunction • ProcessAllWindowFunction 在这里我们使用 ProcessFunction 来讲解如何使用 SideOutPut:

Scala 案例

object sideOutStreamExample {
  def main(args: Array[String]): Unit = {
	    val env = StreamExecutionEnvironment.getExecutionEnvironment
	//    env.setParallelism(1)
	
	    //1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
	    //2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
	    val inputStream: DataStream[String] = env.readTextFile("/home/rjxy/zlp/Code/CodePro/GuoSai/Task01/src/main/resources/day.csv")
	
	    //定义两个OutTag
	    val littleOutTag = new OutputTag[String]("littleStream")
	    val bigOutTag = new OutputTag[String]("bigStream")
	
	    val processStream = inputStream.process(new ProcessFunction[String, String] {
	      override def processElement(i: String,
	                                  context: ProcessFunction[String, String]#Context,
	                                  out: Collector[String]): Unit = {
	        if (i.split(",")(0).toInt < 500) {
	          context.output(littleOutTag, i)
	        } else if (i.split(",")(0).toInt >= 500) {
	          context.output(bigOutTag, i)
	        }
	      }
	    })
	
	    val littleStream = processStream.getSideOutput(littleOutTag)
	    val bigStream = processStream.getSideOutput(bigOutTag)
	
	    littleStream.print("little------")
	    bigStream.print("big------")
	
	    env.execute()
  }
}
输出结果:
big------:15> 682,2012-11-12,4,1,11,1,1,0,1,0.485,0.475383,0.741667,0.173517,1097,5172,6269
little------:4> 184,2011-07-03,3,0,7,0,0,0,2,0.716667,0.668575,0.6825,0.228858,2282,2367,4649
big------:15> 683,2012-11-13,4,1,11,0,2,1,2,0.343333,0.323225,0.662917,0.342046,327,3767,4094
little------:4> 185,2011-07-04,3,0,7,1,1,0,2,0.726667,0.665417,0.637917,0.0814792,3065,2978,6043

可以看到,我们将流进行了拆分,并且成功打印出了结果。这里要注意,Flink 最新提供的 SideOutPut 方式拆分流是可以多次进行拆分的,无需担心会爆出异常。

总结

  1. Filter 分流(原始流多次过滤,导致消耗性能)

  2. Split 分流(不支持二次分流)

  3. SideOutput 分流(官方推荐)