Flink 分流之 Filter/Split/SideOutPut 比较(一)

247 阅读2分钟

本文已参与「新人创作礼」活动,一起开启掘金创作之路。

分流场景

我们在生产实践中经常会遇到这样的场景,需把输入源按照需要进行拆分,比如我期望把用户访问日志按照访问者的地理位置进行拆分。面对这样的需求该如何操作呢?

通常来说针对不同的场景,有以下三种办法进行流的拆分。

  • Filter 分流
  • Split 分流
  • SideOutPut 分流 在这里插入图片描述

1. Filter 分流

Scala 案例

/**
 * Flink分流方式
 * 1。 Filter 分流(原始流多次过滤,导致消耗性能)
 * 2。 Split 分流(不支持二次分流)
 * 3。 SideOutput 分流(官方推荐)
 */
object filterStreamExample {
	  def main(args: Array[String]): Unit = {
		    val env = StreamExecutionEnvironment.getExecutionEnvironment
		    env.setParallelism(1)
		
		    //1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
		    //2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
		    val inputStream: DataStream[String] = env.readTextFile("/home/rjxy/zlp/Code/CodePro/GuoSai/Task01/src/main/resources/day.csv")
		
		    val littleStream = inputStream.filter(_.split(",")(0).toInt < 500)
		    val bigStream = inputStream.filter(_.split(",")(0).toInt >= 500)
		
		    //打印结果
		    littleStream.print("little------")
		    bigStream.print("big------")
		    env.execute()
	  }
}
输出结果:
little------> 496,2012-05-10,2,1,5,0,4,1,1,0.505833,0.491783,0.552083,0.314063,1026,5546,6572
little------> 498,2012-05-12,2,1,5,0,6,0,1,0.564167,0.544817,0.480417,0.123133,2622,4807,7429
little------> 499,2012-05-13,2,1,5,0,0,0,1,0.6125,0.585238,0.57625,0.225117,2172,3946,6118
big------> 500,2012-05-14,2,1,5,0,1,1,2,0.573333,0.5499,0.789583,0.212692,342,2501,2843
big------> 501,2012-05-15,2,1,5,0,2,1,2,0.611667,0.576404,0.794583,0.147392,625,4490,5115

Filter的缺点:

Filter 的弊端:为了得到我们需要的流数据,需要多次遍历原始流,这样无形中浪费了我们集群的资源。

2. Split 分流

Scala 案例

object splitStreamExample {
  def main(args: Array[String]): Unit = {
	    val env = StreamExecutionEnvironment.getExecutionEnvironment
	//    env.setParallelism(1)
	
	    //1,2011-01-01,1,0,1,0,6,0,2,0.344167,0.363625,0.805833,0.160446,331,654,985
	    //2,2011-01-02,1,0,1,0,0,0,2,0.363478,0.353739,0.696087,0.248539,131,670,801
	    val inputStream: DataStream[String] = env.readTextFile("/home/rjxy/zlp/Code/CodePro/GuoSai/Task01/src/main/resources/day.csv")
	
	    val splitStream: SplitStream[String] = inputStream.split(new OutputSelector[String] {
	      override def select(out: String): lang.Iterable[String] = {
	        val tags = new util.ArrayList[String]()
	        if (out.split(",")(0).toInt < 500) {
	          tags.add("littleStream")
	        } else if (out.split(",")(0).toInt >= 500) {
	          tags.add("bigStream")
	        }
	        return tags
	      }
	    })
	
	    splitStream.select("littleStream").print("little------")
	    splitStream.select("bigStream").print("big------")
	
	    env.execute()
  }
}
输出结果:
little------:13> 36,2011-02-05,1,0,2,0,6,0,2,0.233333,0.243058,0.929167,0.161079,100,905,1005
little------:15> 137,2011-05-17,2,0,5,0,2,1,2,0.561667,0.538529,0.837917,0.277354,678,3445,4123
little------:13> 37,2011-02-06,1,0,2,0,0,0,1,0.285833,0.291671,0.568333,0.1418,354,1269,1623
big------:9> 592,2012-08-14,3,1,8,0,2,1,1,0.726667,0.676779,0.686667,0.169158,1128,5656,6784
little------:13> 38,2011-02-07,1,0,2,0,1,1,1,0.271667,0.303658,0.738333,0.0454083,120,1592,1712
big------:9> 593,2012-08-15,3,1,8,0,3,1,1,0.706667,0.654037,0.619583,0.169771,1198,6149,7347

请看下篇======