Flink实战系列(三)之Source和Sink的使用

4,739 阅读6分钟

上篇文章中介绍Flink编程模型,这次我们们来看看Flink的Source和Sink,Flink支持向文件、socket、集合等中读写数据,同时Flink也内置许多connectors,例如Kafka、Hadoop、Redis等。这些内置的connectors都是维护精确一次语义,而SparkStreaming对接Kafka、Redis等需要使用者自己维护精确一次语义。接下来,我们来看看Flink如何自定义Source,本文不涉及到checkpoint等实现。

自定义Source

官网Flink提供以下三类接口来实现你所想要的Source:

  • SourceFunction:对所有StreamSource的顶层接口,直接继承该接口的Source无法将并行度设置大于1
/**
 * 当Source运行时就会调用run方法,结束时调用cancel
 */
class MyNonParallelSource extends SourceFunction[Access]{

  private var isRunning = true
  override def run(ctx: SourceFunction.SourceContext[Access]): Unit = {
    val domains = List("ruozedata.com", "dongqiudi.com", "zhibo8.com")
    val random = new Random()
    while (isRunning) {
      val time = System.currentTimeMillis() + ""
      val domain = domains(random.nextInt(domains.length))
      val flow = random.nextInt(10000)
      1.to(10).map(x => ctx.collect(Access(time, domain, flow)))
    }
  }

  override def cancel(): Unit = {
    isRunning = false
  }
}

//自定义生成数据,并行度不能被设置超过1
val ds = env.addSource(new MyNonParallelSource)

  • ParallelSourceFunction:继承该接口的实例能够将并行度设置大于1
class MyParallelSource extends ParallelSourceFunction[Access] {
  private var isRunning = true

  override def run(ctx: SourceFunction.SourceContext[Access]): Unit = {
    val domains = List("ruozedata.com", "dongqiudi.com", "zhibo8.com")
    val random = new Random()
    while (isRunning) {
      val time = System.currentTimeMillis() + ""
      val domain = domains(random.nextInt(domains.length))
      val flow = random.nextInt(10000)
      1.to(10).map(x => ctx.collect(Access(time, domain, flow)))
    }
  }

  override def cancel(): Unit = {
    isRunning = false
  }
}

//自定义生成数据,并行度能被设置大于1
val ds2 = env.addSource(new MyParallelSource)

  • RichParallelSourceFunction:继承该接口不仅将并行度设置大于1而且能够实现生命周期函数如:open、close
/**
 * MySQL作为FLink的Source
 */
class MySQLSource extends RichParallelSourceFunction[City]{

  private var conn:Connection = _
  private var state:PreparedStatement = _

  override def open(parameters: Configuration): Unit = {
    val url = "jdbc:mysql://localhost:3306/g7"
    val user = "ruoze"
    val password = "ruozedata"
    conn = MySQLUtil.getConnection(url,user,password)
  }

  override def close(): Unit = {
    MySQLUtil.close(conn,state)
  }

  override def run(ctx: SourceFunction.SourceContext[City]): Unit = {
    val sql = "select * from city_info"
    state = conn.prepareStatement(sql)
    val rs = state.executeQuery()
    while(rs.next()){
      val id = rs.getInt(1)
      val name = rs.getString(2)
      val area = rs.getString(3)
      ctx.collect(City(id,name,area))
    }
  }

  override def cancel(): Unit = {}
}

//自定义从MySQL中取出数据,支持并行度大于1
val ds3 = env.addSource(new MySQLSource)

Kafka Source Connector

Flink对于对接Kafka各个版本都有支持,版本与Jar对应表格如下:

Maven Dependency Consumer and
Producer Class name
Kafka Version
flink-connector-kafka-0.8_2.11 FlinkKafkaConsumer08
FlinkKafkaProducer08
0.8.x
flink-connector-kafka-0.9_2.11 FlinkKafkaConsumer09
FlinkKafkaProducer09
0.9.x
flink-connector-kafka-0.10_2.11 FlinkKafkaConsumer010
FlinkKafkaProducer010
0.10.x
flink-connector-kafka-0.11_2.11 FlinkKafkaConsumer011
FlinkKafkaProducer011
0.11.x
flink-connector-kafka_2.11 FlinkKafkaConsumer
FlinkKafkaProducer
>= 1.0.0

根据你的Kafka版本你引进不同的Kafka Connector依赖,比如我的Kafka版本为4.1.0,pom.xml加入以下依赖:

<dependency>
  <groupId>org.apache.flink</groupId>
  <artifactId>flink-connector-kafka_2.11</artifactId>
  <version>1.9.0</version>
</dependency>

如何使用Kafka Source,以4.1.0版本的Kafka为例:

val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
// only required for Kafka 0.8
// 只有Kafka 0.8会用到ZK
// properties.setProperty("zookeeper.connect", "localhost:2181")
properties.setProperty("group.id", "test")
stream = env
    .addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties))
    .print()

同样,Flink Kafka也提供以下方式来配置从哪个Kafka partitions position开始:

val env = StreamExecutionEnvironment.getExecutionEnvironment()

val myConsumer = new FlinkKafkaConsumer08[String](...)
myConsumer.setStartFromEarliest()      // start from the earliest record possible
myConsumer.setStartFromLatest()        // start from the latest record
myConsumer.setStartFromTimestamp(...)  // start from specified epoch timestamp (milliseconds)
myConsumer.setStartFromGroupOffsets()  // the default behaviour

val stream = env.addSource(myConsumer)
...
  • setStartFromGroupOffsets(默认):从在brokers(或者是Zookeeper 0.8)中被consumer提交的partitions offset开始读取数据,如果找不到offset,就会从最早的位置开始读取
  • setStartFromEarliest() / setStartFromLatest():从最早/最新的位置开始读取数据,在此模式下,提交给Kafka的offset以及想从此读取的数据都会被忽略
  • setStartFromTimestamp(long):从指定的时间戳开始读取数据,对于每个partitions来说,大于或等于指定时间戳的record都会被读取到,如果没有大于等于指定的时间戳的record,就是从最新的位置开始读取,同样的在此模式下,提交给Kafka的offset以及想从此读取的数据都会被忽略

当然,你也可以指定每个partitions offset来直接读取数据,以下例子就是从topic myTopic的3分区不同offset开始读取数据,如果指定的offset在分区中未找到,将回退到默认的setStartFromGroupOffsets。

val specificStartOffsets = new java.util.HashMap[KafkaTopicPartition, java.lang.Long]()
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 0), 23L)
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 1), 31L)
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 2), 43L)

myConsumer.setStartFromSpecificOffsets(specificStartOffsets)

注意⚠️:以上配置在作业失败之后自动恢复和手动从savepoint恢复时不会对读取数据的位置有任何影响,因为读取数据的位置会从存储在savepoint和checkpoint中的offset开始,关于checkpoint、savepoint会在后续文章讲到。

自定义Sink

Flink支持以下方式写入数据:

  • writeAsText:以文本格式写入数据
  • writeAsCsv:以CSV格式写入数据
  • print() / printToErr():在标准输出/标准错误流上打印每个元素的toString值。可以提供前缀msg,该前缀在输出之前,这可以帮助区分不同的打印请求。如果并行度大于1,则输出之前还将带有产生输出的任务的标识符
  • writeUsingOutputFormat():自定义文件输出基类
  • writeToSocket:向socket写入序列化的数据
  • addSink:调用自定义Sink方法,Flink内置一些已经写好输出到其他系统的(例如Kafka)的connector

我们如何来自定义一个Sink呢?答案是先看官方已经实现好的writeAsText方法,很容易看到如下源码,我们看到熟悉的addSink方法,其中OutputFormatSinkFunction就是我们需要照着实现的例子。

@PublicEvolving
	public DataStreamSink<T> writeUsingOutputFormat(OutputFormat<T> format) {
		return addSink(new OutputFormatSinkFunction<>(format));
	}


继续点开源码,清楚知道OutputFormatSinkFunction就是接受指定的OutputFormat来输出。类的定义中出现RichSinkFunction,这个才是我们自定义Sink的关键。

@PublicEvolving
@Deprecated
public class OutputFormatSinkFunction<IN> extends RichSinkFunction<IN> implements InputTypeConfigurable {


同Source一样,SinkFunction是RichSinkFunction的基类,但是基本被直接不会实现,而是现实RichSinkFunction,因为它有生命周期函数能够在多线程环境下公用同一连接,比如下面这个自定义输出到MySQL:

class MySQLRichSink extends RichSinkFunction[Access]{

  private var conn:Connection = _
  private var state:PreparedStatement = _
  private val sql = "INSERT INTO access_log (time, domain,flow) VALUES (?, ?, ?) ON DUPLICATE KEY UPDATE flow = ?"
  override def open(parameters: Configuration): Unit = {
    val url = "jdbc:mysql://localhost:3306/g7"
    val user = "ruoze"
    val password = "ruozedata"
    conn = MySQLUtil.getConnection(url,user,password)
  }

  override def close(): Unit = {
    MySQLUtil.close(conn,state)
  }

  override def invoke(value: Access, context: SinkFunction.Context[_]): Unit = {
    state = conn.prepareStatement(sql)
    state.setString(1,value.time)
    state.setString(2,value.domain)
    state.setInt(3,value.flow)
    state.setInt(4,value.flow)
    state.executeUpdate()
  }

同样注意⚠️:以上输出都没实现checkpoint,不是精确一次,维护至少一次语义,作业重启之后数据很可能出现重复。

Redis Connector

Flink内置向Redis写入的Connector,用之前先在pom.xml中加入如下依赖:

<dependency>
   <groupId>org.apache.bahir</groupId>
   <artifactId>flink-connector-redis_${scala.binary.version}</artifactId>
   <version>1.0</version>
</dependency>

它可以使用三种不同的方法与不同类型的Redis环境进行通信:单个Redis服务器、Redis集群、Redis Sentinel,下面例子就是针对单个Redis服务器的:

//Redis版本为最新的
class RedisSinkMapper extends RedisMapper[(String,Int)]{

  override def getCommandDescription: RedisCommandDescription = {
    new RedisCommandDescription(RedisCommand.HSET, "traffic")
  }

  override def getKeyFromData(data: (String, Int)): String = data._1

  override def getValueFromData(data: (String, Int)): String = data._2 + ""
}

val conf = new FlinkJedisPoolConfig.Builder().setHost("localhost").build()
env.readTextFile("input/access.log")
  .map(x => {
    val splits = x.split(",")
    val domain = splits(1)
    val flow = splits(2).toInt
    (domain, flow)
  }).keyBy(0).sum(1)
  .addSink(new RedisSink[(String, Int)](conf, new RedisSinkMapper))

上文所涉及到的示例代码以及数据皆已上传到Github上,如有需要请直接clone到本地直接运行,🔗链接如下:
github.com/liverrrr/fl…
上文如有错误或是纰漏,👏欢迎各位下方评论指出,大家一起交流学习📖