上篇文章中介绍Flink编程模型,这次我们们来看看Flink的Source和Sink,Flink支持向文件、socket、集合等中读写数据,同时Flink也内置许多connectors,例如Kafka、Hadoop、Redis等。这些内置的connectors都是维护精确一次语义,而SparkStreaming对接Kafka、Redis等需要使用者自己维护精确一次语义。接下来,我们来看看Flink如何自定义Source,本文不涉及到checkpoint等实现。
自定义Source
官网Flink提供以下三类接口来实现你所想要的Source:
- SourceFunction:对所有StreamSource的顶层接口,直接继承该接口的Source无法将并行度设置大于1
/**
* 当Source运行时就会调用run方法,结束时调用cancel
*/
class MyNonParallelSource extends SourceFunction[Access]{
private var isRunning = true
override def run(ctx: SourceFunction.SourceContext[Access]): Unit = {
val domains = List("ruozedata.com", "dongqiudi.com", "zhibo8.com")
val random = new Random()
while (isRunning) {
val time = System.currentTimeMillis() + ""
val domain = domains(random.nextInt(domains.length))
val flow = random.nextInt(10000)
1.to(10).map(x => ctx.collect(Access(time, domain, flow)))
}
}
override def cancel(): Unit = {
isRunning = false
}
}
//自定义生成数据,并行度不能被设置超过1
val ds = env.addSource(new MyNonParallelSource)
- ParallelSourceFunction:继承该接口的实例能够将并行度设置大于1
class MyParallelSource extends ParallelSourceFunction[Access] {
private var isRunning = true
override def run(ctx: SourceFunction.SourceContext[Access]): Unit = {
val domains = List("ruozedata.com", "dongqiudi.com", "zhibo8.com")
val random = new Random()
while (isRunning) {
val time = System.currentTimeMillis() + ""
val domain = domains(random.nextInt(domains.length))
val flow = random.nextInt(10000)
1.to(10).map(x => ctx.collect(Access(time, domain, flow)))
}
}
override def cancel(): Unit = {
isRunning = false
}
}
//自定义生成数据,并行度能被设置大于1
val ds2 = env.addSource(new MyParallelSource)
- RichParallelSourceFunction:继承该接口不仅将并行度设置大于1而且能够实现生命周期函数如:open、close
/**
* MySQL作为FLink的Source
*/
class MySQLSource extends RichParallelSourceFunction[City]{
private var conn:Connection = _
private var state:PreparedStatement = _
override def open(parameters: Configuration): Unit = {
val url = "jdbc:mysql://localhost:3306/g7"
val user = "ruoze"
val password = "ruozedata"
conn = MySQLUtil.getConnection(url,user,password)
}
override def close(): Unit = {
MySQLUtil.close(conn,state)
}
override def run(ctx: SourceFunction.SourceContext[City]): Unit = {
val sql = "select * from city_info"
state = conn.prepareStatement(sql)
val rs = state.executeQuery()
while(rs.next()){
val id = rs.getInt(1)
val name = rs.getString(2)
val area = rs.getString(3)
ctx.collect(City(id,name,area))
}
}
override def cancel(): Unit = {}
}
//自定义从MySQL中取出数据,支持并行度大于1
val ds3 = env.addSource(new MySQLSource)
Kafka Source Connector
Flink对于对接Kafka各个版本都有支持,版本与Jar对应表格如下:
| Maven Dependency | Consumer and Producer Class name |
Kafka Version |
|---|---|---|
| flink-connector-kafka-0.8_2.11 | FlinkKafkaConsumer08 FlinkKafkaProducer08 |
0.8.x |
| flink-connector-kafka-0.9_2.11 | FlinkKafkaConsumer09 FlinkKafkaProducer09 |
0.9.x |
| flink-connector-kafka-0.10_2.11 | FlinkKafkaConsumer010 FlinkKafkaProducer010 |
0.10.x |
| flink-connector-kafka-0.11_2.11 | FlinkKafkaConsumer011 FlinkKafkaProducer011 |
0.11.x |
| flink-connector-kafka_2.11 | FlinkKafkaConsumer FlinkKafkaProducer |
>= 1.0.0 |
根据你的Kafka版本你引进不同的Kafka Connector依赖,比如我的Kafka版本为4.1.0,pom.xml加入以下依赖:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.9.0</version>
</dependency>
如何使用Kafka Source,以4.1.0版本的Kafka为例:
val properties = new Properties()
properties.setProperty("bootstrap.servers", "localhost:9092")
// only required for Kafka 0.8
// 只有Kafka 0.8会用到ZK
// properties.setProperty("zookeeper.connect", "localhost:2181")
properties.setProperty("group.id", "test")
stream = env
.addSource(new FlinkKafkaConsumer[String]("topic", new SimpleStringSchema(), properties))
.print()
同样,Flink Kafka也提供以下方式来配置从哪个Kafka partitions position开始:
val env = StreamExecutionEnvironment.getExecutionEnvironment()
val myConsumer = new FlinkKafkaConsumer08[String](...)
myConsumer.setStartFromEarliest() // start from the earliest record possible
myConsumer.setStartFromLatest() // start from the latest record
myConsumer.setStartFromTimestamp(...) // start from specified epoch timestamp (milliseconds)
myConsumer.setStartFromGroupOffsets() // the default behaviour
val stream = env.addSource(myConsumer)
...
- setStartFromGroupOffsets(默认):从在brokers(或者是Zookeeper 0.8)中被consumer提交的partitions offset开始读取数据,如果找不到offset,就会从最早的位置开始读取
- setStartFromEarliest() / setStartFromLatest():从最早/最新的位置开始读取数据,在此模式下,提交给Kafka的offset以及想从此读取的数据都会被忽略
- setStartFromTimestamp(long):从指定的时间戳开始读取数据,对于每个partitions来说,大于或等于指定时间戳的record都会被读取到,如果没有大于等于指定的时间戳的record,就是从最新的位置开始读取,同样的在此模式下,提交给Kafka的offset以及想从此读取的数据都会被忽略
当然,你也可以指定每个partitions offset来直接读取数据,以下例子就是从topic myTopic的3分区不同offset开始读取数据,如果指定的offset在分区中未找到,将回退到默认的setStartFromGroupOffsets。
val specificStartOffsets = new java.util.HashMap[KafkaTopicPartition, java.lang.Long]()
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 0), 23L)
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 1), 31L)
specificStartOffsets.put(new KafkaTopicPartition("myTopic", 2), 43L)
myConsumer.setStartFromSpecificOffsets(specificStartOffsets)
注意⚠️:以上配置在作业失败之后自动恢复和手动从savepoint恢复时不会对读取数据的位置有任何影响,因为读取数据的位置会从存储在savepoint和checkpoint中的offset开始,关于checkpoint、savepoint会在后续文章讲到。
自定义Sink
Flink支持以下方式写入数据:
- writeAsText:以文本格式写入数据
- writeAsCsv:以CSV格式写入数据
- print() / printToErr():在标准输出/标准错误流上打印每个元素的toString值。可以提供前缀msg,该前缀在输出之前,这可以帮助区分不同的打印请求。如果并行度大于1,则输出之前还将带有产生输出的任务的标识符
- writeUsingOutputFormat():自定义文件输出基类
- writeToSocket:向socket写入序列化的数据
- addSink:调用自定义Sink方法,Flink内置一些已经写好输出到其他系统的(例如Kafka)的connector
我们如何来自定义一个Sink呢?答案是先看官方已经实现好的writeAsText方法,很容易看到如下源码,我们看到熟悉的addSink方法,其中OutputFormatSinkFunction就是我们需要照着实现的例子。
@PublicEvolving
public DataStreamSink<T> writeUsingOutputFormat(OutputFormat<T> format) {
return addSink(new OutputFormatSinkFunction<>(format));
}
继续点开源码,清楚知道OutputFormatSinkFunction就是接受指定的OutputFormat来输出。类的定义中出现RichSinkFunction,这个才是我们自定义Sink的关键。
@PublicEvolving
@Deprecated
public class OutputFormatSinkFunction<IN> extends RichSinkFunction<IN> implements InputTypeConfigurable {
同Source一样,SinkFunction是RichSinkFunction的基类,但是基本被直接不会实现,而是现实RichSinkFunction,因为它有生命周期函数能够在多线程环境下公用同一连接,比如下面这个自定义输出到MySQL:
class MySQLRichSink extends RichSinkFunction[Access]{
private var conn:Connection = _
private var state:PreparedStatement = _
private val sql = "INSERT INTO access_log (time, domain,flow) VALUES (?, ?, ?) ON DUPLICATE KEY UPDATE flow = ?"
override def open(parameters: Configuration): Unit = {
val url = "jdbc:mysql://localhost:3306/g7"
val user = "ruoze"
val password = "ruozedata"
conn = MySQLUtil.getConnection(url,user,password)
}
override def close(): Unit = {
MySQLUtil.close(conn,state)
}
override def invoke(value: Access, context: SinkFunction.Context[_]): Unit = {
state = conn.prepareStatement(sql)
state.setString(1,value.time)
state.setString(2,value.domain)
state.setInt(3,value.flow)
state.setInt(4,value.flow)
state.executeUpdate()
}
同样注意⚠️:以上输出都没实现checkpoint,不是精确一次,维护至少一次语义,作业重启之后数据很可能出现重复。
Redis Connector
Flink内置向Redis写入的Connector,用之前先在pom.xml中加入如下依赖:
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_${scala.binary.version}</artifactId>
<version>1.0</version>
</dependency>
它可以使用三种不同的方法与不同类型的Redis环境进行通信:单个Redis服务器、Redis集群、Redis Sentinel,下面例子就是针对单个Redis服务器的:
//Redis版本为最新的
class RedisSinkMapper extends RedisMapper[(String,Int)]{
override def getCommandDescription: RedisCommandDescription = {
new RedisCommandDescription(RedisCommand.HSET, "traffic")
}
override def getKeyFromData(data: (String, Int)): String = data._1
override def getValueFromData(data: (String, Int)): String = data._2 + ""
}
val conf = new FlinkJedisPoolConfig.Builder().setHost("localhost").build()
env.readTextFile("input/access.log")
.map(x => {
val splits = x.split(",")
val domain = splits(1)
val flow = splits(2).toInt
(domain, flow)
}).keyBy(0).sum(1)
.addSink(new RedisSink[(String, Int)](conf, new RedisSinkMapper))
上文所涉及到的示例代码以及数据皆已上传到Github上,如有需要请直接clone到本地直接运行,🔗链接如下:
github.com/liverrrr/fl…
上文如有错误或是纰漏,👏欢迎各位下方评论指出,大家一起交流学习📖