Flink 整合 KafkaFlink 整合 Kafka 代码demo environment source opera

这是我参与8月更文挑战的第6天，活动详情查看：8月更文挑战

代码demo

environment

StreamExecutionEnvironment env= StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC);
//开启ck
env.enableCheckpointing(1000L);
//存储到hdfs
env.setStateBackend(new FsStateBackend("hdfs://nameservice1/user/flink/checkpoint"));
//任务取消保留ck
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
//设置精确一次
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
//固定重启两次
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(2, 3000));

source

//kafka配置
Properties properties=new Properties();
properties.setProperty("bootstrap.servers","127.0.0.1:9092");
properties.setProperty("group.id","test");
properties.setProperty("auto.offset.reset","latest");
properties.setProperty("flink.partition-discovery.interval-millis","5000");
properties.setProperty("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
properties.setProperty("enable.auto.commit","true");
properties.setProperty("auto.commit.interval.ms","5000");

FlinkKafkaConsumer<String> eventConsumer = new FlinkKafkaConsumer<String>("flink_test_string",new SimpleStringSchema(), properties);
//注意点
//eventConsumer.setCommitOffsetsOnCheckpoints(false);
DataStreamSource<String> eventStream = env.addSource(eventConsumer).setParallelism(1);

operator

DataStream<Tuple2<String, Integer>> resultStream = eventStream.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
            @Override
            public void flatMap(String s, Collector<Tuple2<String,Integer>> collector) throws Exception {
                String[] words=s.split(" ");
                for (String word :words){
                    collector.collect(new Tuple2<String, Integer>(word,1));
                }
            }
        }).keyBy(t->t.f0).sum(1).setParallelism(1);

sink

resultStream.print().setParallelism(1);

问题

setCommitOffsetsOnCheckpoints方法是用来干什么的?

官方文档有介绍，代码中不建议setCommitOffsetsOnCheckpoints方法设置为 false。
如果禁用CheckPointing，则Flink Kafka Consumer依赖于内部使用的Kafka客户端的自动定期偏移量提交功能。该偏移量会被记录在 Kafka 中的 _consumer_offsets 这个特殊记录偏移量的 Topic 中。
如果启用CheckPointing，偏移量则会被记录在 StateBackend 中。该方法setCommitOffsetsOnCheckpoints设置为 ture 时，偏移量会在 StateBackend 和 Kafka 中的 _consumer_offsets Topic 中都会记录一份；设置为 false 时，偏移量只会在 StateBackend 中的存储一份。

Flink任务重启时，并未指定 savePoint 路径，为什么还能够恢复数据？

如果任务重启时，指定savePoint路径(Checkpoint路径)，它则会从指定的savePoint路径恢复数据
如果不指定 savePoint 路径，任务会从 Kafka 中的_consumer_offsets这个 topic 中，查看有没有相同group.id 的 topic 的偏移量，如果有的话就会接着之前写入的偏移量来读。