因工作需要用到Flink,记录Flink学习遇到坑

4,405 阅读4分钟

学习Flink CEP示例代码

        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();

        //Integer id; Integer userId;String ip;String type;Long timestamp;
        List<LoginEvent> loginEvents = new ArrayList<>();
        
        loginEvents.add(new LoginEvent(1, 1, "127.0.0.1", "success", 16305733111L));
        loginEvents.add(new LoginEvent(2, 2, "127.0.0.2", "fail", 16305733211L));
        loginEvents.add(new LoginEvent(3, 1, "127.0.0.2", "success", 16305733311L));
        loginEvents.add(new LoginEvent(4, 2, "127.0.0.1", "fail", 16305734111L));
        loginEvents.add(new LoginEvent(5, 1, "127.0.0.1", "success", 16305733111L));
        loginEvents.add(new LoginEvent(6, 3, "127.0.0.2", "fail", 16305733011L));
        loginEvents.add(new LoginEvent(7, 3, "127.0.0.1", "success", 16305733711L));
        loginEvents.add(new LoginEvent(8, 4, "127.0.0.1", "fail", 16305733171L));
        loginEvents.add(new LoginEvent(9, 1, "127.0.0.2", "success", 16305733191L));
        loginEvents.add(new LoginEvent(10, 4, "127.0.0.1", "fail", 16305733211L));

        DataStreamSource<LoginEvent> dataStreamSource = environment.fromCollection(loginEvents);

        Pattern<LoginEvent, LoginEvent> pattern = Pattern
                .<LoginEvent>begin("one", AfterMatchSkipStrategy.skipPastLastEvent())
                .where(new SimpleCondition<LoginEvent>() {
                    @Override
                    public boolean filter(LoginEvent loginEvent) throws Exception {
                        return "fail".equals(loginEvent.getType());
                    }
                }).timesOrMore(2);

        PatternStream<LoginEvent> patternStream = CEP.pattern(
                dataStreamSource
                        .keyBy(LoginEvent::getUserId),
                pattern);
//        patternStream
//                .select((PatternSelectFunction<LoginEvent, String>) map -> {
//                    return map.toString();
//                }).print();

        patternStream.process(new PatternProcessFunction<LoginEvent, String>() {
            @Override
            public void processMatch(Map<String, List<LoginEvent>> map, Context context, Collector<String> collector) throws Exception {
                collector.collect(map.get("one").toString());
            }
        }).print();

        environment.execute("TRUTH_TEST_TRUTH1");

1.Lambda表达式问题

最开始我是学习Demo中,使用了大量Lambda表达式,但是最后执行时,出现了如下异常:

Exception in thread "main" org.apache.flink.api.common.functions.InvalidTypesException: The types of the interface org.apache.flink.cep.PatternSelectFunction could not be inferred. Support for synthetic interfaces, lambdas, and generic or raw types is limited at this point
	at org.apache.flink.api.java.typeutils.TypeExtractor.getParameterType(TypeExtractor.java:1239)
	at org.apache.flink.api.java.typeutils.TypeExtractor.getParameterTypeFromGenericType(TypeExtractor.java:1263)
	at org.apache.flink.api.java.typeutils.TypeExtractor.getParameterType(TypeExtractor.java:1226)
	at org.apache.flink.api.java.typeutils.TypeExtractor.privateCreateTypeInfo(TypeExtractor.java:789)
	at org.apache.flink.api.java.typeutils.TypeExtractor.getUnaryOperatorReturnType(TypeExtractor.java:587)
	at org.apache.flink.cep.PatternStream.select(PatternStream.java:132)
	at com.truth.flinkdemo.flink.FlinkCEPKt.main(FlinkCEP.kt:59)
	at com.truth.flinkdemo.flink.FlinkCEPKt.main(FlinkCEP.kt)

在lambda 表达中功能的输入和输出参数类型无需声明,因为 Java 编译器推断了这些参数。Flink虽然可以自动从方法签名的实现中提取结果类型信息,不幸的是,这样的功能是由Java编译器编译成的,这使得 Flink 无法自动推断输出类型的类型信息。根据官网具体解决方法如下:

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;

//使用Flink中的"returns(...)"方法,来表示具体的类型
env.fromElements(1, 2, 3)
    .map(i -> Tuple2.of(i, i))
    .returns(Types.TUPLE(Types.INT, Types.INT))
    .print();

// 写对应接口的实现类
env.fromElements(1, 2, 3)
    .map(new MyTuple2Mapper())
    .print();

public static class MyTuple2Mapper extends MapFunction<Integer, Tuple2<Integer, Integer>> {
    @Override
    public Tuple2<Integer, Integer> map(Integer i) {
        return Tuple2.of(i, i);
    }
}

//使用匿名内部类
env.fromElements(1, 2, 3)
    .map(new MapFunction<Integer, Tuple2<Integer, Integer>> {
        @Override
        public Tuple2<Integer, Integer> map(Integer i) {
            return Tuple2.of(i, i);
        }
    })
    .print();

// 在这个例子中使用元组子类代替
env.fromElements(1, 2, 3)
    .map(i -> new DoubleTuple(i, i))
    .print();

public static class DoubleTuple extends Tuple2<Integer, Integer> {
    public DoubleTuple(int f0, int f1) {
        this.f0 = f0;
        this.f1 = f1;
    }
}

2.在引用Flink1.12.3中遇到异常 Could not create actor system

Exception in thread "main" java.lang.Exception: Could not create actor system
	at org.apache.flink.runtime.clusterframework.BootstrapTools.startLocalActorSystem(BootstrapTools.java:281)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:361)
	at org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
	at org.apache.flink.runtime.minicluster.MiniCluster.createLocalRpcService(MiniCluster.java:952)
	at org.apache.flink.runtime.minicluster.MiniCluster.start(MiniCluster.java:288)
	at org.apache.flink.client.program.PerJobMiniClusterFactory.submitJob(PerJobMiniClusterFactory.java:75)
	at org.apache.flink.client.deployment.executors.LocalExecutor.execute(LocalExecutor.java:85)
	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:1905)
	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1796)
	at org.apache.flink.streaming.api.environment.LocalStreamEnvironment.execute(LocalStreamEnvironment.java:69)
	at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1782)
	at org.truth.flink.FlinkCep.main(FlinkCep.java:104)
Caused by: java.lang.NoClassDefFoundError: akka/actor/ExtensionId$class
	at org.apache.flink.runtime.akka.RemoteAddressExtension$.<init>(RemoteAddressExtension.scala:32)
	at org.apache.flink.runtime.akka.RemoteAddressExtension$.<clinit>(RemoteAddressExtension.scala)
	at org.apache.flink.runtime.akka.AkkaUtils$.getAddress(AkkaUtils.scala:804)
	at org.apache.flink.runtime.akka.AkkaUtils.getAddress(AkkaUtils.scala)
	at org.apache.flink.runtime.clusterframework.BootstrapTools.startActorSystem(BootstrapTools.java:298)
	at org.apache.flink.runtime.clusterframework.BootstrapTools.startLocalActorSystem(BootstrapTools.java:279)
	... 11 more
Caused by: java.lang.ClassNotFoundException: akka.actor.ExtensionId$class
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
	... 17 more

上诉异常出现原因为,Flink在1.12.3的发布过程中出现了一个错误,导致Scala 2.12文物意外含有Scala 2.11。 Scala 2.12 用户不应该使用 1.12.3,而应该直接使用 1.12.4,避免出现这个问题,更新到 1.12.4后这个问题即可消失。查阅问题解答原地址

3.升级到1.12+版本后实例代码执行后,控制台没有任何输出

在使用Flink 1.9.3-1.11.3版本中,示例代码会正常输出:

8> [LoginEvent(id=2, userId=2, ip=127.0.0.2, type=fail), LoginEvent(id=4, userId=2, ip=127.0.0.1, type=fail)]
1> [LoginEvent(id=8, userId=4, ip=127.0.0.1, type=fail), LoginEvent(id=10, userId=4, ip=127.0.0.1, type=fail)]

但是在Flink 1.12版本中运行,patternStream.print()执行却没有任何输出。经查阅得知,原因是Flink 1.12开始,流处理默认的时间特征改为TimeCharacteristic.EventTime,也就是说从之前默认的processing time改为了event time,这就是问题所在。

// Flink中集中时间语义
Event Time:事件创建时间,Event Time是事件创建的时间。它通常由事件中的时间戳描述,例如采集的日志数据中,每一条日志都会记录自己的生成时间,Flink通过时间戳分配器访问事件时间戳。*
Ingestion Time:数据进入Flink的时间;
Processing Time:执行操作算子的本地系统时间,与机器相关;

然而Flink 1.12开始StreamExecutionEnvironment.setStreamTimeCharacteristic()方法已经被废弃,无法通过setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)来设置流处理默认时间。查阅PatternStream类的源码可以看到如下方法:


/** Sets the time characteristic to processing time. */
public PatternStream<T> inProcessingTime() {
    return new PatternStream<>(builder.inProcessingTime());
}

/** Sets the time characteristic to event time. */
public PatternStream<T> inEventTime() {
    return new PatternStream<>(builder.inEventTime());
}

通过注释我们可以得知,流处理时间可以通过inProcessingTime()方法来将PatternStream设置为processing time,通过inEventTime()方法则可以设置为event time。因此修改示例代码为如下,代码即可正常输出。

//修改前
patternStream.process(new PatternProcessFunction<LoginEvent, String>() {
    @Override
    public void processMatch(Map<String, List<LoginEvent>> map, Context context, Collector<String> collector) throws Exception {
        collector.collect(map.get("one").toString());
    }
}).print();


//修改后版本
patternStream
        .inProcessingTime()
        .process(new PatternProcessFunction<LoginEvent, String>() {
    @Override
    public void processMatch(Map<String, List<LoginEvent>> map, Context context, Collector<String> collector) throws Exception {
        collector.collect(map.get("one").toString());
    }
}).print();

但是如果我们并不想使用processing time,而是使用event time的情况下,我们应该设置设置Event的时间戳(毫秒级时间戳,否则依然会输出空白),配置代码如下:

//下方的 * 1000L是为了将秒时间戳变为毫秒时间戳,不然依然会输出空白

//旧版 (新版官方推荐用assignTimestampsAndWatermarks(WatermarkStrategy) ) 
// 升序数据设置事件时间和watermark 
dataStreamSource.assignTimestampsAndWatermarks(new AscendingTimestampExtractor<LoginEvent>() {
    @Override
    public long extractAscendingTimestamp(LoginEvent event) { 
        return event.getTimestamp() * 1000L; 
    }
}) 
// 乱序数据设置时间戳和watermark 
dataStreamSource.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<LoginEvent>(Time.seconds(2)) {
    @Override 
    public long extractTimestamp(LoginEvent event) { 
        return event.getTimestamp() * 1000L; 
    } 
});

//新版本
//为记录无序的情况创建水印策略,但您可以为事件无序的程度设置上限.
//水印是周期性生成的。 这种水印策略引入的延迟是周期间隔长度,加上乱序界限。
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy
        .<LoginEvent>forBoundedOutOfOrderness(Duration.ofSeconds(1))
        .withTimestampAssigner(((loginEvent, l) -> loginEvent.getTimestamp() * 1000L)));
//为时间戳单调递增的情况创建水印策略。
//水印是定期生成的,并严格遵循数据中的最新时间戳。 这种策略引入的延迟主要是产生水印的周期间隔
dataStreamSource.assignTimestampsAndWatermarks(WatermarkStrategy
        .<LoginEvent>forMonotonousTimestamps()
        .withTimestampAssigner(((loginEvent, l) -> loginEvent.getTimestamp() * 1000L)));

4.map之类函数的情况下使用外部变量出现序列化异常

在使用Flink Kafka的demo中,在函数中使用了Gson进行转化操作,但是频繁报序列化失败异常,后查阅原因得知,是在函数之中只用了外部的变量Gson gson = new Gson导致的,使用Spark也会出现类似的情况,将Gson对应定义成静态变量private static final Gson gson = new Gson()后,在函数之中,使用gson对象进行序列化即可解决问题。查阅文章