Flink是一个针对无界和有界流进行有状态计算的分布式引擎框架。
无界流:定义流的开始,没有定义流的结束,会无休止的产生数据。
有界流:定义流的开始,同时定义流的结束,可以在提取所有数据后在进行计算排序,通常称为批处理。
DataStream API 为许多通用的流处理操作提供了处理原语。
Maven 配置
Flink版本:1.18.0
JDK版本:17
<properties>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<filnk-version>1.18.0</filnk-version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java</artifactId>
<version>${filnk-version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>${filnk-version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web</artifactId>
<version>${filnk-version}</version>
<scope>provided</scope>
</dependency>
<!--文件系统连接器-->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>${filnk-version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-datagen</artifactId>
<version>${filnk-version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<artifactSet>
<excludes>
<exclude>com.google.code.findbugs:jsr305</exclude>
</excludes>
</artifactSet>
<filters>
<filter>
<!-- Do not copy the signatures in the META-INF folder.
Otherwise, this might cause SecurityExceptions when using the JAR. --> <artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<!-- Replace this with the main class of your job -->
<mainClass>my.programs.main.clazz</mainClass>
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<scope>provided</scope>
用于编译和测试的类路径中,但不会添加到运行时(runtime)类路径中。它不是传递性的。
如果IDEA直接执行main函数会报相关依赖不存在的错误,以IDEA2022.2版本为例,将Add dependencies with “provided” scope to classpath
参数添加进配置中。
创建DataStream流
获取环境-->获取数据源-->抽取、转换、载入-->输出-->执行
public class DataStreamExample {
public static void main(String[] args) throws Exception {
// 1.创建DataStream流环境
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
// 2.获取数据源
DataStreamSource<String> streamSource = environment.fromElements("Hello Flink", "Hello SilverGravel", "Hello Java", "Hello Kotlin");
// 3.数据抽取、转换、加载
SingleOutputStreamOperator<Tuple2<String, Integer>> streamOperator = streamSource.flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {
String[] split = value.split(" ");
for (String elem : split) {
Tuple2<String, Integer> tuple2 = Tuple2.of(elem, 1);
out.collect(tuple2);
} }) // 使用lambda表达式需要指定返回类型
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy((KeySelector<Tuple2<String, Integer>, String>) value -> value.f0)
.sum(1);
// 4.输出数据
streamOperator.print("print");
// 5.执行环境
environment.execute("DataStream");
}}
数据源 Source、接收器Sink
Flink 支持第三方的数据源以及接收器。目前可支持的source/sink如下:
元素生成
Flink自带元素生成方法
- fromElements
- fromSequence
- fromCollection
- fromParallelCollection
ArrayList<Integer> list = new ArrayList<>();
for (int i = 0; i < 10; i++) {
list.add(i);
}
DataStreamSource<Integer> integerDataStreamSource = environment.fromCollection(list);
DataStreamSource<Integer> dataStreamSource = environment.fromElements(1, 2, 3, 4, 5, 6);
DataStreamSource<Long> longDataStreamSource = environment.fromSequence(2L, 3000L);
SplittableIterator<Long> longValueSequenceIterator = new NumberSequenceIterator(100,3000);
environment.fromParallelCollection(longValueSequenceIterator, Long.class);
Socket
Flink自带接收Socket数据方法
Source
DataStreamSource<String> streamSource = env.socketTextStream("192.168.88.129",9999);
# Linux 虚拟机
nc -lk 9999
Sink
SocketClientSink<String> socketClientSink = new SocketClientSink<>(
"192.168.88.129", 8888, new SimpleStringSchema(StandardCharsets.UTF_8));
# Linux 虚拟机
nc -lk 8888
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment(new Configuration());
DataStreamSource<String> socket = environment.socketTextStream("192.168.88.129", 9999);
SocketClientSink<String> socketClientSink = new SocketClientSink<>(
"192.168.88.129", 8888, new SimpleStringSchema(StandardCharsets.UTF_8));
socket.map(value -> "silver:" + value+"\r\n")
.addSink(socketClientSink).name("socketSink").setParallelism(1);
socket.print();
environment.execute();
}
DataGen
Flink 1.18.0推荐使用
flink-connector-datagen
依赖来生成数据
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-datagen</artifactId>
<version>${filnk-version}</version>
<scope>provided</scope>
</dependency>
Source
import org.apache.flink.connector.datagen.source.DataGeneratorSource;
DataGeneratorSource<Integer> source =
new DataGeneratorSource<>(
Long::intValue,
100,
RateLimiterStrategy.perSecond(2),
Types.INT);
DataStreamSource<Integer> dataStreamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");
File
Flink 1.18.0+JDK17推荐使用
flink-connector-files
来读取文件和写入文件
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>${filnk-version}</version>
<scope>provided</scope>
</dependency>
Source
FileSource<String> source = FileSource
.forRecordStreamFormat(new TextLineInputFormat("UTF-8"), new Path("src/main/resources/test.txt")).build();
DataStreamSource<String> silverName = env.fromSource(source, WatermarkStrategy.noWatermarks(), "SilverGravel");
Sink
FileSink<Tuple2<String, Integer>> fileSink = FileSink
.forRowFormat(new Path("F://临时文件"), new SimpleStringEncoder<Tuple2<String, Integer>>("UTF-8"))
// 按时间目录分桶
.withBucketAssigner(new DateTimeBucketAssigner<>("yyyy-MM-dd HH", ZoneId.systemDefault()))
// 定义文件名的前缀与后缀
.withOutputFileConfig(new OutputFileConfig("silver", ".log"))
// 指定滚动策略
.withRollingPolicy(DefaultRollingPolicy.builder()
// 每 1kb 数据换一个文件
.withMaxPartSize(MemorySize.ofMebiBytes(1024))
// 每 10秒换一个文件
.withRolloverInterval(Duration.ofSeconds(10))
.build())
.build();
forRowFormat
:行处理forBulkFormat
:批处理
public static void main(String[] args) throws Exception {
String[] comment = {"Dawn", "Silver", "Gravel", "Star", "Flink"};
ThreadLocalRandom current = ThreadLocalRandom.current();
DataGeneratorSource<String> source =
new DataGeneratorSource<>(
(GeneratorFunction<Long, String>) aLong -> comment[current.nextInt(comment.length)]+":"+aLong,
5000,
RateLimiterStrategy.perSecond(3),
Types.STRING);
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> dataStreamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "source");
FileSink<Tuple2<String, Integer>> fileSink = FileSink
.forRowFormat(new Path("F://临时文件"), new SimpleStringEncoder<Tuple2<String, Integer>>("UTF-8"))
// 按时间目录分桶
.withBucketAssigner(new DateTimeBucketAssigner<>("yyyy-MM-dd HH", ZoneId.systemDefault()))
// 定义文件名的前缀与后缀
.withOutputFileConfig(new OutputFileConfig("silver", ".log"))
// 指定滚动策略
.withRollingPolicy(DefaultRollingPolicy.builder()
// 每 1kb 数据换一个文件
.withMaxPartSize(MemorySize.ofMebiBytes(1024))
// 每 10秒换一个文件
.withRolloverInterval(Duration.ofSeconds(10))
.build())
.build();
environment.setParallelism(2);
// 每15秒更新保存点
environment.enableCheckpointing(Duration.ofSeconds(15).toMillis());
dataStreamSource.flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {
String[] split = value.split(":");
out.collect(Tuple2.of(split[0], Integer.valueOf(split[1])));
}).returns(Types.TUPLE(Types.STRING, Types.INT))
.sinkTo(fileSink);
environment.execute();
}
代码中设置了environment.setParallelism(2)
,所以有两个.inprogress文件。如果不配置 environment.enableCheckpointing(Duration.ofSeconds(15).toMillis())
启用检查点,将不会生成.log文件。
基本算子 Operator
map
输入一个元素同时输出一个元素。
DataStream → DataStream
// 将String类型元素转换成Tuple2类型元素
streamSource.map((MapFunction<String, Tuple2<String, Integer>>) value ->
Tuple2.of(value, 1)
)
// 使用lambda表达式需要指定返回类型
.returns(Types.TUPLE(Types.STRING, Types.INT))
flatMap
输入一个元素同时输出零个、一个或多个元素。
DataStream → DataStream
// 将String类型元素按空格切割成多个元素,由out.collect(tuple2)进行收集
streamSource.flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {
String[] split = value.split(" ");
for (String elem : split) {
Tuple2<String, Integer> tuple2 = Tuple2.of(elem, 1);
out.collect(tuple2);
}
})
// 使用lambda表达式需要指定返回类型
.returns(Types.TUPLE(Types.STRING, Types.INT))
filter
为每个元素执行一个布尔 function,并保留那些 function 输出值为 true 的元素。
DataStream → DataStream
// 保留不为0的元素
dataStream.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer value) throws Exception {
return value != 0;
}
});
name 、description
添加名称和描述
DataGeneratorSource<String> dataGeneratorSource = new DataGeneratorSource<>(
Object::toString,
Integer.MAX_VALUE,
RateLimiterStrategy.perSecond(1),
Types.STRING
);
try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {
DataStreamSource<String> data = environment.fromSource(dataGeneratorSource, WatermarkStrategy.noWatermarks(), "data");
SingleOutputStreamOperator<Long> streamOperator = data.map(Long::valueOf).name("StringToLong").setDescription("字符串类型转长整型")
.setParallelism(2).filter(value -> value % 2 == 1).name("filter even").setDescription("过滤不为奇数的长整型");
streamOperator.print("nameDescription");
environment.execute();
} catch (Exception e) {
throw new RuntimeException(e);
}}
富函数 RichFunction
富函数提供了函数生命周期的方法,以及访问执行函数的上下文的方法。
顾名思义,map
、filter
、flatMap
都有相关的富函数,可以看做基础的增强
public class SilverRichMapFunction extends RichMapFunction<Integer,String> {
@Override
public void open(Configuration parameters) throws Exception {
System.out.println("map开始...");
}
@Override
public void close() throws Exception {
System.out.println("map结束...");
}
@Override
public String map(Integer value) throws Exception {
return "silver"+value;
}}
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> streamSource = environment.fromElements(1, 2, 3, 4);
environment.setParallelism(3);
streamSource.map(new SilverRichMapFunction())
.print();
environment.execute();
由图可知,每个并行度都会执行一次open
、close
方法。如果程序出错了,在Flink-1.18.0版本也会执行close
方法。
分区 Partition
Flink 提供提供以下几种分区实现:
逻辑分区
keyBy
从逻辑上将流分割成不相交的分区。所有具有相同密钥的记录都被分配到相同的分区。
DataStream → KeyedStream
不可以作为键的情况:
- POJO类型但是没有重写
hashCode()
方法 - 任意类型的数组
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> stringDataStream = environment.fromElements("hello silver", "hello gravel", "hello dawn silver gravel");
SingleOutputStreamOperator<Tuple2<String, Integer>> operator = stringDataStream
.flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {
for (String s : value.split(" ")) {
Tuple2<String, Integer> tuple2 = Tuple2.of(s, 1);
out.collect(tuple2);
} }).returns(Types.TUPLE(Types.STRING, Types.INT));
// 以TUPLE类型的第一个属性分组
operator.keyBy(value -> value.f0)
// 将第二个值累加计算
.sum(1)
.print("keyBy");
environment.execute();
}
相同元素属于同一个分区,如hello在分区5、silver在分区12
简单聚合函数
函数 | 描述 |
---|---|
sum(int positionToSum) | 指定key分组的指定字段滚动求和操作。 |
max(int positionToMax) | 指定key分组在指定字段数据流的获取当前最大值,其他字段保留第一个元素的对应字段值 |
maxBy(int positionToMaxBy) | 指定key分组在指定字段数据流的获取当前最大值,并且其他字段赋予指定字段最大的元素的对应字段值,如果有多个最大值则保留第一个。 |
min(int positionToMin) | 指定key分组在指定字段数据流的获取当前最小值,其他字段保留第一个元素的对应字段值 |
minBy(int positionToMinBy) | 指定key分组在指定字段数据流的获取当前最小值,并且其他字段赋予指定字段最大的元素的对应字段值,如果有多个最小值则保留第一个。 |
reduce | 指定key分组的分组数据流进行数据转换,自定义实现ReduceFunction 接口,其他简单聚合api都实现了该接口 |
public static void main(String[] args) {
// sum、reduce、max、maxBy、min、minBy
ThreadLocalRandom current = ThreadLocalRandom.current();
String[] comment = {"Silver", "Gravel"};
List<Tuple3<String, Integer, Integer>> list = new ArrayList<>();
list.add(Tuple3.of(comment[0], 1, 11));
list.add(Tuple3.of(comment[0], 2, 13));
list.add(Tuple3.of(comment[0], 2, 10));
list.add(Tuple3.of(comment[0], 2, 15));
list.add(Tuple3.of(comment[0], 11, 10));
list.add(Tuple3.of(comment[1], 3, 0));
list.add(Tuple3.of(comment[1], 3, 1));
list.add(Tuple3.of(comment[1], 2, 5));
list.add(Tuple3.of(comment[1], 2, 10));
System.out.println(Arrays.toString(list.toArray()));
try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {
environment.setParallelism(1);
DataStreamSource<Tuple3<String, Integer, Integer>> dataStreamSource = environment.fromCollection(list);
// sum
SingleOutputStreamOperator<Tuple2<String, Integer>> sum = dataStreamSource.map((MapFunction<Tuple3<String, Integer, Integer>, Tuple2<String, Integer>>) value -> Tuple2.of(value.f0, 1))
.returns(Types.TUPLE(Types.STRING, Types.INT))
.keyBy(value -> value.f0).sum(1);
// reduce
SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> reduce = dataStreamSource.keyBy(value -> value.f0)
.reduce(new ReduceFunction<Tuple3<String, Integer, Integer>>() {
@Override
public Tuple3<String, Integer, Integer> reduce(Tuple3<String, Integer, Integer> value1, Tuple3<String, Integer, Integer> value2) throws Exception {
// System.out.println(value1 + "<---->" + value2);
return Tuple3.of(value1.f0, value1.f1 + value2.f1, value2.f2);
} });
// max
SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> max = dataStreamSource.keyBy(value -> value.f0)
.max(1);
// maxBy
SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> maxBy = dataStreamSource.keyBy(value -> value.f0)
.maxBy(1);
// sum.print("sum");
// reduce.print("reduce");
maxBy.print("maxBy");
// max.print("max");
environment.execute();
} catch (Exception e) {
e.printStackTrace();
} }
物理分区
broadcast
向每个分区广播元素。 DataStream → DataStream
public static void main(String[] args) {
DataGeneratorSource<String> source =
new DataGeneratorSource<>(
Object::toString,
Long.MAX_VALUE,
RateLimiterStrategy.perSecond(2),
Types.STRING);
try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {
// 设置并发度3
environment.setParallelism(3);
DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");
streamSource.broadcast().map(Long::valueOf).print("broadcast");
environment.execute();
} catch (Exception e) {
e.printStackTrace();
}}
global
元素只传输到下个操作的第一个分区,该操作可能造成严重的性能瓶颈。
public static void main(String[] args) throws Exception {
DataGeneratorSource<String> source =
new DataGeneratorSource<>(
Object::toString,
5,
RateLimiterStrategy.perSecond(3),
Types.STRING);
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置并发度3
environment.setParallelism(3);
DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");
streamSource.global().map(Long::valueOf).print("global");
environment.execute();
}
rebalance
元素以循环方式均匀地分布每个分区,解决数据源数据倾斜问题。DataStream → DataStream
DataGeneratorSource<String> source =
new DataGeneratorSource<>(
Object::toString,
15,
RateLimiterStrategy.perSecond(3),
Types.STRING);
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置并发度3
environment.setParallelism(3);
DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");
streamSource.rebalance().map(Long::valueOf).print("rebalanced");
environment.execute();
rescale
将元素循环分区到下游操作的子集。DataStream → DataStream
缩放平衡,不是rebalance
的完全平衡,类似将多个元素整合成一个元素进行分发。
DataGeneratorSource<String> source =
new DataGeneratorSource<>(
Object::toString,
15,
RateLimiterStrategy.perSecond(3),
Types.STRING);
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置并发度3
environment.setParallelism(3);
DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "source");
streamSource.rescale().map(Long::valueOf).print("rescale");
environment.execute();
分区一:5-9,分区二:10-14,分区三:0-4
shuffle
根据均匀分布随机分配元素。有些分区可能没有接收到元素。DataStream → DataStream
DataGeneratorSource<String> source =
new DataGeneratorSource<>(
Object::toString,
10,
RateLimiterStrategy.perSecond(3),
Types.STRING);
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
// 设置并发度3
environment.setParallelism(3);
DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");
streamSource.shuffle().map(Long::valueOf).print("shuffle");
environment.execute();
partitionCustom
自定义为每个元素分配指定分区
partitionCustom(
Partitioner<K> partitioner, KeySelector<T, K> keySelector)
public interface Partitioner<K> extends java.io.Serializable, Function {
int partition(K key, int numPartitions);
}
numPartitions为并发度,key为KeySelector
实现返回的值。
public static void main(String[] args) throws Exception {
String[] comment = {"Dawn", "Silver", "Gravel", "Star"};
ThreadLocalRandom current = ThreadLocalRandom.current();
DataGeneratorSource<String> source =
new DataGeneratorSource<>(
(GeneratorFunction<Long, String>) aLong -> comment[current.nextInt(comment.length)] + ":" + aLong,
5000,
RateLimiterStrategy.perSecond(3),
Types.STRING);
try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {
// 设置并发度3
environment.setParallelism(3);
DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");
streamSource.partitionCustom(new SilverPartitioner(), value -> value.split(":")[1]).map(new MapFunction<String, Tuple3<String,Long,Integer>>() {
@Override
public Tuple3<String, Long, Integer> map(String value) throws Exception {
String[] split = value.split(":");
long aLong = Long.parseLong(split[1]);
int l = (int) (aLong % 3);
return Tuple3.of(split[0],aLong,l);
} }).print("customPartition");
environment.execute();
} catch (Exception e) {
e.printStackTrace();
}}
public static class SilverPartitioner implements Partitioner<String> {
@Override
public int partition(String key, int numPartitions) {
return Integer.parseInt(key) % numPartitions;
}}
合流(Union、Connect)、旁路流
union
两个或多个数据流的联合,创建一个包含来自所有数据流的所有元素的新数据流。注意: 如果您将一个数据流与它本身结合,您将在结果流中获得每个元素两次。
DataStream* → DataStream
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> stringDataStream = environment.fromElements("silver", "gravel", "dawn");
DataStreamSource<Integer> integerDataStream = environment.fromElements(1, 11, 100);
// 目标流是String类型,所以integerDataStream需要转换成String类型流
stringDataStream.union(integerDataStream.map(Object::toString)).map(value -> "sink->"+value).print();
environment.execute();
}
connect
“连接”两个保留其类型的数据流。连接允许两个数据流之间的共享状态。
DataStream,DataStream → ConnectedStream
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> stringDataStream = environment.fromElements("silver", "gravel", "dawn");
DataStreamSource<Integer> integerDataStream = environment.fromElements(1, 11, 100);
stringDataStream.connect(integerDataStream).map(new CoMapFunction<String, Integer, String>() {
@Override
public String map1(String value) throws Exception {
return "sink->"+value;
}
@Override
public String map2(Integer value) throws Exception {
return "sink->"+value;
} }).print();
environment.execute();
}
旁路流
旁路流从主流分离出来,且输出元素类型可以与主流元素类型不相同,旁路流可以有多个。
旁路流可以应用于错误数据处理,告警数据等。
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
DataGeneratorSource<Integer> source =
new DataGeneratorSource<>(
Long::intValue,
100,
RateLimiterStrategy.perSecond(2),
Types.INT);
DataStreamSource<Integer> dataStreamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");
OutputTag<String> odd = new OutputTag<String>("odd",Types.STRING);
OutputTag<String> multipleOfFive = new OutputTag<String>("multipleOfFive",Types.STRING);
SingleOutputStreamOperator<Integer> process = dataStreamSource.process(new ProcessFunction<>() {
@Override
public void processElement(Integer value, ProcessFunction<Integer, Integer>.Context ctx, Collector<Integer> out) throws Exception {
if (value % 2 == 0) {
// 主流接收偶数
out.collect(value);
} else {
// 主流也可以接收奇数
out.collect(value);
// 奇数旁路流
ctx.output(odd, value.toString());
}
// 可以有相同元素,一种broadcast的感觉
if (value % 5 == 0) {
// 五的倍数旁路流
ctx.output(multipleOfFive, "multipleOfFive:" + value);
} } }); process.getSideOutput(odd).map(value -> "odd:" + value).print("odd");
process.getSideOutput(multipleOfFive).map(value -> {
if (value.contains("0")) {
return value.replaceAll("multipleOfFive", "silver");
} return value;
}).print("multipleOfFive");
process.print("main");
environment.execute();
}
算子链 OperatorChain
将两个算子链接在一起能使得它们在同一个线程中执行,从而提升性能。Flink 默认会将能链接的算子尽可能地进行链接。但是有些算子执行了重操作,这可能会导致性能下降,可以对链接进行拆分优化。
算子串在一起的条件 :
- 一对一 :重分区也会拆开算子链
- 并发度相同:某个算子独自设置了
setParallelism
并发度并与前或者后的算子并发度不一致,那么该算子不参与前或后的算子链接。
算子链示例:
public static void main(String[] args) {
try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {
String[] comment = {"Dawn", "Silver", "Gravel", "Star", "Flink"};
ThreadLocalRandom current = ThreadLocalRandom.current();
DataGeneratorSource<Integer> generatorSource = new DataGeneratorSource<>(
Long::intValue,
3000,
RateLimiterStrategy.perSecond(2),
Types.INT
);
DataStreamSource<Integer> streamSource = environment.fromSource(generatorSource, WatermarkStrategy.noWatermarks(), "operator-chain");
SingleOutputStreamOperator<Tuple3<String, Integer, String>> dataStream = streamSource.map(value -> comment[current.nextInt(comment.length)] + ":" + value)
.filter(value -> value.contains("1"))
.flatMap((FlatMapFunction<String, Tuple3<String, Integer, String>>) (value, out) -> {
final String[] split = value.split(":");
Tuple3<String, Integer, String> tuple3 =
Tuple3.of(split[0], 1, split[1]);
out.collect(tuple3);
}).returns(Types.TUPLE(Types.STRING, Types.INT, Types.STRING));
SingleOutputStreamOperator<Tuple3<String, Integer, String>> streamOperator = dataStream.keyBy(value -> value.f0).sum(1);
streamOperator.print("chain");
System.out.println("http://localhost:8081");
environment.execute();
} catch (Exception e) {
e.printStackTrace();
}}
图中的flatMap
与keyBy
不是一对一的关系,所以拆分成两个算子链接。以上示例共有source
、map
、filter
、flatMap
、keyBy
、print
操作。
disableOperatorChain
全局禁用算子链接
在上述代码中加上:
environment.disableOperatorChaining();
所有的算子都不会合并。
disableChaining
某个算子不参与前后的算子链接。
在上述代码的的filter
函数添加:
.filter(value -> value.contains("1"))
.disableChaining()
从图中可知:
filter
算子独立了出来。
startNewChain
某个算子不参与前一个的算子链接,从这个算子开启新的链接。
在上述代码的的filter
函数添加:
.filter(value -> value.contains("1"))
.startNewChain()
从图中可知:
filter
算子不与map
算子建立链接,而与之后的flatMap
算子建立链接。