Flink DataStream 总结上 基础

218 阅读9分钟

Flink是一个针对无界和有界流进行有状态计算的分布式引擎框架。

无界流:定义流的开始,没有定义流的结束,会无休止的产生数据。

有界流:定义流的开始,同时定义流的结束,可以在提取所有数据后在进行计算排序,通常称为批处理

DataStream API 为许多通用的流处理操作提供了处理原语。

Maven 配置

Flink版本:1.18.0

JDK版本:17

<properties>  
    <maven.compiler.source>17</maven.compiler.source>  
    <maven.compiler.target>17</maven.compiler.target>  
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>  
    <filnk-version>1.18.0</filnk-version>  
</properties>  
<dependencies>  
    <dependency>  
        <groupId>org.apache.flink</groupId>  
        <artifactId>flink-streaming-java</artifactId>  
        <version>${filnk-version}</version>  
        <scope>provided</scope>  
    </dependency>  
    <dependency>  
        <groupId>org.apache.flink</groupId>  
        <artifactId>flink-clients</artifactId>  
        <version>${filnk-version}</version>  
        <scope>provided</scope>  
    </dependency>  
    <dependency>  
        <groupId>org.apache.flink</groupId>  
        <artifactId>flink-runtime-web</artifactId>  
        <version>${filnk-version}</version>  
        <scope>provided</scope>  
    </dependency>  
    <!--文件系统连接器-->
    <dependency>  
        <groupId>org.apache.flink</groupId>  
        <artifactId>flink-connector-files</artifactId>  
        <version>${filnk-version}</version>  
        <scope>provided</scope>  
    </dependency>  
    <dependency>  
       <groupId>org.apache.flink</groupId>  
       <artifactId>flink-connector-datagen</artifactId>  
       <version>${filnk-version}</version>  
       <scope>provided</scope>  
   </dependency>
</dependencies>
<build>  
    <plugins>  
        <plugin>  
            <groupId>org.apache.maven.plugins</groupId>  
            <artifactId>maven-shade-plugin</artifactId>  
            <version>3.1.1</version>  
            <executions>  
                <execution>  
                    <phase>package</phase>  
                    <goals>  
                        <goal>shade</goal>  
                    </goals>  
                    <configuration>  
                        <artifactSet>  
                            <excludes>  
                                <exclude>com.google.code.findbugs:jsr305</exclude>  
                            </excludes>  
                        </artifactSet>  
                        <filters>  
                            <filter>  
                                <!-- Do not copy the signatures in the META-INF folder.  
                                Otherwise, this might cause SecurityExceptions when using the JAR. -->                                <artifact>*:*</artifact>  
                                <excludes>  
                                    <exclude>META-INF/*.SF</exclude>  
                                    <exclude>META-INF/*.DSA</exclude>  
                                    <exclude>META-INF/*.RSA</exclude>  
                                </excludes>  
                            </filter>  
                        </filters>  
                        <transformers>  
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">  
                                <!-- Replace this with the main class of your job -->  
                                <mainClass>my.programs.main.clazz</mainClass>  
                            </transformer>  
                            <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>  
                        </transformers>  
                    </configuration>  
                </execution>  
            </executions>  
        </plugin>  
    </plugins>  
</build>

<scope>provided</scope>用于编译和测试的类路径中,但不会添加到运行时(runtime)类路径中。它不是传递性的。

如果IDEA直接执行main函数会报相关依赖不存在的错误,以IDEA2022.2版本为例,将Add dependencies with “provided” scope to classpath参数添加进配置中。

image.png

image.png

创建DataStream流

获取环境-->获取数据源-->抽取、转换、载入-->输出-->执行

public class DataStreamExample {  
    public static void main(String[] args) throws Exception {  
        // 1.创建DataStream流环境  
        StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
        // 2.获取数据源  
        DataStreamSource<String> streamSource = environment.fromElements("Hello Flink", "Hello SilverGravel", "Hello Java", "Hello Kotlin");  
        // 3.数据抽取、转换、加载  
        SingleOutputStreamOperator<Tuple2<String, Integer>> streamOperator = streamSource.flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {  
                    String[] split = value.split(" ");  
                    for (String elem : split) {  
                        Tuple2<String, Integer> tuple2 = Tuple2.of(elem, 1);  
                        out.collect(tuple2);  
                    }                })                // 使用lambda表达式需要指定返回类型  
                .returns(Types.TUPLE(Types.STRING, Types.INT))  
                .keyBy((KeySelector<Tuple2<String, Integer>, String>) value -> value.f0)  
                .sum(1);  
        // 4.输出数据  
        streamOperator.print("print");  
        // 5.执行环境  
        environment.execute("DataStream");  
    }}

数据源 Source、接收器Sink

Flink 支持第三方的数据源以及接收器。目前可支持的source/sink如下:

image.png

元素生成

Flink自带元素生成方法

  • fromElements
  • fromSequence
  • fromCollection
  • fromParallelCollection

ArrayList<Integer> list = new ArrayList<>();  
for (int i = 0; i < 10; i++) {  
    list.add(i);  
}  
DataStreamSource<Integer> integerDataStreamSource = environment.fromCollection(list);  
DataStreamSource<Integer> dataStreamSource = environment.fromElements(1, 2, 3, 4, 5, 6);  
DataStreamSource<Long> longDataStreamSource = environment.fromSequence(2L, 3000L);  
  
SplittableIterator<Long> longValueSequenceIterator = new NumberSequenceIterator(100,3000);  
environment.fromParallelCollection(longValueSequenceIterator, Long.class);

Socket

Flink自带接收Socket数据方法

Source

DataStreamSource<String> streamSource = env.socketTextStream("192.168.88.129",9999);
# Linux 虚拟机
nc -lk 9999

Sink

SocketClientSink<String> socketClientSink = new SocketClientSink<>(  
        "192.168.88.129", 8888, new SimpleStringSchema(StandardCharsets.UTF_8));
# Linux 虚拟机
nc -lk 8888

public static void main(String[] args) throws Exception {  
  
    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment(new Configuration());  
    DataStreamSource<String> socket = environment.socketTextStream("192.168.88.129", 9999);  
    SocketClientSink<String> socketClientSink = new SocketClientSink<>(  
            "192.168.88.129", 8888, new SimpleStringSchema(StandardCharsets.UTF_8));  
    socket.map(value -> "silver:" + value+"\r\n")  
            .addSink(socketClientSink).name("socketSink").setParallelism(1);  
    socket.print();  
    environment.execute();  
}

image.png

DataGen

Flink 1.18.0推荐使用flink-connector-datagen依赖来生成数据

<dependency>  
    <groupId>org.apache.flink</groupId>  
    <artifactId>flink-connector-datagen</artifactId>  
    <version>${filnk-version}</version>  
    <scope>provided</scope>  
</dependency>

Source

import org.apache.flink.connector.datagen.source.DataGeneratorSource;

DataGeneratorSource<Integer> source =  
        new DataGeneratorSource<>(  
                Long::intValue,  
                100,  
                RateLimiterStrategy.perSecond(2),  
                Types.INT);  
DataStreamSource<Integer> dataStreamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");

File

Flink 1.18.0+JDK17推荐使用flink-connector-files来读取文件和写入文件

<dependency>  
    <groupId>org.apache.flink</groupId>  
    <artifactId>flink-connector-files</artifactId>  
    <version>${filnk-version}</version>  
    <scope>provided</scope>  
</dependency>

Source

FileSource<String> source = FileSource  
        .forRecordStreamFormat(new TextLineInputFormat("UTF-8"), new Path("src/main/resources/test.txt")).build();  
DataStreamSource<String> silverName = env.fromSource(source, WatermarkStrategy.noWatermarks(), "SilverGravel");

Sink

  FileSink<Tuple2<String, Integer>> fileSink = FileSink  
            .forRowFormat(new Path("F://临时文件"), new SimpleStringEncoder<Tuple2<String, Integer>>("UTF-8"))  
            // 按时间目录分桶  
            .withBucketAssigner(new DateTimeBucketAssigner<>("yyyy-MM-dd HH", ZoneId.systemDefault()))  
            // 定义文件名的前缀与后缀  
            .withOutputFileConfig(new OutputFileConfig("silver", ".log"))  
            // 指定滚动策略  
            .withRollingPolicy(DefaultRollingPolicy.builder()  
                    // 每 1kb 数据换一个文件  
                    .withMaxPartSize(MemorySize.ofMebiBytes(1024))  
                    // 每 10秒换一个文件  
                    .withRolloverInterval(Duration.ofSeconds(10))  
                    .build())  
            .build();  
  • forRowFormat:行处理
  • forBulkFormat:批处理
public static void main(String[] args) throws Exception {  
    String[] comment = {"Dawn", "Silver", "Gravel", "Star", "Flink"};  
    ThreadLocalRandom current = ThreadLocalRandom.current();  
    DataGeneratorSource<String> source =  
            new DataGeneratorSource<>(  
                    (GeneratorFunction<Long, String>) aLong -> comment[current.nextInt(comment.length)]+":"+aLong,  
                    5000,  
                    RateLimiterStrategy.perSecond(3),  
                    Types.STRING);  
    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
    DataStreamSource<String> dataStreamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "source");  
    FileSink<Tuple2<String, Integer>> fileSink = FileSink  
            .forRowFormat(new Path("F://临时文件"), new SimpleStringEncoder<Tuple2<String, Integer>>("UTF-8"))  
            // 按时间目录分桶  
            .withBucketAssigner(new DateTimeBucketAssigner<>("yyyy-MM-dd HH", ZoneId.systemDefault()))  
            // 定义文件名的前缀与后缀  
            .withOutputFileConfig(new OutputFileConfig("silver", ".log"))  
            // 指定滚动策略  
            .withRollingPolicy(DefaultRollingPolicy.builder()  
                    // 每 1kb 数据换一个文件  
                    .withMaxPartSize(MemorySize.ofMebiBytes(1024))  
                    // 每 10秒换一个文件  
                    .withRolloverInterval(Duration.ofSeconds(10))  
                    .build())  
            .build();  
    environment.setParallelism(2);  
    // 每15秒更新保存点  
    environment.enableCheckpointing(Duration.ofSeconds(15).toMillis());  
    dataStreamSource.flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {  
                String[] split = value.split(":");  
                out.collect(Tuple2.of(split[0], Integer.valueOf(split[1])));  
            }).returns(Types.TUPLE(Types.STRING, Types.INT))  
            .sinkTo(fileSink);  
    environment.execute();  
}

image.png

代码中设置了environment.setParallelism(2),所以有两个.inprogress文件。如果不配置 environment.enableCheckpointing(Duration.ofSeconds(15).toMillis())启用检查点,将不会生成.log文件。

image.png

基本算子 Operator

map

输入一个元素同时输出一个元素。

DataStream → DataStream


// 将String类型元素转换成Tuple2类型元素
streamSource.map((MapFunction<String, Tuple2<String, Integer>>) value ->  
                Tuple2.of(value, 1)  
        )        
        // 使用lambda表达式需要指定返回类型  
        .returns(Types.TUPLE(Types.STRING, Types.INT))  

flatMap

输入一个元素同时输出零个、一个或多个元素。

DataStream → DataStream


// 将String类型元素按空格切割成多个元素,由out.collect(tuple2)进行收集
streamSource.flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {  
            String[] split = value.split(" ");  
            for (String elem : split) {  
                Tuple2<String, Integer> tuple2 = Tuple2.of(elem, 1);  
                out.collect(tuple2);  
            } 
	    })        
        // 使用lambda表达式需要指定返回类型  
        .returns(Types.TUPLE(Types.STRING, Types.INT))

filter

为每个元素执行一个布尔 function,并保留那些 function 输出值为 true 的元素。

DataStream → DataStream


// 保留不为0的元素
dataStream.filter(new FilterFunction<Integer>() {
    @Override
    public boolean filter(Integer value) throws Exception {
        return value != 0;
    }
});

name 、description

添加名称和描述


    DataGeneratorSource<String> dataGeneratorSource = new DataGeneratorSource<>(  
            Object::toString,  
            Integer.MAX_VALUE,  
            RateLimiterStrategy.perSecond(1),  
            Types.STRING  
    );  
    try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {  
        DataStreamSource<String> data = environment.fromSource(dataGeneratorSource, WatermarkStrategy.noWatermarks(), "data");  
        SingleOutputStreamOperator<Long> streamOperator = data.map(Long::valueOf).name("StringToLong").setDescription("字符串类型转长整型")  
                .setParallelism(2).filter(value -> value % 2 == 1).name("filter even").setDescription("过滤不为奇数的长整型");  
        streamOperator.print("nameDescription");  
        environment.execute();  
    } catch (Exception e) {  
        throw new RuntimeException(e);  
    }}

image.png

http://localhost:8081

富函数 RichFunction

富函数提供了函数生命周期的方法,以及访问执行函数的上下文的方法。

顾名思义,mapfilterflatMap都有相关的富函数,可以看做基础的增强

image.png

public class SilverRichMapFunction extends RichMapFunction<Integer,String> {  
  
    @Override  
    public void open(Configuration parameters) throws Exception {  
        System.out.println("map开始...");  
    }  
    @Override  
    public void close() throws Exception {  
        System.out.println("map结束...");  
    }  
    @Override  
    public String map(Integer value) throws Exception {  
        return "silver"+value;  
    }}
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
DataStreamSource<Integer> streamSource = environment.fromElements(1, 2, 3, 4);  
environment.setParallelism(3);  
streamSource.map(new SilverRichMapFunction())  
        .print();  
environment.execute();

image.png

由图可知,每个并行度都会执行一次openclose方法。如果程序出错了,在Flink-1.18.0版本也会执行close方法。

分区 Partition

Flink 提供提供以下几种分区实现:

image.png

逻辑分区

keyBy

从逻辑上将流分割成不相交的分区。所有具有相同密钥的记录都被分配到相同的分区。

DataStream → KeyedStream

不可以作为键的情况:

  • POJO类型但是没有重写hashCode()方法
  • 任意类型的数组

public static void main(String[] args) throws Exception {  
    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
    DataStreamSource<String> stringDataStream = environment.fromElements("hello silver", "hello gravel", "hello dawn silver gravel");  
    SingleOutputStreamOperator<Tuple2<String, Integer>> operator = stringDataStream  
            .flatMap((FlatMapFunction<String, Tuple2<String, Integer>>) (value, out) -> {  
                for (String s : value.split(" ")) {  
                    Tuple2<String, Integer> tuple2 = Tuple2.of(s, 1);  
                    out.collect(tuple2);  
                }            }).returns(Types.TUPLE(Types.STRING, Types.INT));  
    // 以TUPLE类型的第一个属性分组  
    operator.keyBy(value -> value.f0)  
            // 将第二个值累加计算  
            .sum(1)  
            .print("keyBy");  
    environment.execute();  
}

image.png

相同元素属于同一个分区,如hello在分区5、silver在分区12

简单聚合函数

函数描述
sum(int positionToSum)指定key分组的指定字段滚动求和操作。
max(int positionToMax)指定key分组在指定字段数据流的获取当前最大值,其他字段保留第一个元素的对应字段值
maxBy(int positionToMaxBy)指定key分组在指定字段数据流的获取当前最大值,并且其他字段赋予指定字段最大的元素的对应字段值,如果有多个最大值则保留第一个。
min(int positionToMin)指定key分组在指定字段数据流的获取当前最小值,其他字段保留第一个元素的对应字段值
minBy(int positionToMinBy)指定key分组在指定字段数据流的获取当前最小值,并且其他字段赋予指定字段最大的元素的对应字段值,如果有多个最小值则保留第一个。
reduce指定key分组的分组数据流进行数据转换,自定义实现ReduceFunction 接口,其他简单聚合api都实现了该接口

image.png

  public static void main(String[] args) {  
        // sum、reduce、max、maxBy、min、minBy  
        ThreadLocalRandom current = ThreadLocalRandom.current();  
        String[] comment = {"Silver", "Gravel"};  
        List<Tuple3<String, Integer, Integer>> list = new ArrayList<>();  
        list.add(Tuple3.of(comment[0], 1, 11));  
        list.add(Tuple3.of(comment[0], 2, 13));  
        list.add(Tuple3.of(comment[0], 2, 10));  
        list.add(Tuple3.of(comment[0], 2, 15));  
        list.add(Tuple3.of(comment[0], 11, 10));  
        list.add(Tuple3.of(comment[1], 3, 0));  
        list.add(Tuple3.of(comment[1], 3, 1));  
        list.add(Tuple3.of(comment[1], 2, 5));  
        list.add(Tuple3.of(comment[1], 2, 10));  
        System.out.println(Arrays.toString(list.toArray()));  
        try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {  
            environment.setParallelism(1);  
            DataStreamSource<Tuple3<String, Integer, Integer>> dataStreamSource = environment.fromCollection(list);  
            // sum  
            SingleOutputStreamOperator<Tuple2<String, Integer>> sum = dataStreamSource.map((MapFunction<Tuple3<String, Integer, Integer>, Tuple2<String, Integer>>) value -> Tuple2.of(value.f0, 1))  
                    .returns(Types.TUPLE(Types.STRING, Types.INT))  
                    .keyBy(value -> value.f0).sum(1);  
  
            // reduce  
            SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> reduce = dataStreamSource.keyBy(value -> value.f0)  
                    .reduce(new ReduceFunction<Tuple3<String, Integer, Integer>>() {  
                        @Override  
                        public Tuple3<String, Integer, Integer> reduce(Tuple3<String, Integer, Integer> value1, Tuple3<String, Integer, Integer> value2) throws Exception {  
//                            System.out.println(value1 + "<---->" + value2);  
                            return Tuple3.of(value1.f0, value1.f1 + value2.f1, value2.f2);  
                        }                    });  
  
            // max  
            SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> max = dataStreamSource.keyBy(value -> value.f0)  
                    .max(1);  
  
            // maxBy  
            SingleOutputStreamOperator<Tuple3<String, Integer, Integer>> maxBy = dataStreamSource.keyBy(value -> value.f0)  
                    .maxBy(1);  
  
  
//            sum.print("sum");  
//            reduce.print("reduce");  
            maxBy.print("maxBy");  
//            max.print("max");  
  
            environment.execute();  
        } catch (Exception e) {  
            e.printStackTrace();  
        }    }

物理分区

broadcast

向每个分区广播元素。 DataStream → DataStream

image.png

public static void main(String[] args) {  
    DataGeneratorSource<String> source =  
            new DataGeneratorSource<>(  
                    Object::toString,  
                    Long.MAX_VALUE,  
                    RateLimiterStrategy.perSecond(2),  
                    Types.STRING);  
  
    try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {  
        // 设置并发度3  
        environment.setParallelism(3);  
        DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");  
        streamSource.broadcast().map(Long::valueOf).print("broadcast");  
        environment.execute();  
    } catch (Exception e) {  
        e.printStackTrace();  
    }}

image.png

global

元素只传输到下个操作的第一个分区,该操作可能造成严重的性能瓶颈。

image.png

public static void main(String[] args) throws Exception {  
    DataGeneratorSource<String> source =  
            new DataGeneratorSource<>(  
                    Object::toString,  
                    5,  
                    RateLimiterStrategy.perSecond(3),  
                    Types.STRING);  
    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
    // 设置并发度3  
    environment.setParallelism(3);  
    DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");  
    streamSource.global().map(Long::valueOf).print("global");  
    environment.execute();  
}

image.png

rebalance

元素以循环方式均匀地分布每个分区,解决数据源数据倾斜问题。DataStream → DataStream

image.png

DataGeneratorSource<String> source =  
        new DataGeneratorSource<>(  
                Object::toString,  
                15,  
                RateLimiterStrategy.perSecond(3),  
                Types.STRING);  
  
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
// 设置并发度3  
environment.setParallelism(3);  
DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");  
streamSource.rebalance().map(Long::valueOf).print("rebalanced");  
environment.execute();

image.png

rescale

将元素循环分区到下游操作的子集。DataStream → DataStream

缩放平衡,不是rebalance的完全平衡,类似将多个元素整合成一个元素进行分发。

image.png

DataGeneratorSource<String> source =  
        new DataGeneratorSource<>(  
                Object::toString,  
                15,  
                RateLimiterStrategy.perSecond(3),  
                Types.STRING);  
  
  
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
// 设置并发度3  
environment.setParallelism(3);  
DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "source");  
  
streamSource.rescale().map(Long::valueOf).print("rescale");  
environment.execute();

image.png 分区一:5-9,分区二:10-14,分区三:0-4

shuffle

根据均匀分布随机分配元素。有些分区可能没有接收到元素。DataStream → DataStream

image.png

DataGeneratorSource<String> source =  
        new DataGeneratorSource<>(  
                Object::toString,  
                10,  
                RateLimiterStrategy.perSecond(3),  
                Types.STRING);  
  
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
// 设置并发度3  
environment.setParallelism(3);  
DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");  
streamSource.shuffle().map(Long::valueOf).print("shuffle");  
environment.execute();

image.png

partitionCustom

自定义为每个元素分配指定分区

partitionCustom(  
        Partitioner<K> partitioner, KeySelector<T, K> keySelector)
public interface Partitioner<K> extends java.io.Serializable, Function {  
  
    int partition(K key, int numPartitions);  
}

numPartitions为并发度,key为KeySelector实现返回的值。

public static void main(String[] args) throws Exception {  
    String[] comment = {"Dawn", "Silver", "Gravel", "Star"};  
    ThreadLocalRandom current = ThreadLocalRandom.current();  
    DataGeneratorSource<String> source =  
            new DataGeneratorSource<>(  
                    (GeneratorFunction<Long, String>) aLong -> comment[current.nextInt(comment.length)] + ":" + aLong,  
                    5000,  
                    RateLimiterStrategy.perSecond(3),  
                    Types.STRING);  
  
    try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {  
  
        // 设置并发度3  
        environment.setParallelism(3);  
        DataStreamSource<String> streamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");  
        streamSource.partitionCustom(new SilverPartitioner(), value -> value.split(":")[1]).map(new MapFunction<String, Tuple3<String,Long,Integer>>() {  
  
            @Override  
            public Tuple3<String, Long, Integer> map(String value) throws Exception {  
                String[] split = value.split(":");  
                long aLong = Long.parseLong(split[1]);  
                int l = (int) (aLong % 3);  
                return Tuple3.of(split[0],aLong,l);  
            }        }).print("customPartition");  
        environment.execute();  
    } catch (Exception e) {  
        e.printStackTrace();  
    }}  
  
public static class SilverPartitioner implements Partitioner<String> {  
  
    @Override  
    public int partition(String key, int numPartitions) {  
        return Integer.parseInt(key) % numPartitions;  
    }}

合流(Union、Connect)、旁路流

union

两个或多个数据流的联合,创建一个包含来自所有数据流的所有元素的新数据流。注意: 如果您将一个数据流与它本身结合,您将在结果流中获得每个元素两次。

DataStream* → DataStream

image.png

public static void main(String[] args) throws Exception {  
    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
    DataStreamSource<String> stringDataStream = environment.fromElements("silver", "gravel", "dawn");  
    DataStreamSource<Integer> integerDataStream = environment.fromElements(1, 11, 100);  
    // 目标流是String类型,所以integerDataStream需要转换成String类型流  
    stringDataStream.union(integerDataStream.map(Object::toString)).map(value -> "sink->"+value).print();  
    environment.execute();  
}

connect

“连接”两个保留其类型的数据流。连接允许两个数据流之间的共享状态。

DataStream,DataStream → ConnectedStream

image.png

public static void main(String[] args) throws Exception {  
    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
    DataStreamSource<String> stringDataStream = environment.fromElements("silver", "gravel", "dawn");  
    DataStreamSource<Integer> integerDataStream = environment.fromElements(1, 11, 100);  
    stringDataStream.connect(integerDataStream).map(new CoMapFunction<String, Integer, String>() {  
        @Override  
        public String map1(String value) throws Exception {  
            return "sink->"+value;  
        }  
        @Override  
        public String map2(Integer value) throws Exception {  
            return "sink->"+value;  
        }    }).print();  
    environment.execute();  
}

旁路流

旁路流从主流分离出来,且输出元素类型可以与主流元素类型不相同,旁路流可以有多个。

旁路流可以应用于错误数据处理,告警数据等。

image.png

public static void main(String[] args) throws Exception {  
    StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();  
    DataGeneratorSource<Integer> source =  
            new DataGeneratorSource<>(  
                    Long::intValue,  
                    100,  
                    RateLimiterStrategy.perSecond(2),  
                    Types.INT);  
    DataStreamSource<Integer> dataStreamSource = environment.fromSource(source, WatermarkStrategy.noWatermarks(), "data");  
    OutputTag<String> odd = new OutputTag<String>("odd",Types.STRING);  
    OutputTag<String> multipleOfFive = new OutputTag<String>("multipleOfFive",Types.STRING);  
    SingleOutputStreamOperator<Integer> process = dataStreamSource.process(new ProcessFunction<>() {  
        @Override  
        public void processElement(Integer value, ProcessFunction<Integer, Integer>.Context ctx, Collector<Integer> out) throws Exception {  
            if (value % 2 == 0) {  
                // 主流接收偶数  
                out.collect(value);  
            } else {  
                // 主流也可以接收奇数  
                out.collect(value);  
                // 奇数旁路流  
                ctx.output(odd, value.toString());  
            }            
            // 可以有相同元素,一种broadcast的感觉  
            if (value % 5 == 0) {  
                // 五的倍数旁路流  
                ctx.output(multipleOfFive, "multipleOfFive:" + value);  
            }        }    });    process.getSideOutput(odd).map(value -> "odd:" + value).print("odd");  
    process.getSideOutput(multipleOfFive).map(value -> {  
        if (value.contains("0")) {  
            return value.replaceAll("multipleOfFive", "silver");  
        }        return value;  
    }).print("multipleOfFive");  
    process.print("main");  
    environment.execute();  
}

算子链 OperatorChain

将两个算子链接在一起能使得它们在同一个线程中执行,从而提升性能。Flink 默认会将能链接的算子尽可能地进行链接。但是有些算子执行了重操作,这可能会导致性能下降,可以对链接进行拆分优化。

算子串在一起的条件 :

  • 一对一 :重分区也会拆开算子链
  • 并发度相同:某个算子独自设置了setParallelism并发度并与前或者后的算子并发度不一致,那么该算子不参与前或后的算子链接。

算子链示例:


public static void main(String[] args) {  
    try (StreamExecutionEnvironment environment = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(new Configuration())) {  
        String[] comment = {"Dawn", "Silver", "Gravel", "Star", "Flink"};  
        ThreadLocalRandom current = ThreadLocalRandom.current();  
        DataGeneratorSource<Integer> generatorSource = new DataGeneratorSource<>(  
                Long::intValue,  
                3000,  
                RateLimiterStrategy.perSecond(2),  
                Types.INT  
        );  
        DataStreamSource<Integer> streamSource = environment.fromSource(generatorSource, WatermarkStrategy.noWatermarks(), "operator-chain");  
        SingleOutputStreamOperator<Tuple3<String, Integer, String>> dataStream = streamSource.map(value -> comment[current.nextInt(comment.length)] + ":" + value)  
                .filter(value -> value.contains("1"))  
                .flatMap((FlatMapFunction<String, Tuple3<String, Integer, String>>) (value, out) -> {  
                    final String[] split = value.split(":");  
                    Tuple3<String, Integer, String> tuple3 =  
                            Tuple3.of(split[0], 1, split[1]);  
                    out.collect(tuple3);  
                }).returns(Types.TUPLE(Types.STRING, Types.INT, Types.STRING));  
        SingleOutputStreamOperator<Tuple3<String, Integer, String>> streamOperator = dataStream.keyBy(value -> value.f0).sum(1);  
        streamOperator.print("chain");  
        System.out.println("http://localhost:8081");  
        environment.execute();  
    } catch (Exception e) {  
        e.printStackTrace();  
    }}

image.png

图中的flatMapkeyBy不是一对一的关系,所以拆分成两个算子链接。以上示例共有sourcemapfilterflatMapkeyByprint操作。

disableOperatorChain

全局禁用算子链接

在上述代码中加上:

environment.disableOperatorChaining();

所有的算子都不会合并。

image.png

disableChaining

某个算子不参与前后的算子链接。

在上述代码的的filter函数添加:

.filter(value -> value.contains("1"))  
.disableChaining()

image.png 从图中可知:filter算子独立了出来。

startNewChain

某个算子不参与前一个的算子链接,从这个算子开启新的链接。

在上述代码的的filter函数添加:

.filter(value -> value.contains("1"))  
.startNewChain()

image.png 从图中可知:filter算子不与map算子建立链接,而与之后的flatMap算子建立链接。