Flink流处理API

521 阅读1分钟

image.png

Environment

getExecutionEnvironment

创建一个执行环境,表示当前执行程序的上下文,如果程序是独立调用的,则此方法返回本次执行环境;如果从命令行客户端调用程序以提交到集群,则此方法返回此集群的执行环境。
也就是说getExecutionEnvironment会根据查询运行的方式决定返回什么样的环境,是最常用的一种创建执行环境的方式。

ExecutionEnvironment env =ExecutionEnvironment.getExecutionEnvironment();
//创建流处理环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

createLocalEnvironment

本地执行环境,需要在调用时指定默认的并行度。

LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(1);

createRemoteEnvironment

返回集群执行环境,将Jar提交到远程服务器,需要在调用时制定JobManager的IP和端口号,并制定要在集群中运行Jar包。

StreamExecutionEnvironment env = StreamExecutionEnvironment.createRemoteEnvironment("jobManager-hostname",6123,"path//xxx.jar");

Source

从集合中读取数据

从文件读取数据

从Kafka读取数据

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>hi-flink</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <flink.version>1.12.1</flink.version>
        <scala.binary.version>2.12</scala.binary.version>
    </properties>

    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <!-- kafka -->
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
            <version>${flink.version}</version>
        </dependency>
    </dependencies>
</project>
package com.hi.flink.source;


import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;

import java.util.Properties;


public class SourceKafkaDemo {

    public static void main(String[] args) throws Exception{
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        //设置并行度为1
        env.setParallelism(1);

        Properties p  = new Properties();
        p.setProperty("bootstrap.servers","localhost:9092");
        p.setProperty("group.id","consumer-group");
        p.setProperty("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
        p.setProperty("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
        p.setProperty("auto.offset.reset","latest");

        DataStream<String> dataStream = env.addSource(new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),p));

        //打印输出(只是打印,没有做统计)
        dataStream.print();

        env.execute();
    }
}

Transform

map、flatMap、filter通常被同一称为基本转换算子(简单转换算子)

Map

image.png

DataStream<Integer> mapStream = dataStream.map(new MapFunction<String,integer>(){
    public Integer map(String value) throws Exception{
        return value.length();
    }
});

flatMap

DataStream<String> flatMapStream = dataStream.flatMap(new FlatMapFunction<String,String>() {
    public void flatMap(String value,Collector<String> out) throws Exception{
        String[] fields = value.split(",");
        for(String field:fields){
            out.collect(field);
        }
    }
}

filter

image.png

keyBy

image.png

DataStream -> KeyedStream:逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同key的元素,在内部以hash的形式实现的。

滚动聚合算子(Rolling Aggregation)

这些算子可以针对KeyedStream的每一个支流做聚合操作。

  • sum()
  • min()
  • max()
  • minBy()
  • maxBy()

Reduce