Flink - DataStream Example Program

432 阅读4分钟

架构说明

下图展示 Flink 的整体架构图:

image.png

可以看出 Runtime 上提供了 2 种操作的 api,DataStream API 和 DataSet ApI.其中 DataSet API 用于批操作,DataStream API 用于流操作。批和流的区分可以使用数据的有界性来做简单的判断。

image.png

  • 当数据是有界的,比如: "I have a cat " ,这个字符串是一串有界的字母,对其进行操作时就可以使用批操作。
  • 当数据的一个数据源,可以持续的接收数据,没有确定什么时候会终止的时候,就成为了无界的数据。我们可以使用 [linux nc ](nc命令用法举例 - nmap - 博客园 (cnblogs.com))来模拟一个持续的数据源输入。

image.png

当然在 Flink 1.12 之后,就不再推荐使用 DataSet ApI,DataStream API 也可以支持批操作。

image.png 不过稍后从下面的例子就可以看出,有界的数据也可以作为流数据操作,那么各种数据便都可以作为流数据来进行流操作。

编程示例

下面介绍使用 Java 的编程示例,实现一个用于记录单词数量的程序:

  • 首先配置 pom 依赖:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.learn</groupId>
    <artifactId>flink-demo1</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>11</maven.compiler.source>
        <maven.compiler.target>11</maven.compiler.target>
        <java.version>11</java.version>
        <flink.version>1.13.2</flink.version>
    </properties>

    <!-- 指定仓库地址 -->
    <repositories>
        <repository>
            <id>aliyun</id>
            <name>maven-aliyun</name>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>
    </repositories>

    <!-- 配置依赖 -->
    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients_2.12</artifactId>
            <version>${flink.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>${flink.version}</version>
        </dependency>

        <!-- slf4j-->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>1.7.25</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>1.7.25</version>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.8.1</version>
                <configuration>
                    <source>${java.version}</source>
                    <target>${java.version}</target>
                </configuration>
            </plugin>
            <plugin>
                <artifactId>maven-release-plugin</artifactId>
                <version>2.5.3</version>
            </plugin>
            <plugin>
                <artifactId>maven-source-plugin</artifactId>
                <version>3.2.0</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>jar-no-fork</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

</project>
  • WordCountStream demo1
package com.learn.flink;

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class WordCountStream {

    public static void main(String[] args) throws Exception {
        // 0: env
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 不设置的话,就会默认使用流模式来处理
        // env.setRuntimeMode(RuntimeExecutionMode.BATCH); // 使用批模式
        // env.setRuntimeMode(RuntimeExecutionMode.STREAMING); //使用流模式
        // env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC); //会自动使用流模式:如果是有界数据,就会使用批模式,如果是无界数据,就会使用流模式。

        // 1: source
        final DataStream<String> lines = env.fromElements("Who's there?",
                "I think I hear them. Stand, ho! Who's there?");

        // 2: transformation
        // 切割
        final DataStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String value, Collector<String> out) throws Exception {
                final String[] arr = value.split(" ");
                for (String word : arr) {
                    out.collect(word);
                }
            }
        });

        // 每个单词记录为 1
        final DataStream<Tuple2<String, Integer>> wordAndOne = words.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String word) throws Exception {
                return Tuple2.of(word, 1);
            }
        });

        // 分组
        final KeyedStream<Tuple2<String, Integer>, String> grouped = wordAndOne.keyBy(t -> t.f0);

        //聚合
        final SingleOutputStreamOperator<Tuple2<String, Integer>> result = grouped.sum(1);

        // 3: sink
        result.print();

        //4: execute 启动并等待程序结束
        env.execute("count2");

    }
}

运行结果为:

image.png

  • WordCountStream demo2
package com.learn.flink;

import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

public class WordCountStream2 {

    public static void main(String[] args) throws Exception {
        // 0: env
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC); // 会自动使用流模式:如果是有界数据,就会使用批模式,如果是无界数据,就会使用流模式。
        // 1: source 定义数据源从 socket 种接收数据:这里使用 linux 的 nc 进行模拟 
        final DataStreamSource<String> lines = env.socketTextStream("node01", 999);

        // 2: transformation
        // 切割
        final DataStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
            @Override
            public void flatMap(String value, Collector<String> out) throws Exception {
                final String[] arr = value.split(" ");
                for (String word : arr) {
                    out.collect(word);
                }
            }
        });

        // 每个单词记录为 1
        final DataStream<Tuple2<String, Integer>> wordAndOne = words.map(new MapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public Tuple2<String, Integer> map(String word) throws Exception {
                return Tuple2.of(word, 1);
            }
        });

        // 分组
        final KeyedStream<Tuple2<String, Integer>, String> grouped = wordAndOne.keyBy(t -> t.f0);

        //聚合
        final SingleOutputStreamOperator<Tuple2<String, Integer>> result = grouped.sum(1);

        // 3: sink
        result.print();

        //4: execute 启动并等待程序结束
        env.execute("count2");

    }
}

输入数据:

image.png

结果为:

image.png 备注:设置为 STREAMING, AUTOMATIC 或者不设置结果都是一样的,都是进行流操作。 但是设置为 RuntimeMode 批模式会报错:

image.png

  • WordCountStream_onYarn
package com.learn.flink;

import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;

import java.util.Arrays;

/**
 * 将程序上传并打包让后使用 flink 执行
 * /usr/local/opt/flink/bin/flink run
 * 指定类型:-m yarn-cluster
 * 指定 main:-c com.learn.flink.WordCountStream_onYarn
 * 指定jar: flink-demo1-1.0-SNAPSHOT.jar
 * 设置参数 args:--outPath hdfs://node01:9000/flink/completed-jobs/wordcount_
 */
public class WordCountStream_onYarn {

    public static void main(String[] args) throws Exception {

        // 从参数中接受 outPath 的地址
        final ParameterTool parameterTool = ParameterTool.fromArgs(args);
        String outPath;
        if (parameterTool.has("outPath")) {
            outPath = parameterTool.get("outPath");
            System.out.println("指定输出路径为:" + outPath);
        }  else {
            outPath = "hdfs://node01:9000/flink/output_";
            System.out.println("未指定输出路径,默认路径为:" + outPath);
        }



        // 0: env
        final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // 1: source
        final DataStream<String> lines = env.fromElements("Who's there?",
                "I think I hear them. Stand, ho! Who's there?");
        // 2: transformation
        // 切割
        final DataStream<String> words = lines.flatMap((String value, Collector<String> out) -> {
            Arrays.stream(value.split(" ")).forEach(out::collect);
        }).returns(Types.STRING);

        // 每个单词记录为 1
        final SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOne = words.map(
                (String word) -> Tuple2.of(word, 1)).returns(Types.TUPLE(Types.STRING, Types.INT));

        // 分组
        final KeyedStream<Tuple2<String, Integer>, String> grouped = wordAndOne.keyBy(t -> t.f0);

        //聚合
        final SingleOutputStreamOperator<Tuple2<String, Integer>> result = grouped.sum(1);

        // 3: sink 将内容输出到 hdfs 中
        result.writeAsText(outPath + System.currentTimeMillis()).setParallelism(1);

        //4: execute 启动并等待程序结束
        env.execute("WordCountStream_onYarn");

    }
}

将该类打包并上传至flink, 这里使用/usr/local/opt/flink/task为存放任务的目录:

image.png

使用命令执行: /usr/local/opt/flink/bin/flink run -m yarn-cluster -c com.learn.flink.WordCountStream_onYarn flink-demo1-1.0-SNAPSHOT.jar --outPath hdfs://node01:9000/flink/completed-jobs/wordcount_

  • 指定 flink 运行类型:-m yarn-cluster
  • 指定函数入口 main:-c com.learn.flink.WordCountStream_onYarn
  • 指定jar: flink-demo1-1.0-SNAPSHOT.jar
  • 设置参数 args: --outPath hdfs://node01:9000/flink/completed-jobs/wordcount_

结果:
resourcemanager web 页面可以查看到执行了一个 flink per-job 的任务 image.png hdfs 9870 端口页面可以看到运行任务结果的文件:

image.png flink 8082 历史服务页面可以查看到完成的任务:

image.png