架构说明
下图展示 Flink 的整体架构图:
可以看出 Runtime 上提供了 2 种操作的 api,DataStream API 和 DataSet ApI.其中 DataSet API 用于批操作,DataStream API 用于流操作。批和流的区分可以使用数据的有界性来做简单的判断。
- 当数据是有界的,比如:
"I have a cat ",这个字符串是一串有界的字母,对其进行操作时就可以使用批操作。 - 当数据的一个数据源,可以持续的接收数据,没有确定什么时候会终止的时候,就成为了无界的数据。我们可以使用 [linux nc ](nc命令用法举例 - nmap - 博客园 (cnblogs.com))来模拟一个持续的数据源输入。
当然在 Flink 1.12 之后,就不再推荐使用 DataSet ApI,DataStream API 也可以支持批操作。
不过稍后从下面的例子就可以看出,有界的数据也可以作为流数据操作,那么各种数据便都可以作为流数据来进行流操作。
编程示例
下面介绍使用 Java 的编程示例,实现一个用于记录单词数量的程序:
- 首先配置 pom 依赖:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.learn</groupId>
<artifactId>flink-demo1</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
<java.version>11</java.version>
<flink.version>1.13.2</flink.version>
</properties>
<!-- 指定仓库地址 -->
<repositories>
<repository>
<id>aliyun</id>
<name>maven-aliyun</name>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<!-- 配置依赖 -->
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- slf4j-->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.25</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-simple</artifactId>
<version>1.7.25</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.8.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
<plugin>
<artifactId>maven-release-plugin</artifactId>
<version>2.5.3</version>
</plugin>
<plugin>
<artifactId>maven-source-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>jar-no-fork</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
- WordCountStream demo1
package com.learn.flink;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
public class WordCountStream {
public static void main(String[] args) throws Exception {
// 0: env
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 不设置的话,就会默认使用流模式来处理
// env.setRuntimeMode(RuntimeExecutionMode.BATCH); // 使用批模式
// env.setRuntimeMode(RuntimeExecutionMode.STREAMING); //使用流模式
// env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC); //会自动使用流模式:如果是有界数据,就会使用批模式,如果是无界数据,就会使用流模式。
// 1: source
final DataStream<String> lines = env.fromElements("Who's there?",
"I think I hear them. Stand, ho! Who's there?");
// 2: transformation
// 切割
final DataStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
final String[] arr = value.split(" ");
for (String word : arr) {
out.collect(word);
}
}
});
// 每个单词记录为 1
final DataStream<Tuple2<String, Integer>> wordAndOne = words.map(new MapFunction<String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(String word) throws Exception {
return Tuple2.of(word, 1);
}
});
// 分组
final KeyedStream<Tuple2<String, Integer>, String> grouped = wordAndOne.keyBy(t -> t.f0);
//聚合
final SingleOutputStreamOperator<Tuple2<String, Integer>> result = grouped.sum(1);
// 3: sink
result.print();
//4: execute 启动并等待程序结束
env.execute("count2");
}
}
运行结果为:
- WordCountStream demo2
package com.learn.flink;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
public class WordCountStream2 {
public static void main(String[] args) throws Exception {
// 0: env
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.AUTOMATIC); // 会自动使用流模式:如果是有界数据,就会使用批模式,如果是无界数据,就会使用流模式。
// 1: source 定义数据源从 socket 种接收数据:这里使用 linux 的 nc 进行模拟
final DataStreamSource<String> lines = env.socketTextStream("node01", 999);
// 2: transformation
// 切割
final DataStream<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
final String[] arr = value.split(" ");
for (String word : arr) {
out.collect(word);
}
}
});
// 每个单词记录为 1
final DataStream<Tuple2<String, Integer>> wordAndOne = words.map(new MapFunction<String, Tuple2<String, Integer>>() {
@Override
public Tuple2<String, Integer> map(String word) throws Exception {
return Tuple2.of(word, 1);
}
});
// 分组
final KeyedStream<Tuple2<String, Integer>, String> grouped = wordAndOne.keyBy(t -> t.f0);
//聚合
final SingleOutputStreamOperator<Tuple2<String, Integer>> result = grouped.sum(1);
// 3: sink
result.print();
//4: execute 启动并等待程序结束
env.execute("count2");
}
}
输入数据:
结果为:
备注:设置为
STREAMING, AUTOMATIC 或者不设置结果都是一样的,都是进行流操作。
但是设置为 RuntimeMode 批模式会报错:
- WordCountStream_onYarn
package com.learn.flink;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.util.Collector;
import java.util.Arrays;
/**
* 将程序上传并打包让后使用 flink 执行
* /usr/local/opt/flink/bin/flink run
* 指定类型:-m yarn-cluster
* 指定 main:-c com.learn.flink.WordCountStream_onYarn
* 指定jar: flink-demo1-1.0-SNAPSHOT.jar
* 设置参数 args:--outPath hdfs://node01:9000/flink/completed-jobs/wordcount_
*/
public class WordCountStream_onYarn {
public static void main(String[] args) throws Exception {
// 从参数中接受 outPath 的地址
final ParameterTool parameterTool = ParameterTool.fromArgs(args);
String outPath;
if (parameterTool.has("outPath")) {
outPath = parameterTool.get("outPath");
System.out.println("指定输出路径为:" + outPath);
} else {
outPath = "hdfs://node01:9000/flink/output_";
System.out.println("未指定输出路径,默认路径为:" + outPath);
}
// 0: env
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// 1: source
final DataStream<String> lines = env.fromElements("Who's there?",
"I think I hear them. Stand, ho! Who's there?");
// 2: transformation
// 切割
final DataStream<String> words = lines.flatMap((String value, Collector<String> out) -> {
Arrays.stream(value.split(" ")).forEach(out::collect);
}).returns(Types.STRING);
// 每个单词记录为 1
final SingleOutputStreamOperator<Tuple2<String, Integer>> wordAndOne = words.map(
(String word) -> Tuple2.of(word, 1)).returns(Types.TUPLE(Types.STRING, Types.INT));
// 分组
final KeyedStream<Tuple2<String, Integer>, String> grouped = wordAndOne.keyBy(t -> t.f0);
//聚合
final SingleOutputStreamOperator<Tuple2<String, Integer>> result = grouped.sum(1);
// 3: sink 将内容输出到 hdfs 中
result.writeAsText(outPath + System.currentTimeMillis()).setParallelism(1);
//4: execute 启动并等待程序结束
env.execute("WordCountStream_onYarn");
}
}
将该类打包并上传至flink, 这里使用/usr/local/opt/flink/task为存放任务的目录:
使用命令执行:
/usr/local/opt/flink/bin/flink run -m yarn-cluster -c com.learn.flink.WordCountStream_onYarn flink-demo1-1.0-SNAPSHOT.jar --outPath hdfs://node01:9000/flink/completed-jobs/wordcount_
- 指定 flink 运行类型:-m yarn-cluster
- 指定函数入口 main:-c com.learn.flink.WordCountStream_onYarn
- 指定jar: flink-demo1-1.0-SNAPSHOT.jar
- 设置参数 args: --outPath hdfs://node01:9000/flink/completed-jobs/wordcount_
结果:
resourcemanager web 页面可以查看到执行了一个 flink per-job 的任务
hdfs 9870 端口页面可以看到运行任务结果的文件:
flink 8082 历史服务页面可以查看到完成的任务: