「这是我参与2022首次更文挑战的第22天,活动详情查看:2022首次更文挑战」
一、准备工作
需求:统计一个文件中各个单词出现的次数, 把统计结果输出到文件, 步骤:
- 读取数据源
- 处理数据源
- 将读到的数据源文件中的每一行根据空格切分
- 将切分好的每个单词拼接、根据单词聚合(将相同的单词放在一起)
- 累加相同的单词(单词后面的1进行累加)
- 保存处理结果
引入依赖:
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.11.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-
java -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.12</artifactId>
<version>1.11.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_2.12</artifactId>
<version>1.11.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-scala_2.12</artifactId>
<version>1.11.1</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-
scala -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.12</artifactId>
<version>1.11.1</version>
</dependency>
二、实现
Flink
程序开发的流程总结如下:
- 获得一个执行环境
- 加载/创建初始化数据
- 指定数据操作的算子
- 指定结果数据存放位置
- 调用
execute()
触发执行程序
注意:
Flink
程序是延迟计算的, 只有最后调用execute()
方法的时候才会真正触发执行程序。
(1)批处理
Java
版
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
/**
* @author donald
* @date 2021/04/15
*/
public class WordCount {
public static void main(String[] args) throws Exception {
// 输入路径和出入路径通过参数传入, 约定第一个参数为输入路径, 第二个参数为输出路径
String inPath = args[0];
String outPath = args[1];
// 获取 Flink 批处理执行环境
ExecutionEnvironment executionEnvironment = ExecutionEnvironment.getExecutionEnvironment();
// 获取文件中内容
DataSet<String> text = executionEnvironment.readTextFile(inPath);
// 对数据进行处理
DataSet<Tuple2<String, Integer>> dataSet = text.flatMap(new LineSplitter())
.groupBy(0) // 根据第一个元素统计 (hello, 1), 0 表示 hello
.sum(1); // 根据第二个元素求和 (hello, 1), 1 表示 1
dataSet.writeAsCsv(outPath,"\n","").setParallelism(1);
// 触发执行程序
executionEnvironment.execute("wordcount batch process");
}
static class LineSplitter implements FlatMapFunction<String, Tuple2<String,Integer>> {
@Override
public void flatMap(String line, Collector<Tuple2<String, Integer>> collector) {
// 切分: hello donald -> (hello, 1) (donald, 1)
for (String word : line.split(" ")) {
collector.collect(new Tuple2<>(word,1));
}
}
}
}
Scala
版
import org.apache.flink.api.scala._
/**
* @author donald
* @date 2021/04/15
*/
object WordCountBatch {
def main(args: Array[String]): Unit = {
val inputPath = ""
val outputPath = ""
val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
val text: DataSet[String] = environment.readTextFile(inputPath)
val out: AggregateDataSet[(String, Int)] = text.flatMap(_.split(" ")).map((_,1)).groupBy(0).sum(1)
out.writeAsCsv(outputPath, "\n", " ").setParallelism(1)
environment.execute("word count batch")
}
}
(2)流处理
Socket
模拟实时发送单词, 使用 Flink
实时接收数据, 对指定时间窗口内(如5s)的数据进行聚合统计, 每隔1s汇总计算一次, 并且把时间窗口内计算结果打印出来。
模拟发送消息给端口:
nc -lp 7788
# 实操如下
donald@donald-pro:~$ nc -lp 7788
hello
hello donald
scala
版本
import org.apache.flink.streaming.api.scala._
/**
* @author donald
* @date 2021/04/15
*
* 流式数据
*
* 控制台输出: nc -lp 7788
*/
object WordCountStream {
def main(args: Array[String]): Unit = {
val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val streamData: DataStream[String] = environment.socketTextStream("127.0.0.1", 7788)
val out = streamData.flatMap(_.split(" ")).map((_, 1)).keyBy(0).sum(1)
out.print()
environment.execute()
}
}
输入:
donald@donald-pro:~$ nc -lp 7788
hello
hello donald
I
have
not
seen
you
for
a
long
time
hello
输出结果如下:
3> (hello,1)
8> (donald,1)
3> (hello,2)
3> (I,1)
7> (have,1)
4> (not,1)
3> (seen,1)
5> (you,1)
3> (for,1)
6> (a,1)
1> (long,1)
5> (time,1)
3> (hello,3)
Java
版本
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* @author donald
* @date 2021/04/15
*/
public class WordCountStream {
public static void main(String[] args) throws Exception {
// 监听的ip和端口号
String ip = "127.0.0.1";
int port = 7788;
// 获取 Flink 流执行环境
StreamExecutionEnvironment streamExecutionEnvironment =
StreamExecutionEnvironment.getExecutionEnvironment();
// 获取socket输入数据
DataStreamSource<String> textStream =
streamExecutionEnvironment.socketTextStream(ip, port, "\n");
SingleOutputStreamOperator<Tuple2<String, Integer>> tuple2SingleOutputStreamOperator
= textStream.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
@Override
public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) {
String[] splits = s.split("\\s");
for (String word : splits) {
collector.collect(Tuple2.of(word, 1));
}
}
});
SingleOutputStreamOperator<Tuple2<String, Integer>> word = tuple2SingleOutputStreamOperator
.keyBy(0)
.timeWindow(Time.seconds(2), Time.seconds(1))
.sum(1);
// 打印数据
word.print();
// 触发任务执行
streamExecutionEnvironment.execute("wordcount stream process");
}
}