【Flink】单词统计

382 阅读2分钟

「这是我参与2022首次更文挑战的第22天,活动详情查看:2022首次更文挑战

一、准备工作

需求:统计一个文件中各个单词出现的次数, 把统计结果输出到文件, 步骤:

  1. 读取数据源
  2. 处理数据源
    • 将读到的数据源文件中的每一行根据空格切分
    • 将切分好的每个单词拼接、根据单词聚合(将相同的单词放在一起)
    • 累加相同的单词(单词后面的1进行累加)
  3. 保存处理结果

引入依赖:

	<dependencies>
		<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-java -->
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-java</artifactId>
			<version>1.11.1</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-
        java -->
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-streaming-java_2.12</artifactId>
			<version>1.11.1</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-clients -->
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-clients_2.12</artifactId>
			<version>1.11.1</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-scala -->
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-scala_2.12</artifactId>
			<version>1.11.1</version>
		</dependency>
		<!-- https://mvnrepository.com/artifact/org.apache.flink/flink-streaming-
        scala -->
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-streaming-scala_2.12</artifactId>
			<version>1.11.1</version>
		</dependency>

二、实现

Flink 程序开发的流程总结如下:

  1. 获得一个执行环境
  2. 加载/创建初始化数据
  3. 指定数据操作的算子
  4. 指定结果数据存放位置
  5. 调用 execute() 触发执行程序

注意: Flink 程序是延迟计算的, 只有最后调用 execute() 方法的时候才会真正触发执行程序。

(1)批处理

  1. Java
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;

/**
 * @author donald
 * @date 2021/04/15
 */
public class WordCount {

    public static void main(String[] args) throws Exception {
        // 输入路径和出入路径通过参数传入, 约定第一个参数为输入路径, 第二个参数为输出路径
        String inPath = args[0];
        String outPath = args[1];

        // 获取 Flink 批处理执行环境
        ExecutionEnvironment executionEnvironment = ExecutionEnvironment.getExecutionEnvironment();

        // 获取文件中内容
        DataSet<String> text = executionEnvironment.readTextFile(inPath);

        // 对数据进行处理
        DataSet<Tuple2<String, Integer>> dataSet = text.flatMap(new LineSplitter())
                .groupBy(0) // 根据第一个元素统计 (hello, 1), 0 表示 hello
                .sum(1);      // 根据第二个元素求和 (hello, 1), 1 表示 1

        dataSet.writeAsCsv(outPath,"\n","").setParallelism(1);

        // 触发执行程序
        executionEnvironment.execute("wordcount batch process");
    }

    static class LineSplitter implements FlatMapFunction<String, Tuple2<String,Integer>> {
        @Override
        public void flatMap(String line, Collector<Tuple2<String, Integer>> collector) {
            // 切分: hello donald -> (hello, 1) (donald, 1)
            for (String word : line.split(" ")) {

                collector.collect(new Tuple2<>(word,1));
            }
        }
    }
}
  1. Scala
import org.apache.flink.api.scala._

/**
  * @author donald
  * @date 2021/04/15
  */
object WordCountBatch {

  def main(args: Array[String]): Unit = {
    val inputPath = ""
    val outputPath = ""
    val environment: ExecutionEnvironment = ExecutionEnvironment.getExecutionEnvironment
    val text: DataSet[String] = environment.readTextFile(inputPath)
    val out: AggregateDataSet[(String, Int)] = text.flatMap(_.split(" ")).map((_,1)).groupBy(0).sum(1)
    out.writeAsCsv(outputPath, "\n", " ").setParallelism(1)
    environment.execute("word count batch")
  }
}

(2)流处理

Socket 模拟实时发送单词, 使用 Flink 实时接收数据, 对指定时间窗口内(如5s)的数据进行聚合统计, 每隔1s汇总计算一次, 并且把时间窗口内计算结果打印出来。

模拟发送消息给端口:

nc -lp 7788

# 实操如下
donald@donald-pro:~$ nc -lp 7788
hello
hello donald
  1. scala 版本
import org.apache.flink.streaming.api.scala._

/**
  * @author donald
  * @date 2021/04/15
  *
  * 流式数据
  *
  * 控制台输出: nc -lp 7788
  */
object WordCountStream {

  def main(args: Array[String]): Unit = {
    val environment: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
    val streamData: DataStream[String] = environment.socketTextStream("127.0.0.1", 7788)
    val out = streamData.flatMap(_.split(" ")).map((_, 1)).keyBy(0).sum(1)
    out.print()
    environment.execute()
  }
}

输入:

donald@donald-pro:~$ nc -lp 7788
hello
hello donald
I   
have 
not 
seen
you 
for 
a
long
time
hello

输出结果如下:

3> (hello,1)
8> (donald,1)
3> (hello,2)
3> (I,1)
7> (have,1)
4> (not,1)
3> (seen,1)
5> (you,1)
3> (for,1)
6> (a,1)
1> (long,1)
5> (time,1)
3> (hello,3)
  1. Java 版本
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;

/**
 * @author donald
 * @date 2021/04/15
 */
public class WordCountStream {

    public static void main(String[] args) throws Exception {

        // 监听的ip和端口号
        String ip = "127.0.0.1";
        int port = 7788;

        // 获取 Flink 流执行环境
        StreamExecutionEnvironment streamExecutionEnvironment =
                StreamExecutionEnvironment.getExecutionEnvironment();

        // 获取socket输入数据
        DataStreamSource<String> textStream =
                streamExecutionEnvironment.socketTextStream(ip, port, "\n");
        SingleOutputStreamOperator<Tuple2<String, Integer>> tuple2SingleOutputStreamOperator
                = textStream.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
            @Override
            public void flatMap(String s, Collector<Tuple2<String, Integer>> collector) {
                String[] splits = s.split("\\s");
                for (String word : splits) {
                    collector.collect(Tuple2.of(word, 1));
                }
            }
        });

        SingleOutputStreamOperator<Tuple2<String, Integer>> word = tuple2SingleOutputStreamOperator
                .keyBy(0)
                .timeWindow(Time.seconds(2), Time.seconds(1))
                .sum(1);

        // 打印数据
        word.print();

        // 触发任务执行
        streamExecutionEnvironment.execute("wordcount stream process");
    }
}