前言
最近没什么事干就开始整理一下以前用过的技术,打发时间。 我之前基于Flink的使用主要也是用于实时处理,离线基本使用spark,这里可以简单了解一下Flink 与 Spark 简要对比,简单来讲就是实时处理方面Flink更好,离线Spark更好。安装部署的话,我这边就不再演示了,很久以前我就已经部署过了,这里主要是演示Flink的日常使用,工作之中,我们也是直接在本地编写,然后打包,提交到yarn上面就可以了。
1.1 Flink 简介
1.1.1 Flink 描述
flink官网:flink.apache.org/
Apache Flink® — Stateful Computations over Data Streams
Apache Flink 是一个框架和分布式处理引擎,用于在无边界和有边界数据流上进行有状态的计算。Flink 能在所有常见集群环境中运行,并能以内存速度和任意规模进行计算。
任何类型的数据都可以形成一种事件流。信用卡交易、传感器测量、机器日志、网站或移动应用程序上的用户交互记录,所有这些数据都形成一种流。
数据可以被作为 无界 或者 有界 流来处理。
-
无界流 :有定义流的开始,但没有定义流的结束。它们会无休止地产生数据。无界流的数据必须持续处理,即数据被摄取后需要立刻处理。我们不能等到所有数据都到达再处理,因为输入是无限的,在任何时候输入都不会完成。处理无界数据通常要求以特定顺序摄取事件,例如事件发生的顺序,以便能够推断结果的完整性。
-
有界流 :有定义流的开始,也有定义流的结束。有界流可以在摄取所有数据后再进行计算。有界流所有数据可以被排序,所以并不需要有序摄取。有界流处理通常被称为批处理
Apache Flink 擅长处理无界和有界数据集 精确的时间控制和状态化使得 Flink 的运行时(runtime)能够运行任何处理无界流的应用。有界流则由一些专为固定大小数据集特殊设计的算法和数据结构进行内部处理,产生了出色的性能。
1.1.2 Flink 架构
1.2 Flink基础入门案例
1.2.1 maven 依赖
这里是我以前的Demo的工程依赖,里面包括有redis、kafak、scala等等依赖,还有之前的kylin依赖,打包插件这些都有,可以按需选择。
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>cn.tao</groupId>
<artifactId>flinkTestDemo</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<flink.version>1.8.2</flink.version>
<scala.version>2.11.8</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.44</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<artifactId>slf4j-api</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
<exclusion>
<artifactId>scala-parser-combinators_2.11</artifactId>
<groupId>org.scala-lang.modules</groupId>
</exclusion>
</exclusions>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web_2.11</artifactId>
<version>${flink.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<artifactId>scala-library</artifactId>
<groupId>org.scala-lang</groupId>
</exclusion>
</exclusions>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>${flink.version}</version>
<exclusions>
<exclusion>
<artifactId>snappy-java</artifactId>
<groupId>org.xerial.snappy</groupId>
</exclusion>
<exclusion>
<artifactId>slf4j-api</artifactId>
<groupId>org.slf4j</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>${flink.version}</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-runtime-web_2.11</artifactId>
<version>${flink.version}</version>
<!-- <scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-statebackend-rocksdb_2.11</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>joda-time</groupId>
<artifactId>joda-time</artifactId>
<version>2.7</version>
</dependency>
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
<dependency>
<groupId>org.apache.kylin</groupId>
<artifactId>kylin-jdbc</artifactId>
<version>2.5.1</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
<testExcludes>
<testExclude>/src/test/**</testExclude>
</testExcludes>
<encoding>utf-8</encoding>
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.2</version>
<executions>
<execution>
<id>compile-scala</id>
<phase>compile</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>test-compile-scala</id>
<phase>test-compile</phase>
<goals>
<goal>add-source</goal>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
<configuration>
<scalaVersion>${scala.version}</scalaVersion>
</configuration>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id> <!-- this is used for inheritance merges -->
<phase>package</phase> <!-- 指定在打包节点执行jar包合并操作 -->
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
1.2.2 实时处理 java版本
这里我们需要安装一下nc命令
mac如何安装nc命令, windows安装nc命令,linux安装nc。 虽然这3个系统都装过,但是很久了,不太记得了,仅供参考。
我们这里使用本地socket做数据源,进行调试。首先本地启动socket
nc -lk 8888
它会像这样阻塞住
java代码
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* 每隔1秒统计最近2秒的单词数量
*/
public class JavaFlinkStreamingDemo {
public static void main(String[] args) throws Exception {
//拿到Stream执行环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//基于socket 构建数据源 数据源有几种,后面再演示
DataStreamSource<String> stream = env.socketTextStream("localhost", 8888);
/**
* 第一个泛形是输入类型
* 第二个是返回的类型
*/
SingleOutputStreamOperator<WordCount> flatStream = stream.flatMap(new FlatMapFunction<String, WordCount>() {
@Override
public void flatMap(String value, Collector<WordCount> out) throws Exception {
String[] split = value.split(",");
for (String word : split) {
out.collect(new WordCount(word, 1));
}
}
});
//这里根据对象的字段进行keyBy然后统计,最后聚合
SingleOutputStreamOperator<WordCount> sum = flatStream.keyBy("word")
.timeWindow(Time.seconds(2), Time.seconds(1))
.sum("count");
sum.print();
env.execute("javaFlinkStreamingDemo");
}
public static class WordCount {
public String word;
public long count;
public WordCount() {
}
public WordCount(String word, long count) {
this.word = word;
this.count = count;
}
@Override
public String toString() {
return "单词:" + word + "数量" + count;
}
}
}
启动我们的程序
到控制台进行数据输入
idea这边
1.2.3 实时处理 scala版本
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.api.windowing.time.Time
object ScalaFlinkStreamingDemo {
def main(args: Array[String]): Unit = {
//导入隐式转换
import org.apache.flink.api.scala._
//步骤一:获取执行环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//步骤二:获取数据源
val textStream = env.socketTextStream("localhost",8888)
//步骤三:数据处理
val wordCountStream = textStream.flatMap(line => line.split(","))
.map((_, 1))
.keyBy(0)
.timeWindow(Time.seconds(2), Time.seconds(1))
.sum(1)
wordCountStream.print()
env.execute("ScalaFlinkStreamingDemo")
}
}
这边就按照了我们的要求进行了统计,一个简单实时计算就完成了。
1.2.4 离线读取文件
创建一个文件内容如下
代码如下
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.AggregateOperator;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.FlatMapOperator;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
public class ReadTextFileDemo {
public static void main(String[] args) throws Exception {
//拿到Stream执行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//基于readTextFile 构建数据源
DataSource<String> source = env.readTextFile("/Users/xxx/workspace/flinkTestDemo/src/main/resources/wordFile.txt");
/**
* 第一个泛形是输入类型
* 第二个是返回的类型
*/
FlatMapOperator<String, Tuple2<String,Integer>> flatSource = source.flatMap(new FlatMapFunction<String, Tuple2<String,Integer>>() {
@Override
public void flatMap(String value, Collector<Tuple2<String,Integer>> out) throws Exception {
String[] split = value.split(",");
for (String word : split) {
out.collect(new Tuple2<String,Integer>(word, 1));
}
}
});
/**
* timeWindow
* size 第一个参数窗口的大小
* slide 第二个参数 间隔多久进行滑动
*/
AggregateOperator<Tuple2<String,Integer>> sum = flatSource.groupBy(0).sum(1);
//指定输出的路径 这里我设置了并行度为1
sum.writeAsText("/Users/xxx/workspace/flinkTestDemo/src/main/resources/result.txt").setParallelism(1);
env.execute("ReadTextFileDemo");
}
}
指定输出的路径 这里我设置了并行度为1,不设置会导致他按我的cpu核进行默认设置,相当于setParallelism(16),不同配置的电脑不一样,然后会输出一个文件夹,里面有16个文件,大部分是没有数据的,因为我这里就5行数据。并行度东西后面再写。如图
setParallelism(1) 运行之后的结果文件
2.1 DataStream 相关
2.1.1 Flink自带Connectors
2.1.2 source
-
基于文件
-
readTextFile(path)
读取文本文件,逐行读取并返回。
DataSource<String> source = env.readTextFile("/Users/xxx/workspace/flinkTestDemo/src/main/resources/wordFile.txt");
- 基于socket
socketTextStream 从socker中读取数据,元素可以通过一个分隔符切开。
DataStreamSource<String> stream = env.socketTextStream("localhost", 8888);
- 基于集合
通过java 的collection集合创建一个数据流,集合中的所有元素必须是相同类型的。
// 创建集合
ArrayList<String> data = new ArrayList<String>();
data.add("flink");
data.add("spark");
data.add("hive");
DataStreamSource<String> dataStream = env.fromCollection(data);
//..进行对应操作
dataStream.map(...).print()
env.execute(...)
- 自定义输入
addSource 可以实现读取第三方数据源的数据
单并行度
当基于单并行度的数据源,实现SourceFunction,则不能设置setParallelism()大于1,否则将出现异常,如socketTextStream就是单并行度的。
我们设置为2,运行一下,看看报错
public class CustomizeSourceSingleClass implements SourceFunction<Integer> {
boolean flag = true;
@Override
public void run(SourceContext<Integer> out) throws Exception {
while (flag){
out.collect(new Random().nextInt(1000));
Thread.sleep(1000);
}
}
@Override
public void cancel() {
flag = false;
}
}
多并行度
实现ParallelSourceFunction
public class CustomizeSourceParalleClass implements ParallelSourceFunction<Integer> {
boolean flag = true;
@Override
public void run(SourceContext<Integer> out) throws Exception {
while (flag){
out.collect(new Random().nextInt(1000));
Thread.sleep(1000);
}
}
@Override
public void cancel() {
flag = false;
}
}
2.1.3 常用 Transformation 算子
2.1.3.1 map 和 filter
import demo.CustomizeSourceSingleClass;
import org.apache.flink.api.common.functions.FilterFunction;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
public class MapAndFilter {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> numberStream = env.addSource(new CustomizeSourceSingleClass());
SingleOutputStreamOperator<Integer> dataStream = numberStream.map(new MapFunction<Integer, Integer>() {
@Override
public Integer map(Integer value) throws Exception {
System.out.println("map接受到了数据:"+value);
return value;
}
});
//只要大于100的数据
SingleOutputStreamOperator<Integer> streamOperator = dataStream.filter(new FilterFunction<Integer>() {
@Override
public boolean filter(Integer number) throws Exception {
return number > 100 ;
}
});
streamOperator.print().setParallelism(1);
env.execute("MapAndFilter");
}
}
2.1.3.2 flatMap,keyBy,sum
请查看 1.2.2 实时处理 java版本 的代码即可。
2.1.3.3 union
import demo.CustomizeSourceSingleClass;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
/**
* 数据类型一样才可以合并
**/
public class UnionDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> numberStream1 = env.addSource(new CustomizeSourceSingleClass());
DataStreamSource<Integer> numberStream2 = env.addSource(new CustomizeSourceSingleClass());
//这里我对数据流2做一个加法
SingleOutputStreamOperator<Integer> mapAdd1000 = numberStream2.map(a -> a + 1000);
DataStream<Integer> union = numberStream1.union(mapAdd1000);
//超过1000 就是第二个数据流的
SingleOutputStreamOperator<Integer> dataStream = union.map(new MapFunction<Integer, Integer>() {
@Override
public Integer map(Integer value) throws Exception {
System.out.println("map接受到了数据:"+value);
return value;
}
});
dataStream.print().setParallelism(1);
env.execute("UnionDemo");
}
}
2.1.3.4 connect
import demo.CustomizeSourceSingleClass;
import org.apache.flink.streaming.api.datastream.ConnectedStreams;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.CoMapFunction;
public class ConnectiDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> numberStream1 = env.addSource(new CustomizeSourceSingleClass());
DataStreamSource<Integer> numberStream2 = env.addSource(new CustomizeSourceSingleClass());
//这里我对数据流2做一个加法
SingleOutputStreamOperator<String> mapAddAA = numberStream2.map(a -> a + "AA");
ConnectedStreams<Integer, String> connect = numberStream1.connect(mapAddAA);
//带AA的 就是第二个数据流的
//CoMapFunction 三个参数 第一个流的数据类型,第二个流的数据类型,输出的数据类型
SingleOutputStreamOperator<Object> dataStream = connect.map(new CoMapFunction<Integer, String, Object>() {
//获取相应数据流的数据
@Override
public Object map1(Integer value) throws Exception {
return value;
}
@Override
public Object map2(String value) throws Exception {
return value;
}
});
dataStream.print().setParallelism(1);
env.execute("ConnectiDemo");
}
}
2.1.3.5 Split和Select
import demo.CustomizeSourceSingleClass;
import org.apache.flink.streaming.api.collector.selector.OutputSelector;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SplitStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.ArrayList;
public class SplitAndSelectDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<Integer> stream = env.addSource(new CustomizeSourceSingleClass());
SplitStream<Integer> splitStream = stream.split(new OutputSelector<Integer>() {
@Override
public Iterable<String> select(Integer value) {
ArrayList<String> outPut = new ArrayList<>();
if (value % 2 != 0) {
outPut.add("odd");//奇数
} else {
outPut.add("even");//偶数
}
return outPut;
}
});
//选择一个或者多个切分后的流
DataStream<Integer> evenStream = splitStream.select("even");
DataStream<Integer> oddStream = splitStream.select("odd");
DataStream<Integer> moreStream = splitStream.select("odd","even");
//打印结果
evenStream.print().setParallelism(1);
env.execute("SplitAndSelectDemo");
}
}
2.1.4 sink 操作
2.1.4.1 print
evenStream.print().setParallelism(1);
2.1.4.2 writeAsText
stream.writeAsText("/xxx/xxx/xxx.txt").setParallelism(1);
2.1.4.3 Flink 自带的 sink
- Apache Kafka (source/sink)
- Apache Cassandra (sink)
- Amazon Kinesis Streams (source/sink)
- Elasticsearch (sink)
- Hadoop FileSystem (sink)
- RabbitMQ (source/sink)
- Apache NiFi (source/sink)
- Twitter Streaming API (source)
- Google PubSub (source/sink)
2.1.4.4 自定义 sink
我们工作中也经常会将数据存入redis中
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.redis.RedisSink;
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommand;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisCommandDescription;
import org.apache.flink.streaming.connectors.redis.common.mapper.RedisMapper;
public class SinkRedisDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> text = env.socketTextStream("localhost", 8888);
DataStream<Tuple2<String, String>> l_wordsData = text.map(new MapFunction<String, Tuple2<String, String>>() {
@Override
public Tuple2<String, String> map(String value) throws Exception {
System.out.println(value);
return new Tuple2<>("listData", value);
}
});
//redis的配置
FlinkJedisPoolConfig conf = new FlinkJedisPoolConfig.Builder().setHost("node2").setPort(6379).build();
//创建sink
RedisSink<Tuple2<String, String>> redisSink = new RedisSink<>(conf, new MyRedisMapper());
l_wordsData.addSink(redisSink);
env.execute("SinkRedisDemo");
}
public static class MyRedisMapper implements RedisMapper<Tuple2<String, String>> {
//表示从数据中获取需要操作的redis的key
@Override
public String getKeyFromData(Tuple2<String, String> data) {
return data.f0;
}
//表示从数据中获取需要操作的redis的value
@Override
public String getValueFromData(Tuple2<String, String> data) {
return data.f1;
}
//下面可以选择插入redis的命令类型
@Override
public RedisCommandDescription getCommandDescription() {
return new RedisCommandDescription(RedisCommand.LPUSH);
}
}
}
然后去服务器查看redis的数据存储,可以看到这里数据也存进去了
redis-cli
127.0.0.1:6379> LRANGE listData 0 10
1) "asdasd"
2) "asdsad"
3) "sadasd"
4) "asdsa"
5) "23423423 32423432 32423 4"
127.0.0.1:6379>
2.2 DataSet 相关
2.2.1 source
基于文件读取
env.readTextFile(path)
基于集合读取
env.fromCollection(Collection)
2.2.2 transform
2.2.2.1 map 与 MapPartition
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.MapPartitionFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.MapOperator;
import org.apache.flink.util.Collector;
import java.util.ArrayList;
import java.util.Iterator;
public class MapAndMapPartition {
public static void main(String[] args) throws Exception {
//获取运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ArrayList<String> data = new ArrayList<>();
data.add("flink ");
data.add("spark");
data.add("hadoop");
DataSource<String> text = env.fromCollection(data);
//假设我们现在要将数据插入到数据库中
MapOperator<String, String> map = text.map(new MapFunction<String, String>() {
@Override
public String map(String value) throws Exception {
//创建连接 处理 关闭连接 每一条数据都要这样做
return value;
}
});
DataSet<String> mapPartition = map.mapPartition(new MapPartitionFunction<String, String>() {
@Override
public void mapPartition(Iterable<String> values, Collector<String> out) throws Exception {
//基于一个分区创建连接
//一个分区的数据
Iterator<String> it = values.iterator();
while (it.hasNext()) {
out.collect(it.next());
}
//然后处理完毕 关闭连接
}
});
mapPartition.print();
}
}
2.2.2.2 distinct 去重
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.FlatMapOperator;
import org.apache.flink.util.Collector;
import java.util.ArrayList;
public class DistinctDemo {
public static void main(String[] args) throws Exception {
//获取运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ArrayList<String> data = new ArrayList<>();
data.add("spark flink");
data.add("spark");
DataSource<String> text = env.fromCollection(data);
FlatMapOperator<String, String> flatMap = text.flatMap(new FlatMapFunction<String, String>() {
@Override
public void flatMap(String value, Collector<String> out) throws Exception {
String[] split = value.toLowerCase().split("\\W+");
for (String word : split) {
System.out.println("单词 " + word);
out.collect(word);
}
}
});
flatMap.distinct().print();
}
}
结果
2.2.2.3 join
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import java.util.ArrayList;
public class JoinDemo {
public static void main(String[] args) throws Exception {
//获取运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ArrayList<Tuple2<Integer, String>> tuple1 = new ArrayList<>();
tuple1.add(new Tuple2<>(1, "xiaoming"));
tuple1.add(new Tuple2<>(2, "xiaoli"));
tuple1.add(new Tuple2<>(3, "wangcai"));
//tuple2<用户id,用户所在城市>
ArrayList<Tuple2<Integer, String>> tuple2 = new ArrayList<>();
tuple2.add(new Tuple2<>(1, "guangzhou"));
tuple2.add(new Tuple2<>(2, "shenzhen"));
tuple2.add(new Tuple2<>(3, "foshan"));
DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(tuple1);
DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(tuple2);
data1.join(data2).where(0) //指定第一个数据集中需要进行比较的元素角标
.equalTo(0) //指定第二个数据集中需要进行比较的元素角标
.with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
@Override
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second)
throws Exception {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
}).print();
}
}
结果
2.2.2.4 OutJoin
import org.apache.flink.api.common.functions.JoinFunction;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import java.util.ArrayList;
public class OutJoin {
public static void main(String[] args) throws Exception {
//获取运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ArrayList<Tuple2<Integer, String>> tuple1 = new ArrayList<>();
tuple1.add(new Tuple2<>(1, "xiaoming"));
tuple1.add(new Tuple2<>(2, "xiaoli"));
tuple1.add(new Tuple2<>(3, "wangcai"));
tuple1.add(new Tuple2<>(4, "zhangfei"));
//tuple2<用户id,用户所在城市>
ArrayList<Tuple2<Integer, String>> tuple2 = new ArrayList<>();
tuple2.add(new Tuple2<>(1, "guangzhou"));
tuple2.add(new Tuple2<>(2, "shenzhen"));
tuple2.add(new Tuple2<>(3, "foshan"));
tuple2.add(new Tuple2<>(6, "zhaoyun"));
DataSource<Tuple2<Integer, String>> data1 = env.fromCollection(tuple1);
DataSource<Tuple2<Integer, String>> data2 = env.fromCollection(tuple2);
//左连接 意味着second可能为空 需要增加判断
data1.leftOuterJoin(data2).where(0) //指定第一个数据集中需要进行比较的元素角标
.equalTo(0) //指定第二个数据集中需要进行比较的元素角标
.with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
@Override
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second)
throws Exception {
if (second == null) {
return new Tuple3<>(first.f0, first.f1, null);
} else {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
}
}).print();
System.out.println("===========" + "左连接结束" + "===========");
//右连接 意味着first可能为空 需要增加判断
data1.rightOuterJoin(data2).where(0) //指定第一个数据集中需要进行比较的元素角标
.equalTo(0) //指定第二个数据集中需要进行比较的元素角标
.with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
@Override
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second)
throws Exception {
if (first == null) {
return new Tuple3<>(second.f0, null, second.f1);
} else {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
}
}).print();
System.out.println("===========" + "右连接结束" + "===========");
//全连接 意味着2个 可能为空 需要增加判断
data1.fullOuterJoin(data2).where(0) //指定第一个数据集中需要进行比较的元素角标
.equalTo(0) //指定第二个数据集中需要进行比较的元素角标
.with(new JoinFunction<Tuple2<Integer, String>, Tuple2<Integer, String>, Tuple3<Integer, String, String>>() {
@Override
public Tuple3<Integer, String, String> join(Tuple2<Integer, String> first, Tuple2<Integer, String> second)
throws Exception {
if (first == null) {
return new Tuple3<>(second.f0, null, second.f1);
}
if (second == null) {
return new Tuple3<>(first.f0, first.f1, null);
} else {
return new Tuple3<>(first.f0, first.f1, second.f1);
}
}
}).print();
System.out.println("===========" + "全连接结束" + "===========");
}
}
结果
2.2.2.5 Cross 笛卡尔积
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.CrossOperator;
import org.apache.flink.api.java.operators.DataSource;
import java.util.ArrayList;
public class CrossDemo {
public static void main(String[] args) throws Exception{
//获取运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ArrayList<String> data1 = new ArrayList<>();
data1.add("A");
data1.add("B");
ArrayList<Integer> data2 = new ArrayList<>();
data2.add(1);
data2.add(2);
DataSource<String> text1 = env.fromCollection(data1);
DataSource<Integer> text2 = env.fromCollection(data2);
CrossOperator.DefaultCross<String, Integer> cross = text1.cross(text2);
cross.print();
}
}
结果
2.2.2.6 sortPartition 与 first
import org.apache.flink.api.common.operators.Order;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.operators.GroupReduceOperator;
import org.apache.flink.api.java.operators.SortPartitionOperator;
import org.apache.flink.api.java.tuple.Tuple2;
import java.util.ArrayList;
public class FirstAndSortDemo {
public static void main(String[] args) throws Exception {
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ArrayList<Tuple2<Integer, String>> data = new ArrayList<>();
data.add(new Tuple2<>(2,"aa"));
data.add(new Tuple2<>(4,"bb"));
data.add(new Tuple2<>(3,"cc"));
data.add(new Tuple2<>(1,"ab"));
data.add(new Tuple2<>(1,"bc"));
data.add(new Tuple2<>(1,"bd"));
DataSource<Tuple2<Integer, String>> dataSource = env.fromCollection(data);
GroupReduceOperator<Tuple2<Integer, String>, Tuple2<Integer, String>> first2 = dataSource.first(2);
System.out.println("=====打印前2个 基于读取顺序=====");
first2.print();
System.out.println("=====按第一个属性升序排序 基于读取顺序=====");
SortPartitionOperator<Tuple2<Integer, String>> sort1 = dataSource.sortPartition(0, Order.ASCENDING);
sort1.print();
System.out.println("=====按第一列升序排序 再按第二列倒序排序 =====");
SortPartitionOperator<Tuple2<Integer, String>> sort2 = dataSource.sortPartition(0, Order.ASCENDING)
.sortPartition(1,Order.DESCENDING);
sort2.print();
System.out.println("=====分组后 按第二列升序排列 取第一个=====");
SortPartitionOperator<Tuple2<Integer, String>> sort3 = dataSource.groupBy(0).sortGroup(1, Order.ASCENDING).first(1)
.sortPartition(1,Order.DESCENDING);
sort3.print();
}
}
结果
2.2.2.7 partition 与 自定义分区
import org.apache.flink.api.common.functions.MapPartitionFunction;
import org.apache.flink.api.common.functions.Partitioner;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.util.Collector;
import java.util.ArrayList;
import java.util.Iterator;
public class PartitionDemo {
public static void main(String[] args) throws Exception {
//获取运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
ArrayList<Tuple2<Integer, String>> data = new ArrayList<>();
data.add(new Tuple2<>(1, "demo1"));
data.add(new Tuple2<>(1, "demo2"));
data.add(new Tuple2<>(1, "demo3"));
data.add(new Tuple2<>(2, "demo4"));
data.add(new Tuple2<>(2, "demo5"));
data.add(new Tuple2<>(2, "demo6"));
data.add(new Tuple2<>(3, "demo7"));
data.add(new Tuple2<>(3, "demo8"));
data.add(new Tuple2<>(4, "demo9"));
data.add(new Tuple2<>(4, "demo10"));
data.add(new Tuple2<>(4, "demo11"));
data.add(new Tuple2<>(4, "demo12"));
data.add(new Tuple2<>(5, "demo13"));
data.add(new Tuple2<>(5, "demo14"));
data.add(new Tuple2<>(5, "demo15"));
data.add(new Tuple2<>(5, "demo16"));
data.add(new Tuple2<>(5, "demo17"));
data.add(new Tuple2<>(6, "demo18"));
data.add(new Tuple2<>(6, "demo19"));
data.add(new Tuple2<>(6, "demo20"));
data.add(new Tuple2<>(6, "demo21"));
DataSource<Tuple2<Integer, String>> testSource = env.fromCollection(data);
/* // 指定第一个字段 基于 hash来划分partition 相同的key 会在一个分区
testSource.partitionByHash(0).mapPartition(new MapPartitionFunction<Tuple2<Integer, String>, Tuple2<Integer, String>>() {
@Override
public void mapPartition(Iterable<Tuple2<Integer, String>> values, Collector<Tuple2<Integer, String>> out) throws Exception {
Iterator<Tuple2<Integer, String>> it = values.iterator();
while (it.hasNext()) {
Tuple2<Integer, String> next = it.next();
System.out.println("线程id:" + Thread.currentThread().getId() + "," + next);
}
}
}).print();
//指定第一个字段作为Range来划分partition 相同数字的会在一个分区
testSource.partitionByRange(0).mapPartition(new MapPartitionFunction<Tuple2<Integer,String>, Tuple2<Integer,String>>() {
@Override
public void mapPartition(Iterable<Tuple2<Integer, String>> values, Collector<Tuple2<Integer, String>> out) throws Exception {
Iterator<Tuple2<Integer, String>> it = values.iterator();
while (it.hasNext()){
Tuple2<Integer, String> next = it.next();
System.out.println("线程id:"+Thread.currentThread().getId()+","+next);
}
}
}).print();*/
//设置分区为4个
testSource.partitionCustom(new MyPartition(4), 0).mapPartition(new MapPartitionFunction<Tuple2<Integer, String>, Tuple2<Integer, String>>() {
@Override
public void mapPartition(Iterable<Tuple2<Integer, String>> values, Collector<Tuple2<Integer, String>> out) throws Exception {
Iterator<Tuple2<Integer, String>> iterator = values.iterator();
while (iterator.hasNext()) {
Tuple2<Integer, String> next = iterator.next();
System.out.println("线程id:" + Thread.currentThread().getId() + "," + next);
}
}
}).print();
}
//自定义分区器
public static class MyPartition implements Partitioner<Integer> {
private int numPartition;
public MyPartition() {
}
public MyPartition(int numPartitions) {
this.numPartition = numPartitions;
}
public int partition(Integer key, int numPartitions) {
System.out.println("分区总数" + numPartitions);
return key % numPartition;
}
}
}
这个太长了 不贴结果了。
2.2.3 sink
我这里直接就拿官网的例子贴在这里
// 数据
DataSet<String> textData = // [...]
// 将文件数据逐行输入到指定目录
textData.writeAsText("file:///my/result/on/localFS");
// 将文件数据逐行输入到hdfs上指定目录
textData.writeAsText("hdfs://$Host:$Port/my/result/on/localFS");
// 如果文件已经存在则覆盖
textData.writeAsText("file:///my/result/on/localFS", WriteMode.OVERWRITE);
// 将tuples 按照指定的行分隔 字段分隔 写入到指定目录
DataSet<Tuple3<String, Integer, Double>> values = // [...]
values.writeAsCsv("file:///path/to/the/result/file", "\n", "|");
// 这将以文本格式“(a,b,c)”而不是CSV行写入元组
values.writeAsText("file:///path/to/the/result/file");
// 给当前输出指定一个格式化器,按照给定方式输出到磁盘
values.writeAsFormattedText("file:///path/to/the/result/file",
new TextFormatter<Tuple2<Integer, Integer>>() {
public String format (Tuple2<Integer, Integer> value) {
return value.f1 + " - " + value.f0;
}
});
2.2.4 broadcast 广播变量
可以理解为是一个公共的共享变量,通过吧一个dataSet的数据集广播出去,然后不同的task在当前节点上都可以获取,这个数据在每个节点上只会存在一份,如果不使用,那么每个节点的每个task都会拷贝一份数据,浪费内存。spark也有广播变量。
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
public class BroadCastDemo {
public static void main(String[] args) throws Exception{
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
//构建广播数据
ArrayList<Tuple2<String, Integer>> broadData = new ArrayList<>();
broadData.add(new Tuple2<>("zhaoyun",33));
broadData.add(new Tuple2<>("zhangfei",30));
broadData.add(new Tuple2<>("liubei",40));
DataSet<Tuple2<String, Integer>> tupleData = env.fromCollection(broadData);
//模拟真实数据
DataSource<String> data = env.fromElements("zhaoyun", "zhangfei", "liubei");
//这里使用到RichMapFunction获取广播变量,因为Rich 会多一个open方法 可以做初始化
DataSet<String> result = data.map(new RichMapFunction<String, String>() {
List<HashMap<String, Integer>> broadCastMap = new ArrayList<HashMap<String, Integer>>();
HashMap<String, Integer> map = new HashMap<String, Integer>();
/**
* 这个方法只会执行一次
* 可以在这里实现一些初始化的功能
*/
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
List<Tuple2<String, Integer>> broadCastData = getRuntimeContext().getBroadcastVariable("broadCastData");
for (Tuple2<String, Integer> tuple2 : broadCastData) {
map.put(tuple2.f0,tuple2.f1);
}
}
@Override
public String map(String value) throws Exception {
Integer age = map.get(value);
return value + "," + age;
}
}).withBroadcastSet(tupleData, "broadCastData");//广播数据
result.print();
}
}
结果数据
zhaoyun,33
zhangfei,30
liubei,40
2.2.5 Counter 计数器
其实就是针对于分布式的计算任务,每个task统计数量,然后最终合并统计,跟 MR 的 counter差不多。
只有任务运行结束,才能获取合并结果。
import org.apache.flink.api.common.JobExecutionResult;
import org.apache.flink.api.common.accumulators.IntCounter;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.operators.DataSource;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.core.fs.FileSystem;
public class CounterDemo {
public static void main(String[] args) throws Exception {
//获取运行环境
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSource<String> data = env.fromElements("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11");
//RichMapFunction 在open 方法 进行注册
DataSet<String> result = data.map(new RichMapFunction<String, String>() {
//创建累加器
private IntCounter numLines = new IntCounter();
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
//注册累加器
getRuntimeContext().addAccumulator("num-sum", this.numLines);
}
@Override
public String map(String value) throws Exception {
//考虑到我们的任务一般是多并行度的,不能简单++1。
this.numLines.add(1);
return value;
}
}).setParallelism(16);
result.writeAsText("/Users/xxxx/workspace/flinkTestDemo/src/main/resources/result.txt", FileSystem.WriteMode.OVERWRITE).setParallelism(1);
//拿到job 执行 结果
JobExecutionResult counter = env.execute("counter");
//在结果里面获取统计结果
int num = counter.getAccumulatorResult("num-sum");
System.out.println("总数:" + num);
}
}
结果
总数:11
最后
到这里flink的基础使用以及常用算子就演示完了,没有演示的部分可以大家去到官网或者查阅资料测试一下。下一篇flink之state、checkpoint、savepoint主要是关于Flink的state的介绍,state的使用非常重要,也是Flink的一个非常重要的特性。