这是我参与11月更文挑战的第20天,活动详情查看:2021最后一次更文挑战
一、概述
目前截止 1.10 版本依然采用了 DataSet 和 DataStream 两套 API 来适配不同的应用场景。
Apache Flink在诞生之初的设计哲学是:用同一个引擎支持多种形式的计算,包括批处理、流处理和机器学习等。尤其是在流式计算方面,Flink实现了计算引擎级别的流批一体。
DataSet的核心类在flink-java这个模块DataStream的核心实现类则在flink-streaming-java这个模块
二者支持的
API都非常丰富且十分类似,比如常用的map、filter、join等常见的transformation函数。
在 Flink 的编程模型中:
DataSet,Source部分来源于文件、表或者Java集合DataStream的Source部分则一般是消息中间件比如Kafka等
这里以DataStream为例子:
Flink程序的基础构建模块是流(Streams)和转换(Transformations),每一个数据流起始于一个或多个Source,并终止于一个或多个Sink。数据流类似于有向无环图(DAG)。
三、算子
(1)自定义实时数据源
利用
Flink提供的自定义Sourceb功能来实现一个自定义的实时数据源
public class MyStreamingSource implements SourceFunction<MyStreamingSource.Item> {
private boolean isRunning = true;
/**
* 重写run方法产生一个源源不断的数据发送源
* @param ctx
* @throws Exception
*/
@Override
public void run(SourceContext<Item> ctx) throws Exception {
while(isRunning){
Item item = generateItem();
ctx.collect(item);
//每秒产生一条数据
Thread.sleep(1000);
}
}
@Override
public void cancel() {
isRunning = false;
}
//随机产生一条商品数据
private Item generateItem(){
int i = new Random().nextInt(100);
Item item = new Item();
item.setName("name" + i);
item.setId(i);
return item;
}
class Item{
private String name;
private Integer id;
Item() {
}
public String getName() {
return name;
}
void setName(String name) {
this.name = name;
}
private Integer getId() {
return id;
}
void setId(Integer id) {
this.id = id;
}
@Override
public String toString() {
return "Item{" +
"name='" + name + '\'' +
", id=" + id +
'}';
}
}
}
class StreamingDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//获取数据源
DataStreamSource<MyStreamingSource.Item> text =
//注意:并行度设置为1,我们会在后面的课程中详细讲解并行度
env.addSource(new MyStreamingSource()).setParallelism(1);
DataStream<MyStreamingSource.Item> item = text.map(
(MapFunction<MyStreamingSource.Item, MyStreamingSource.Item>) value -> value);
//打印结果
item.print().setParallelism(1);
String jobName = "user defined streaming source";
env.execute(jobName);
}
}
(2)Map
Map 接受一个元素作为输入,并且根据开发者自定义的逻辑处理后输出。
class StreamingDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//获取数据源
DataStreamSource<MyStreamingSource.Item> items = env.addSource(new MyStreamingSource()).setParallelism(1);
//Map
SingleOutputStreamOperator<Object> mapItems = items.map(new MapFunction<MyStreamingSource.Item, Object>() {
@Override
public Object map(MyStreamingSource.Item item) throws Exception {
return item.getName();
}
});
//打印结果
mapItems.print().setParallelism(1);
String jobName = "user defined streaming source";
env.execute(jobName);
}
}
同时可以自己定义 Map
class StreamingDemo {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//获取数据源
DataStreamSource<MyStreamingSource.Item> items = env.addSource(new MyStreamingSource()).setParallelism(1);
SingleOutputStreamOperator<String> mapItems = items.map(new MyMapFunction());
//打印结果
mapItems.print().setParallelism(1);
String jobName = "user defined streaming source";
env.execute(jobName);
}
static class MyMapFunction extends RichMapFunction<MyStreamingSource.Item,String> {
@Override
public String map(MyStreamingSource.Item item) throws Exception {
return item.getName();
}
}
}
(3)FlatMap
FlatMap 接受一个元素,返回零到多个元素。
FlatMap 和 Map 有些类似,但是当返回值是列表的时候,FlatMap 会将列表“平铺”,也就是以单个元素的形式进行输出。
SingleOutputStreamOperator<Object> flatMapItems = items.flatMap(new FlatMapFunction<MyStreamingSource.Item, Object>() {
@Override
public void flatMap(MyStreamingSource.Item item, Collector<Object> collector) throws Exception {
String name = item.getName();
collector.collect(name);
}
});
(4)Filter
Fliter 的意思就是过滤掉不需要的数据,每个元素都会被 filter 函数处理,如果 filter 函数返回 true 则保留,否则丢弃。
SingleOutputStreamOperator<MyStreamingSource.Item> filterItems = items.filter(new FilterFunction<MyStreamingSource.Item>() {
@Override
public boolean filter(MyStreamingSource.Item item) throws Exception {
return item.getId() % 2 == 0;
}
});
(5)KeyBy
经常会需要根据数据的某种属性或者单纯某个字段进行分组,然后对不同的组进行不同的处理。
// 将接收的数据进行拆分,分组,窗口计算并且进行聚合输出
DataStream<WordWithCount> windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
@Override
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5), Time.seconds
....
(6)Aggregations
Aggregations 为聚合函数的总称,常见的聚合函数包括但不限于 sum、max、min 等。
尽量避免在一个无限流上使用 Aggregations
keyedStream.sum(0);
keyedStream.sum("key");
keyedStream.min(0);
keyedStream.min("key");
keyedStream.max(0);
keyedStream.max("key");
keyedStream.minBy(0);
keyedStream.minBy("key");
keyedStream.maxBy(0);
keyedStream.maxBy("key");
举个例子:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//获取数据源
List data = new ArrayList<Tuple3<Integer,Integer,Integer>>();
data.add(new Tuple3<>(0,1,0));
data.add(new Tuple3<>(0,1,1));
data.add(new Tuple3<>(0,2,2));
data.add(new Tuple3<>(0,1,3));
data.add(new Tuple3<>(1,2,5));
data.add(new Tuple3<>(1,2,9));
data.add(new Tuple3<>(1,2,11));
data.add(new Tuple3<>(1,2,13));
DataStreamSource<MyStreamingSource.Item> items = env.fromCollection(data);
items.keyBy(0).max(2).printToErr();
//打印结果
String jobName = "user defined streaming source";
env.execute(jobName);
(7)Reduce
Reduce 函数的原理是,会在每一个分组的 keyedStream 上生效,它会按照用户自定义的聚合逻辑进行分组聚合。
List data = new ArrayList<Tuple3<Integer,Integer,Integer>>();
data.add(new Tuple3<>(0,1,0));
data.add(new Tuple3<>(0,1,1));
data.add(new Tuple3<>(0,2,2));
data.add(new Tuple3<>(0,1,3));
data.add(new Tuple3<>(1,2,5));
data.add(new Tuple3<>(1,2,9));
data.add(new Tuple3<>(1,2,11));
data.add(new Tuple3<>(1,2,13));
DataStreamSource<Tuple3<Integer,Integer,Integer>> items = env.fromCollection(data);
//items.keyBy(0).max(2).printToErr();
SingleOutputStreamOperator<Tuple3<Integer, Integer, Integer>> reduce = items.keyBy(0).reduce(new ReduceFunction<Tuple3<Integer, Integer, Integer>>() {
@Override
public Tuple3<Integer,Integer,Integer> reduce(Tuple3<Integer, Integer, Integer> t1, Tuple3<Integer, Integer, Integer> t2) throws Exception {
Tuple3<Integer,Integer,Integer> newTuple = new Tuple3<>();
newTuple.setFields(0,0,(Integer)t1.getField(2) + (Integer) t2.getField(2));
return newTuple;
}
});
reduce.printToErr().setParallelism(1);