1. 大数据flink基础教程(一):Flink第一讲之1小时快速入门
1.1. What is flink
-
Flink Batch/Stream
- SparkStreaming:Batch & Stream -- 批统一流
- Flink:Batch & Stream -- 流统一批
-
API
- SQL:Stream & Batch
DataStream* DataSet (Spark:DStreams& DataSet)
-
unbounded / bounded
- Unbounded streams:have a start but no defined end
- Bounded streams:have a defined start and end.
- Deploy: local、standalone、yarn、mesos、k8s、cloud
- 90% on yarn
- spark/flink on yarn 无 HA
- standalone有HA
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.
1.2. 编程模型
- flink 抽象级别
- Programs
- Mapreduce: input -> map(reduce) -> output
- Spark: input -> transformations -> action -> output
- Storm: input -> Spout -> Bolt -> output
- Flink: source -> transformations -> sink
1.3. 开发环境搭建
windows下创建maven工程,添加如下依赖
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>1.9.1</version>
<!--<scope>provided</scope>-->
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>1.9.1</version>
<!--<scope>provided</scope>-->
</dependency>
</dependencies>
【备注】
<scope>provided</scope>必须注释,否则会报错:
java.lang.NoClassDefFoundError: org/apache/flink/streaming/api/datastream/DataStream
1.4. wordcount案例
- 需求: 每隔2秒计算最近4秒的wordcount数据
- pojo实体类
package com.betop.flinktrain.entity;
/**
* @Author: eastlong
* @Date 2020/2/7
* @function: WordCount类
*
* recognized as a POJO
* 1) public
* 2) without arguement constructor
* 3) getter/setter
* 4) some need serialize
**/
public class WC {
private String word;
private long count;
public WC() {
}
public WC(String word, long count) {
this.word = word;
this.count = count;
}
// 省略getter/setter 、toString
}
- FlinkSocketApp
package com.betop.flinktrain.impl;
import com.betop.flinktrain.entity.WC;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;
/**
* @Author: eastlong
* @Date 2020/2/7
* @function: 每隔2秒计算最近4秒的数据
**/
public class FlinkSocketApp {
public static void main(String[] args) throws Exception {
// 创建环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// source
DataStream<String> lines = env.socketTextStream("192.168.211.128",9999);
// transformations
DataStream<WC> results = lines.flatMap(new FlatMapFunction<String, WC>() {
@Override
public void flatMap(String s, Collector<WC> collector) throws Exception {
String[] datas = s.split(","); // flatMap压平操作
for(String data: datas){
collector.collect(new WC(data,1));
}
}
}).keyBy("word")
.timeWindow(Time.seconds(4),Time.seconds(2)) // 每隔2秒计算最近4秒的数据
.sum("count");
results.print().setParallelism(1);
// sink
env.execute("FlinkSocketApp");
}
}
- linux端执行命令
[hadoop@hadoop101 files]$ nc -l 9999
hello,hadoop,spark
hello,hive,maven
aa,bb,cc
- 启动java程序
- 观察程序运行结果
WC{word='spark', count=1}
WC{word='hive', count=1}
WC{word='hadoop', count=1}
WC{word='hello', count=2}
......
【补充】
- 什么是pojo?
简单的Java对象(Plain Ordinary Java Objects)实际就是普通JavaBeans,使用POJO名称是为了避免和EJB混淆起来, 而且简称比较直接. - 满足的条件:
recognized as a POJO:- public
- without arguement constructor
- getter/setter
- some need serialize