大数据flink基础教程(一):Flink第一讲之1小时快速入门

1,561 阅读2分钟

1. 大数据flink基础教程(一):Flink第一讲之1小时快速入门

1.1. What is flink

  • Flink Batch/Stream

    • SparkStreaming:Batch & Stream -- 批统一流
    • Flink:Batch & Stream -- 流统一批
  • API

    • SQL:Stream & Batch
    • DataStream * DataSet (Spark: DStreams & DataSet)
  • unbounded / bounded

    • Unbounded streams:have a start but no defined end
    • Bounded streams:have a defined start and end.

unbounded/bounded

  • Deploy: local、standalone、yarn、mesos、k8s、cloud
    • 90% on yarn
    • spark/flink on yarn 无 HA
    • standalone有HA

Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

1.2. 编程模型

  • flink 抽象级别

flink 抽象级别

  • Programs
    • Mapreduce: input -> map(reduce) -> output
    • Spark: input -> transformations -> action -> output
    • Storm: input -> Spout -> Bolt -> output
    • Flink: source -> transformations -> sink

1.3. 开发环境搭建

windows下创建maven工程,添加如下依赖

<dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.9.1</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>1.9.1</version>
            <!--<scope>provided</scope>-->
        </dependency>
    </dependencies>

【备注】 <scope>provided</scope>必须注释,否则会报错:

java.lang.NoClassDefFoundError: org/apache/flink/streaming/api/datastream/DataStream

1.4. wordcount案例

  1. 需求: 每隔2秒计算最近4秒的wordcount数据
  2. pojo实体类
package com.betop.flinktrain.entity;

/**
 * @Author: eastlong
 * @Date 2020/2/7
 * @function: WordCount类
 *
 * recognized as a POJO
 * 1) public
 * 2) without arguement constructor
 * 3) getter/setter
 * 4) some need serialize
 **/
public class WC {
    private String word;
    private long count;

    public WC() {
    }
    public WC(String word, long count) {
        this.word = word;
        this.count = count;
    }
    // 省略getter/setter 、toString  
}

  1. FlinkSocketApp
package com.betop.flinktrain.impl;

import com.betop.flinktrain.entity.WC;
import org.apache.flink.api.common.functions.FlatMapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.util.Collector;


/**
 * @Author: eastlong
 * @Date 2020/2/7
 * @function: 每隔2秒计算最近4秒的数据
 **/
public class FlinkSocketApp {
    public static void main(String[] args) throws Exception {
        // 创建环境
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        // source
        DataStream<String> lines = env.socketTextStream("192.168.211.128",9999);
        // transformations
        DataStream<WC> results = lines.flatMap(new FlatMapFunction<String, WC>() {
            @Override
            public void flatMap(String s, Collector<WC> collector) throws Exception {
                String[] datas = s.split(","); // flatMap压平操作
                for(String data: datas){
                    collector.collect(new WC(data,1));
                }
            }
        }).keyBy("word") 
                .timeWindow(Time.seconds(4),Time.seconds(2)) // 每隔2秒计算最近4秒的数据
                .sum("count");

        results.print().setParallelism(1);
        // sink
        env.execute("FlinkSocketApp");

    }
}
  1. linux端执行命令
[hadoop@hadoop101 files]$ nc -l 9999
hello,hadoop,spark
hello,hive,maven
aa,bb,cc
  1. 启动java程序
  2. 观察程序运行结果
WC{word='spark', count=1}
WC{word='hive', count=1}
WC{word='hadoop', count=1}
WC{word='hello', count=2}
......

【补充】

  • 什么是pojo?
    简单的Java对象(Plain Ordinary Java Objects)实际就是普通JavaBeans,使用POJO名称是为了避免和EJB混淆起来, 而且简称比较直接.
  • 满足的条件:
    recognized as a POJO:
    1. public
    2. without arguement constructor
    3. getter/setter
    4. some need serialize

1.5. 参考资料

1.flink官网
2.flink Concepts
3.【若泽大数据】Flink第一讲之1小时快速入门