Environment
getExecutionEnvironment
创建一个执行环境,表示当前执行程序的上下文,如果程序是独立调用的,则此方法返回本次执行环境;如果从命令行客户端调用程序以提交到集群,则此方法返回此集群的执行环境。
也就是说getExecutionEnvironment会根据查询运行的方式决定返回什么样的环境,是最常用的一种创建执行环境的方式。
ExecutionEnvironment env =ExecutionEnvironment.getExecutionEnvironment();
//创建流处理环境
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
createLocalEnvironment
本地执行环境,需要在调用时指定默认的并行度。
LocalStreamEnvironment env = StreamExecutionEnvironment.createLocalEnvironment(1);
createRemoteEnvironment
返回集群执行环境,将Jar提交到远程服务器,需要在调用时制定JobManager的IP和端口号,并制定要在集群中运行Jar包。
StreamExecutionEnvironment env = StreamExecutionEnvironment.createRemoteEnvironment("jobManager-hostname",6123,"path//xxx.jar");
Source
从集合中读取数据
从文件读取数据
从Kafka读取数据
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>hi-flink</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<flink.version>1.12.1</flink.version>
<scala.binary.version>2.12</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
<!-- kafka -->
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_${scala.binary.version}</artifactId>
<version>${flink.version}</version>
</dependency>
</dependencies>
</project>
package com.hi.flink.source;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import java.util.Properties;
public class SourceKafkaDemo {
public static void main(String[] args) throws Exception{
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//设置并行度为1
env.setParallelism(1);
Properties p = new Properties();
p.setProperty("bootstrap.servers","localhost:9092");
p.setProperty("group.id","consumer-group");
p.setProperty("key.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
p.setProperty("value.deserializer","org.apache.kafka.common.serialization.StringDeserializer");
p.setProperty("auto.offset.reset","latest");
DataStream<String> dataStream = env.addSource(new FlinkKafkaConsumer<String>("test", new SimpleStringSchema(),p));
//打印输出(只是打印,没有做统计)
dataStream.print();
env.execute();
}
}
Transform
map、flatMap、filter通常被同一称为基本转换算子(简单转换算子)
Map
DataStream<Integer> mapStream = dataStream.map(new MapFunction<String,integer>(){
public Integer map(String value) throws Exception{
return value.length();
}
});
flatMap
DataStream<String> flatMapStream = dataStream.flatMap(new FlatMapFunction<String,String>() {
public void flatMap(String value,Collector<String> out) throws Exception{
String[] fields = value.split(",");
for(String field:fields){
out.collect(field);
}
}
}
filter
keyBy
DataStream -> KeyedStream:逻辑地将一个流拆分成不相交的分区,每个分区包含具有相同key的元素,在内部以hash的形式实现的。
滚动聚合算子(Rolling Aggregation)
这些算子可以针对KeyedStream的每一个支流做聚合操作。
- sum()
- min()
- max()
- minBy()
- maxBy()