有状态计算是Flink一个特点， Flink计算时有状态的，它会将状态数据保存到本地或者内存中。

状态数据可以是：

到在这之前接受的事件序列
计算的中间结果
一些历史数据

Flink保存数据状态，可以用于Checkpoint和Savepoint中，来实现容错保证数据一致性。
同时Flink的状态数据是可以被外界查询的，使用Flink提供的Client就可以读取状态数据。

State分类

KeyedState

使用在key by之后的算子，对key聚合之后，在一个operator的计算过程中可以记录中间state做一些自定义的计算。
包括:

ValueState
MapState
ListState
ReducingState
AggregatingState
FoldingState KeyedState用于保存k-v数据，只能用在Keyed算子中。
使用：
KeyedState需要重写RichFunction类
定义Descriptor状态描述符
使用getRuntimeContext.getState(...Descriptor)管理state

//重写RichFunction类(这里使用的是map算子)
class TestClass extends RichMapFunction<Tuple2<String, String>,Long>{
	private ValueState<Long> valueState;
	@Override
	public void open(Configuration parameters) throws Exception{
		super.open(parameters);
		//定义Descriptor状态描述符
		ValueStateDescriptor<Long> stateDescriptor = new ValueStateDescriptor<Long>("name1",LongSerializer.INSTANCE);
		//将state注册到getRuntimeContext中管理
		valueState = getRuntimeContext().getState(stateDescriptor);
	}
	@Override
	public Long map(<Tuple2<String, Long> value) throws Exception{
		Long lenght = valueState.value();
                // init
		if (lenght == null) {
			length = 0;
		}
		long newValue = lenght + value.f1 * 2 ;
		// update
		valueState.update(newValue);
		return newValue;
	}
}

//调用
stream.map(new TestClass())

OperatorState

相对于KeyedState，OperatorState可以使用在非key by的算子中。
OpeartorState支持状态的重新分布。所谓状态重新分布是指当并行度数量发生改变时，状态数据如何从上游算子传递给下游算子，符合重新分布见下面每一种状态详情。
OpeartorState不常用，主要用于source/sink节点，比如bufferSink. 包括：

ListState 如果并行度发生改变，会将上游的state平均分配给下游。
UnionListState 如果并行度发生改变，会将上游的state合并，然后通过广播的方式全部发送给下游每一个节点。
BroadcastState 如果并行度发生改变，因为是广播，所以上游每一个节点的数据是一致的，所以会以广播的形式将上游一个节点的数据发送给下游每一个节点。使用：
实现CheckPointedFunction/ListCheckPointed(Deprecated)接口
定义Descriptor状态描述符
使用getRuntimeContext.getState(...Descriptor)管理state

//接口 CheckPointedFunction
@Public
public interface CheckpointedFunction {
    void snapshotState(FunctionSnapshotContext var1) throws Exception;
    //init
    void initializeState(FunctionInitializationContext var1) throws Exception;
}

//官方给的例子,用来统计每一个key的count和每一个并行分区的count
public class MyFunction<T> implements MapFunction<T, T>, CheckpointedFunction {
     private ReducingState<Long> countPerKey;
     private ListState<Long> countPerPartition;
     private long localCount;
     //每一个分区operator创建的时候执行，用于初始化
     public void initializeState(FunctionInitializationContext context) throws Exception {
         countPerKey = context.getKeyedStateStore().getReducingState(
                 new ReducingStateDescriptor<>("perKeyCount", new AddFunction<>(), Long.class));
         countPerPartition = context.getOperatorStateStore().getOperatorState(
                 new ListStateDescriptor<>("perPartitionCount", Long.class));
         // 将从持久化恢复的数据恢复到state中
         for (Long l : countPerPartition.get()) {
             localCount += l;
         }
     }
     //snapshot备份
     public void snapshotState(FunctionSnapshotContext context) throws Exception {
         countPerPartition.clear();
         countPerPartition.add(localCount);
     }
     public T map(T value) throws Exception {
         // update the states
         countPerKey.add(1L);
         localCount++;
         return value;
     }
 }
 
 //使用CheckpointedFunction其实不是最简单的方式，可以使用RichMapFunction代替， 下面这种方式的效率比上面的高很多，所以建议使用下面这种方式
  public class CountPerKeyFunction<T> extends RichMapFunction<T, T> {
     private ValueState<Long> count;
     public void open(Configuration cfg) throws Exception {
         count = getRuntimeContext().getState(new ValueStateDescriptor<>("myCount", Long.class));
     }
     public T map(T value) throws Exception {
         Long current = count.get();
         count.update(current == null ? 1L : current + 1);

         return value;
     }
 }

BufferSink例子：

class BufferingSinkFunction implements SinkFunction<Tuple2<String, Integer>>, CheckpointedFunction {
    //触发批量发送的阈值
    private final int threshold;
    private transient ListState<Tuple2<String, Integer>> checkpointedState;
    //buffer list, 保存批量待sink数据
    private List<Tuple2<String, Integer>> bufferedElements;
    public BufferingSinkFunction(int threshold) {
        this.threshold = threshold;
        this.bufferedElements = new ArrayList<>();
    }
    @Override
    //每一条记录触发
    public void invoke(Tuple2<String, Integer> value, Context contex) throws Exception {
        bufferedElements.add(value);
        if (bufferedElements.size() == threshold) {
            for (Tuple2<String, Integer> element : bufferedElements) {
                //todo send it to the sink
            }
            bufferedElements.clear();
        }
    }
    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
        checkpointedState.clear();
        //将未发送出去的数据持久化
        for (Tuple2<String, Integer> element : bufferedElements) {
            checkpointedState.add(element);
        }
    }
    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
        //初始化state
        ListStateDescriptor<Tuple2<String, Integer>> descriptor =
                new ListStateDescriptor<>(
                        "buffered-elements",
                        TypeInformation.of(new TypeHint<Tuple2<String, Integer>>() {
                        }));
        checkpointedState = context.getOperatorStateStore().getListState(descriptor);
        if (context.isRestored()) {
            //将从磁盘恢复的数据加入到buffer list中
            for (Tuple2<String, Integer> element : checkpointedState.get()) {
                bufferedElements.add(element);
            }
        }
    }
}

BroadcastState

BroadcastState是一种特殊的OperatorState。它主要用于将一些小数据量的A流以广播的形式发送到B流的每一个Operator中，B流在计算的时候就可以使用A流的数据。
广播流是k-v的形式保存的，就类似于Map，可以同时保存很多广播流，以Key值来区分。

使用场景：

大小流的join，小流可以使用广播的形式。
可以用于动态参数管理。大流计算的时候依赖动态参数做适当的逻辑，可以在广播流中获取更新。

注意点：

广播流数据量要比较小。
给广播流预留足够的空间。
RockDB不支持持久化广播数据。

使用：

//定义Descriptor
MapStateDescriptor<String, MyPojo> stateDescriptor = new MapStateDescriptor<>(
        "BroadcastStateName",
        BasicTypeInfo.STRING_TYPE_INFO,
        TypeInformation.of(new TypeHint<MyPojo>() {
        }));
//创建broadcastStream 
BroadcastStream<MyPojo> broadcastAStream = keyedAStream
        .broadcast(stateDescriptor);
//将A/B流建立连接
DataStream<String> output = bStream
        .connect(broadcastAStream)
        .process(
        // KS bStream的key类型
        // IN1 bStream的value类型
        // IN2 keyedAStream的value类型
        // output类型
        // 如果是非keyed stream，实现BroadcastProcessFunction<IN1,IN2,OUT>接口
        new KeyedBroadcastProcessFunction<KS, IN1, IN2, OUT>() {        
            //定义广播stage的Descriptor
            private final MapStateDescriptor<String, Rule> broadcastStateDescriptor =
                new MapStateDescriptor<>(
                        "BroadcastStateName",
                        BasicTypeInfo.STRING_TYPE_INFO,
                        TypeInformation.of(new TypeHint<MyPojo>() {
                        }));
            //当广播流(A流)的数据更新时会触发这个方法
            @Overwrite
            public void processElement(final IN1 value, final ReadOnlyContext ctx, final Collector<OUT> out) throws Exception{
                //更新广播流中的数据
                ctx.getBroadcastState(broadcastStateDescriptor).put(value.name, value);
            }
            //当数据量(B流)数据更新是触发这个方法
            @Overwrite
            public void processBroadcastElement(final IN2 value, final Context ctx, final Collector<OUT> out) throws Exception{
                //获取广播state中的数据，在这个方法中广播流的数据是只读的
                ReadOnlyBroadcastState<String, IN2> broadcastState = ctx.getBroadcastState(ruleStateDescriptor)
                //todo process ...
                out.collect(...)
            }
        })

Flink有状态计算

State分类

KeyedState

OperatorState

BroadcastState