Flink实时去重——布隆过滤器(BloomFilter)实现

594 阅读1分钟

flink常见的实时去重方案:

  • 基于状态后端
  • 基于HyperLogLog
  • 基于布隆过滤器(BloomFilter)
  • 基于BitMap
  • 基于外部数据库

基于布隆过滤器的flink实时去重

source部分还是跟上一篇一样采用0-9的随机数组成的字符串。使用BloomFilter,对中间结果的判断储存;使用ValueState存放布隆过滤器,以便更新布隆过滤器。具体实现如下:

package others;

import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.connector.source.util.ratelimit.RateLimiterStrategy;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.connector.datagen.source.DataGeneratorSource;
import org.apache.flink.connector.datagen.source.GeneratorFunction;
import org.apache.flink.shaded.guava30.com.google.common.hash.BloomFilter;
import org.apache.flink.shaded.guava30.com.google.common.hash.Funnels;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.KeyedProcessFunction;
import org.apache.flink.util.Collector;

import java.util.Random;

/**
 * @projectName: wc
 * @package: others
 * @className: boolmFilterApp
 * @author: NelsonWu
 * @description: Flink中BloomFilter(布隆过滤器)和ValueState的结合使用对数据进行去重
 * @date: 2024/2/25 0:06
 * @version: 1.0
 */
public class boolmFilterApp {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);  // 设置全局并行度为1。

        // 重写数据生成器的方法,生成0-9以内的随机数用于测试
        // 输出为String类型:num:0,num:1,num:3,...num:9
        Random random = new Random();
        DataGeneratorSource<String> dataGeneratorSource = new DataGeneratorSource<>(
                new GeneratorFunction<Long, String>() {
                    @Override
                    public String map(Long aLong) throws Exception {
                        int i = random.nextInt(10);
                        return "num:" + i;
                    }
                },
                20,
                RateLimiterStrategy.perSecond(1),
                Types.STRING
        );

        DataStreamSource<String> stringDataStreamSource = env.fromSource(
                dataGeneratorSource,
                WatermarkStrategy.noWatermarks(),
                "data-generator"
        );
        KeyedStream<String, String> keyedStream = stringDataStreamSource.keyBy(
                new KeySelector<String, String>() {
                    @Override
                    public String getKey(String value) throws Exception {
                        String[] split = value.split(":");
                        return split[1];
                    }
                }
        );

        keyedStream.process(new KeyedProcessFunction<String, String, String>() {

            public transient ValueState<BloomFilter> bloomFilterState;

            @Override
            public void open(Configuration parameters) throws Exception {
                super.open(parameters);
                // 布隆过滤器初始化
                BloomFilter<CharSequence> bloomFilter = BloomFilter.create(Funnels.unencodedCharsFunnel(), 10000000);
                ValueStateDescriptor<BloomFilter> descriptor = new ValueStateDescriptor<>(
                                "bloomFilterState",
                                TypeInformation.of(new TypeHint<BloomFilter>() {}),
                                bloomFilter);
                bloomFilterState = getRuntimeContext().getState(descriptor);

            }

            @Override
            public void processElement(String s, KeyedProcessFunction<String, String, String>.Context context, Collector<String> collector) throws Exception {
                String key = s.split(":")[1]; // 取出key用于判断以及插入状态后端
                BloomFilter bloomFilter = bloomFilterState.value();
                if (!bloomFilter.mightContain(key)){
                    bloomFilter.put(key);
                    bloomFilterState.update(bloomFilter);
                    collector.collect(s);
                }
            }
        }).print();

        env.execute("bloomFilterApp");

    }
}

执行结果: 输入20个数据,最终输出9个数据。 image.png 可以正常实现过滤数据(数据去重)。

博客

blog.csdn.net/u010271601/…