Flink实时去重——BitMap实现

484 阅读1分钟

flink常见的实时去重方案:

  • 基于状态后端
  • 基于HyperLogLog
  • 基于布隆过滤器(BloomFilter)
  • 基于BitMap
  • 基于外部数据库

bitmap以及Roaringbitmap原理

cloud.tencent.com/developer/a…

www.cnblogs.com/cjsblog/p/1…

www.cnblogs.com/huangxinche…

cloud.tencent.com/developer/a…

bitmap实现flink数据去重

此处采用的是Roaringbitmap,需要添加maven依赖

<dependency>
    <groupId>org.roaringbitmap</groupId>
    <artifactId>RoaringBitmap</artifactId>
    <version>0.9.21</version>
</dependency>

完整的flink程序: 数据源部分还跟之前的一致,将process函数部分替换为bitmap对象进行去重

package others;

import com.google.common.base.Charsets;
import com.google.common.hash.Hashing;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.api.connector.source.util.ratelimit.RateLimiterStrategy;
import org.apache.flink.api.java.functions.KeySelector;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.connector.datagen.source.DataGeneratorSource;
import org.apache.flink.connector.datagen.source.GeneratorFunction;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.KeyedStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.ProcessFunction;
import org.apache.flink.util.Collector;
import org.roaringbitmap.RoaringBitmap;
import java.util.Random;

/**
 * @projectName: wc
 * @package: others
 * @className: bitMapFilterDemo
 * @author: NelsonWu
 * @description: bitmap去重
 * @date: 2024/2/25 15:16
 * @version: 1.0
 */
public class bitMapFilterDemo {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);  // 设置全局并行度为1。

        // 重写数据生成器的方法,生成0-9以内的随机数用于测试
        // 输出为String类型:num:0,num:1,num:3,...num:9
        Random random = new Random();
        DataGeneratorSource<String> dataGeneratorSource = new DataGeneratorSource<>(
                new GeneratorFunction<Long, String>() {
                    @Override
                    public String map(Long aLong) throws Exception {
                        int i = random.nextInt(10);
                        return "num:" + i;
                    }
                },
                20,
                RateLimiterStrategy.perSecond(1),
                Types.STRING
        );

        DataStreamSource<String> stringDataStreamSource = env.fromSource(
                dataGeneratorSource,
                WatermarkStrategy.noWatermarks(),
                "data-generator"
        );
        KeyedStream<String, String> keyedStream = stringDataStreamSource.keyBy(
                new KeySelector<String, String>() {
                    @Override
                    public String getKey(String value) throws Exception {
                        String[] split = value.split(":");
                        return split[1];
                    }
                }
        );

        keyedStream.process(
                new ProcessFunction<String, String>() {

                    private ValueState<RoaringBitmap> bitmapValueState;

                    @Override
                    public void open(Configuration parameters) throws Exception {
                        super.open(parameters);
                        RoaringBitmap bitmap = new RoaringBitmap();
                        ValueStateDescriptor<RoaringBitmap> descriptor = new ValueStateDescriptor<>(
                                "bloomFilterState",
                                TypeInformation.of(new TypeHint<RoaringBitmap>() {}),
                                bitmap);
                        bitmapValueState = getRuntimeContext().getState(descriptor);
                    }

                    @Override
                    public void processElement(String s, ProcessFunction<String, String>.Context context, Collector<String> collector) throws Exception {
                        String key = s.split(":")[1]; // 取出key用于判断以及插入状态后端
                        int IntKey = hash2Int(key);
                        RoaringBitmap bitmap = bitmapValueState.value();  // state中取出bitmap对象
                        if (!bitmap.contains(IntKey)){
                            bitmap.add(IntKey);
                            bitmapValueState.update(bitmap);
                            collector.collect(s);
                        }
                    }
                }
        ).print();

        env.execute("bitMapApplication");
    }

    public static int hash2Int(String value){
        // 由于数据为字符串类型,bitmap只能处理int类型,这里利用哈希函数将字符串转换为int类型
        return Hashing.murmur3_32().hashString(value, Charsets.UTF_8).asInt();
    }

}

执行结果:

image.png

PS:主要去重部分逻辑跟布隆过滤器的写法一样。。从理论上bitmap的精确度要比布隆过滤器要高。

参考:

cloud.tencent.com/developer/a…

www.jianshu.com/p/201b45f2a…

blog.csdn.net/fan_yi_bo/a…