Flume01：【案例】Channel Selectors01：多Channel之Multiplexing Channel Selector

本文已参与「新人创作礼」活动，一起开启掘金创作之路。

一、多Channel之Multiplexing Channel Selector

在这里插入图片描述

在这个案例中我们使用Multiplexing选择器，将source采集到的数据按照一定规则发送给两个channle，最终再把不同channel中的数据存储到不同介质中。

在这里面我们需要用到正则抽取拦截器在Event的header中生成key-value 作为Multiplexing选择器的规则

假设我们的原始数据格式为

{"name":"jack","age":19,"city":"bj"}
{"name":"tom","age":26,"city":"sh"}

下面来配置Agent，复制tcp-to-replicatingchannel.conf的内容，主要增加source拦截器和修改channel选择器，以及hdfsink中的path路径在bigdata04上创建新文件tcp-to-multiplexingchannel.conf

[root@bigdata04 conf]# vi tcp-to-multiplexingchannel.conf
# agent的名称是a1
# 指定source组件、channel组件和Sink组件的名称
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2

# 配置source组件
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444

# 配置source拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = "city":"(\\w+)"
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = city

# 配置channle选择器
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = city
a1.sources.r1.selector.mapping.bj = c1
a1.sources.r1.selector.default = c2

# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

# 配置sink组件
a1.sinks.k1.type = logger

a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://192.168.182.100:9000/multiplexing
a1.sinks.k2.hdfs.fileType = DataStream
a1.sinks.k2.hdfs.writeFormat = Text
a1.sinks.k2.hdfs.rollInterval = 3600
a1.sinks.k2.hdfs.rollSize = 134217728
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.useLocalTimeStamp = true
a1.sinks.k2.hdfs.filePrefix = data
a1.sinks.k2.hdfs.fileSuffix = .log

# 把组件连接起来
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

启动Agent

[root@bigdata04 apache-flume-1.9.0-bin]#  bin/flume-ng agent --name a1 --conf conf --conf-file conf/tcp-to-multiplexingchannel.conf -Dflume.root.logger=INFO,console

生成测试数据，通过telnet连接到socket

[root@bigdata04 ~]# telnet localhost 44444                  
Trying ::1...
Connected to localhost.
Escape character is '^]'.
{"name":"jack","age":19,"city":"bj"}
OK
{"name":"tom","age":26,"city":"sh"}
OK

查看结果，在Flume启动命令行中会输出如下日志信息

2020-05-03 10:19:58,181 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{city=bj} body: 7B 22 6E 61 6D 65 22 3A 22 6A 61 63 6B 22 2C 22 {"name":"jack"," }
2020-05-03 10:20:43,058 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.hdfs.BucketWriter.open(BucketWriter.java:246)] Creating hdfs://192.168.182.100:9000/multiplexing/data.1588472338039.tmp

查看sink2输出到hdfs中的数据

[root@bigdata04 ~]# hdfs dfs -cat hdfs://192.168.182.100:9000/multiplexing/data.1588472338039.tmp
{"name":"tom","age":26,"city":"sh"}

这样就实现了，根据规则把source采集到的数据分发到不同channel中，最终输出到不同存储介质中。

这就是Multiplexing Channel Selector的应用了

# 需求：匹配每行开头字母
# C开头输出到Kafka的 ChangeRecord topic中
# P开头输出到Kafka的 ProduceRecord topic中
# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2 k3
a1.channels = c1 c2 c3

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /home/data_log/data.log

#配置source拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = ^(C|P|E)
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = mytype
#如果excludeEvents设为false,表示过滤掉不是以A开头的events
#如果excludeEvents设为true，则表示过滤掉以A开头的events

#配置channle选择器
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = mytype
a1.sources.r1.selector.mapping.C = c1
a1.sources.r1.selector.mapping.P = c2
a1.sources.r1.selector.default = c3

# Describe the sink
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = ChangeRecord
a1.sinks.k1.kafka.bootstrap.servers = localhost:9092

a1.sinks.k2.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k2.topic = ProduceRecord
a1.sinks.k2.kafka.bootstrap.servers = localhost:9092

a1.sinks.k3.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k3.topic = EnvironmentData
a1.sinks.k3.kafka.bootstrap.servers = localhost:9092

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100

a1.channels.c3.type = memory
a1.channels.c3.capacity = 1000
a1.channels.c3.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2 c3
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
a1.sinks.k3.channel = c3