大数据开发Flume高级组件(第十四篇)

150 阅读11分钟

一、Flume的高级组件

  1. Source Interceptors:Source可以指定一个或者多个拦截器按先后顺序依次对采集到的数据进行处理
  2. Channel Selectors:Source发往多个Channel的策略设置,如果Source后面接了多个Channel,到底是给所有的Channel都发,还是根据规则发送到不同的Channel,这些是由Channel Selectors来控制的
  3. Sink Processors:Sink发送数据的策略设置,一个Channel后面可以接多个Sink,Channel中的数据是被哪个Sink获取,这个是由Sink Processors控制的
1.1、Event

Event是Flume传输数据的基本单位,也是事务的基本单位,在文本文件中,通常一行记录就是一个Event。Event包含header和body。

  • header类型为Map<String,String>,里面可以存储一些属性信息,方便后面使用,我们可以在Source中给每一条数据的header中增加key-value,在Channel和Sink中使用header中的值
  • body是采集到的那一行记录的原始内容
1.2、Source Interceptors

常见的Source Interceptors类型:

  1. Timestamp Interceptor

    向event中的header里面添加timestamp时间戳信息

  2. Host Interceptor

    向event中的header里面添加host属性,host的值为当前机器的主机名或者ip

  3. Search And Replace Interceptor

    根据指定的规则查询Event中body里面的数据,然后进行替换,这个拦截器会修改event中body的值,也就是会修改原始采集到的数据内容

  4. Static Interceptor

    向event中的header里面添加固定的key和value

  5. Regex Extractor Interceotor

    根据指定的规则从event中的body里面抽取数据,生成key和value,再把key和value添加到header中

  6. ...

二、案例

对采集到的数据按天按类型分目录存储。比如直播网站运行的时候,产生的日志数据,希望把这些数据采集到hdfs进行存储,并且安装类型数据进行分目录存储,视频数据放一块、用户数据放一块、礼物数据放一块。

针对这个需求配置agent的话,source使用基于文件的execsource、channel使用基于文件的channel、数据希望保持完整性和准确性,sink使用hdfssink。但是hdfs的path不能写死,首先按照天就是需要动态获取日期,然后是因为不同类型的数据要存储到不同的目录中,也就是路径的path路径中肯定要有变量,除了日期变量还要有数据类型变量。

此时的flume的config如下:

官网的exec-source文档地址:flume.apache.org/releases/co…

官网的file-channel文档地址:flume.apache.org/releases/co…

官网的search and replace文档地址:flume.apache.org/releases/co…

官网的regex-filtering-interceptor文档地址:flume.apache.org/releases/co…

file-to-hdfs-moreType.conf

# 定义source名字、sink名字、channel名字
a1.sources = r1
a1.sinks = k1
a1.channels = c1
​
# Describe/configure the source
# 配置source相关参数
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /root/log/moreType.log
​
# 配置拦截器。多个拦截器按照顺序依次执行
# a1.sources.r1.interceptors = search-replace
# 定义一个别名
a1.sources.r1.interceptors = i1 i2 i3 i4
# 定义一个规则1
a1.sources.r1.interceptors.i1.type=search_replace
a1.sources.r1.interceptors.i1.searchPattern = "type":"video_info"
a1.sources.r1.interceptors.i1.replaceString = "type":"videoInfo"# 定义一个规则2
a1.sources.r1.interceptors.i2.type=search_replace
a1.sources.r1.interceptors.i2.searchPattern = "type":"user_info"
a1.sources.r1.interceptors.i2.replaceString = "type":"userInfo"# 定义一个规则3
a1.sources.r1.interceptors.i3.type=search_replace
a1.sources.r1.interceptors.i3.searchPattern = "type":"gift_record"
a1.sources.r1.interceptors.i3.replaceString = "type":"giftRecord"
​
​
# 定义一个规则4
a1.sources.r1.interceptors.i4.type=regex_extractor
a1.sources.r1.interceptors.i4.regex = "type":"(\w+)"
a1.sources.r1.interceptors.i4.serializers = s1
a1.sources.r1.interceptors.i4.serializers.s1.name = logType
​
# Use a channel which buffers events in memory
# 配置channel相关参数
a1.channels.c1.type = file
a1.channels.c1.checkpointDir =/root/software/apache-flume-1.10.1-bin/data/moreType/checkponit
a1.channels.c1.dataDirs =/root/software/apache-flume-1.10.1-bin/data/moreType/data
​
# 配置sink相关参数
a1.sinks.k1.type = hdfs
# 指定hdfs存储目录
a1.sinks.k1.hdfs.path = hdfs://192.168.234.100:9000/moreType/%Y%m%d/%{logType}
# 默认SequenceFile。默认DataStream不会压缩数据
a1.sinks.k1.hdfs.fileType = DataStream
# 普通文本数据
a1.sinks.k1.hdfs.writeFormat = Text
# hdfs 30s切分一个文件
a1.sinks.k1.hdfs.rollInterval = 3600
# 默认1024,单位是字节。设置为0不按文件大小切 。这边设置128M
a1.sinks.k1.hdfs.rollSize = 134217728
# 每隔10条数据切出一个文件。0表示不按数据条切文件
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
​
​
# 文件前缀。会在hdfs上生成的文件加上这个前缀
a1.sinks.k1.hdfs.filePrefix=data
# 文件后缀。
a1.sinks.k1.hdfs.fileSuffix=.log
​
# Bind the source and sink to the channel
# 把三个组件链接起来,告诉source需要向哪个channel写入数据,告诉sink需要从哪个channel读取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

测试数据

{"id":"1","start_time":"1111","type":"video_info"}
{"id":"1","start_time":"222","type":"user_info"}
{"id":"1","start_time":"333","type":"gift_record"}

启动flume

bin/flume-ng agent --name a1 --conf conf --conf-file conf/file-to-hdfs-moreType.conf -Dflume.root.logger=INFO,console

查看hdfs

hdfs dfs -ls -R /moreType

查看结果

image-20221001163847674

查看具体内容

hdfs dfs -cat /moreType/20221001/videoInfo/data.1664613474505.log.tmp

image-20221001164023885

三、Channel Selectors

Channel Selectors类型包含:

  1. Replicating Channel Selector 默认的channel选择器
  2. Multiplexing Channel Selector
3.1、Replicating Channel Selector

会将Source采集过来的Event发往所有的Channel

flume.apache.org/releases/co…

a1.sources = r1
a1.channels = c1 c2 c3
a1.sources.r1.selector.type = replicating
a1.sources.r1.channels = c1 c2 c3
a1.sources.r1.selector.optional = c3

上面的配置C3是可选的channel,对于c3的写入失败将被忽略。由于c1和c2未标记为可选,因此未能写入这些channel将导致事务失败。针对这个配置通俗来讲,source的数据会发往c1,c2,c3这三个channel中,可以保证c1,c2一定能接收到所有数据,但是c3就无法保证了。

Hadoop-Channel Selectors.drawio

使用Replicating选择器,将source采集到的数据重复发送给两个Channel,最后每个Channel后面接一个sink,负责把数据存储到不同存储介质上,方便后期使用。实际工作中,比较常见,因为不同存储介质的特点和应用场景是不一样的,典型的就是hdfssink和kafkasink。

  • hdfssink实现离线数据落盘存储,方便后面进行离线数据计算
  • kafkasink实现实时数据存储,方便后面进行实时计算
在flume的conf目录创建tcp-to-replicatingchannel.conf
# 定义source名字、sink名字、channel名字
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
​
# Describe/configure the source
# 配置source相关参数
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444
​
​
# 配置channle选择器[默认就是replicating,所以可以省略]
a1.sources.r1.selector.type = replicating
​
# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100# 配置sink1组件
a1.sinks.k1.type = logger
​
# 配置sink2组件
a1.sinks.k2.type = hdfs
# 指定hdfs存储目录
a1.sinks.k2.hdfs.path = hdfs://192.168.234.100:9000/replicating
# 默认SequenceFile。默认DataStream不会压缩数据
a1.sinks.k2.hdfs.fileType = DataStream
# 普通文本数据
a1.sinks.k2.hdfs.writeFormat = Text
# hdfs 30s切分一个文件
a1.sinks.k2.hdfs.rollInterval = 3600
# 默认1024,单位是字节。设置为0不按文件大小切 。这边设置128M
a1.sinks.k2.hdfs.rollSize = 134217728
# 每隔10条数据切出一个文件。0表示不按数据条切文件
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.useLocalTimeStamp = true
# 文件前缀。会在hdfs上生成的文件加上这个前缀
a1.sinks.k2.hdfs.filePrefix=data
# 文件后缀。
a1.sinks.k2.hdfs.fileSuffix=.log
​
​
# Bind the source and sink to the channel
# 把三个组件链接起来,告诉source需要向哪个channel写入数据,告诉sink需要从哪个channel读取数据
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
  1. 启动flume

    bin/flume-ng agent --name a1 --conf conf --conf-file conf/tcp-to-replicatingchannel.conf -Dflume.root.logger=INFO,console

  2. telnet链接socket

    telnet localhost 44444

    输入hello

    image-20221006103302974

  3. 查看flume日志

    image-20221006103333549

  4. 查看hdfs日志

    hdfs dfs -ls -R /replicating

    image-20221006104413455

  5. 查看具体hdfs日志

    hdfs dfs -cat hdfs://192.168.234.100:9000/replicating/data.1665024181894.log.tmp

    image-20221006104532603

3.2、Multiplexing Channel Selector

会根据Event中的header里面的值将Event发往不同的Channel

a1.sources = r1
a1.channels = c1 c2 c3 c4
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = state
a1.sources.r1.selector.mapping.CZ = c1
a1.sources.r1.selector.mapping.US = c2 c3
a1.sources.r1.selector.default = c4

在这个例子的配置中,指定了4个channel,c1、c2、c3、c4source采集到的数据具体会发送到哪个channel中,会根据event中header里面的state属性的值,这个是通过selector.header控制的

  1. 如果state属性的值是CZ,则发送给c1
  2. 如果state属性的值是US,则发送给c2 c3
  3. 如果state属性的值是其它值,则发送给c4

这些规则是通过selector.mapping和selector.default控制的这样就可以实现根据一定规则把数据分发给不同的channel了

Hadoop-multiplexing channel selector.drawio

在这里我们需要用到正则抽取拦截器在Event的header中生成key-value作为Muptiplexing选择器的规则

假设原始数据:

{"name":"jack","age":19,"city":"bj"}
{"name":"tom","age":26,"city":"sh"}

下面来配置Agent,复制tcp-to-multiplexingchannel.conf的内容

# 定义source名字、sink名字、channel名字
a1.sources = r1
a1.channels = c1 c2
a1.sinks = k1 k2
​
# Describe/configure the source
# 配置source相关参数
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444# 配置source拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
a1.sources.r1.interceptors.i1.regex = "city":"(\w+)"
a1.sources.r1.interceptors.i1.serializers = s1
a1.sources.r1.interceptors.i1.serializers.s1.name = city
​
# 配置channle选择器[默认就是replicating,所以可以省略]
a1.sources.r1.selector.type = multiplexing
a1.sources.r1.selector.header = city
​
# 配置channle选择器规则
#北京进入c1选择器
a1.sources.r1.selector.mapping.bj = c1
#其他城市进入c2选择器
a1.sources.r1.selector.default = c2
​
# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100# 配置sink1组件
a1.sinks.k1.type = logger
​
# 配置sink2组件
a1.sinks.k2.type = hdfs
# 指定hdfs存储目录
a1.sinks.k2.hdfs.path = hdfs://192.168.234.100:9000/replicating
# 默认SequenceFile。默认DataStream不会压缩数据
a1.sinks.k2.hdfs.fileType = DataStream
# 普通文本数据
a1.sinks.k2.hdfs.writeFormat = Text
# hdfs 30s切分一个文件
a1.sinks.k2.hdfs.rollInterval = 3600
# 默认1024,单位是字节。设置为0不按文件大小切 。这边设置128M
a1.sinks.k2.hdfs.rollSize = 134217728
# 每隔10条数据切出一个文件。0表示不按数据条切文件
a1.sinks.k2.hdfs.rollCount = 0
a1.sinks.k2.hdfs.useLocalTimeStamp = true
# 文件前缀。会在hdfs上生成的文件加上这个前缀
a1.sinks.k2.hdfs.filePrefix=data
# 文件后缀。
a1.sinks.k2.hdfs.fileSuffix=.log
​
​
# Bind the source and sink to the channel
# 把三个组件链接起来,告诉source需要向哪个channel写入数据,告诉sink需要从哪个channel读取数据
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
  1. 启动flume

    bin/flume-ng agent --name a1 --conf conf --conf-file conf/tcp-to-multiplexingchannel.conf -Dflume.root.logger=INFO,console

  2. telnet链接socket

    telnet localhost 44444

    输入:{"name":"jack","age":19,"city":"bj"}

    {"name":"tom","age":26,"city":"sh"}

  3. 查看hdfs

    hdfs dfs -ls -R /replicating

    hdfs dfs -cat hdfs://192.168.234.100:9000/replicating/data.1665026324530.log.tmp

    image-20221006111959946

  4. flume的日志

    image-20221006112054406

四、Sink Processors

Sink Processors类型包括这三种:

  1. Default Sink Processor

    不用配置sinkgroup,就是咱们现在使用的这种最普通的形式,一个channel后面接一个sink的形式

  2. Load balancing Sink Processor

    是负载均衡处理器,一个channle后面可以接多个sink,这多个sink属于一个sink group,根据指定的算法进行轮询或者随机发送,减轻单个sink的压力

  3. Failover Sink Processor 故障转移

    是故障转移处理器,一个channle后面可以接多个sink,这多个sink属于一个sink group,按照sink的优先级,默认先让优先级高的sink来处理数据,如果这个sink出现了故障,则用优先级低一点的sink处理数据,可以保证数据不丢失。

4.1、Load balancing Sink Processor

Hadoop-Load Balancing Sink Processor.drawio

sink group的flume设置

# 定义source名字、sink名字、channel名字
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
​
# Describe/configure the source
# 配置source相关参数
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100# 配置sink组件,[为了方便演示效果,把batch-size设置为1]
# batch size设置为100条才进行发送
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=192.168.182.101
a1.sinks.k1.port=41414
a1.sinks.k1.batch-size = 1a1.sinks.k2.type=avro
a1.sinks.k2.hostname=192.168.182.102
a1.sinks.k2.port=41414
a1.sinks.k2.batch-size = 1# 配置sink策略
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = load_balance
a1.sinkgroups.g1.processor.backoff = true
a1.sinkgroups.g1.processor.selector = round_robin
​
​
# Bind the source and sink to the channel
# 把三个组件链接起来,告诉source需要向哪个channel写入数据,告诉sink需要从哪个channel读取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

balancing-01.conf

# 指定source组件、channel组件和Sink组件的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source组件
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414
# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 配置sink组件[为了区分两个sink组件生成的文件,修改filePrefix的值]
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.182.100:9000/load_balance
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 3600
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = data101
a1.sinks.k1.hdfs.fileSuffix = .log
# 把组件连接起来
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

balancing-02.conf

# agent的名称是a1
# 指定source组件、channel组件和Sink组件的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source组件
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414
# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 配置sink组件[为了区分两个sink组件生成的文件,修改filePrefix的值]
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.182.100:9000/load_balance
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 3600
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = data102
a1.sinks.k1.hdfs.fileSuffix = .log
# 把组件连接起来
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

注意:先启动balancing-01、balancing-02。再启动group

4.2、Failover Sink Processor

Hadoop-Failover Sink Processor.drawio

# 定义source名字、sink名字、channel名字
a1.sources = r1
a1.channels = c1
a1.sinks = k1 k2
​
# Describe/configure the source
# 配置source相关参数
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100# 配置sink组件,[为了方便演示效果,把batch-size设置为1]
# batch size设置为100条才进行发送
a1.sinks.k1.type=avro
a1.sinks.k1.hostname=192.168.182.101
a1.sinks.k1.port=41414
a1.sinks.k1.batch-size = 1a1.sinks.k2.type=avro
a1.sinks.k2.hostname=192.168.182.102
a1.sinks.k2.port=41414
a1.sinks.k2.batch-size = 1# 配置sink策略
a1.sinkgroups = g1
a1.sinkgroups.g1.sinks = k1 k2
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 10
a1.sinkgroups.g1.processor.maxpenalty = 10000
​
​
# Bind the source and sink to the channel
# 把三个组件链接起来,告诉source需要向哪个channel写入数据,告诉sink需要从哪个channel读取数据
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1

failover01

# agent的名称是a1
# 指定source组件、channel组件和Sink组件的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source组件
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414
# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 配置sink组件[为了区分两个sink组件生成的文件,修改filePrefix的值]
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.182.100:9000/failover
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 3600
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = data101
a1.sinks.k1.hdfs.fileSuffix = .log
# 把组件连接起来
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

failover02

# agent的名称是a1
# 指定source组件、channel组件和Sink组件的名称
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# 配置source组件
a1.sources.r1.type = avro
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 41414
# 配置channel组件
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# 配置sink组件[为了区分两个sink组件生成的文件,修改filePrefix的值]
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://192.168.182.100:9000/failover
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
a1.sinks.k1.hdfs.rollInterval = 3600
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.filePrefix = data102
a1.sinks.k1.hdfs.fileSuffix = .log
# 把组件连接起来
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1