Flume01：【案例】正则表达式匹配文件名本文已参与「新人创作礼」活动，一起开启掘金创作之路。 Flume 中文文档：

本文已参与「新人创作礼」活动，一起开启掘金创作之路。

Flume NG 高级组件除了 Source、channel、Sink外，Flume Agent 还允许用户设置其他组件更灵活地控制数据流，包括 Interceptor，Channel Selector 和 Sink Processor。 Interceptor Flume 中的拦截器（Interceptor），当 Source 读取 Event 发送到 Sink 的 Event 时候，在 Event header 中加入一些有用的信息，或者对 Event 的内容进行过滤，完成初步的数据清洗。

用户可配置多个 Interceptor，形成一个 Interceptor 链。

a1.sources.r1.interceptors=i1 i2  
a1.sources.r1.interceptors.i1.type=regex_filter  
a1.sources.r1.interceptors.i1.regex={.*}  
a1.sources.r1.interceptors.i2.type=timestamp

这在实际业务场景中非常有用，Flume-ng 1.7 中目前提供了以下拦截器：

Timestamp Interceptor： 该 Interceptor 在每个 Event 头部插入时间戳，其中key是timestamp,value为当前时刻。 Host Interceptor：该 Interceptor 在每个 Event 头部插入当前 Agent 所在机器的host或ip，其中key是host(也可自定义)。

Channel Selector Channel Selector 允许 Flume Source 选择一个或多个目标 Channel，并将当前 Event 写入这些 Channel。

Flume 提供了两种 Channel Selector 实现： - Replicating Channel Selector：将每个 Event 指定多个 Channel，通过该 Selector，Flume 可将相同数据导入到多套系统中，一遍进行不同地处理。这是Flume 默认采用的 Channel Selector。

Sink Processor

Flume 允许将多个 Sink 组装在一起形成一个逻辑实体，成为 Sink Group。而 Sink Processor 则在 Sink Group 基础上提供负载均衡以及容错功能。当一个 Sink 挂掉了，可由另一个 Sink 接替。

案例：下面是使用正则表达式匹配文件名，

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = spooldir
a1.sources.r1.spoolDir = /opt/hadoop-2.7.7/logs/
a1.sources.r1.includePattern = .*log$
#a1.sources.r1.includePattern = ^hadoop-root-namenode.*

# Describe the sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /tmp/flume

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

Flume 中过滤以什么结尾的文件 a1.sources.r1.includePattern = .*log$

Flume 中过滤以什么开头的文件 a1.sources.r1.includePattern = ^hadoop-root-namenode.*