Flume 案例:监控目录下的多个追加文件写入 HDFS

133 阅读1分钟

数据流:Taildir Source -> Memory Channel -> HDFS Sink

配置文件:

a1.sources = r1
a1.channels = c1
a1.sinks = k1

a1.sources.r1.type = TAILDIR
# 记录每个文件的 inode、读取偏移量以及文件绝对路径的 JSON 文件
a1.sources.r1.positionFile = /home/admin/taildir/taildir.json
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /home/admin/file/.*txt.*
a1.sources.r1.filegroups.f2 = /home/admin/logs/.*log.*

a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flumeTail/%Y%m%d/%H
a1.sinks.k1.hdfs.filePrefix = tail
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = hour
a2.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.batchSize = 100
a1.sinks.k1.hdfs.fileType = DataStream

a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

监控的目录要提前创建好:

~ mkdir -p /home/admin/file /home/admin/logs

运行:

~ flume-ng agent -n a1 -c conf -f conf/taildir-memory-hdfs.conf

准备数据:

~ cp 1.log 2.log logs/
~ cp 3.txt 4.txt file/

Flume 日志:

2023-02-24 03:21:18,946 INFO hdfs.BucketWriter: Creating /flumeTail/20230224/03/tail.1677180078858.tmp
2023-02-24 03:21:49,545 INFO hdfs.HDFSEventSink: Writer callback called.
2023-02-24 03:21:49,545 INFO hdfs.BucketWriter: Closing /flumeTail/20230224/03/tail.1677180078858.tmp
2023-02-24 03:21:49,606 INFO hdfs.BucketWriter: Renaming /flumeTail/20230224/03/tail.1677180078858.tmp to /flumeTail/20230224/03/tail.1677180078858