1 排查记录
线上日志系统在小流量场景下会导致数据丢失, 现在日志采集系统的数据流程是
even-logger -> flume -> hdfs /kafka
even-logger部分通过maven api方式提供应用程序调用,内部通过log4j2生成一个滚动logger
import org.apache.logging.log4j.core.LoggerContext;
import org.apache.logging.log4j.core.appender.rolling.CompositeTriggeringPolicy;
import org.apache.logging.log4j.core.appender.rolling.SizeBasedTriggeringPolicy;
import org.apache.logging.log4j.core.appender.rolling.TimeBasedTriggeringPolicy;
import org.apache.logging.log4j.core.config.AppenderRef;
import org.apache.logging.log4j.core.config.Configuration;
import org.apache.logging.log4j.core.config.LoggerConfig;
import org.slf4j.LoggerFactory;
public static org.slf4j.Logger getLogger(String catalog,String pattern) {
LoggerContext ctx = (LoggerContext) org.apache.logging.log4j.LogManager.getContext(false);
final Configuration config = ctx.getConfiguration();
// String catalog = "event-logger-test";
final String FILE_APPENDER = "file-appender";
// 设置触发条件
// 时间+文件大小
TimeBasedTriggeringPolicy timeBasedTriggeringPolicy = TimeBasedTriggeringPolicy
.createPolicy("1", "false");
timeBasedTriggeringPolicy.start();
SizeBasedTriggeringPolicy sizeBasedTriggeringPolicy = SizeBasedTriggeringPolicy
.createPolicy("200M");
sizeBasedTriggeringPolicy.start();
CompositeTriggeringPolicy compositeTriggeringPolicy = CompositeTriggeringPolicy
.createPolicy(timeBasedTriggeringPolicy, sizeBasedTriggeringPolicy);
compositeTriggeringPolicy.start();
//设置file appender
org.apache.logging.log4j.core.Appender fileAppender = org.apache.logging.log4j.core.appender.RollingFileAppender
.newBuilder()
.withName(catalog + FILE_APPENDER)
.withFilePattern("/Users/shanjunwei/data/agent-smith" + "/" + catalog + "/" + catalog + ".%d{"+pattern+"}.%i.rio2")
.withPolicy(compositeTriggeringPolicy)
// .withLayout(layout)
.withBufferedIo(true)
.withImmediateFlush(false)
//设为true的话,当磁盘被打满,AppenderLoggingException不会被继续向外抛出,只在log4j2内部处理了
.withIgnoreExceptions(false)
.setConfiguration(config)
.build();
fileAppender.start();
config.addAppender(fileAppender);
AppenderRef[] appenderRefs = new AppenderRef[1];
AppenderRef fileAppenderRef = AppenderRef.createAppenderRef(
catalog + FILE_APPENDER,
org.apache.logging.log4j.Level.INFO,
null);
appenderRefs[0] = fileAppenderRef;
LoggerConfig loggerConfig = LoggerConfig.createLogger(
false,
org.apache.logging.log4j.Level.INFO,
catalog,
"false",
appenderRefs,
null,
config,
null);
loggerConfig.addAppender(fileAppender, org.apache.logging.log4j.Level.INFO, null);
config.addLogger(catalog, loggerConfig);
ctx.updateLoggers();
System.out.println("new logger " + catalog);
org.slf4j.Logger logger = LoggerFactory.getLogger(catalog);
return logger;
}
通过代码调试可以发现最终写文件的类 org.apache.logging.log4j.core.appender.OutputStreamManager
写文件方法的逻辑是,如果开启了缓存IO模式,那么数据不满缓存区的时候,是先写到缓冲区。如果关闭了这一模式,那么每次调用EventLogger.log() 来打印日志都会进行 flush 操作,这个操作不管缓存区有没有满,都会直接刷数据到磁盘。
protected synchronized void write(final byte[] bytes, final int offset, final int length, final boolean immediateFlush) {
if (immediateFlush && byteBuffer.position() == 0) {
writeToDestination(bytes, offset, length);
flushDestination();
return;
}
if (length >= byteBuffer.capacity()) {
// if request length exceeds buffer capacity, flush the buffer and write the data directly
flush();
writeToDestination(bytes, offset, length);
} else {
if (length > byteBuffer.remaining()) {
flush();
}
byteBuffer.put(bytes, offset, length);
}
if (immediateFlush) {
flush();
}
}
protected synchronized void flushBuffer(final ByteBuffer buf) {
((Buffer) buf).flip();
try {
if (buf.remaining() > 0) {
writeToDestination(buf.array(), buf.arrayOffset() + buf.position(), buf.remaining());
}
} finally {
buf.clear();
}
}
/**
* Flushes any buffers.
*/
public synchronized void flush() {
flushBuffer(byteBuffer);
flushDestination();
}
然后通过flume处理日志数据,flume端处理完一个日志文件就删除一个日志文件,删除的逻辑是
File logDictionary = new File("/data/agent-smith/" + catLog);
for (File file : logDictionary.listFiles()) {
System.out.println(file.lastModified());
if (file.lastModified() + 70000 < System.currentTimeMillis()) {
file.delete();
}
}
log4j2开启了缓存模式,两种情况数据会从内存缓存区持久化到磁盘
- 缓冲区满了,默认是8K
- 文件切换的时候,在切文件之前也会触发flush
由于删除逻辑是通过最后更新时间判断文件是否还会写入,那么就存在这样一种可能,判断逻辑执行的时候当前这个文件没有发生切换,缓存区也没有满,但是时间超过了 70S,这个时候如果清除文件的话是会导致滞留在缓存区的数据丢失的,当然丢失的数据并不多,在8K左右。
为什么要删除文件,一个原因是处理过的文件节省磁盘空间,另一个原因需要通过文件判断当前小时数据是否处理完成。
2 思考
- 使用一个工具,用到什么特性需要对这个特性有足够的了解
- 写代码的时候有些断言或者假设前提需要确保其靠谱程度
- 可以怀疑开源组件的bug,但是在此之前确认自己的使用方式是否有问题