Log4j2 数据丢失问题排查与记录

991 阅读3分钟

1 排查记录

线上日志系统在小流量场景下会导致数据丢失, 现在日志采集系统的数据流程是

even-logger -> flume -> hdfs /kafka

even-logger部分通过maven api方式提供应用程序调用,内部通过log4j2生成一个滚动logger

import org.apache.logging.log4j.core.LoggerContext;
import org.apache.logging.log4j.core.appender.rolling.CompositeTriggeringPolicy;
import org.apache.logging.log4j.core.appender.rolling.SizeBasedTriggeringPolicy;
import org.apache.logging.log4j.core.appender.rolling.TimeBasedTriggeringPolicy;
import org.apache.logging.log4j.core.config.AppenderRef;
import org.apache.logging.log4j.core.config.Configuration;
import org.apache.logging.log4j.core.config.LoggerConfig;
import org.slf4j.LoggerFactory;
public static org.slf4j.Logger getLogger(String catalog,String pattern) {
    LoggerContext ctx = (LoggerContext) org.apache.logging.log4j.LogManager.getContext(false);
    final Configuration config = ctx.getConfiguration();
   // String catalog = "event-logger-test";
    final String FILE_APPENDER = "file-appender";
    // 设置触发条件
    // 时间+文件大小
    TimeBasedTriggeringPolicy timeBasedTriggeringPolicy = TimeBasedTriggeringPolicy
            .createPolicy("1", "false");
    timeBasedTriggeringPolicy.start();
    SizeBasedTriggeringPolicy sizeBasedTriggeringPolicy = SizeBasedTriggeringPolicy
            .createPolicy("200M");
    sizeBasedTriggeringPolicy.start();
    CompositeTriggeringPolicy compositeTriggeringPolicy = CompositeTriggeringPolicy
            .createPolicy(timeBasedTriggeringPolicy, sizeBasedTriggeringPolicy);
    compositeTriggeringPolicy.start();
    //设置file appender
    org.apache.logging.log4j.core.Appender fileAppender = org.apache.logging.log4j.core.appender.RollingFileAppender
            .newBuilder()
            .withName(catalog + FILE_APPENDER)
            .withFilePattern("/Users/shanjunwei/data/agent-smith" + "/" + catalog + "/" + catalog +  ".%d{"+pattern+"}.%i.rio2")
            .withPolicy(compositeTriggeringPolicy)
            // .withLayout(layout)
            .withBufferedIo(true)
            .withImmediateFlush(false)
            //设为true的话,当磁盘被打满,AppenderLoggingException不会被继续向外抛出,只在log4j2内部处理了
            .withIgnoreExceptions(false)
            .setConfiguration(config)
            .build();
    fileAppender.start();
    config.addAppender(fileAppender);

    AppenderRef[] appenderRefs = new AppenderRef[1];
    AppenderRef fileAppenderRef = AppenderRef.createAppenderRef(
            catalog + FILE_APPENDER,
            org.apache.logging.log4j.Level.INFO,
            null);
    appenderRefs[0] = fileAppenderRef;
    LoggerConfig loggerConfig = LoggerConfig.createLogger(
            false,
            org.apache.logging.log4j.Level.INFO,
            catalog,
            "false",
            appenderRefs,
            null,
            config,
            null);
    loggerConfig.addAppender(fileAppender, org.apache.logging.log4j.Level.INFO, null);
    config.addLogger(catalog, loggerConfig);
    ctx.updateLoggers();
    System.out.println("new logger " + catalog);
    org.slf4j.Logger logger = LoggerFactory.getLogger(catalog);
    return logger;
}

通过代码调试可以发现最终写文件的类 org.apache.logging.log4j.core.appender.OutputStreamManager

写文件方法的逻辑是,如果开启了缓存IO模式,那么数据不满缓存区的时候,是先写到缓冲区。如果关闭了这一模式,那么每次调用EventLogger.log() 来打印日志都会进行 flush 操作,这个操作不管缓存区有没有满,都会直接刷数据到磁盘。

protected synchronized void write(final byte[] bytes, final int offset, final int length, final boolean immediateFlush) {
    if (immediateFlush && byteBuffer.position() == 0) {
        writeToDestination(bytes, offset, length);
        flushDestination();
        return;
    }
    if (length >= byteBuffer.capacity()) {
        // if request length exceeds buffer capacity, flush the buffer and write the data directly
        flush();
        writeToDestination(bytes, offset, length);
    } else {
        if (length > byteBuffer.remaining()) {
            flush();
        }
        byteBuffer.put(bytes, offset, length);
    }
    if (immediateFlush) {
        flush();
    }
}
protected synchronized void flushBuffer(final ByteBuffer buf) {
    ((Buffer) buf).flip();
    try {
        if (buf.remaining() > 0) {
            writeToDestination(buf.array(), buf.arrayOffset() + buf.position(), buf.remaining());
        }
    } finally {
        buf.clear();
    }
}

/**
 * Flushes any buffers.
 */
public synchronized void flush() {
    flushBuffer(byteBuffer);
    flushDestination();
}

然后通过flume处理日志数据,flume端处理完一个日志文件就删除一个日志文件,删除的逻辑是

File logDictionary = new File("/data/agent-smith/" + catLog);
for (File file : logDictionary.listFiles()) {
    System.out.println(file.lastModified());
    if (file.lastModified() + 70000 < System.currentTimeMillis()) {
        file.delete();
    }
}

log4j2开启了缓存模式,两种情况数据会从内存缓存区持久化到磁盘

  1. 缓冲区满了,默认是8K
  2. 文件切换的时候,在切文件之前也会触发flush

由于删除逻辑是通过最后更新时间判断文件是否还会写入,那么就存在这样一种可能,判断逻辑执行的时候当前这个文件没有发生切换,缓存区也没有满,但是时间超过了 70S,这个时候如果清除文件的话是会导致滞留在缓存区的数据丢失的,当然丢失的数据并不多,在8K左右。

为什么要删除文件,一个原因是处理过的文件节省磁盘空间,另一个原因需要通过文件判断当前小时数据是否处理完成。

2 思考

  1. 使用一个工具,用到什么特性需要对这个特性有足够的了解
  2. 写代码的时候有些断言或者假设前提需要确保其靠谱程度
  3. 可以怀疑开源组件的bug,但是在此之前确认自己的使用方式是否有问题