分布式系统日志链路跟踪优化

1,757 阅读5分钟

一、背景

排查系统问题最多的手段便是查看系统日志,在分布式系统中,引入了traceId,作为全链路日志跟踪。

<property name="pattern" value="%d{yyyy-MM-dd HH:mm:ss.SSS} [${applicationName},%X{X-B3-TraceId:-}] [%thread] %-5level [%uid] [%C:%L]   %msg%n"/>

traceIdX-B3-TraceId 在MDC中值是一样的。但由于MDC内部使用的是ThreadLocal所以只有本线程才有效,子线程和下游的服务MDC里的值会丢失; 所以方案主要的难点是解决值的传递问题,主要包括以几下部分:

  • 异步情况(线程池)如何传递traceId给子线程
  • 异步服务调用(RPC)如何传递 本文将围绕上述两个问题,提供解决思路。

二、解决方案

1、扩展线程池

1.1 对ThreadLocalThreadPoolExecutor进行扩展,实现将父线程的MDC内容复制给子线程。

public class LogThreadLocalThreadPoolExecutor extends ThreadLocalThreadPoolExecutor {
	
    ...

    @Override
    public void execute(Runnable command) {
        super.execute(new MDCRunnable() {
            @Override
            protected void innerRun() {
                command.run();
            }
        });
    }

    public static abstract class MDCRunnable implements Runnable {

        private Map<String, String> copyOfContextMap = MDC.getCopyOfContextMap();

        @Override
        public void run() {
            if (copyOfContextMap != null) {
                MDC.setContextMap(copyOfContextMap);
            }
            try {
                innerRun();
            } finally {
                if (MapUtils.isNotEmpty(copyOfContextMap)) {
                    MDC.clear();
                }
            }
        }

        protected abstract void innerRun();
    }

}

1.2 阿里巴巴 TTL 传递traceId问题

项目中,我们采用TtlRunnable包了一层,由于logback的MDC实现内部使用的是ThreadLocal不能传递子线程,所以需要重写替换为阿里的TransmittableThreadLocal。 注:虽然InheritableThreadLocal能够传递给子线程,但对于线程池中的线程复用,将当前线程中的ThreadLocal传递给线程池中的其他线程,TransmittableThreadLocal提供了解决方案。

在ThreadLocal的需求场景即是TTL的潜在需求场景,如果你的业务需要『在使用线程池等会缓存线程的组件情况下传递ThreadLocal』则是TTL目标场景

public class ThreadLocalThreadPoolExecutor extends ThreadPoolExecutor {
    
    ...
 
    @Override
    public void execute(Runnable command) {
        super.execute(TtlRunnable.get(command));
    }
 
}

重构LogbackMDCAdapter,将ThreadLocalMap替换成TransmittableThreadLocal 注: 包名一定得是org.slf4j

image.png 初始化TtlMDCAdapter实例,并替换MDC中的adapter对象。 自动注入配置spring.factories

org.springframework.context.ApplicationContextInitializer=\
  com.moka.hcm.common.log.config.TtlMDCAdapterInitializer
public class TtlMDCAdapterInitializer implements ApplicationContextInitializer<ConfigurableApplicationContext> {
    @Override
    public void initialize(ConfigurableApplicationContext applicationContext) {
        //加载TtlMDCAdapter实例
        TtlMDCAdapter.getInstance();
    }
}

2、异步Feign调用问题

在线程池中通过feign调用其它服务时,由于hystrix是通过线程池来发送请求的,而MDC中ThreadLocal不能传递子线程变量,因此导致traceId不一致。

2021-12-21 19:20:58.385 [50c8a6e474e8d858] [http-nio-8080-exec-5] INFO  
2021-12-21 19:20:58.388 [50c8a6e474e8d858] [http-nio-8080-exec-5] INFO  
2021-12-21 19:21:05.195 [50c8a6e474e8d858] [pool-15-thread-1] INFO  
2021-12-21 19:21:05.219 [f85da8ad351a0ff3] [hystrix-9] INFO  
2021-12-21 19:21:05.319 [50c8a6e474e8d858] [http-nio-8080-exec-5] INFO 

解决方法:

public class ThreadLocalHystrixConcurrencyStrategy extends HystrixConcurrencyStrategy {
    /**
     * hystrix的线程池传递 MDC的context
     */
    @Override
    public <T> Callable<T> wrapCallable(Callable<T> callable) {
        Map<String, String> map = MDC.getCopyOfContextMap();
        return () -> {
            try {
                MDC.setContextMap(map);
                return callable.call();
            } finally {
                MDC.clear();
            }
        };
    }
}

看看修改后的效果:

2021-12-23 14:52:32.714 [205932514ea19d87] [http-nio-8080-exec-1] INFO  
2021-12-23 14:52:32.850 [205932514ea19d87] [hystrix-2] INFO  
2021-12-23 14:52:32.890 [205932514ea19d87] [http-nio-8080-exec-1] INFO  
2021-12-23 14:52:32.892 [205932514ea19d87] [http-nio-8080-exec-1] INFO  
2021-12-23 14:52:32.937 [205932514ea19d87] [http-nio-8080-exec-1] INFO  
2021-12-23 14:52:32.987 [205932514ea19d87] [pool-15-thread-2] INFO  
2021-12-23 14:52:33.036 [205932514ea19d87] [hystrix-3] INFO  
2021-12-23 14:52:33.153 [205932514ea19d87] [http-nio-8080-exec-1] INFO  

3、Slf4jCurrentTraceContext使用InheritableThreadLocal问题

日志打印的CurrentTraceContext默认使用的是InheritableThreadLocalInheritableThreadLocal可以解决父线程生成的变量传递到子线程中进行使用,而遇到了线程池时,就会出现一种问题:

在线程池状态下,第一次创建线程的时候会从父线程中copy inheritableThreadLocals中的数据,因此开始执行任务时候就会拿到父线程设置的变量,当核心线程数都创建完成后复用了前面创建的线程时,并不会触发复制父线程中的inheritableThreadLocals中的数据操作,因此即便在主线程中设置了新的traceId,也不会生效,异步的线程traceId使用的是旧traceId。

那么应该如何解决这个问题呢?解决方法如下:

自定义一个DefaultTraceContext,定义三类线程副本ThreadLocalInheritableThreadLocalTransmittableThreadLocal(方便拓展),本次方案采用的是TransmittableThreadLocal

/**
 * DefaultTraceContext
 *
 * @author weiyilin
 * @date 2022-01-29 16:12:43
 */
public class DefaultTraceContext extends CurrentTraceContext {
 
    static final ThreadLocal<TraceContext> DEFAULT = new ThreadLocal<>();
    static final InheritableThreadLocal<TraceContext> INHERITABLE = new InheritableThreadLocal<>();
    static final TransmittableThreadLocal<TraceContext> TRANSMITTABLE = new TransmittableThreadLocal<>();
 
 
    final ThreadLocal<TraceContext> local;
 
    /**
     * Uses a non-inheritable static thread local
     */
    public static CurrentTraceContext create() {
        return new DefaultTraceContext(DEFAULT);
    }
 
    /**
     * Uses an inheritable static thread local which allows arbitrary calls to {@link
     * Thread#start()} to automatically inherit this context. This feature is available as it is was
     * the default in Brave 3, because some users couldn't control threads in their applications.
     *
     * <p>This can be a problem in scenarios such as thread pool expansion, leading to data being
     * recorded in the wrong span, or spans with the wrong parent. If you are impacted by this,
     * switch to {@link #create()}.
     */
    public static CurrentTraceContext inheritable() {
        return new DefaultTraceContext(INHERITABLE);
    }
 
    public static CurrentTraceContext transmittable() {
        return new DefaultTraceContext(TRANSMITTABLE);
    }
 
    DefaultTraceContext(ThreadLocal<TraceContext> local) {
        if (local == null) {
            throw new NullPointerException("local == null");
        }
        this.local = local;
    }
 
    @Override
    public TraceContext get() {
        return local.get();
    }
 
    @Override
    public CurrentTraceContext.Scope newScope(@Nullable TraceContext currentSpan) {
        final TraceContext previous = local.get();
        local.set(currentSpan);
        class DefaultCurrentTraceContextScope implements CurrentTraceContext.Scope {
            @Override
            public void close() {
                local.set(previous);
            }
        }
        return new DefaultCurrentTraceContextScope();
    }
 
}

注入bean对象

/**
 * TraceConfiguration
 *
 * @author weiyilin
 * @date 2022-01-29 16:10:16
 */
@Configuration
public class TraceConfiguration {
 
    @Bean
    public CurrentTraceContext slf4jSpanLogger() {
        return Slf4jCurrentTraceContext.create(DefaultTraceContext.transmittable());
    }
 
}

image.png

参考文档