记一次 opentelemetry 日志打印 traceId 缺失问题分析

579 阅读3分钟

背景:项目中有一些消费 kafka 消息者打印的日志中缺失 traceId,项目中使用的是 javaagent 的方式接入 opentelemetry,使用的版本是 1.4.x。

Kafka 生产者

首先看看生产者消息发送方法的方法增强

public static class ProducerAdvice {

    @Advice.OnMethodEnter(suppress = Throwable.class)
    public static void onEnter(
        @Advice.FieldValue("apiVersions") ApiVersions apiVersions,
        @Advice.Argument(value = 0, readOnly = false) ProducerRecord<?, ?> record,
        @Advice.Argument(value = 1, readOnly = false) Callback callback,
        @Advice.Local("otelContext") Context context,
        @Advice.Local("otelScope") Scope scope) {
      // 获取当前线程上下文
      Context parentContext = Java8BytecodeBridge.currentContext();
      // 生成 context (包括生成 span)
      context = tracer().startProducerSpan(parentContext, record);

      callback = new ProducerCallback(callback, parentContext, context);
    
      // 判断是否需要传播链路
      if (tracer().shouldPropagate(apiVersions)) {
        try {
          // 往请求头中添加 span 信息
          tracer().inject(context, record.headers(), SETTER);
        } catch (IllegalStateException e) {
          // headers must be read-only from reused record. try again with new one.
          record =
              new ProducerRecord<>(
                  record.topic(),
                  record.partition(),
                  record.timestamp(),
                  record.key(),
                  record.value(),
                  record.headers());

          tracer().inject(context, record.headers(), SETTER);
        }
      }
      // 将生成的 context 设置到当前上下文(默认使用threadLocal)
      scope = context.makeCurrent();
    }

    @Advice.OnMethodExit(onThrowable = Throwable.class, suppress = Throwable.class)
    public static void stopSpan(
        @Advice.Thrown Throwable throwable,
        @Advice.Local("otelContext") Context context,
        @Advice.Local("otelScope") Scope scope) {
      
      // 将当前上下文清空,并恢复到父上下文(如果存在的话)
      scope.close();

      if (throwable != null) {
        tracer().endExceptionally(context, throwable);
      }
      // span finished by ProducerCallback
    }
  }

从上述源码可以看到:

  1. 方法增强会在发送消息前生成 context,如果存在 parentContext 则使用 parentContext 的 traceId。
  2. 如果支持传播,则会将 span 组装好放到消息 header,具体参数名看使用的是什么协议,默认是 traceparent,这里使用 jaeger,所以参数名是 uber-trace-id。
  3. 将生成的 context 设置到当前线程上下文
  4. 消息发送结束时会将线程上下文恢复到原本的上下文

Producer traceId 消息缺失分析

  1. 由于当前不存在上下文,生产者在发送完消息后就直接将上下文清空了,所以后续打印的日志都不存在上下文
  2. 通过 callback 打印日志,由于 callback 是在另外一个线程执行,所以也获取不到上下文

Kafka 消费者

首先看看消费者消息消费方法的方法增强

public void transform(TypeTransformer transformer) {
    transformer.applyAdviceToMethod(
        isMethod()
            .and(isPublic())
            .and(named("records"))
            .and(takesArgument(0, String.class))
            .and(returns(Iterable.class)),
        KafkaConsumerInstrumentation.class.getName() + "$IterableAdvice");
    transformer.applyAdviceToMethod(
        isMethod()
            .and(isPublic())
            .and(named("records"))
            .and(takesArgument(0, named("org.apache.kafka.common.TopicPartition")))
            .and(returns(List.class)),
        KafkaConsumerInstrumentation.class.getName() + "$ListAdvice");
    transformer.applyAdviceToMethod(
        isMethod()
            .and(isPublic())
            .and(named("iterator"))
            .and(takesArguments(0))
            .and(returns(Iterator.class)),
        KafkaConsumerInstrumentation.class.getName() + "$IteratorAdvice");
  }

  @SuppressWarnings("unused")
  public static class IterableAdvice {

    @Advice.OnMethodExit(suppress = Throwable.class)
    public static void wrap(
        @Advice.Return(readOnly = false) Iterable<ConsumerRecord<?, ?>> iterable) {
      if (iterable != null) {
        // 覆盖迭代器
        iterable = new TracingIterable(iterable, tracer());
      }
    }
  }

  @SuppressWarnings("unused")
  public static class ListAdvice {

    @Advice.OnMethodExit(suppress = Throwable.class)
    public static void wrap(@Advice.Return(readOnly = false) List<ConsumerRecord<?, ?>> iterable) {
      if (iterable != null) {
        // 覆盖List
        iterable = new TracingList(iterable, tracer());
      }
    }
  }

  @SuppressWarnings("unused")
  public static class IteratorAdvice {

    @Advice.OnMethodExit(suppress = Throwable.class)
    public static void wrap(
        @Advice.Return(readOnly = false) Iterator<ConsumerRecord<?, ?>> iterator) {
      if (iterator != null) {
        // 覆盖迭代器
        iterator = new TracingIterator(iterator, tracer());
      }
    }
  }
  
  
public class TracingIterator implements Iterator<ConsumerRecord<?, ?>> {

  private final Iterator<ConsumerRecord<?, ?>> delegateIterator;
  private final KafkaConsumerTracer tracer;

  /**
   * Note: this may potentially create problems if this iterator is used from different threads. But
   * at the moment we cannot do much about this.
   */
  @Nullable private Context currentContext;

  @Nullable private Scope currentScope;

  public TracingIterator(
      Iterator<ConsumerRecord<?, ?>> delegateIterator, KafkaConsumerTracer tracer) {
    this.delegateIterator = delegateIterator;
    this.tracer = tracer;
  }

  @Override
  public boolean hasNext() {
    closeScopeAndEndSpan();
    return delegateIterator.hasNext();
  }

  @Override
  public ConsumerRecord<?, ?> next() {
    // in case they didn't call hasNext()...
    // 清除上一条消息的上下文
    closeScopeAndEndSpan();

    ConsumerRecord<?, ?> next = delegateIterator.next();

    if (next != null) {
      // 生成上下文,包含span,如果消息有上下文则提取
      currentContext = tracer.startSpan(next);
      currentScope = currentContext.makeCurrent();
    }
    return next;
  }

  private void closeScopeAndEndSpan() {
    if (currentScope != null) {
      currentScope.close();
      currentScope = null;
      tracer.end(currentContext);
      currentContext = null;
    }
  }

  @Override
  public void remove() {
    delegateIterator.remove();
  }
}

从上述源码可以看到:

  1. 迭代器每获取下一个元素,会先清除上下文,然后再生成新的上下文,如果消息头有上下文信息则提取
  2. 每次判断 hasNext 是都会清除上下文,所以遍历时循环结束,上下文也会被清除

Consumer traceId 消息缺失分析

  1. 同上在消费消息时,会使用 foreach,如果逻辑不在 foreach 之间,则会造成 traceId 缺失。
  2. 有的是使用线程池,同样会造成 traceId 缺失

Opentelemetry 日志上下文

同样是先看看源码

public class OpenTelemetryContextDataProvider implements ContextDataProvider {

  /**
   * Returns context from the current span when available.
   *
   * @return A map containing string versions of the traceId, spanId, and traceFlags, which can then
   *     be accessed from layout components
   */
  @Override
  public Map<String, String> supplyContextData() {
    Span currentSpan = Span.current();
    if (!currentSpan.getSpanContext().isValid()) {
      return Collections.emptyMap();
    }

    Map<String, String> contextData = new HashMap<>();
    SpanContext spanContext = currentSpan.getSpanContext();
    contextData.put(TRACE_ID, spanContext.getTraceId());
    contextData.put(SPAN_ID, spanContext.getSpanId());
    contextData.put(TRACE_FLAGS, spanContext.getTraceFlags().asHex());
    return contextData;
  }
}

从源码中可以看出,不是使用的 MDC,另外可以尝试使用下面方式使用自定义的上下文数据存储(亲测不行,这个可以通过修改源码解决)

java -javaagent:path/to/opentelemetry-javaagent.jar -Dio.opentelemetry.context.contextStorageProvider=your.package.CustomContextStorageProvider -jar your-application.jar