背景:项目中有一些消费 kafka 消息者打印的日志中缺失 traceId,项目中使用的是 javaagent 的方式接入 opentelemetry,使用的版本是 1.4.x。
Kafka 生产者
首先看看生产者消息发送方法的方法增强
public static class ProducerAdvice {
@Advice.OnMethodEnter(suppress = Throwable.class)
public static void onEnter(
@Advice.FieldValue("apiVersions") ApiVersions apiVersions,
@Advice.Argument(value = 0, readOnly = false) ProducerRecord<?, ?> record,
@Advice.Argument(value = 1, readOnly = false) Callback callback,
@Advice.Local("otelContext") Context context,
@Advice.Local("otelScope") Scope scope) {
// 获取当前线程上下文
Context parentContext = Java8BytecodeBridge.currentContext();
// 生成 context (包括生成 span)
context = tracer().startProducerSpan(parentContext, record);
callback = new ProducerCallback(callback, parentContext, context);
// 判断是否需要传播链路
if (tracer().shouldPropagate(apiVersions)) {
try {
// 往请求头中添加 span 信息
tracer().inject(context, record.headers(), SETTER);
} catch (IllegalStateException e) {
// headers must be read-only from reused record. try again with new one.
record =
new ProducerRecord<>(
record.topic(),
record.partition(),
record.timestamp(),
record.key(),
record.value(),
record.headers());
tracer().inject(context, record.headers(), SETTER);
}
}
// 将生成的 context 设置到当前上下文(默认使用threadLocal)
scope = context.makeCurrent();
}
@Advice.OnMethodExit(onThrowable = Throwable.class, suppress = Throwable.class)
public static void stopSpan(
@Advice.Thrown Throwable throwable,
@Advice.Local("otelContext") Context context,
@Advice.Local("otelScope") Scope scope) {
// 将当前上下文清空,并恢复到父上下文(如果存在的话)
scope.close();
if (throwable != null) {
tracer().endExceptionally(context, throwable);
}
// span finished by ProducerCallback
}
}
从上述源码可以看到:
- 方法增强会在发送消息前生成 context,如果存在 parentContext 则使用 parentContext 的 traceId。
- 如果支持传播,则会将 span 组装好放到消息 header,具体参数名看使用的是什么协议,默认是 traceparent,这里使用 jaeger,所以参数名是 uber-trace-id。
- 将生成的 context 设置到当前线程上下文
- 消息发送结束时会将线程上下文恢复到原本的上下文
Producer traceId 消息缺失分析
- 由于当前不存在上下文,生产者在发送完消息后就直接将上下文清空了,所以后续打印的日志都不存在上下文
- 通过 callback 打印日志,由于 callback 是在另外一个线程执行,所以也获取不到上下文
Kafka 消费者
首先看看消费者消息消费方法的方法增强
public void transform(TypeTransformer transformer) {
transformer.applyAdviceToMethod(
isMethod()
.and(isPublic())
.and(named("records"))
.and(takesArgument(0, String.class))
.and(returns(Iterable.class)),
KafkaConsumerInstrumentation.class.getName() + "$IterableAdvice");
transformer.applyAdviceToMethod(
isMethod()
.and(isPublic())
.and(named("records"))
.and(takesArgument(0, named("org.apache.kafka.common.TopicPartition")))
.and(returns(List.class)),
KafkaConsumerInstrumentation.class.getName() + "$ListAdvice");
transformer.applyAdviceToMethod(
isMethod()
.and(isPublic())
.and(named("iterator"))
.and(takesArguments(0))
.and(returns(Iterator.class)),
KafkaConsumerInstrumentation.class.getName() + "$IteratorAdvice");
}
@SuppressWarnings("unused")
public static class IterableAdvice {
@Advice.OnMethodExit(suppress = Throwable.class)
public static void wrap(
@Advice.Return(readOnly = false) Iterable<ConsumerRecord<?, ?>> iterable) {
if (iterable != null) {
// 覆盖迭代器
iterable = new TracingIterable(iterable, tracer());
}
}
}
@SuppressWarnings("unused")
public static class ListAdvice {
@Advice.OnMethodExit(suppress = Throwable.class)
public static void wrap(@Advice.Return(readOnly = false) List<ConsumerRecord<?, ?>> iterable) {
if (iterable != null) {
// 覆盖List
iterable = new TracingList(iterable, tracer());
}
}
}
@SuppressWarnings("unused")
public static class IteratorAdvice {
@Advice.OnMethodExit(suppress = Throwable.class)
public static void wrap(
@Advice.Return(readOnly = false) Iterator<ConsumerRecord<?, ?>> iterator) {
if (iterator != null) {
// 覆盖迭代器
iterator = new TracingIterator(iterator, tracer());
}
}
}
public class TracingIterator implements Iterator<ConsumerRecord<?, ?>> {
private final Iterator<ConsumerRecord<?, ?>> delegateIterator;
private final KafkaConsumerTracer tracer;
/**
* Note: this may potentially create problems if this iterator is used from different threads. But
* at the moment we cannot do much about this.
*/
@Nullable private Context currentContext;
@Nullable private Scope currentScope;
public TracingIterator(
Iterator<ConsumerRecord<?, ?>> delegateIterator, KafkaConsumerTracer tracer) {
this.delegateIterator = delegateIterator;
this.tracer = tracer;
}
@Override
public boolean hasNext() {
closeScopeAndEndSpan();
return delegateIterator.hasNext();
}
@Override
public ConsumerRecord<?, ?> next() {
// in case they didn't call hasNext()...
// 清除上一条消息的上下文
closeScopeAndEndSpan();
ConsumerRecord<?, ?> next = delegateIterator.next();
if (next != null) {
// 生成上下文,包含span,如果消息有上下文则提取
currentContext = tracer.startSpan(next);
currentScope = currentContext.makeCurrent();
}
return next;
}
private void closeScopeAndEndSpan() {
if (currentScope != null) {
currentScope.close();
currentScope = null;
tracer.end(currentContext);
currentContext = null;
}
}
@Override
public void remove() {
delegateIterator.remove();
}
}
从上述源码可以看到:
- 迭代器每获取下一个元素,会先清除上下文,然后再生成新的上下文,如果消息头有上下文信息则提取
- 每次判断 hasNext 是都会清除上下文,所以遍历时循环结束,上下文也会被清除
Consumer traceId 消息缺失分析
- 同上在消费消息时,会使用 foreach,如果逻辑不在 foreach 之间,则会造成 traceId 缺失。
- 有的是使用线程池,同样会造成 traceId 缺失
Opentelemetry 日志上下文
同样是先看看源码
public class OpenTelemetryContextDataProvider implements ContextDataProvider {
/**
* Returns context from the current span when available.
*
* @return A map containing string versions of the traceId, spanId, and traceFlags, which can then
* be accessed from layout components
*/
@Override
public Map<String, String> supplyContextData() {
Span currentSpan = Span.current();
if (!currentSpan.getSpanContext().isValid()) {
return Collections.emptyMap();
}
Map<String, String> contextData = new HashMap<>();
SpanContext spanContext = currentSpan.getSpanContext();
contextData.put(TRACE_ID, spanContext.getTraceId());
contextData.put(SPAN_ID, spanContext.getSpanId());
contextData.put(TRACE_FLAGS, spanContext.getTraceFlags().asHex());
return contextData;
}
}
从源码中可以看出,不是使用的 MDC,另外可以尝试使用下面方式使用自定义的上下文数据存储(亲测不行,这个可以通过修改源码解决)
java -javaagent:path/to/opentelemetry-javaagent.jar -Dio.opentelemetry.context.contextStorageProvider=your.package.CustomContextStorageProvider -jar your-application.jar