Sleuth + OKHttp全链路跟踪怎么玩

1,444 阅读4分钟

背景

sleuth提供了一套完整的服务跟踪解决方案,包括链路跟踪,性能分析,在分布式系统中,sleuth负责的是监控,zipkin负责展现。在分布式系统中,通过sleuth可以将请求的各个节点串联起来,从而方便问题的定位与排查。但是最近在使用sleuth的过程中发现了一个问题。

image.png

如图所示:外部请求经过Gateway之后,Gateway将请求通过OKhttp工具类发送到后端服务。在实际运行过程中发现请求经过Gateway之后sleuth会自动生成traceId,但是请求发送后端服务serverA或者serviceB之后gateway中生成traceId会被新的traceId替换,导致通过客户端返回的traceId无法跟踪整个调用过程。但是通过feign调用的请求traceId却被保留下来了,说明sleuth本身是没什么问题的。

问题分析

TraceFilter排查

客户端返回的traceId,是在httpResponse中。首先排查TraceFilter。

@Component
@Order(TraceWebServletAutoConfiguration.TRACING_FILTER_ORDER + 1)
public class TraceFilter extends GenericFilterBean {

    private final Tracer tracer;

    TraceFilter(Tracer tracer) {
        this.tracer = tracer;
    }

    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
        throws IOException, ServletException {
        Span currentSpan = this.tracer.currentSpan();
        if (currentSpan == null) {
            chain.doFilter(request, response);
            return;
        }
        traceId = currentSpan.context().traceIdString();
      
        ((HttpServletResponse) response).addHeader("TRACE-ID", traceId);
        chain.doFilter(request, response);
    }
}

TraceFilter只是将Tracer中生成的traceId set到httpResponse中,肯定不是产生问题的地方。但是这里有一个TraceWebServletAutoConfiguration注解,这个注解是干什么的,通过查询相关资料了解到这个注解正式我们今天的主角sleuth。

sleuth traceId跟踪原理

网上介绍sleuth的资料很多,但大部分都是介绍sleuth如何使用的。这个官方文档其实已经总结的很好的,没有多少价值。feign能够调用,说明sleuth本身能够支持feign。很自然的想到让feign能否支持okhttp。网上搜了一圈下来,sleuth支持的组件有很多,常见的feign,rgpc,zuul,redis,messaging。httpclient是一大类支持resttemplate,webclient,nettyhttpclient,HttpClientBuilder等,但是很遗憾不支持OKHttp,难道OKHttp就无法使用sleuth。看来需要看下sleuth的源代码。

sleuth源码分析

spring-cloud sleuth traceId的基本原理是通过Servlet的Filter实现的,无论在外部封装了多少层,基本原理不会改变。直接找到源码中TracingFilter类。分析该类的实现。

 public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
      throws IOException, ServletException {
    HttpServletRequest httpRequest = (HttpServletRequest) request;
    HttpServletResponse httpResponse = servlet.httpResponse(response);

    // Prevent duplicate spans for the same request
    TraceContext context = (TraceContext) request.getAttribute(TraceContext.class.getName());
    if (context != null) {
      // A forwarded request might end up on another thread, so make sure it is scoped
      Scope scope = currentTraceContext.maybeScope(context);
      try {
        chain.doFilter(request, response);
      } finally {
        scope.close();
      }
      return;
    }

    Span span = handler.handleReceive(extractor, httpRequest);

    // Add attributes for explicit access to customization or span context
    request.setAttribute(SpanCustomizer.class.getName(), span.customizer());
    request.setAttribute(TraceContext.class.getName(), span.context());

    Throwable error = null;
    Scope scope = currentTraceContext.newScope(span.context());
    try {
      // any downstream code can see Tracer.currentSpan() or use Tracer.currentSpanCustomizer()
      chain.doFilter(httpRequest, httpResponse);
    } catch (IOException | ServletException | RuntimeException | Error e) {
      error = e;
      throw e;
    } finally {
      scope.close();
      if (servlet.isAsync(httpRequest)) { // we don't have the actual response, handle later
        servlet.handleAsync(handler, httpRequest, httpResponse, span);
      } else { // we have a synchronous response, so we can finish the span
        handler.handleSend(ADAPTER.adaptResponse(httpRequest, httpResponse), error, span);
      }
    }
  }

核心逻辑:Span span = handler.handleReceive(extractor, httpRequest);。继续跟进去

  /** Creates a potentially noop span representing this request */
  Span nextSpan(TraceContextOrSamplingFlags extracted, Req request) {
    Boolean sampled = extracted.sampled();
    // only recreate the context if the http sampler made a decision
    if (sampled == null && (sampled = sampler.trySample(adapter, request)) != null) {
      extracted = extracted.sampled(sampled.booleanValue());
    }
    return extracted.context() != null
        ? tracer.joinSpan(extracted.context())
        : tracer.nextSpan(extracted);
  }

注释非常清晰,如果extracted.context()不为空,则joinSpan,本质就是复用httpreqeuest中的traceId,否则创建新的span。因此重点跟进TraceContextOrSamplingFlags的创建过程。

  static final class ExtraFieldExtractor<C, K> implements Extractor<C> {
    final ExtraFieldPropagation<K> propagation;
    final Extractor<C> delegate;
    final Propagation.Getter<C, K> getter;

    ExtraFieldExtractor(ExtraFieldPropagation<K> propagation, Getter<C, K> getter) {
      this.propagation = propagation;
      this.delegate = propagation.delegate.extractor(getter);//代理创建
      this.getter = getter;
    }

真正创建类是B3Propagation

    @Override public TraceContextOrSamplingFlags extract(C carrier) {
      if (carrier == null) throw new NullPointerException("carrier == null");

      // try to extract single-header format
      TraceContextOrSamplingFlags extracted = singleExtractor.extract(carrier);
      if (!extracted.equals(TraceContextOrSamplingFlags.EMPTY)) return extracted;

      // Start by looking at the sampled state as this is used regardless
      // Official sampled value is 1, though some old instrumentation send true
      String sampled = getter.get(carrier, propagation.sampledKey);
      Boolean sampledV = sampled != null
          ? sampled.equals("1") || sampled.equalsIgnoreCase("true")
          : null;
      boolean debug = "1".equals(getter.get(carrier, propagation.debugKey));

      String traceIdString = getter.get(carrier, propagation.traceIdKey);
      // It is ok to go without a trace ID, if sampling or debug is set
      if (traceIdString == null) return TraceContextOrSamplingFlags.create(sampledV, debug);

      // Try to parse the trace IDs into the context
      TraceContext.Builder result = TraceContext.newBuilder();
      if (result.parseTraceId(traceIdString, propagation.traceIdKey)    //判断request中是否包含X-B3-TraceId
          && result.parseSpanId(getter, carrier, propagation.spanIdKey) //判断request中是否包含X-B3-SpanId    //判断request中是否包含X-B3-ParentSpanId
          && result.parseParentId(getter, carrier, propagation.parentSpanIdKey)) {
        if (sampledV != null) result.sampled(sampledV.booleanValue());
        if (debug) result.debug(true);
        return TraceContextOrSamplingFlags.create(result.build());
      }
      return TraceContextOrSamplingFlags.EMPTY; // trace context is malformed so return empty
    }
  }
}

以上三个条件都满足的情况下会创建新的TraceContextOrSamplingFlags,否则返回TraceContextOrSamplingFlags.EMPTY。因此解决方案也就清晰起来了,只要request请求中包含X-B3-SpanId,X-B3-ParentSpanId,X-B3-ParentSpanId,应该能解决traceId的丢失问题。

解决方案

根据上面的分析,解决方案如下:

  1. 请求汇总获取tracer。
Tracer tracer = Tracing.currentTracer()
  1. okhttp发送请求之前header中新增X-B3-SpanId,X-B3-ParentSpanId,X-B3-ParentSpanId
   private void addTracer(Map<String, String> headers, Tracer tracer) {
        TraceContext context = tracer.currentSpan().context();
        headers.put("X-B3-TraceId",context.traceIdString());
        headers.put("X-B3-SpanId",context.spanIdString());
        if(StringUtils.isBlank(context.parentIdString())){
            headers.put("X-B3-ParentSpanId",context.spanIdString());
        }else{
            headers.put("X-B3-ParentSpanId",context.parentIdString());
        }
    }
  1. 修改完成之后验证结果符合预期。

小结

  1. 对于sleuth不支持的httpclient应该都可以通过该方式来解决问题。
  2. 这种方式解决问题确实不够优雅,更理想的解决方式应该是参考resttemplate或者其他组件的集成方式封装OKHttp的调用。