可观测性体系:日志、指标、链路追踪

11 阅读2分钟

可观测性体系:日志、指标、链路追踪

可观测性是生产系统的"眼睛"。本文构建完整的可观测性体系。


一、什么是可观测性?

1.1 三大支柱

        ┌─────────────────┐
        │   可观测性       │
        │  Observability  │
        └────────┬────────┘
                 │
        ┌────────┼────────┐
        │        │        │
   ┌────▼───┐ ┌─▼─────┐ ┌▼─────┐
   │  Logs  │ │Metrics│ │Traces│
   │  日志  │ │ 指标  │ │ 链路 │
   └────────┘ └───────┘ └──────┘
类型用途示例
日志事件详情错误详情、调试信息
指标趋势分析QPS、响应时间、错误率
链路请求追踪调用链、依赖关系

二、日志系统

2.1 ELK 架构

应用日志 → Filebeat → Kafka → Logstash → Elasticsearch → Kibana

2.2 Spring Boot 集成

<dependency>
    <groupId>net.logstash.logback</groupId>
    <artifactId>logstash-logback-encoder</artifactId>
    <version>7.4</version>
</dependency>
<!-- logback-spring.xml -->
<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder">
        <includeContext>true</includeContext>
        <includeMdc>true</includeMdc>
    </encoder>
</appender>

2.3 日志规范

// ✅ 结构化日志
log.info("用户下单成功", kv("userId", userId), kv("orderId", orderId));

// 输出:{"timestamp":"2024-01-01","level":"INFO","msg":"用户下单成功","userId":"123","orderId":"456"}

// ❌ 避免
log.info("用户" + userId + "创建了订单" + orderId); // 不好解析

三、指标系统

3.1 Prometheus + Grafana

# prometheus.yml
scrape_configs:
  - job_name: 'spring-boot'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['app:8080']
// Micrometer 指标
@Component
public class OrderMetrics {
    
    private final Counter orderCounter;
    private final Timer orderTimer;
    
    public OrderMetrics(MeterRegistry registry) {
        orderCounter = Counter.builder("orders.created")
            .tag("service", "order")
            .register(registry);
        orderTimer = Timer.builder("orders.duration")
            .register(registry);
    }
    
    public void recordOrder() {
        orderCounter.increment();
    }
    
    public void recordDuration(Runnable action) {
        orderTimer.record(action);
    }
}

3.2 关键指标

指标类型意义
request_countCounter总请求数
request_durationHistogram请求耗时分布
error_rateGauge错误率
jvm_memory_usedGaugeJVM 内存使用
thread_activeGauge活跃线程数

四、链路追踪

4.1 SkyWalking

// 引入依赖
<dependency>
    <groupId>org.apache.skywalking</groupId>
    <artifactId>apm-toolkit-trace</artifactId>
</dependency>

// 自动追踪
@SpringBootApplication
public class Application {
    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }
}

4.2 自定义追踪

@Trace
@Tag(key = "userId", value = "arg[0]")
public User getUser(Long userId) {
    return userMapper.findById(userId);
}

// 记录自定义跨度
ActiveSpan.tag("custom_tag", "value");

4.3 分布式链路

// Feign 集成
@FeignClient(name = "user-service", configuration = TraceConfig.class)
public interface UserClient {
    @GetMapping("/users/{id}")
    User getUser(@PathVariable Long id);
}

五、统一可观测性平台

5.1 OpenTelemetry

// OpenTelemetry 统一标准
OpenTelemetrySdk.builder()
    .setTracerProvider(
        SdkTracerProvider.builder()
            .addSpanProcessor(
                BatchSpanProcessor.builder(
                    JaegerGrpcSpanExporter.builder()
                        .setEndpoint("http://jaeger:14250")
                        .build()
                ).build()
            )
            .build()
    )
    .build();

5.2 一体化方案

# 推荐架构
logs:
  - fluentd → Loki → Grafana
  
metrics:
  - app → Prometheus → Grafana
  
traces:
  - app → Jaeger/SkyWalking → Grafana

# Grafana 统一展示

六、告警配置

# AlertManager
rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "错误率过高"