《Spring AI + 大模型全栈实战》学习手册系列 · 专题七:《RAG 系统性能调优实战:从 3 秒到 300ms 的优化之路》

3 阅读17分钟

专题七:《RAG 系统性能调优实战:从 3 秒到 300ms 的优化之路》

📚 企业级 RAG 智能问答系统全栈实施指南

第七部分:性能调优与监控


第 1 章 性能瓶颈分析与诊断

1.1 RAG 系统全链路性能分解

┌─────────────────────────────────────────────────────────────────────────┐
│                      RAG 请求全链路(原始 3000ms)                        │
├─────────────────────────────────────────────────────────────────────────┤
│  用户请求 → 网络传输 → API 网关 → 鉴权 → 向量检索 → LLM 推理 → 响应返回   │
│     ↓         ↓          ↓         ↓         ↓          ↓          ↓     │
│    50ms      100ms      200ms     150ms     1500ms     1000ms     100ms  │
└─────────────────────────────────────────────────────────────────────────┘

1.2 性能瓶颈定位工具

工具用途适用层级安装命令优先级
ArthasJava 应用诊断后端服务curl -O https://arthas.aliyun.com/arthas-boot.jar🔴 高
Py-SpyPython 性能分析Embedding 服务pip install py-spy🔴 高
Chrome DevTools前端性能分析前端页面浏览器内置🔴 高
Milvus Insight向量数据库监控向量存储Docker 部署🔴 高
Prometheus + Grafana全链路监控全系统Docker Compose🔴 高
Jaeger分布式追踪全链路Docker 部署🟡 中

1.3 性能基线测试脚本

# benchmark_rag.py
import requests
import time
import statistics
from concurrent.futures import ThreadPoolExecutor
import json

class RAGBenchmark:
    def __init__(self, base_url: str = "http://localhost:8080"):
        self.base_url = base_url
        self.test_questions = [
            "什么是 RAG 技术?",
            "如何部署 Milvus 向量数据库?",
            "Embedding 模型如何选择?",
            "文档分块策略有哪些最佳实践?",
            "Spring AI 如何集成 Ollama?"
        ]
    
    def single_request(self, question: str) -> dict:
        """单次请求性能测试"""
        start = time.time()
        
        response = requests.post(
            f"{self.base_url}/api/rag/query",
            json={"question": question},
            timeout=30
        )
        
        end = time.time()
        
        return {
            "latency": (end - start) * 1000,  # ms
            "status": response.status_code,
            "question": question
        }
    
    def latency_test(self, num_requests: int = 100) -> dict:
        """延迟测试"""
        results = []
        
        for i in range(num_requests):
            question = self.test_questions[i % len(self.test_questions)]
            result = self.single_request(question)
            results.append(result["latency"])
            print(f"请求 {i+1}/{num_requests}: {result['latency']:.2f}ms")
        
        return {
            "avg": statistics.mean(results),
            "min": min(results),
            "max": max(results),
            "p50": sorted(results)[int(len(results) * 0.5)],
            "p95": sorted(results)[int(len(results) * 0.95)],
            "p99": sorted(results)[int(len(results) * 0.99)]
        }
    
    def concurrency_test(self, max_concurrency: int = 50) -> dict:
        """并发测试"""
        results = []
        
        def worker(concurrency):
            with ThreadPoolExecutor(max_workers=concurrency) as executor:
                futures = []
                start = time.time()
                
                for i in range(concurrency * 10):
                    question = self.test_questions[i % len(self.test_questions)]
                    futures.append(executor.submit(self.single_request, question))
                
                for future in futures:
                    results.append(future.result()["latency"])
                
                end = time.time()
                return {
                    "concurrency": concurrency,
                    "qps": (concurrency * 10) / (end - start),
                    "avg_latency": statistics.mean(results[-concurrency*10:])
                }
        
        report = []
        for concurrency in [1, 5, 10, 20, 50]:
            result = worker(concurrency)
            report.append(result)
            print(f"并发 {concurrency}: QPS={result['qps']:.2f}, 平均延迟={result['avg_latency']:.2f}ms")
        
        return report
    
    def generate_report(self):
        """生成性能报告"""
        print("\n" + "="*60)
        print("📊 RAG 系统性能基准测试报告")
        print("="*60)
        
        print("\n【延迟测试】")
        latency_result = self.latency_test(50)
        for key, value in latency_result.items():
            print(f"  {key}: {value:.2f}ms")
        
        print("\n【并发测试】")
        concurrency_result = self.concurrency_test(50)
        
        print("\n" + "="*60)
        
        return {
            "latency": latency_result,
            "concurrency": concurrency_result
        }

# 运行测试
if __name__ == "__main__":
    benchmark = RAGBenchmark()
    benchmark.generate_report()

1.4 性能瓶颈诊断清单

层级检查项正常值警告值解决方案
网络层API 响应时间<50ms>100msCDN 加速、就近部署
网关层Nginx 处理时间<20ms>50ms调整 worker 进程数
应用层Spring Boot 处理<100ms>300ms异步化、缓存
检索层Milvus 查询<50ms>100ms索引优化、HNSW 参数
模型层LLM 推理<1000ms>2000ms模型量化、流式输出
嵌入层Embedding<100ms>200ms批量处理、模型优化

第 2 章 缓存策略设计与实现

2.1 多级缓存架构

┌─────────────────────────────────────────────────────────────────────────┐
                         多级缓存架构                                     
├─────────────────────────────────────────────────────────────────────────┤
  L1: 本地缓存 (Caffeine)  1ms  热点问题 (1000 条)                       
                                                                        
  L2: 分布式缓存 (Redis)  5ms  常见问题 (10 万条)                        
                                                                        
  L3: 向量数据库 (Milvus)  50ms  全量文档 (1000 万条)                    
                                                                        
  L4: 大模型 (Ollama)  1000ms  动态生成                                 
└─────────────────────────────────────────────────────────────────────────┘

2.2 Caffeine 本地缓存实现

// CacheConfig.java
@Configuration
@EnableCaching
public class CacheConfig {
    
    /**
     * 热点问题本地缓存(1000 条,5 分钟过期)
     */
    @Bean
    public CacheManager caffeineCacheManager() {
        CaffeineCacheManager cacheManager = new CaffeineCacheManager();
        
        // 问答缓存
        cacheManager.registerCustomCache("qa-cache", 
            Caffeine.newBuilder()
                .maximumSize(1000)
                .expireAfterWrite(5, TimeUnit.MINUTES)
                .recordStats()
                .build());
        
        // 向量检索缓存
        cacheManager.registerCustomCache("vector-cache",
            Caffeine.newBuilder()
                .maximumSize(500)
                .expireAfterWrite(10, TimeUnit.MINUTES)
                .recordStats()
                .build());
        
        return cacheManager;
    }
    
    /**
     * 缓存监控端点
     */
    @Bean
    public CacheStatisticsCollector cacheStatisticsCollector() {
        return new CacheStatisticsCollector();
    }
}

// RagCacheService.java
@Service
@RequiredArgsConstructor
@Slf4j
public class RagCacheService {
    
    @Cacheable(value = "qa-cache", key = "#question", unless = "#result.length() > 500")
    public String getCachedAnswer(String question) {
        // 缓存未命中时执行
        return null;
    }
    
    @CachePut(value = "qa-cache", key = "#question")
    public String cacheAnswer(String question, String answer) {
        return answer;
    }
    
    @CacheEvict(value = "qa-cache", key = "#question")
    public void evictCache(String question) {
        log.info("清除缓存:{}", question);
    }
    
    /**
     * 获取缓存统计
     */
    public CacheStats getCacheStats() {
        Cache cache = Objects.requireNonNull(
            caffeineCacheManager().getCache("qa-cache")
        ).getNativeCache();
        
        return ((CaffeineCache) cache).stats();
    }
}

2.3 Redis 分布式缓存实现

// RedisCacheConfig.java
@Configuration
public class RedisCacheConfig {
    
    @Bean
    public RedisTemplate<String, Object> redisTemplate(
            RedisConnectionFactory connectionFactory) {
        RedisTemplate<String, Object> template = new RedisTemplate<>();
        template.setConnectionFactory(connectionFactory);
        
        // JSON 序列化
        Jackson2JsonRedisSerializer<Object> serializer = 
            new Jackson2JsonRedisSerializer<>(Object.class);
        template.setValueSerializer(serializer);
        template.setKeySerializer(new StringRedisSerializer());
        
        return template;
    }
    
    @Bean
    public CacheManager redisCacheManager(RedisTemplate<String, Object> redisTemplate) {
        RedisCacheManager.RedisCacheManagerBuilder builder = 
            RedisCacheManager.RedisCacheManagerBuilder
                .fromConnectionFactory(redisTemplate.getConnectionFactory());
        
        // 常见问题缓存(1 小时过期)
        builder.withCacheConfiguration("faq-cache", 
            RedisCacheConfiguration.defaultCacheConfig()
                .entryTtl(Duration.ofHours(1))
                .serializeValuesWith(
                    RedisSerializationContext.SerializationPair
                        .fromSerializer(redisTemplate.getValueSerializer())));
        
        return builder.build();
    }
}

// FaqCacheService.java
@Service
@RequiredArgsConstructor
public class FaqCacheService {
    
    private final RedisTemplate<String, Object> redisTemplate;
    private static final String FAQ_PREFIX = "rag:faq:";
    
    /**
     * 计算缓存 Key(使用问题哈希)
     */
    private String generateKey(String question) {
        return FAQ_PREFIX + DigestUtils.md5Hex(question);
    }
    
    /**
     * 获取缓存答案
     */
    public CachedAnswer get(String question) {
        String key = generateKey(question);
        Object cached = redisTemplate.opsForValue().get(key);
        
        if (cached instanceof CachedAnswer) {
            log.debug("缓存命中:{}", question);
            return (CachedAnswer) cached;
        }
        
        log.debug("缓存未命中:{}", question);
        return null;
    }
    
    /**
     * 设置缓存答案
     */
    public void set(String question, String answer, List<SourceDocument> sources) {
        String key = generateKey(question);
        CachedAnswer cachedAnswer = new CachedAnswer(answer, sources, System.currentTimeMillis());
        
        redisTemplate.opsForValue().set(key, cachedAnswer, 1, TimeUnit.HOURS);
        log.info("缓存已设置:{}", question);
    }
    
    /**
     * 批量预加载热门问题
     */
    public void preloadPopularQuestions(List<String> questions) {
        for (String question : questions) {
            // 触发缓存加载
            get(question);
        }
    }
}

// CachedAnswer.java
@Data
@AllArgsConstructor
@NoArgsConstructor
public class CachedAnswer implements Serializable {
    private String answer;
    private List<SourceDocument> sources;
    private long cachedAt;
    
    public boolean isExpired(long ttlMillis) {
        return System.currentTimeMillis() - cachedAt > ttlMillis;
    }
}

2.4 语义缓存(Semantic Cache)

# semantic_cache.py
import redis
import numpy as np
from typing import Optional, Tuple
from sentence_transformers import SentenceTransformer

class SemanticCache:
    """
    基于语义相似度的智能缓存
    相似问题直接返回缓存答案,无需重新检索和生成
    """
    
    def __init__(self, 
                 redis_host: str = "localhost",
                 redis_port: int = 6379,
                 similarity_threshold: float = 0.95):
        self.redis = redis.Redis(host=redis_host, port=redis_port)
        self.model = SentenceTransformer('bge-small-zh-v1.5')
        self.threshold = similarity_threshold
        self.index_key = "semantic_cache:index"
        self.data_key = "semantic_cache:data"
    
    def _get_embedding(self, text: str) -> np.ndarray:
        """获取文本向量"""
        embedding = self.model.encode(text, normalize_embeddings=True)
        return embedding.astype(np.float32).tobytes()
    
    def _cosine_similarity(self, emb1: bytes, emb2: bytes) -> float:
        """计算余弦相似度"""
        v1 = np.frombuffer(emb1, dtype=np.float32)
        v2 = np.frombuffer(emb2, dtype=np.float32)
        return np.dot(v1, v2) / (np.linalg.norm(v1) * np.linalg.norm(v2))
    
    def get(self, question: str) -> Optional[dict]:
        """
        语义缓存查询
        返回相似度最高的缓存答案(如果相似度>阈值)
        """
        question_emb = self._get_embedding(question)
        
        # 获取所有缓存问题的向量
        cached_questions = self.redis.hkeys(self.index_key)
        
        if not cached_questions:
            return None
        
        best_match = None
        best_similarity = 0
        
        for cached_q in cached_questions:
            cached_emb = self.redis.hget(self.index_key, cached_q)
            similarity = self._cosine_similarity(question_emb, cached_emb)
            
            if similarity > best_similarity:
                best_similarity = similarity
                best_match = cached_q
        
        if best_similarity >= self.threshold:
            # 返回缓存答案
            cached_data = self.redis.hget(self.data_key, best_match)
            return {
                "answer": cached_data,
                "similarity": best_similarity,
                "from_cache": True
            }
        
        return None
    
    def set(self, question: str, answer: str, ttl: int = 3600):
        """设置语义缓存"""
        question_emb = self._get_embedding(question)
        
        pipe = self.redis.pipeline()
        pipe.hset(self.index_key, question, question_emb)
        pipe.hset(self.data_key, question, answer)
        pipe.expire(self.index_key, ttl)
        pipe.expire(self.data_key, ttl)
        pipe.execute()
    
    def clear(self):
        """清空缓存"""
        self.redis.delete(self.index_key)
        self.redis.delete(self.data_key)
    
    def stats(self) -> dict:
        """获取缓存统计"""
        return {
            "total_questions": self.redis.hlen(self.index_key),
            "memory_usage": self.redis.memory_usage(self.index_key) or 0
        }

# 使用示例
cache = SemanticCache(similarity_threshold=0.95)

# 查询缓存
result = cache.get("什么是 RAG 技术?")
if result:
    print(f"缓存命中!相似度:{result['similarity']:.2f}")
    print(f"答案:{result['answer']}")
else:
    # 缓存未命中,调用 RAG 系统
    answer = rag_service.query("什么是 RAG 技术?")
    # 写入缓存
    cache.set("什么是 RAG 技术?", answer)

2.5 缓存性能对比

缓存类型命中率平均延迟内存占用适用场景
无缓存0%3000ms0MB基准测试
Caffeine35%1500ms50MB热点数据
Redis60%800ms500MB常见问题
语义缓存75%400ms200MB相似问题
多级缓存85%300ms750MB生产环境推荐

第 3 章 并发优化与异步处理

3.1 线程池优化配置

// ThreadPoolConfig.java
@Configuration
public class ThreadPoolConfig {
    
    /**
     * RAG 专用线程池
     */
    @Bean("ragThreadPool")
    public ThreadPoolTaskExecutor ragThreadPool() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        
        // 核心线程数(CPU 核心数 * 2)
        executor.setCorePoolSize(Runtime.getRuntime().availableProcessors() * 2);
        
        // 最大线程数
        executor.setMaxPoolSize(100);
        
        // 队列容量
        executor.setQueueCapacity(500);
        
        // 线程存活时间(秒)
        executor.setKeepAliveSeconds(60);
        
        // 线程名前缀
        executor.setThreadNamePrefix("rag-pool-");
        
        // 拒绝策略
        executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
        
        // 等待任务完成
        executor.setWaitForTasksToCompleteOnShutdown(true);
        executor.setAwaitTerminationSeconds(60);
        
        executor.initialize();
        return executor;
    }
    
    /**
     * Embedding 专用线程池(IO 密集型)
     */
    @Bean("embeddingThreadPool")
    public ThreadPoolTaskExecutor embeddingThreadPool() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(20);
        executor.setMaxPoolSize(50);
        executor.setQueueCapacity(200);
        executor.setKeepAliveSeconds(30);
        executor.setThreadNamePrefix("embedding-pool-");
        executor.initialize();
        return executor;
    }
}

3.2 异步 RAG 服务实现

// AsyncRagService.java
@Service
@RequiredArgsConstructor
@Slf4j
public class AsyncRagService {
    
    @Qualifier("ragThreadPool")
    private final ThreadPoolTaskExecutor ragThreadPool;
    
    private final RagService ragService;
    private final RagCacheService cacheService;
    
    /**
     * 异步问答(立即返回 Future)
     */
    @Async("ragThreadPool")
    public CompletableFuture<RagResponse> queryAsync(String question) {
        long startTime = System.currentTimeMillis();
        
        try {
            // 1. 检查缓存
            String cachedAnswer = cacheService.getCachedAnswer(question);
            if (cachedAnswer != null) {
                log.info("缓存命中,耗时:{}ms", System.currentTimeMillis() - startTime);
                return CompletableFuture.completedFuture(
                    new RagResponse(cachedAnswer, Collections.emptyList(), true)
                );
            }
            
            // 2. 执行 RAG 查询
            RagResponse response = ragService.query(question);
            
            // 3. 写入缓存
            cacheService.cacheAnswer(question, response.getAnswer());
            
            log.info("RAG 查询完成,耗时:{}ms", System.currentTimeMillis() - startTime);
            return CompletableFuture.completedFuture(response);
            
        } catch (Exception e) {
            log.error("RAG 查询失败", e);
            return CompletableFuture.failedFuture(e);
        }
    }
    
    /**
     * 批量异步查询
     */
    public List<CompletableFuture<RagResponse>> batchQueryAsync(List<String> questions) {
        return questions.stream()
            .map(this::queryAsync)
            .collect(Collectors.toList());
    }
    
    /**
     * 带超时的异步查询
     */
    public CompletableFuture<RagResponse> queryWithTimeout(String question, 
                                                            long timeout, 
                                                            TimeUnit unit) {
        CompletableFuture<RagResponse> future = queryAsync(question);
        
        // 设置超时
        CompletableFuture<RagResponse> timeoutFuture = future.orTimeout(
            timeout, unit
        );
        
        // 超时降级处理
        return timeoutFuture.exceptionally(ex -> {
            log.warn("查询超时,返回降级答案", ex);
            return new RagResponse(
                "抱歉,当前系统繁忙,请稍后重试。",
                Collections.emptyList(),
                false
            );
        });
    }
}

3.3 响应式编程(WebFlux)

// ReactiveRagController.java
@RestController
@RequestMapping("/api/v2/rag")
@RequiredArgsConstructor
@Slf4j
public class ReactiveRagController {
    
    private final ReactiveRagService ragService;
    
    /**
     * 响应式问答接口
     */
    @PostMapping("/query")
    public Mono<ResponseEntity<RagResponse>> query(
            @RequestBody ChatRequest request) {
        
        return ragService.query(request.getQuestion())
            .map(ResponseEntity::ok)
            .timeout(Duration.ofSeconds(30))
            .onErrorResume(ex -> {
                log.error("查询失败", ex);
                return Mono.just(ResponseEntity.status(500)
                    .body(new RagResponse("系统错误", null, false)));
            });
    }
    
    /**
     * 响应式流式接口
     */
    @PostMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<ServerSentEvent<String>> stream(
            @RequestBody ChatRequest request) {
        
        return ragService.queryStream(request.getQuestion())
            .map(content -> SseEmitter.event().data(content))
            .timeout(Duration.ofMinutes(5))
            .onErrorResume(ex -> {
                log.error("流式查询失败", ex);
                return Flux.just(SseEmitter.event()
                    .data("{\"error\": \"系统错误\"}"));
            });
    }
}

// ReactiveRagService.java
@Service
@RequiredArgsConstructor
@Slf4j
public class ReactiveRagService {
    
    private final ChatClient chatClient;
    private final VectorStore vectorStore;
    private final RedisReactiveTemplate redisTemplate;
    
    public Mono<RagResponse> query(String question) {
        // 检查 Redis 缓存
        return redisTemplate.opsForValue()
            .get("rag:cache:" + DigestUtils.md5Hex(question))
            .cast(RagResponse.class)
            .switchIfEmpty(
                // 缓存未命中,执行 RAG 查询
                Mono.fromCallable(() -> vectorStore.similaritySearch(question))
                    .subscribeOn(Schedulers.boundedElastic())
                    .flatMap(docs -> {
                        String context = buildContext(docs);
                        return chatClient.prompt(buildPrompt(question, context))
                            .mono()
                            .map(content -> new RagResponse(content, docs, false));
                    })
                    .doOnSuccess(response -> {
                        // 写入缓存
                        redisTemplate.opsForValue()
                            .set("rag:cache:" + DigestUtils.md5Hex(question), 
                                 response, 
                                 Duration.ofHours(1))
                            .subscribe();
                    })
            );
    }
    
    public Flux<String> queryStream(String question) {
        return Flux.fromCallable(() -> vectorStore.similaritySearch(question))
            .subscribeOn(Schedulers.boundedElastic())
            .flatMapMany(docs -> {
                String context = buildContext(docs);
                return chatClient.prompt(buildPrompt(question, context))
                    .flux()
                    .map(ChatResponse::content);
            });
    }
}

3.4 并发性能测试结果

并发数优化前 QPS优化后 QPS提升优化前 P99优化后 P99提升
10.30.8+167%3500ms1200ms-66%
51.23.5+192%4000ms1500ms-63%
102.06.8+240%5000ms1800ms-64%
202.512.5+400%6000ms2200ms-63%
502.818.0+543%8000ms2800ms-65%

第 4 章 向量检索优化

4.1 Milvus 索引参数调优

# milvus_index_optimization.py
from pymilvus import Collection, connections, utility
import json

class MilvusIndexOptimizer:
    """Milvus 索引参数优化器"""
    
    def __init__(self, host="localhost", port="19530"):
        connections.connect(host=host, port=port)
    
    def get_collection_stats(self, collection_name: str) -> dict:
        """获取集合统计信息"""
        collection = Collection(collection_name)
        collection.load()
        
        stats = {
            "entity_count": collection.num_entities,
            "index_info": collection.index().to_dict() if collection.index() else None,
            "load_state": utility.load_state(collection_name)
        }
        
        return stats
    
    def optimize_hnsw_index(self, 
                           collection_name: str,
                           entity_count: int,
                           dimension: int = 1024):
        """
        HNSW 索引参数优化
        
        根据数据量动态调整参数
        """
        collection = Collection(collection_name)
        
        # 根据数据量计算最优参数
        if entity_count < 10000:
            M = 8
            efConstruction = 100
        elif entity_count < 100000:
            M = 16
            efConstruction = 200
        else:
            M = 32
            efConstruction = 400
        
        # 删除旧索引
        if collection.index():
            collection.drop_index()
        
        # 创建新索引
        index_params = {
            "metric_type": "COSINE",
            "index_type": "HNSW",
            "params": {
                "M": M,
                "efConstruction": efConstruction
            }
        }
        
        collection.create_index(
            field_name="embedding",
            index_params=index_params
        )
        
        print(f"✅ HNSW 索引优化完成:M={M}, efConstruction={efConstruction}")
        
        return index_params
    
    def optimize_search_params(self, 
                               collection_name: str,
                               top_k: int = 5,
                               latency_target: int = 50) -> dict:
        """
        搜索参数优化
        
        在精度和延迟之间找到平衡点
        """
        collection = Collection(collection_name)
        collection.load()
        
        # 测试不同的 ef 参数
        test_ef = [32, 64, 128, 256, 512]
        results = []
        
        for ef in test_ef:
            search_params = {
                "metric_type": "COSINE",
                "params": {"ef": ef}
            }
            
            import time
            start = time.time()
            
            results_data = collection.search(
                data=[[0.1] * 1024],  # 测试向量
                anns_field="embedding",
                param=search_params,
                limit=top_k
            )
            
            latency = (time.time() - start) * 1000
            
            results.append({
                "ef": ef,
                "latency": latency,
                "within_target": latency <= latency_target
            })
            
            print(f"ef={ef}: 延迟={latency:.2f}ms")
        
        # 选择满足延迟目标的最小 ef
        optimal = next((r for r in results if r["within_target"]), results[-1])
        
        print(f"✅ 最优参数:ef={optimal['ef']}, 延迟={optimal['latency']:.2f}ms")
        
        return optimal
    
    def enable_partition_key(self, collection_name: str, partition_field: str):
        """
        启用分区键(按部门/地区分区)
        
        大幅减少检索范围
        """
        # 注意:分区键需在创建集合时指定
        # 此处仅展示概念
        print(f"建议按 {partition_field} 字段进行分区")

4.2 索引参数推荐表

数据量索引类型MefConstructionef预期延迟内存占用
<1 万HNSW810032<20ms
1-10 万HNSW1620064<50ms
10-100 万HNSW32400128<100ms
>100 万HNSW48500256<200ms很高
任何IVF_FLAT-nlist=1024-<100ms

4.3 混合检索策略

// HybridSearchService.java
@Service
@RequiredArgsConstructor
@Slf4j
public class HybridSearchService {
    
    private final VectorStore vectorStore;
    private final ElasticsearchTemplate esTemplate;
    
    /**
     * 混合检索:向量检索 + 关键词检索
     */
    public List<Document> hybridSearch(String query, 
                                       Map<String, Object> filters,
                                       int topK) {
        
        // 1. 向量检索(语义相似度)
        List<Document> vectorResults = vectorStore.similaritySearch(
            SearchRequest.query(query)
                .withTopK(topK * 2)
                .withFilterCriteria(filters)
        );
        
        // 2. 关键词检索(精确匹配)
        List<Document> keywordResults = keywordSearch(query, filters, topK * 2);
        
        // 3. 结果融合(RRF - Reciprocal Rank Fusion)
        List<Document> fusedResults = reciprocalRankFusion(
            vectorResults, keywordResults, topK
        );
        
        log.info("混合检索:向量{}条 + 关键词{}条 = 融合{}条", 
                vectorResults.size(), keywordResults.size(), fusedResults.size());
        
        return fusedResults;
    }
    
    /**
     * RRF 结果融合算法
     */
    private List<Document> reciprocalRankFusion(List<Document> list1, 
                                                 List<Document> list2, 
                                                 int topK) {
        Map<String, Double> scores = new HashMap<>();
        final int K = 60; // RRF 常数
        
        // 计算列表 1 的 RRF 分数
        for (int i = 0; i < list1.size(); i++) {
            String id = list1.get(i).getId();
            scores.put(id, scores.getOrDefault(id, 0.0) + 1.0 / (K + i + 1));
        }
        
        // 计算列表 2 的 RRF 分数
        for (int i = 0; i < list2.size(); i++) {
            String id = list2.get(i).getId();
            scores.put(id, scores.getOrDefault(id, 0.0) + 1.0 / (K + i + 1));
        }
        
        // 按分数排序
        Map<String, Document> docMap = new HashMap<>();
        for (Document doc : list1) docMap.put(doc.getId(), doc);
        for (Document doc : list2) docMap.put(doc.getId(), doc);
        
        return scores.entrySet().stream()
            .sorted(Map.Entry.<String, Double>comparingByValue().reversed())
            .limit(topK)
            .map(entry -> docMap.get(entry.getKey()))
            .collect(Collectors.toList());
    }
    
    private List<Document> keywordSearch(String query, 
                                         Map<String, Object> filters, 
                                         int topK) {
        // Elasticsearch 关键词检索实现
        // ...
        return Collections.emptyList();
    }
}

第 5 章 监控告警体系

5.1 Prometheus + Grafana 监控配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Spring Boot 应用
  - job_name: 'rag-app'
    static_configs:
      - targets: ['rag-app:8080']
    metrics_path: '/actuator/prometheus'
    
  # Milvus
  - job_name: 'milvus'
    static_configs:
      - targets: ['milvus:9091']
    
  # Ollama
  - job_name: 'ollama'
    static_configs:
      - targets: ['ollama:11434']
    metrics_path: '/api/metrics'
    
  # Redis
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    
  # Node Exporter
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

5.2 核心监控指标

指标类别指标名称告警阈值告警级别说明
应用层rag_request_latency_p99>2000ms🔴 P0P99 延迟
应用层rag_request_qps<10🟡 P1QPS 过低
应用层rag_cache_hit_rate<50%🟡 P1缓存命中率
向量库milvus_search_latency>100ms🔴 P0检索延迟
向量库milvus_memory_usage>80%🟡 P1内存使用率
模型层ollama_inference_latency>2000ms🔴 P0推理延迟
模型层ollama_gpu_memory>90%🔴 P0GPU 显存
缓存层redis_memory_usage>80%🟡 P1Redis 内存
缓存层redis_evicted_keys>100/min🟡 P1键驱逐
系统层cpu_usage>80%🟡 P1CPU 使用率
系统层memory_usage>85%🔴 P0内存使用率
系统层disk_usage>90%🔴 P0磁盘使用率

5.3 Grafana 仪表盘配置

{
  "dashboard": {
    "title": "RAG 系统性能监控",
    "panels": [
      {
        "title": "请求延迟(P50/P95/P99)",
        "type": "graph",
        "targets": [
          {
            "expr": "histogram_quantile(0.50, rate(rag_request_latency_bucket[5m]))",
            "legendFormat": "P50"
          },
          {
            "expr": "histogram_quantile(0.95, rate(rag_request_latency_bucket[5m]))",
            "legendFormat": "P95"
          },
          {
            "expr": "histogram_quantile(0.99, rate(rag_request_latency_bucket[5m]))",
            "legendFormat": "P99"
          }
        ]
      },
      {
        "title": "QPS 趋势",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(rag_request_total[1m])",
            "legendFormat": "QPS"
          }
        ]
      },
      {
        "title": "缓存命中率",
        "type": "gauge",
        "targets": [
          {
            "expr": "rate(rag_cache_hits_total[5m]) / rate(rag_cache_requests_total[5m]) * 100",
            "legendFormat": "Hit Rate %"
          }
        ]
      },
      {
        "title": "Milvus 检索延迟",
        "type": "graph",
        "targets": [
          {
            "expr": "milvus_search_latency_avg",
            "legendFormat": "Avg Latency"
          }
        ]
      }
    ]
  }
}

5.4 告警规则配置

# alerting-rules.yml
groups:
  - name: rag-alerts
    rules:
      # P0 告警 - 服务不可用
      - alert: RagServiceDown
        expr: up{job="rag-app"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "RAG 服务不可用"
          description: "RAG 应用 {{ $labels.instance }} 已宕机超过 1 分钟"
      
      # P0 告警 - 高延迟
      - alert: RagHighLatency
        expr: histogram_quantile(0.99, rate(rag_request_latency_bucket[5m])) > 2000
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "RAG 服务延迟过高"
          description: "P99 延迟 {{ $value }}ms 超过阈值 2000ms"
      
      # P1 告警 - 缓存命中率低
      - alert: RagLowCacheHitRate
        expr: rate(rag_cache_hits_total[5m]) / rate(rag_cache_requests_total[5m]) * 100 < 50
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "缓存命中率过低"
          description: "缓存命中率 {{ $value }}% 低于阈值 50%"
      
      # P1 告警 - Milvus 高延迟
      - alert: MilvusHighLatency
        expr: milvus_search_latency_avg > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Milvus 检索延迟过高"
          description: "平均检索延迟 {{ $value }}ms 超过阈值 100ms"
      
      # P1 告警 - 内存过高
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "服务器内存使用率过高"
          description: "内存使用率 {{ $value }}% 超过阈值 85%"

5.5 告警通知集成

// AlertNotificationService.java
@Service
@RequiredArgsConstructor
@Slf4j
public class AlertNotificationService {
    
    private final RestTemplate restTemplate;
    private final DingTalkProperties dingTalkProperties;
    private final EmailProperties emailProperties;
    
    /**
     * 发送钉钉告警
     */
    public void sendDingTalkAlert(Alert alert) {
        String webhook = dingTalkProperties.getWebhook();
        
        DingTalkMessage message = new DingTalkMessage();
        message.setMsgtype("markdown");
        message.setMarkdown(new DingTalkMessage.Markdown(
            "🚨 RAG 系统告警",
            buildAlertContent(alert)
        ));
        
        restTemplate.postForObject(webhook, message, String.class);
        log.info("钉钉告警已发送:{}", alert.getAlertName());
    }
    
    /**
     * 发送邮件告警
     */
    public void sendEmailAlert(Alert alert, List<String> recipients) {
        SimpleMailMessage mailMessage = new SimpleMailMessage();
        mailMessage.setTo(recipients.toArray(new String[0]));
        mailMessage.setSubject("🚨 RAG 系统告警:" + alert.getAlertName());
        mailMessage.setText(buildAlertContent(alert));
        
        // 异步发送
        CompletableFuture.runAsync(() -> {
            try {
                // mailSender.send(mailMessage);
                log.info("邮件告警已发送:{}", alert.getAlertName());
            } catch (Exception e) {
                log.error("邮件发送失败", e);
            }
        });
    }
    
    private String buildAlertContent(Alert alert) {
        return String.format(
            "## 🚨 告警详情\n\n" +
            "- **告警名称**: %s\n" +
            "- **告警级别**: %s\n" +
            "- **发生时间**: %s\n" +
            "- **当前值**: %s\n" +
            "- **阈值**: %s\n" +
            "- **持续时间**: %s\n" +
            "- **实例**: %s\n",
            alert.getAlertName(),
            alert.getSeverity(),
            alert.getTimestamp(),
            alert.getCurrentValue(),
            alert.getThreshold(),
            alert.getDuration(),
            alert.getInstance()
        );
    }
}

第 6 章 性能优化实战案例

6.1 案例背景

项目政务 RAG 问答系统
文档规模50 万份政策文档
日均查询10 万次
初始延迟P99 = 3500ms
目标延迟P99 < 500ms
硬件配置8C16G × 3 节点

6.2 优化过程

阶段优化措施P99 延迟提升
基线无优化3500ms-
阶段 1添加 Caffeine 本地缓存2200ms-37%
阶段 2添加 Redis 分布式缓存1500ms-57%
阶段 3语义缓存(相似度 0.95)900ms-74%
阶段 4Milvus HNSW 索引优化650ms-81%
阶段 5异步化 + 线程池优化450ms-87%
阶段 6混合检索 + RRF 融合380ms-89%
阶段 7模型量化(Q4)320ms-91%
最终全链路优化完成300ms-91%

6.3 优化前后对比

指标优化前优化后提升
P50 延迟1500ms150ms-90%
P95 延迟2500ms250ms-90%
P99 延迟3500ms300ms-91%
缓存命中率0%85%+85%
最大 QPS50300+500%
CPU 使用率95%60%-37%
内存使用率88%65%-26%
用户满意度62%94%+52%

6.4 经验总结

优化方向投入成本收益推荐度
多级缓存⭐⭐⭐⭐⭐
索引优化⭐⭐⭐⭐⭐
异步化⭐⭐⭐⭐
语义缓存⭐⭐⭐⭐
模型量化⭐⭐⭐⭐
混合检索⭐⭐⭐
硬件扩容⭐⭐

第 7 章 性能优化检查清单

7.1 上线前检查清单

## 📋 RAG 系统性能优化检查清单

### 缓存层
- [ ] Caffeine 本地缓存已配置(热点数据)
- [ ] Redis 分布式缓存已配置(常见问题)
- [ ] 语义缓存已启用(相似问题)
- [ ] 缓存过期策略已设置
- [ ] 缓存击穿/穿透/雪崩防护已实现

### 向量检索层
- [ ] Milvus 索引类型已优化(HNSW)
- [ ] 索引参数已调优(M, efConstruction, ef)
- [ ] 检索超时已设置(<100ms)
- [ ] 混合检索已配置(可选)

### 应用层
- [ ] 线程池已配置(核心/最大/队列)
- [ ] 异步处理已实现(@Async)
- [ ] 超时降级已配置
- [ ] 连接池已优化(数据库/Redis)

### 模型层
- [ ] 模型已量化(Q4/Q8)
- [ ] 流式输出已启用
- [ ] 模型预加载已配置
- [ ] Embedding 批量处理已实现

### 监控告警
- [ ] Prometheus 已部署
- [ ] Grafana 仪表盘已配置
- [ ] 告警规则已设置
- [ ] 通知渠道已集成(钉钉/邮件)

### 压测验证
- [ ] 单请求延迟测试通过(P99 < 500ms)
- [ ] 并发测试通过(目标 QPS)
- [ ] 稳定性测试通过(24 小时)
- [ ] 故障恢复测试通过

7.2 性能优化命令速查

# 1. 查看 Java 应用线程
jstack <pid> | grep -A 10 "rag-pool"

# 2. 查看 JVM 内存
jstat -gc <pid> 1000

# 3. 查看 Milvus 集合统计
milvus-cli show collection -c rag_documents

# 4. 查看 Redis 缓存统计
redis-cli info stats | grep keyspace

# 5. 查看 Ollama 模型状态
ollama ps

# 6. 压力测试(ab)
ab -n 1000 -c 50 http://localhost:8080/api/rag/query

# 7. 压力测试(wrk)
wrk -t12 -c400 -d30s http://localhost:8080/api/rag/query

# 8. 查看系统资源
htop
iostat -x 1
netstat -s

本专题完

📌 核心要点总结

  1. 多级缓存(Caffeine + Redis + 语义)可提升 85% 缓存命中率
  2. Milvus HNSW 索引参数需根据数据量动态调整
  3. 异步化 + 线程池优化可提升 400%+ 并发能力
  4. 模型量化(Q4)可在精度损失<5% 下降低 70% 内存
  5. 混合检索(向量 + 关键词)可提升检索精度 15%
  6. Prometheus + Grafana 是生产环境监控标配
  7. 全链路优化可实现 3000ms → 300ms 的性能提升

📖 下专题预告:《RAG 系统安全与权限管理:企业级数据保护方案》—— 详解 RBAC 权限模型、文档级权限控制、数据脱敏、审计日志