在Spring Boot 中向量化文档并存入向量库

1,069 阅读2分钟

1. 环境准备

1.1 启动Milvus服务

  • 单机模式(Docker快速启动):

    bash复制代码
    docker run -d --name milvus-standalone \
      -p 19530:19530 \
      -p 9091:9091 \
      milvusdb/milvus:v2.3.4-standalone
    

1.2 添加Java SDK依赖

pom.xml中添加Milvus Java SDK:

xml复制代码
<dependency>
    <groupId>io.milvus</groupId>
    <artifactId>milvus-sdk-java</artifactId>
    <version>2.3.4</version>
</dependency>

2. 配置Milvus连接

2.1 创建配置类

java复制代码
@Configuration
public class MilvusConfig {
    @Value("${milvus.host:localhost}")
    private String host;

    @Value("${milvus.port:19530}")
    private int port;

    @Bean
    public MilvusClient milvusClient() {
        return new MilvusClient(host, port);
    }
}

2.2 配置文件

application.properties

properties复制代码
milvus.host=localhost
milvus.port=19530

3. 数据模型与集合管理

3.1 定义集合结构

java复制代码
public class MilvusCollection {
    public static final String COLLECTION_NAME = "doc_vectors";
    public static final int VECTOR_DIM = 1536; // 根据实际向量维度调整

    // 字段定义
    public static final String ID_FIELD = "id";
    public static final String CONTENT_FIELD = "content";
    public static final String VECTOR_FIELD = "vector";
    
    // 创建集合
    public static void createCollection(MilvusClient client) {
        FieldType idField = FieldType.newBuilder()
            .withName(ID_FIELD)
            .withDataType(DataType.Int64)
            .withPrimaryKey(true)
            .withAutoID(true)
            .build();

        FieldType contentField = FieldType.newBuilder()
            .withName(CONTENT_FIELD)
            .withDataType(DataType.VarChar)
            .withMaxLength(1000)
            .build();

        FieldType vectorField = FieldType.newBuilder()
            .withName(VECTOR_FIELD)
            .withDataType(DataType.FloatVector)
            .withDimension(VECTOR_DIM)
            .build();

        CreateCollectionParam createParam = CreateCollectionParam.newBuilder()
            .withCollectionName(COLLECTION_NAME)
            .addFieldType(idField)
            .addFieldType(contentField)
            .addFieldType(vectorField)
            .build();

        client.createCollection(createParam);
    }
}

4. 实现数据存储服务

4.1 数据插入服务

java复制代码
@Service
public class MilvusStorageService {
    private final MilvusClient milvusClient;

    @Autowired
    public MilvusStorageService(MilvusClient milvusClient) {
        this.milvusClient = milvusClient;
    }

    public void storeDocument(String content, float[] vector) {
        // 构建插入数据
        List<InsertParam.Field> fields = new ArrayList<>();
        fields.add(new InsertParam.Field(MilvusCollection.CONTENT_FIELD, Collections.singletonList(content)));
        fields.add(new InsertParam.Field(MilvusCollection.VECTOR_FIELD, Collections.singletonList(vector)));

        InsertParam insertParam = InsertParam.newBuilder()
            .withCollectionName(MilvusCollection.COLLECTION_NAME)
            .withFields(fields)
            .build();

        // 执行插入操作
        milvusClient.insert(insertParam);
        
        // 刷新数据使可立即搜索(生产环境需谨慎使用)
        milvusClient.flush(MilvusCollection.COLLECTION_NAME);
    }
}

4.2 完整调用流程

java复制代码
@RestController
public class DocumentController {
    @Autowired
    private OpenAIVectorizationService vectorizationService;
    
    @Autowired
    private MilvusStorageService storageService;

    @PostMapping("/upload")
    public ResponseEntity<String> uploadDocument(@RequestBody String content) {
        // 向量化文档
        float[] vector = vectorizationService.vectorize(content);
        
        // 存储到Milvus
        storageService.storeDocument(content, vector);
        
        return ResponseEntity.ok("文档存储成功");
    }
}

5. 集合管理操作

5.1 初始化集合(应用启动时)

java复制代码
@SpringBootApplication
public class Application implements CommandLineRunner {
    @Autowired
    private MilvusClient milvusClient;

    public static void main(String[] args) {
        SpringApplication.run(Application.class, args);
    }

    @Override
    public void run(String... args) {
        if (!milvusClient.hasCollection(MilvusCollection.COLLECTION_NAME)) {
            MilvusCollection.createCollection(milvusClient);
        }
    }
}

5.2 索引创建(可选)

java复制代码
public void createIndex() {
    IndexType indexType = IndexType.IVF_FLAT;
    String indexParam = "{"nlist":1024}";
    
    CreateIndexParam createIndexParam = CreateIndexParam.newBuilder()
        .withCollectionName(MilvusCollection.COLLECTION_NAME)
        .withFieldName(MilvusCollection.VECTOR_FIELD)
        .withIndexType(indexType)
        .withMetricType(MetricType.L2)
        .withExtraParam(indexParam)
        .build();
    
    milvusClient.createIndex(createIndexParam);
}

6. 关键注意事项

  1. 批处理优化:批量插入数据时使用List<List<?>>结构提高效率

    java复制代码
    // 批量插入示例
    List<String> contents = ...; // 多文档内容
    List<float[]> vectors = ...; // 对应向量
    
    fields.add(new InsertParam.Field(CONTENT_FIELD, contents));
    fields.add(new InsertParam.Field(VECTOR_FIELD, vectors));
    
  2. 连接管理

    • 使用连接池配置(推荐使用Zookeeper管理的集群)
    • 处理连接超时和重试机制
  3. 数据一致性

    • 重要数据插入后执行flush()
    • 使用Milvus的原子性操作保证数据完整
  4. 性能调优

    • 根据数据规模选择合适的索引类型(IVF_FLAT、HNSW等)
    • 调整nlistM等索引参数优化查询性能

7. 完整架构示意图

复制代码
Spring Boot应用
    │
    │ 向量化请求
    ▼
[OpenAI/Cohere API]
    │
    │ 返回向量
    ▼
[业务逻辑层]
    │
    │ 结构化数据
    ▼
[Milvus向量数据库]
    │
    ▼
(后续支持RAG检索)

通过以上步骤,即可在Spring Boot应用中实现文档向量化并存储到Milvus向量数据库的完整流程。