milvus 出现了一个bug,所有集合 全部卡在 0 % , 日志显示:Milvus 内部的 RocksMQ 消息队列数据损坏,无法修复,只能重新导入。所以,还是整一个备份吧!顺便把RocksMQ 替换成 plauser
[2026/04/07 10:32:03.284 +00:00] [WARN] [adaptor/scanner_switchable.go:200] ["create underlying scanner for wal scanner, start a backoff"] [module=streamingnode] [component=scanner] [name=recovery] [channel=by-dev-rootcoord-dml_12:rw@104] [startMessageID=4/2] [nextInterval=2.535581757s] [error="decode rmqID fail with err: strconv.ParseUint: parsing \"CAQQAg==\": invalid syntax, id: CAQQAg==: invalid message id"] [errorVerbose="decode rmqID fail with err: strconv.ParseUint: parsing \"CAQQAg==\": invalid syntax, id: CAQQAg==: invalid message id\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/v2/streaming/walimpls/impls/rmq.unmarshalMessageID\n | \t/workspace/source/pkg/streaming/walimpls/impls/rmq/message_id.go:33\n | github.com/milvus-io/milvus/pkg/v2/streaming/walimpls/impls/rmq.(*walImpl).Read\n | \t/workspace/source/pkg/streaming/walimpls/impls/rmq/wal.go:86\n | github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor.(*catchupScanner).createReaderWithBackoff\n | \t/workspace/source/internal/streamingnode/server/wal/adaptor/scanner_switchable.go:187\n | github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor.(*catchupScanner).Do\n | \t/workspace/source/internal/streamingnode/server/wal/adaptor/scanner_switchable.go:86\n | github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor.(*scannerAdaptorImpl).produceEventLoop\n | \t/workspace/source/internal/streamingnode/server/wal/adaptor/scanner_adaptor.go:198\n | [...repeated from below...]\nWraps: (2) decode rmqID fail with err: strconv.ParseUint: parsing \"CAQQAg==\": invalid syntax, id: CAQQAg==\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/v2/streaming/util/message.init\n | \t/workspace/source/pkg/streaming/util/message/message_id.go:16\n | runtime.doInit1\n | \t/usr/local/go/src/runtime/proc.go:7410\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:7377\n | runtime.main\n | \t/usr/local/go/src/runtime/proc.go:254\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1700\nWraps: (4) invalid message id\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.leafError"]
[2026/04/07 10:32:03.456 +00:00] [WARN] [adaptor/scanner_switchable.go:200] ["create underlying scanner for wal scanner, start a backoff"] [module=streamingnode] [component=scanner] [name=recovery] [channel=by-dev-rootcoord-dml_10:rw@104] [startMessageID=2/41] [nextInterval=4.105696061s] [error="decode rmqID fail with err: strconv.ParseUint: parsing \"CAIQKQ==\": invalid syntax, id: CAIQKQ==: invalid message id"] [errorVerbose="decode rmqID fail with err: strconv.ParseUint: parsing \"CAIQKQ==\": invalid syntax, id: CAIQKQ==: invalid message id\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/v2/streaming/walimpls/impls/rmq.unmarshalMessageID\n | \t/workspace/source/pkg/streaming/walimpls/impls/rmq/message_id.go:33\n | github.com/milvus-io/milvus/pkg/v2/streaming/walimpls/impls/rmq.(*walImpl).Read\n | \t/workspace/source/pkg/streaming/walimpls/impls/rmq/wal.go:86\n | github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor.(*catchupScanner).createReaderWithBackoff\n | \t/workspace/source/internal/streamingnode/server/wal/adaptor/scanner_switchable.go:187\n | github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor.(*catchupScanner).Do\n | \t/workspace/source/internal/streamingnode/server/wal/adaptor/scanner_switchable.go:86\n | github.com/milvus-io/milvus/internal/streamingnode/server/wal/adaptor.(*scannerAdaptorImpl).produceEventLoop\n | \t/workspace/source/internal/streamingnode/server/wal/adaptor/scanner_adaptor.go:198\n | [...repeated from below...]\nWraps: (2) decode rmqID fail with err: strconv.ParseUint: parsing \"CAIQKQ==\": invalid syntax, id: CAIQKQ==\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/pkg/v2/streaming/util/message.init\n | \t/workspace/source/pkg/streaming/util/message/message_id.go:16\n | runtime.doInit1\n | \t/usr/local/go/src/runtime/proc.go:7410\n | runtime.doInit\n | \t/usr/local/go/src/runtime/proc.go:7377\n | runtime.main\n | \t/usr/local/go/src/runtime/proc.go:254\n | runtime.goexit\n |
具体过程:
- 下载
milvus-backup - 解压并放置指定目录
- 修改配置
milvus.yaml - 重新启动 milvus
下载 milvus-backup
milvus-backup貌似只支持 linux系统
解压并放置指定目录
# 将压缩包上传至服务器
tar -zxvf milvus-backup_0.5.12_Linux_x86_64.tar.gz
# 放置milvus-backup和创建文件backup.yaml
——config
|———backup.yaml
——milvus-backup
修改配置
backup.yaml
# milvus: 要备份的源 Milvus 实例信息[reference:5]
milvus:
address: 127.0.0.1 # 容器名,与 docker-compose-base.yml 中一致
port: 19530
# 如果你没有开启认证,这部分可以注释掉
# authorizationEnabled: true
# user: "root" # 如果开启了认证,填写你的用户名
# password: "Milvus" # 如果开启了认证,填写你的密码
# tlsMode: 0 # 0 表示关闭 TLS[reference:6]
# minio: Milvus 实例所使用的对象存储信息[reference:7]
minio:
storageType: "minio" # 固定为 "minio"[reference:8]
address: 127.0.0.1 # 容器名,与 docker-compose-base.yml 中一致
port: 9000
accessKeyID: ${MINIO_USER} # 替换为你的 ${MINIO_USER} 实际值
secretAccessKey: ${MINIO_PASSWORD} # 替换为你的 ${MINIO_PASSWORD} 实际值
useSSL: false
bucketName: "a-bucket" # Milvus 默认存储桶,Docker Compose 部署为 a-bucket[reference:9]
rootPath: "files" # Milvus 存储根路径,与 docker-compose-base.yml 中一致[reference:10]
# backup: 备份文件存放的目标存储信息[reference:11]
backup:
backupStorageType: "minio" # 备份存储类型[reference:12]
address: 127.0.0.1 # 备份存储地址,与 minio.address 保持一致[reference:13]
backupPort: 9000
backupAccessKeyID: ${MINIO_USER} # 与 accessKeyID 保持一致
backupSecretAccessKey: ${MINIO_PASSWORD} # 与 secretAccessKey 保持一致
backupBucketName: "a-bucket" # 备份文件存放的桶名,推荐与 bucketName 相同[reference:14]
backupRootPath: "backup" # 备份文件的根路径,建议单独设置一个目录[reference:15]
milvus.yaml 中,修改 mq的type值 和 pulsar的 address值
...
mq:
# Default value: "default"
# Valid values: [default, pulsar, kafka, rocksmq, woodpecker]
type: pulsar # 修改此处
...
pulsar:
# IP address of Pulsar service.
# Environment variable: PULSAR_ADDRESS
# pulsar.address and pulsar.port together generate the valid access to Pulsar.
# Pulsar preferentially acquires the valid IP address from the environment variable PULSAR_ADDRESS when Milvus is started.
# Default value applies when Pulsar is running on the same network with Milvus.
address: pulsar # 修改此处,不是container-name
docker 配置,注意自己要配置好 .env文件,至于docker 镜像问题,建议科学上网
minio:
image: quay.io/minio/minio:RELEASE.2023-12-20T01-00-02Z
container_name: ragflow-minio
command: server --console-address ":9001" /data
ports:
- ${MINIO_PORT}:9000
- ${MINIO_CONSOLE_PORT}:9001
env_file: .env
environment:
- MINIO_ROOT_USER=${MINIO_USER}
- MINIO_ROOT_PASSWORD=${MINIO_PASSWORD}
- TZ=${TIMEZONE}
volumes:
- minio_data:/data
networks:
- ragflow
restart: unless-stopped
etcd:
container_name: ragflow-milvus-etcd
image: quay.io/coreos/etcd:v3.5.18
environment:
- ETCD_AUTO_COMPACTION_MODE=revision
- ETCD_AUTO_COMPACTION_RETENTION=1000
- ETCD_QUOTA_BACKEND_BYTES=4294967296
- ETCD_SNAPSHOT_COUNT=50000
volumes:
- etcd_data:/etcd # 改用命名卷,避免路径冲突
command: etcd -advertise-client-urls=http://etcd:2379 -listen-client-urls http://0.0.0.0:2379 --data-dir /etcd
healthcheck:
test: ["CMD", "etcdctl", "endpoint", "health"]
interval: 30s
timeout: 20s
retries: 3
networks:
- ragflow
restart: unless-stopped
standalone:
container_name: ragflow-milvus-standalone
image: milvusdb/milvus:v2.6.13
command: ["milvus", "run", "standalone"]
security_opt:
- seccomp:unconfined
environment:
MINIO_ACCESS_KEY_ID: ${MINIO_USER} # 与 base .env 中的 MINIO_USER 一致
MINIO_SECRET_ACCESS_KEY: ${MINIO_PASSWORD} # 与 base .env 中的 MINIO_PASSWORD 一致
ETCD_ENDPOINTS: etcd:2379
# 修改 minio 地址:指向 base 中的 minio 服务(服务名 minio,内部端口 9000)
MINIO_ADDRESS: minio:9000
# MQ_TYPE: woodpecker
volumes:
- milvus_data:/var/lib/milvus # 改用命名卷
- ./milvus.yaml:/milvus/configs/milvus.yaml # 新增:挂载配置文件
ports:
- "19530:19530" # Milvus gRPC
- "9091:9091" # Milvus metrics
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: ["gpu"]
device_ids: ["1"]
depends_on:
- etcd
- minio # 依赖 base 的 minio
networks:
- ragflow
restart: unless-stopped
# milvus attu WEBUI界面
attu:
container_name: ragflow-milvus-attu
image: zilliz/attu:latest
environment:
MILVUS_URL: standalone:19530
ports:
- "8987:3000"
depends_on:
- "standalone"
# profiles:
# - all
# - attu
networks:
- ragflow # 如果 Attu 也需要加入同一网络(可选,但建议加上)
restart: unless-stopped
pulsar:
container_name: ragflow-milvus-pulsar
image: apachepulsar/pulsar:3.3.4 # 建议使用稳定版本
command: bin/pulsar standalone
ports:
- "6650:6650" # Pulsar 客户端连接端口
- "8080:8080" # Pulsar HTTP API 端口
volumes:
- pulsar_data:/pulsar/data # 数据持久化
networks:
- ragflow
restart: unless-stopped
volumes:
minio_data:
driver: local
etcd_data:
driver: local
milvus_data:
driver: local
pulsar_data:
driver: local
networks:
ragflow:
driver: bridge
重新启动 milvus
# 启动容器
docker compose -f <docker文件路径> -p <项目名称> up -d
# 备份milvus
./milvus-backup create -n my_backup_$(date +%Y%m%d_%H%M%S)
# 读取列表
./milvus-backup list
# 还原备份
./milvus-backup restore -n <备份文件名> -s _restored
关于备份是否成功,可以打开minio控制台,localhost:9001,找到是否存在a-bunket的桶
定期备份
- 新建 milvus_cleanup_backup.sh
#!/bin/bash
# --- 配置区 ---
# 1. 设置milvus-backup二进制文件的路径
BACKUP_BIN=<milvus-backup 路径> # 例如"/home/ykt-root/文档/_dockers/ragflow/milvus-backup"
# 2. 设置备份脚本的工作目录
WORK_DIR=<configs文件夹路径> # 例如"/home/ykt-root/文档/_dockers/ragflow"
# 3. 设置备份保留天数
RETENTION_DAYS=30
# 4. 设置日志文件路径
LOG_FILE="/var/log/milvus_backup.log"
# --- 获取当前日期的时间戳,用于对比 ---
CURRENT_DATE=$(date +%s)
# --- 切换工作目录 ---
cd $WORK_DIR
echo "$(date) - 开始清理超过 $RETENTION_DAYS 天的旧备份。" >> $LOG_FILE
# 1. 使用 list 命令获取所有备份名称
BACKUP_LIST=$($BACKUP_BIN list 2>/dev/null)
# 2. 遍历每个备份名称
echo "$BACKUP_LIST" | while read BACKUP_NAME; do
# 跳过空行
if [ -z "$BACKUP_NAME" ]; then
continue
fi
# 从备份名称中提取日期部分,假设格式为 auto_backup_20231015_143000
BACKUP_DATE_STR=$(echo $BACKUP_NAME | grep -oP 'auto_backup_\K[0-9]{8}')
if [ -z "$BACKUP_DATE_STR" ]; then
echo "警告: 无法从备份名 $BACKUP_NAME 中解析日期,跳过。" >> $LOG_FILE
continue
fi
# 将日期字符串转换为时间戳
BACKUP_DATE=$(date -d "${BACKUP_DATE_STR}" +%s 2>/dev/null)
if [ -z "$BACKUP_DATE" ]; then
echo "警告: 日期 $BACKUP_DATE_STR 无效,跳过。" >> $LOG_FILE
continue
fi
# 计算备份文件的年龄(天数)
AGE_DAYS=$(( ($CURRENT_DATE - $BACKUP_DATE) / 86400 ))
# 如果超过保留天数,则执行删除
if [ $AGE_DAYS -gt $RETENTION_DAYS ]; then
echo "$(date) - 备份 $BACKUP_NAME 已存在 $AGE_DAYS 天,超过 $RETENTION_DAYS 天,正在删除..." >> $LOG_FILE
$BACKUP_BIN delete --name $BACKUP_NAME >> $LOG_FILE 2>&1
if [ $? -eq 0 ]; then
echo "$(date) - 备份 $BACKUP_NAME 删除成功。" >> $LOG_FILE
else
echo "$(date) - 备份 $BACKUP_NAME 删除失败。" >> $LOG_FILE
fi
fi
done
echo "----------------------------------------" >> $LOG_FILE
- 设置权限:
chmod +x <替换为milvus_cleanup_backup.sh实际路径>
3. 设置定时启动
# 编辑 crontab 文件:
crontab -e
# 设置默认编辑器
# 3点定时清理脚本
0 3 * * * <替换为milvus_cleanup_backup.sh实际路径>
# 查看编辑的定时任务
crontab -l