Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: 完整的 handleAddOrUpdateVMIMigration ===" && git show upstream/master:pkg/controller/kubevirt.go | sed -n '96,210p'
Ran terminal command: cd /root/f/kube-ovn && echo "=== MC: migrateVM 完整函数 ===" && git show upstream/release-1.12-mc:pkg/controller/pod.go | sed -n '2297,2360p'
Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: Pod 删除时的 VM 清理 ===" && git show upstream/master:pkg/controller/pod.go | grep -A 30 "CleanLogicalSwitchPortMigrateOptions"
Ran terminal command: cd /root/f/kube-ovn && echo "=== MC: Pod 删除时的 VM 清理 ===" && git show upstream/release-1.12-mc:pkg/controller/pod.go | grep -A 30 "CleanLogicalSwitchPortMigrateOptions"
Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: Clean 调用的上下文 ===" && git show upstream/master:pkg/controller/pod.go | grep -B 30 "CleanLogicalSwitchPortMigrateOptions" | tail -40
现在让我看看 Reset 函数的具体实现:
Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: ResetLogicalSwitchPortMigrateOptions ===" && git show upstream/master:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 40 "func.*ResetLogicalSwitchPortMigrateOptions"
Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: CleanLogicalSwitchPortMigrateOptions ===" && git show upstream/master:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 30 "func.*CleanLogicalSwitchPortMigrateOptions"
Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: SetLogicalSwitchPortMigrateOptions ===" && git show upstream/master:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 50 "func.*SetLogicalSwitchPortMigrateOptions"
现在让我检查 MC 分支的 Reset/Set/Clean 实现是否相同:
Ran terminal command: cd /root/f/kube-ovn && git diff upstream/master upstream/release-1.12-mc -- pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 10 -B 5 "ResetLogicalSwitchPortMigrateOptions|CleanLogicalSwitchPortMigrateOptions|SetLogicalSwitchPortMigrateOptions" | head -60
有一个微小差异!让我再详细看看:
Ran terminal command: cd /root/f/kube-ovn && echo "=== Master Clean ===" && git show upstream/master:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 15 "CleanLogicalSwitchPortMigrateOptions"
Ran terminal command: cd /root/f/kube-ovn && echo "=== MC Clean ===" && git show upstream/release-1.12-mc:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 20 "CleanLogicalSwitchPortMigrateOptions"
深入分析总结
1. LSP 迁移函数实现差异
| 函数 | Master | MC | 差异 |
|---|---|---|---|
SetLogicalSwitchPortMigrateOptions | ✅ 相同 | ✅ 相同 | 无差异 |
ResetLogicalSwitchPortMigrateOptions | ✅ 相同 | ✅ 相同 | 无差异 |
CleanLogicalSwitchPortMigrateOptions | 删除 requested-chassis + activation-strategy | 只删除 requested-chassis | 有差异 |
2. 热迁移完整流程对比
┌─────────────────────────────────────────────────────────────────────────────┐
│ 热迁移阶段 │
├─────────────┬───────────────────────────────┬───────────────────────────────┤
│ 阶段 │ Master │ MC │
├─────────────┼───────────────────────────────┼───────────────────────────────┤
│ 开始迁移 │ Set(src,target) │ Set(src,target) │
│ │ requested-chassis=src,target │ requested-chassis=src,target │
│ │ activation-strategy=rarp │ activation-strategy=rarp │
├─────────────┼───────────────────────────────┼───────────────────────────────┤
│ 迁移成功 │ Reset(false) │ Reset(false) │
│ │ requested-chassis=target │ requested-chassis=target │
│ │ 删除 activation-strategy │ 删除 activation-strategy │
├─────────────┼───────────────────────────────┼───────────────────────────────┤
│ 迁移失败 │ Reset(true) │ Reset(true) │
│ │ requested-chassis=src │ requested-chassis=src │
│ │ 删除 activation-strategy │ 删除 activation-strategy │
├─────────────┼───────────────────────────────┼───────────────────────────────┤
│ Pod 删除 │ Clean() │ Clean() │
│ │ 删除 requested-chassis │ 删除 requested-chassis │
│ │ 删除 activation-strategy ⚠️ │ ❌ 不删除 activation-strategy │
└─────────────┴───────────────────────────────┴───────────────────────────────┘
3. Master 的关键 Bug
问题 1: 获取节点信息时没有校验 MigrationUID
// Master 当前代码
if vmi.Status.MigrationState != nil {
srcNodeName = vmi.Status.MigrationState.SourceNode // 可能是上次迁移的!
targetNodeName = vmi.Status.MigrationState.TargetNode // 可能是上次迁移的!
}
问题 2: Early return 导致 cleanup 被跳过
if vmiMigration.Status.MigrationState.Completed {
return nil // MigrationSucceeded/Failed 时可能直接跳过!
}
4. Issue #6220 的实际场景
第一次迁移: Node-A → Node-B (成功)
- vmi.Status.MigrationState = {SourceNode: A, TargetNode: B, MigrationUID: xxx}
第二次迁移: Node-B → Node-A (开始)
- 新的 vmiMigration 创建,UID = yyy
- vmi.Status.MigrationState 还没更新,仍然是上次的!
Master 控制器处理:
- 获取 srcNodeName = A (错!应该是 B)
- 获取 targetNodeName = B (错!应该是 A)
- 判断: sourceNode == targetPod.Spec.NodeName (B == A? 取决于时序)
- 可能跳过 SetLogicalSwitchPortMigrateOptions!
结果:
- OVN 没有正确设置 requested-chassis
- OVS interface 无法就绪
- 超时报错: "ovs interface xxx is not ready after 30s"
5. 结论
| 维度 | Master | MC |
|---|---|---|
| LSP 操作函数 | 基本相同 | 基本相同 |
| 节点信息获取 | 有 stale state bug | 从 Pod annotation 获取,无此问题 |
| 迁移过程触发 | 监听 VMIMigration CRD | Pod annotation 变化触发 |
| 可靠性 | 依赖 vmi.Status 同步时序 | 依赖 kubevirt 写入 annotation |
PR #6239 的 MigrationUID 校验是解决 Master 问题的关键修复。
Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/ovs/ovn-nb-logical_switch_port.go 2>/dev/null | grep -A 80 "SetLogicalSwitchPortMigrateOptions|ResetLogicalSwitchPortMigrateOptions|CleanLogicalSwitchPortMigrateOptions"
Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/daemon/handler.go 2>/dev/null | grep -B 5 -A 50 "SetLogicalSwitchPortMigrateOptions|MigrateOptions|migrate"
Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/daemon/controller.go 2>/dev/null | grep -B 5 -A 50 "SetLogicalSwitchPortMigrateOptions|MigrateOptions|Migrate|kubevirt"
Ran terminal command: cd /root/f/kube-ovn && git ls-tree --name-only -r upstream/release-1.12-mc | xargs -I {} sh -c 'git show upstream/release-1.12-mc:{} 2>/dev/null | grep -l "SetLogicalSwitchPortMigrateOptions|CleanLogicalSwitchPortMigrateOptions" && echo {}'
Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/controller/pod.go 2>/dev/null | grep -B 10 -A 30 "SetLogicalSwitchPortMigrateOptions|CleanLogicalSwitchPortMigrateOptions"
Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/controller/pod.go 2>/dev/null | grep -B 40 "isMigrate|migrated|migratedFail" | head -100
Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/controller/pod.go 2>/dev/null | grep -A 80 "func.*migrateVM"
现在让我对比当前 fix-migration 分支的实现:
Ran terminal command: cd /root/f/kube-ovn && git show HEAD:pkg/controller/pod.go 2>/dev/null | grep -B 5 -A 30 "CleanLogicalSwitchPortMigrateOptions"
现在我有足够信息来做完整对比。让我汇总一下:
Release 1.12-mc vs 当前 fix-migration 分支 LSP 控制逻辑对比
一、触发方式不同
| 维度 | release-1.12-mc | fix-migration (master) |
|---|---|---|
| 迁移状态来源 | Pod Annotations (如 util.MigrationPhaseAnnotation) | VirtualMachineInstanceMigration CRD Status |
| 触发时机 | Pod Add/Update 时在 reconcileAllocateSubnets 中检查 | VirtualMachineInstanceMigration 的 Add/Update 事件 |
| 控制器 | pod.go 中的 migrateVM() | kubevirt.go 中的 handleAddOrUpdateVMIMigration() |
二、LSP Options 操作对比
1. 迁移开始时 (SetLogicalSwitchPortMigrateOptions)
| release-1.12-mc | fix-migration | |
|---|---|---|
| 触发条件 | MigrationPhaseStarted annotation | MigrationScheduling phase |
| 操作 | 完全相同 ✅ | |
| 结果 | requested-chassis=src,targetactivation-strategy=rarp | 完全相同 ✅ |
// 两者实现完全一致
lsp.Options["requested-chassis"] = fmt.Sprintf("%s,%s", srcNodeName, targetNodeName)
lsp.Options["activation-strategy"] = "rarp"
2. 迁移成功时
| release-1.12-mc | fix-migration | |
|---|---|---|
| 触发条件 | MigrationPhaseSucceeded annotation | MigrationSucceeded phase |
| 操作 | ResetLogicalSwitchPortMigrateOptions(..., false) | CleanLogicalSwitchPortMigrateOptions() |
| 结果 | requested-chassis=targetNode删除 activation-strategy | 删除 requested-chassis删除 activation-strategy |
⚠️ 差异点: 1.12-mc 保留 requested-chassis=target,fix-migration 完全清理
3. 迁移失败时
| release-1.12-mc | fix-migration | |
|---|---|---|
| 触发条件 | MigrationPhaseFailed annotation | MigrationFailed phase |
| 操作 | ResetLogicalSwitchPortMigrateOptions(..., true) | CleanLogicalSwitchPortMigrateOptions() |
| 结果 | requested-chassis=srcNode删除 activation-strategy | 删除 requested-chassis删除 activation-strategy |
⚠️ 差异点: 1.12-mc 回滚 requested-chassis=src,fix-migration 完全清理
4. Pod 删除时的清理
| release-1.12-mc | fix-migration | |
|---|---|---|
| 操作 | CleanLogicalSwitchPortMigrateOptions() | CleanLogicalSwitchPortMigrateOptions() |
| 结果 | 完全相同 ✅ |
5. CleanLogicalSwitchPortMigrateOptions 实现对比
| release-1.12-mc | fix-migration | |
|---|---|---|
| 删除 requested-chassis | ✅ | ✅ |
| 删除 activation-strategy | ❌ 没删 | ✅ 删除 |
关键差异: 1.12-mc 的 CleanLogicalSwitchPortMigrateOptions 只删除 requested-chassis,不删 activation-strategy!
三、关键代码对比
release-1.12-mc 的 Clean 实现:
delete(lsp.Options, "requested-chassis")
// 注意:没有删除 activation-strategy
fix-migration 的 Clean 实现:
delete(lsp.Options, "requested-chassis")
delete(lsp.Options, "activation-strategy") // ✅ 多了这行
四、结论
| 方面 | 是否一致 | 说明 |
|---|---|---|
| 迁移开始 (Set) | ✅ 一致 | 都设置 requested-chassis=src,target + activation-strategy=rarp |
| 迁移成功 (Reset/Clean) | ❌ 不一致 | 1.12-mc 用 Reset 保留 target,fix-migration 用 Clean 完全清理 |
| 迁移失败 (Reset/Clean) | ❌ 不一致 | 1.12-mc 用 Reset 回滚 src,fix-migration 用 Clean 完全清理 |
| Pod 删除清理 | ⚠️ 部分一致 | 都调用 Clean,但 1.12-mc 的 Clean 不删 activation-strategy |
| Clean 方法实现 | ❌ 不一致 | fix-migration 多删除了 activation-strategy |
五、建议
你的 fix-migration 分支实际上比 1.12-mc 更彻底:
- 迁移完成后统一用 Clean — 这是正确的,因为 OVN 会根据实际 chassis binding 路由流量,不需要保留
requested-chassis - Clean 方法同时删除
activation-strategy— 这也是正确的,避免残留状态
如果要和 1.12-mc 完全一致,你需要:
- 将
MigrationSucceeded改回用ResetLogicalSwitchPortMigrateOptions(..., false) - 将
MigrationFailed改回用ResetLogicalSwitchPortMigrateOptions(..., true)
但我建议保持你当前的实现,因为:
- Clean 是更安全的做法
- 这也是 PR #5833 的改进方向
- 残留的
requested-chassis可能导致后续迁移问题
你说得对,让我重新核实一下:
Ran terminal command: cd /root/f/kube-ovn && git fetch upstream master && git log upstream/master --oneline --grep="5833|clean migrate" | head -10
Ran terminal command: cd /root/f/kube-ovn && git show upstream/master:pkg/controller/kubevirt.go 2>/dev/null | grep -A 20 "MigrationSucceeded|MigrationFailed"
奇怪!让我看看 commit 9681fc70 的内容:
Ran terminal command: cd /root/f/kube-ovn && git show 9681fc70 --stat && echo "---" && git show 9681fc70 -p -- "*.go" | head -80
Ran terminal command: cd /root/f/kube-ovn && git log upstream/master --oneline | head -20
Ran terminal command: cd /root/f/kube-ovn && git branch -a --contains 9681fc70 | head -10
Ran terminal command: cd /root/f/kube-ovn && git merge-base upstream/master 9681fc70
Ran terminal command: cd /root/f/kube-ovn && git log upstream/master --oneline --all -- pkg/controller/kubevirt.go | head -20
我发现了!让我检查 PR #5873 "fix migrate failed":
Ran terminal command: cd /root/f/kube-ovn && git show 25504091 -p -- pkg/controller/kubevirt.go | head -80
找到了!PR #5873 "fix migrate failed" 把 #5833 的改动给 revert 回去了!
让我确认提交顺序:
Ran terminal command: cd /root/f/kube-ovn && git log upstream/master --oneline -- pkg/controller/kubevirt.go | head -15
真相大白!
时间线梳理(从旧到新):
| 顺序 | PR | 描述 | 改动 |
|---|---|---|---|
| 1 | #5833 | "clean migrate state when migrate is done" | Reset → Clean ✅ |
| 2 | #5873 | "fix migrate failed" | Clean → Reset (revert!) ❌ |
| 3 | #6066 | "fix occasional migration failures..." | 其他修复 |
| 4 | #6081 | "refactor: unify type constants" | 重构 |
所以 upstream/master 当前状态是:
- PR #5833 先把
Reset改成了Clean - 但 PR #5873 又把它改回了
Reset!
结论
你的 PR #6242 实际上是在恢复 PR #5833 的改动,因为 PR #5873 错误地将其 revert 了。
这说明:
- 最初 changluyi 认为用
Clean是正确的 (#5833) - 后来又认为需要用
Reset回滚 (#5873) - 而你现在认为应该用
Clean(#6242)
你需要在 PR 描述中解释清楚为什么 Clean 比 Reset 更好,否则 reviewer 可能会质疑为什么要把 #5873 的修复再改回去。