kubeovn 热迁移修复对比 mc

6 阅读7分钟

Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: 完整的 handleAddOrUpdateVMIMigration ===" && git show upstream/master:pkg/controller/kubevirt.go | sed -n '96,210p'

Ran terminal command: cd /root/f/kube-ovn && echo "=== MC: migrateVM 完整函数 ===" && git show upstream/release-1.12-mc:pkg/controller/pod.go | sed -n '2297,2360p'

Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: Pod 删除时的 VM 清理 ===" && git show upstream/master:pkg/controller/pod.go | grep -A 30 "CleanLogicalSwitchPortMigrateOptions"

Ran terminal command: cd /root/f/kube-ovn && echo "=== MC: Pod 删除时的 VM 清理 ===" && git show upstream/release-1.12-mc:pkg/controller/pod.go | grep -A 30 "CleanLogicalSwitchPortMigrateOptions"

Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: Clean 调用的上下文 ===" && git show upstream/master:pkg/controller/pod.go | grep -B 30 "CleanLogicalSwitchPortMigrateOptions" | tail -40

现在让我看看 Reset 函数的具体实现:

Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: ResetLogicalSwitchPortMigrateOptions ===" && git show upstream/master:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 40 "func.*ResetLogicalSwitchPortMigrateOptions"

Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: CleanLogicalSwitchPortMigrateOptions ===" && git show upstream/master:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 30 "func.*CleanLogicalSwitchPortMigrateOptions"

Ran terminal command: cd /root/f/kube-ovn && echo "=== Master: SetLogicalSwitchPortMigrateOptions ===" && git show upstream/master:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 50 "func.*SetLogicalSwitchPortMigrateOptions"

现在让我检查 MC 分支的 Reset/Set/Clean 实现是否相同:

Ran terminal command: cd /root/f/kube-ovn && git diff upstream/master upstream/release-1.12-mc -- pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 10 -B 5 "ResetLogicalSwitchPortMigrateOptions|CleanLogicalSwitchPortMigrateOptions|SetLogicalSwitchPortMigrateOptions" | head -60

有一个微小差异!让我再详细看看:

Ran terminal command: cd /root/f/kube-ovn && echo "=== Master Clean ===" && git show upstream/master:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 15 "CleanLogicalSwitchPortMigrateOptions"

Ran terminal command: cd /root/f/kube-ovn && echo "=== MC Clean ===" && git show upstream/release-1.12-mc:pkg/ovs/ovn-nb-logical_switch_port.go | grep -A 20 "CleanLogicalSwitchPortMigrateOptions"

深入分析总结

1. LSP 迁移函数实现差异

函数MasterMC差异
SetLogicalSwitchPortMigrateOptions✅ 相同✅ 相同无差异
ResetLogicalSwitchPortMigrateOptions✅ 相同✅ 相同无差异
CleanLogicalSwitchPortMigrateOptions删除 requested-chassis + activation-strategy只删除 requested-chassis有差异

2. 热迁移完整流程对比

┌─────────────────────────────────────────────────────────────────────────────┐
│                           热迁移阶段                                         │
├─────────────┬───────────────────────────────┬───────────────────────────────┤
│   阶段      │         Master                │           MC                  │
├─────────────┼───────────────────────────────┼───────────────────────────────┤
│ 开始迁移    │ Set(src,target)               │ Set(src,target)               │
│             │ requested-chassis=src,target  │ requested-chassis=src,target  │
│             │ activation-strategy=rarp      │ activation-strategy=rarp      │
├─────────────┼───────────────────────────────┼───────────────────────────────┤
│ 迁移成功    │ Reset(false)                  │ Reset(false)                  │
│             │ requested-chassis=target      │ requested-chassis=target      │
│             │ 删除 activation-strategy      │ 删除 activation-strategy      │
├─────────────┼───────────────────────────────┼───────────────────────────────┤
│ 迁移失败    │ Reset(true)                   │ Reset(true)                   │
│             │ requested-chassis=src         │ requested-chassis=src         │
│             │ 删除 activation-strategy      │ 删除 activation-strategy      │
├─────────────┼───────────────────────────────┼───────────────────────────────┤
│ Pod 删除    │ Clean()                       │ Clean()                       │
│             │ 删除 requested-chassis        │ 删除 requested-chassis        │
│             │ 删除 activation-strategy ⚠️    │ ❌ 不删除 activation-strategy │
└─────────────┴───────────────────────────────┴───────────────────────────────┘

3. Master 的关键 Bug

问题 1: 获取节点信息时没有校验 MigrationUID

// Master 当前代码
if vmi.Status.MigrationState != nil {
    srcNodeName = vmi.Status.MigrationState.SourceNode      // 可能是上次迁移的!
    targetNodeName = vmi.Status.MigrationState.TargetNode   // 可能是上次迁移的!
}

问题 2: Early return 导致 cleanup 被跳过

if vmiMigration.Status.MigrationState.Completed {
    return nil  // MigrationSucceeded/Failed 时可能直接跳过!
}

4. Issue #6220 的实际场景

第一次迁移: Node-A → Node-B (成功)
  - vmi.Status.MigrationState = {SourceNode: A, TargetNode: B, MigrationUID: xxx}
  
第二次迁移: Node-B → Node-A (开始)
  - 新的 vmiMigration 创建,UID = yyy
  - vmi.Status.MigrationState 还没更新,仍然是上次的!
  
Master 控制器处理:
  - 获取 srcNodeName = A (错!应该是 B)
  - 获取 targetNodeName = B (错!应该是 A)
  - 判断: sourceNode == targetPod.Spec.NodeName (B == A? 取决于时序)
  - 可能跳过 SetLogicalSwitchPortMigrateOptions!
  
结果:
  - OVN 没有正确设置 requested-chassis
  - OVS interface 无法就绪
  - 超时报错: "ovs interface xxx is not ready after 30s"

5. 结论

维度MasterMC
LSP 操作函数基本相同基本相同
节点信息获取有 stale state bug从 Pod annotation 获取,无此问题
迁移过程触发监听 VMIMigration CRDPod annotation 变化触发
可靠性依赖 vmi.Status 同步时序依赖 kubevirt 写入 annotation

PR #6239 的 MigrationUID 校验是解决 Master 问题的关键修复

Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/ovs/ovn-nb-logical_switch_port.go 2>/dev/null | grep -A 80 "SetLogicalSwitchPortMigrateOptions|ResetLogicalSwitchPortMigrateOptions|CleanLogicalSwitchPortMigrateOptions"

Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/daemon/handler.go 2>/dev/null | grep -B 5 -A 50 "SetLogicalSwitchPortMigrateOptions|MigrateOptions|migrate"

Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/daemon/controller.go 2>/dev/null | grep -B 5 -A 50 "SetLogicalSwitchPortMigrateOptions|MigrateOptions|Migrate|kubevirt"

Ran terminal command: cd /root/f/kube-ovn && git ls-tree --name-only -r upstream/release-1.12-mc | xargs -I {} sh -c 'git show upstream/release-1.12-mc:{} 2>/dev/null | grep -l "SetLogicalSwitchPortMigrateOptions|CleanLogicalSwitchPortMigrateOptions" && echo {}'

Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/controller/pod.go 2>/dev/null | grep -B 10 -A 30 "SetLogicalSwitchPortMigrateOptions|CleanLogicalSwitchPortMigrateOptions"

Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/controller/pod.go 2>/dev/null | grep -B 40 "isMigrate|migrated|migratedFail" | head -100

Ran terminal command: cd /root/f/kube-ovn && git show upstream/release-1.12-mc:pkg/controller/pod.go 2>/dev/null | grep -A 80 "func.*migrateVM"

现在让我对比当前 fix-migration 分支的实现:

读取 ,行 1040 到 1073

Ran terminal command: cd /root/f/kube-ovn && git show HEAD:pkg/controller/pod.go 2>/dev/null | grep -B 5 -A 30 "CleanLogicalSwitchPortMigrateOptions"

现在我有足够信息来做完整对比。让我汇总一下:


Release 1.12-mc vs 当前 fix-migration 分支 LSP 控制逻辑对比

一、触发方式不同

维度release-1.12-mcfix-migration (master)
迁移状态来源Pod Annotations (如 util.MigrationPhaseAnnotation)VirtualMachineInstanceMigration CRD Status
触发时机Pod Add/Update 时在 reconcileAllocateSubnets 中检查VirtualMachineInstanceMigration 的 Add/Update 事件
控制器pod.go 中的 migrateVM()kubevirt.go 中的 handleAddOrUpdateVMIMigration()

二、LSP Options 操作对比

1. 迁移开始时 (SetLogicalSwitchPortMigrateOptions)

release-1.12-mcfix-migration
触发条件MigrationPhaseStarted annotationMigrationScheduling phase
操作完全相同 ✅
结果requested-chassis=src,target
activation-strategy=rarp
完全相同 ✅
// 两者实现完全一致
lsp.Options["requested-chassis"] = fmt.Sprintf("%s,%s", srcNodeName, targetNodeName)
lsp.Options["activation-strategy"] = "rarp"

2. 迁移成功时

release-1.12-mcfix-migration
触发条件MigrationPhaseSucceeded annotationMigrationSucceeded phase
操作ResetLogicalSwitchPortMigrateOptions(..., false)CleanLogicalSwitchPortMigrateOptions()
结果requested-chassis=targetNode
删除 activation-strategy
删除 requested-chassis
删除 activation-strategy

⚠️ 差异点: 1.12-mc 保留 requested-chassis=target,fix-migration 完全清理

3. 迁移失败时

release-1.12-mcfix-migration
触发条件MigrationPhaseFailed annotationMigrationFailed phase
操作ResetLogicalSwitchPortMigrateOptions(..., true)CleanLogicalSwitchPortMigrateOptions()
结果requested-chassis=srcNode
删除 activation-strategy
删除 requested-chassis
删除 activation-strategy

⚠️ 差异点: 1.12-mc 回滚 requested-chassis=src,fix-migration 完全清理

4. Pod 删除时的清理

release-1.12-mcfix-migration
操作CleanLogicalSwitchPortMigrateOptions()CleanLogicalSwitchPortMigrateOptions()
结果完全相同 ✅

5. CleanLogicalSwitchPortMigrateOptions 实现对比

release-1.12-mcfix-migration
删除 requested-chassis
删除 activation-strategy❌ 没删✅ 删除

关键差异: 1.12-mc 的 CleanLogicalSwitchPortMigrateOptions 只删除 requested-chassis,不删 activation-strategy


三、关键代码对比

release-1.12-mc 的 Clean 实现:

delete(lsp.Options, "requested-chassis")
// 注意:没有删除 activation-strategy

fix-migration 的 Clean 实现:

delete(lsp.Options, "requested-chassis")
delete(lsp.Options, "activation-strategy")  // ✅ 多了这行

四、结论

方面是否一致说明
迁移开始 (Set)✅ 一致都设置 requested-chassis=src,target + activation-strategy=rarp
迁移成功 (Reset/Clean)❌ 不一致1.12-mc 用 Reset 保留 target,fix-migration 用 Clean 完全清理
迁移失败 (Reset/Clean)❌ 不一致1.12-mc 用 Reset 回滚 src,fix-migration 用 Clean 完全清理
Pod 删除清理⚠️ 部分一致都调用 Clean,但 1.12-mc 的 Clean 不删 activation-strategy
Clean 方法实现❌ 不一致fix-migration 多删除了 activation-strategy

五、建议

你的 fix-migration 分支实际上比 1.12-mc 更彻底

  1. 迁移完成后统一用 Clean — 这是正确的,因为 OVN 会根据实际 chassis binding 路由流量,不需要保留 requested-chassis
  2. Clean 方法同时删除 activation-strategy — 这也是正确的,避免残留状态

如果要和 1.12-mc 完全一致,你需要:

  1. MigrationSucceeded 改回用 ResetLogicalSwitchPortMigrateOptions(..., false)
  2. MigrationFailed 改回用 ResetLogicalSwitchPortMigrateOptions(..., true)

但我建议保持你当前的实现,因为:

  • Clean 是更安全的做法
  • 这也是 PR #5833 的改进方向
  • 残留的 requested-chassis 可能导致后续迁移问题

你说得对,让我重新核实一下:

Ran terminal command: cd /root/f/kube-ovn && git fetch upstream master && git log upstream/master --oneline --grep="5833|clean migrate" | head -10

Ran terminal command: cd /root/f/kube-ovn && git show upstream/master:pkg/controller/kubevirt.go 2>/dev/null | grep -A 20 "MigrationSucceeded|MigrationFailed"

奇怪!让我看看 commit 9681fc70 的内容:

Ran terminal command: cd /root/f/kube-ovn && git show 9681fc70 --stat && echo "---" && git show 9681fc70 -p -- "*.go" | head -80

Ran terminal command: cd /root/f/kube-ovn && git log upstream/master --oneline | head -20

Ran terminal command: cd /root/f/kube-ovn && git branch -a --contains 9681fc70 | head -10

Ran terminal command: cd /root/f/kube-ovn && git merge-base upstream/master 9681fc70

Ran terminal command: cd /root/f/kube-ovn && git log upstream/master --oneline --all -- pkg/controller/kubevirt.go | head -20

我发现了!让我检查 PR #5873 "fix migrate failed":

Ran terminal command: cd /root/f/kube-ovn && git show 25504091 -p -- pkg/controller/kubevirt.go | head -80

找到了!PR #5873 "fix migrate failed" 把 #5833 的改动给 revert 回去了!

让我确认提交顺序:

Ran terminal command: cd /root/f/kube-ovn && git log upstream/master --oneline -- pkg/controller/kubevirt.go | head -15


真相大白!

时间线梳理(从旧到新):

顺序PR描述改动
1#5833"clean migrate state when migrate is done"ResetClean
2#5873"fix migrate failed"CleanReset (revert!) ❌
3#6066"fix occasional migration failures..."其他修复
4#6081"refactor: unify type constants"重构

所以 upstream/master 当前状态是:

  • PR #5833 先把 Reset 改成了 Clean
  • 但 PR #5873 又把它改回了 Reset

结论

你的 PR #6242 实际上是在恢复 PR #5833 的改动,因为 PR #5873 错误地将其 revert 了。

这说明:

  1. 最初 changluyi 认为用 Clean 是正确的 (#5833)
  2. 后来又认为需要用 Reset 回滚 (#5873)
  3. 而你现在认为应该用 Clean (#6242)

你需要在 PR 描述中解释清楚为什么 CleanReset 更好,否则 reviewer 可能会质疑为什么要把 #5873 的修复再改回去。