Curosr Debug模式让我发现了 2 > 3 > 1 > 4本文通过一个 Embedding 情感分析 Bug，实

什么是 Debug 模式？

Cursor 提供了多种工作模式，其中 Debug 模式 专为解决那些难以复现或难以理解的 Bug 而设计。与传统调试不同，它不会立即尝试修复代码，而是先生成假设、添加日志、收集运行时数据，最后才进行精准修复。

根据 Cursor 官方文档和官方博客：

How it works

Explore and hypothesize: The agent explores relevant files, builds context, and generates multiple hypotheses about potential root causes.
Add instrumentation: The agent adds log statements that send data to a local debug server running in a Cursor extension.
Reproduce the bug: Debug Mode asks you to reproduce the bug and provides specific steps. This keeps you in the loop and ensures the agent captures real runtime behavior.
Analyze logs: After reproduction, the agent reviews the collected logs to identify the actual root cause based on runtime evidence.
Make targeted fix: The agent makes a focused fix that directly addresses the root cause—often just a few lines of code.
Verify and clean up: You can re-run the reproduction steps to verify the fix. Once confirmed, the agent removes all instrumentation.

相比 Agent、Plan 模式，Debug 模式用的次数确实太少了。这次终于找了一个练手文件试试，防止它在大项目里面乱来。

实战案例：Embedding 情感分析 Bug

问题描述

使用 Embedding 向量进行情感分析，一段明显的好评文本被错误判断为差评。这里使用的是阿里云通义千问的 text-embedding-v4 模型：

EMBEDDING_MODEL = "text-embedding-v4"

# 参考向量
positive_review = get_embedding("好评")
negative_review = get_embedding("差评")

# 待分析文本（明显的好评）
text_to_analyze = "买的银色版真的很好看，一天就到了，晚上就开始拿起来完系统很丝滑流畅，做工扎实，手感细腻，很精致哦苹果一如既往的好品质"

# 计算相似度并判断
score = cosine_similarity(text_embedding, positive_review) - \
        cosine_similarity(text_embedding, negative_review)
# 结果：score < 0 → 错误判断为差评

Debug 模式分析

进入 Debug 模式后，AI 生成了 5 个假设：

假设	内容
A	参考向量本身有问题
B	相似度计算结果异常
C	文本 embedding 数据异常
D	判断逻辑有误
E	"好评"与"差评"向量相似度过高

AI 自动插入日志后，让我们介入重新运行程序，保证 human-in-the-loop：

收集到的关键日志数据：

{"hypothesisId": "E", "message": "好评与差评向量的相似度", "data": {"pos_neg_similarity": 0.6828}}
{"hypothesisId": "B", "message": "相似度计算结果", "data": {"pos_sim": 0.4622, "neg_sim": 0.4642, "score": -0.002}}

根因确认

日志揭示了问题：

"好评"和"差评"的向量相似度高达 68%，区分度不足
两个相似度差值仅 0.002，几乎无法区分

修复方案

使用更具语义区分度的参考句子：

# 修复前
positive_review = get_embedding("好评")
negative_review = get_embedding("差评")

# 修复后
positive_review = get_embedding("这个产品非常好，质量很棒，很满意，强烈推荐购买")
negative_review = get_embedding("这个产品非常差，质量很烂，很失望，不推荐购买")

验证结果

指标	修复前	修复后
与正面参考相似度	0.4622	0.5615
与负面参考相似度	0.4642	0.5563
score	-0.002	+0.005
结果	差评 ❌	好评 ✅

意外发现：Embedding 模型版本对比

Debug 结束后，我突然想到：修复后的 score 只有 +0.005，效果似乎并不理想。于是我测试了不同版本的 embedding 模型，使用原始的简单参考词（"好评"/"差评"）：

各版本测试结果

模型版本	与"好评"相似度	与"差评"相似度	score	结果
v1	0.3578	0.3375	+0.0203	好评 ✅
v2	0.1787	-0.0325	+0.2113	好评 ✅
v3	0.6925	0.5481	+0.1444	好评 ✅
v4	0.4622	0.4642	-0.0020	差评 ❌

令人意外的是，v4 是唯一判断错误的版本！

官方评测基准

根据阿里云官方文档，v4 在 MTEB/CMTEB 评测中分数最高：

模型	MTEB	CMTEB
text-embedding-v1	58.30	59.84
text-embedding-v2	60.13	62.17
text-embedding-v3（1024 维度）	63.39	68.92
text-embedding-v4（2048 维度）	71.58	71.99

即使将 v4 的维度提升至 2048，结果依然是差评：

与'好评'的相似度: 0.444916
与'差评'的相似度: 0.446109
综合分数: -0.001192
结果: 差评 ❌

结论

在这个简单情感分类任务上，模型表现排序：v2 > v3 > v1 > v4

这说明：评测基准分数高 ≠ 所有任务都表现好。选择模型时需要在实际场景中测试。

代码参考：geektime-ai-course