Claude 3.7 Sonnet & DeepSeek R1& Qwen2.5-Max 不严谨测试背景由于AI研究公

背景

由于AI研究公司Anthropic在不久前发布了Claude 3.7 Sonnet，网上相应的评测也多如繁星，秉着读万卷书不如行万里路的态度，笔者也希望通过亲自的实测来对Claude 3.7 Sonnet的相关能力进行不严谨测试，通过实际场景来感知模型能力的变化，本次选取的模型对比对象为：

Claude 3.7 Sonnet（2025_02_25）
Claude 3.5 Sonnet（2024_06_21）
DeepSeek_R1（2025_01_20）
Qwen2.5-Max（2025_01_29）

声明：本次测试仅供纯属娱乐，并不代表模型自身实际能力，测试结果仅供大家参考

Claude 3.7 Sonnet

Claude 3.7 Sonnet由AI研究公司Anthropic开发，于2025年2月25日发布，是Claude系列的最新迭代。根据Anthropic的官方公告，这款模型被定位为“迄今最智能的模型”，并首次引入混合推理功能，结合了普通大型语言模型（LLM）和专门的推理模型能力。

技术细节

Claude 3.7 Sonnet的关键创新在于其双模式设计：

标准模式：提供近乎即时的响应，适合快速交互。
扩展思考模式：允许模型逐步推理，展示思维过程，特别适用于复杂问题。这一模式在Amazon Bedrock中可通过切换启用，研究表明它显著提升了数学、物理和指令遵循任务的性能。

此次更新的核心功能就是增加了思考模型，也就是推理模式。后续测试在部分场景将会开启思考模式。

基准测试

测试对比

本次对比分为三个维度，分别为

编程测试
算法测试
推理测试

如果还有其他角度，欢迎交流

编程测试

经典的弹力小球编程，

题目：实现一个弹力小球游戏：

环境设置：创建一个旋转的六边形作为小球的活动区域。
物理规律：小球需要受到重力和摩擦力的影响。
碰撞检测：小球与六边形墙壁碰撞时，需要按照物理规律反弹。

在给出最终答案之前，请自行运行代码检查是否有语法错误后再发送正确答案（由于开始测试过程经常会出现语法错误，所以加上此句提示词）

Claude 3.7 效果

可以看到，效果非常好~，并未开启推理模式

Claude 3.5 效果

可以明显看到，穿模了。

DeepSeek_R1效果

效果和claude3.5类似，也穿模了，并且推理时长达到了10多分钟，编程场景R1确实有很大提升空间

Qwen2.5-Max效果

这里给qwen好评，效果也比较好，推理速度也很快

编程结论

	### Claude 3.7	### Claude 3.5	DeepSeek R1	Qwen2.5-Max
弹力小球编程	✔️	❎	❎	✔️

算法测试

本次算法测试题均来自edabit.com/，有兴趣的朋友可以自行选取任意难度的算法题进行测试

本次选取的算法难度为“非常难”

算法题为edabit.com/challenge/K…

Create a function that returns all pairs of numbers in an array that sum to a target. Sort the pairs in ascending order with respect to the smaller number, then order each pair in this order: [smaller, larger].

Examples

allPairs([2, 4, 5, 3], 7) ➞ [[2, 5], [3, 4]]
// 2 + 5 = 7, 3 + 4 = 7

allPairs([5, 3, 9, 2, 1], 3) ➞ [[1, 2]]

allPairs([4, 5, 1, 3, 6, 8], 9) ➞ [[1, 8], [3, 6], [4, 5]]
// Sorted: 1 < 3 < 4; each pair is ordered [smaller, larger]

Notes

If no pairs are found, return an empty array [].
You are only allowed to use each number once in a pair.
See Comments for a hint.

先直接晒结论吧

算法结论

	### Claude 3.7	### Claude 3.5	DeepSeek R1	Qwen2.5-Max
算法测试	1次通过✔️	修改2次后通过✔️	推理了十几二十分钟，没结果❎	修改1次后通过✔️

Claude 3.7算法

function allPairs(arr, target) {
  // Create a map to store the numbers we've seen
  const numMap = new Map();
  // Array to store our result pairs
  const result = [];
  
  for (const num of arr) {
    // Calculate the complement needed to reach target
    const complement = target - num;
    
    // If we've seen the complement before, we found a pair
    if (numMap.has(complement) && numMap.get(complement)) {
      // Add the pair in [smaller, larger] order
      const pair = [Math.min(num, complement), Math.max(num, complement)];
      result.push(pair);
      
      // Mark the complement as used so we don't use it again
      numMap.set(complement, false);
    } else {
      // Mark this number as available for pairing
      numMap.set(num, true);
    }
  }
  
  // Sort pairs by the smaller number
  result.sort((a, b) => a[0] - b[0]);
  
  return result;

Claude 3.5算法

function allPairs(numbers, target) {
    // Create a map to count occurrences of each number
    const numCount = new Map();
    for (let num of numbers) {
        numCount.set(num, (numCount.get(num) || 0) + 1);
    }
    
    // Initialize result array
    const result = [];
    
    // Iterate through the numbers
    for (let num of numbers) {
        const complement = target - num;
        
        if (num === complement) {
            // If the number is half of the target, we need at least 2 occurrences
            if (numCount.get(num) >= 2) {
                result.push([num, num]);
                // Remove this pair to avoid duplicates
                numCount.set(num, numCount.get(num) - 2);
            }
        } else if (numCount.get(complement) > 0) {
            result.push(num < complement ? [num, complement] : [complement, num]);
            // Remove these numbers to avoid duplicates
            numCount.set(num, numCount.get(num) - 1);
            numCount.set(complement, numCount.get(complement) - 1);
        }
    }
    
    // Sort the result based on the smaller number in each pair
    result.sort((a, b) => a[0] - b[0]);
    
    return result;
}

Qwen2.5-Max算法

function allPairs(arr, target) {
    const seen = new Set(); // To track numbers we've already seen
    const used = new Set(); // To track numbers already used in a pair
    const pairs = [];       // To store the resulting pairs

    for (const num of arr) {
        if (used.has(num)) {
            continue; // Skip numbers already used in a pair
        }

        const complement = target - num;
        if (seen.has(complement) && !used.has(complement)) {
            // Form the pair [smaller, larger]
            const pair = [Math.min(num, complement), Math.max(num, complement)];
            pairs.push(pair);
            // Mark both numbers as used
            used.add(num);
            used.add(complement);
        }

        // Add the current number to the set of seen numbers
        seen.add(num);
    }

    // Sort the pairs based on the smaller number in each pair
    pairs.sort((a, b) => a[0] - b[0]);
    return pairs;
}

推理测试

问题：

你有两个沙漏，一个是 7 分钟的沙漏，一个是 11 分钟的沙漏。如何用这两个沙漏来测量 15 分钟？

参考答案：

同时启动两个沙漏
七分钟的沙漏结束后，翻转它
十一分钟的沙漏结束后，翻转七分钟的沙漏；此时七分钟的沙漏已经走过四分钟，翻转它会再走四分钟而不是三分钟
七分钟沙漏结束后是11+4=15分钟

推理结论

从推理结过来看，上诉模型全军覆没，笔者不甘心，后续又试了GPT-4o和o1-mini，均以失败告终，网络说仅有o3-mini答对了，有条件的同学可以试试，

测试总结

作为个人用户的真实测试体感来看

Claude3.7：表现确实可圈可点，口碑也一直在线，在闭源AI里面，算比较靠谱的存在，对比Claude3.5也有较大的提升

DeepSeek R1：表现就差强人意，大部分时候已推理十几二十分钟还没有结论是家常便饭，真实用户是没有耐心等下去的

Qwen2.5-Max：整体来看，能力还是靠谱的，表现比较稳定，推理速度快，结果也比较准确，给一个大大的赞