Development of Numerical Error Detection Tasks to Analyze the Numerical Capabilities of Language Models
1. 模型:
数字错误检测任务,分析语言模型的数字能力。
链接:github.com/cogma/BeNED…
数据集:BeNEDect (understand numerical values.),数据来源如下:
包含4种错误类别:
2. 现状
Although GPT-3.5, GPT-4, and Llama 3 performed well on the numerical error detection task, their accuracy was still not as high as that of humans, indicating room for improvement.
- 尤其在需要算术计算和专家知识的数字错误。
- LLM比人更容易误判正确数字
- Prompt的轻微变化对结果影响较大,不鲁棒
3. 发现
- Numerical NLP Tasks:相比普通的错误检测,BeNEDect侧重于分析LLM的数字常识、计算能力、记忆能力。
欢迎评论补充相关资源...