代码大模型-Bench文章

3 阅读1分钟

Large Language Model-Based Agents for Software Engineering: A Survey

Evaluating Large Language Models Trained on Code

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

DebugBench: Evaluating Debugging Capability of Large Language Models

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code