Large Language Model-Based Agents for Software Engineering: A Survey
Evaluating Large Language Models Trained on Code
HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
DebugBench: Evaluating Debugging Capability of Large Language Models
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code