OJBench: A Competition Level Code Benchmark For Large Language Models Paper • 2506.16395 • Published Jun 19, 2025 • 4
Kimi-Dev: Agentless Training as Skill Prior for SWE-Agents Paper • 2509.23045 • Published Sep 27, 2025 • 3
Multi-Docker-Eval: A `Shovel of the Gold Rush' Benchmark on Automatic Environment Building for Software Engineering Paper • 2512.06915 • Published 25 days ago • 12