SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models Paper • 2511.05459 • Published Nov 7, 2025 • 3
Scaling Latent Reasoning via Looped Language Models Paper • 2510.25741 • Published Oct 29, 2025 • 221
AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? Paper • 2509.03312 • Published Sep 3, 2025 • 5
OmniVideoBench: Towards Audio-Visual Understanding Evaluation for Omni MLLMs Paper • 2510.10689 • Published Oct 12, 2025 • 46