Beyond Accuracy: Unveiling Inefficiency Patterns in Tool-Integrated Reasoning Paper • 2604.05404 • Published 4 days ago • 38
Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation Paper • 2604.02368 • Published 15 days ago • 9
SocialOmni: Benchmarking Audio-Visual Social Interactivity in Omni Models Paper • 2603.16859 • Published 24 days ago • 248
CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation Paper • 2602.24286 • Published Feb 27 • 97
When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning Paper • 2602.08236 • Published Feb 9 • 9
Everything in Its Place: Benchmarking Spatial Intelligence of Text-to-Image Models Paper • 2601.20354 • Published Jan 28 • 112
CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding Paper • 2602.01785 • Published Feb 2 • 96
3D-Aware Implicit Motion Control for View-Adaptive Human Video Generation Paper • 2602.03796 • Published Feb 3 • 64
Balancing Understanding and Generation in Discrete Diffusion Models Paper • 2602.01362 • Published Feb 1 • 17
Latent Chain-of-Thought as Planning: Decoupling Reasoning from Verbalization Paper • 2601.21358 • Published Jan 29 • 7
Vision-DeepResearch Benchmark: Rethinking Visual and Textual Search for Multimodal Large Language Models Paper • 2602.02185 • Published Feb 2 • 118