On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral Paper • 2512.04220 • Published Dec 3, 2025 • 15
Token Hidden Reward: Steering Exploration-Exploitation in Group Relative Deep Reinforcement Learning Paper • 2510.03669 • Published Oct 4, 2025 • 1