ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement
Abstract
ThinkTwice is a two-phase framework that jointly optimizes large language models for reasoning and self-refinement using Group Relative Policy Optimization, demonstrating improved performance on mathematical reasoning benchmarks.
We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines. Specifically, on Qwen3-4B, ThinkTwice outperforms GRPO on AIME by 5 percentage points before refinement and by 11.5 points after one self-refinement step, measured by pass@4. Analysis of the training dynamics of ThinkTwice reveals an implicit rectify-then-fortify curriculum: refinement predominantly corrects errors early in training and naturally shifts toward preserving already-correct solutions as the model improves, yielding a more rectified reward signal. Our work establishes joint training of reasoning and self-refinement as a principled and effective methodology for RLVR.
Community
We introduce ThinkTwice, a simple two-phase framework that jointly optimizes LLMs to solve reasoning problems and refine the answers, based on Group Relative Policy Optimization (GRPO). In each pair of training steps, ThinkTwice first optimizes the model on solving reasoning problems, then optimizes it on refining its own solutions to the same problems, using the same binary correctness reward in both phases without correctness signals or critique annotations. Across five mathematical reasoning benchmarks and two model families including Qwen3-4B and Olmo3-7B, ThinkTwice substantially improves both reasoning and refinement performance over competitive online policy optimization baselines.
๐
ThinkTwice: Jointly Optimizing Large Language Models for Reasoning and Self-Refinement
ThinkTwice is a two-phase GRPO framework that trains large language models to both solve reasoning problems and refine their own solutions, using only binary correctness rewards and no critique annotations. The method discovers an implicit "rectify-then-fortify" curriculum where early training corrects errors and later training preserves already-correct solutions.
Key Idea
The core insight is that reasoning and self-refinement can be jointly optimized in a simple two-phase loop. In Phase 1 the model generates an initial solution to a reasoning problem. In Phase 2 the model attempts to refine that solution. Both phases use the same binary correctness reward signal under GRPO, eliminating the need for expensive critique or preference annotations.
Method / Approach
During training, an emergent "rectify-then-fortify" curriculum naturally arises. In early iterations the refinement phase primarily corrects wrong initial answers, learning to identify and fix mistakes. As training progresses and the model's initial solutions improve, the refinement phase shifts to preserving correct answers โ learning when not to change things. This two-regime dynamic is not engineered but emerges from the joint optimization.
Results
On Qwen3-4B, ThinkTwice outperforms standard GRPO on AIME by 5 percentage points before refinement and 11.5 percentage points after refinement. The approach generalizes across architectures, showing consistent gains on both Qwen3-4B and OLMo3-7B.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- iGRPO: Self-Feedback-Driven LLM Reasoning (2026)
- Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning (2026)
- Apriel-1.5-OpenReasoner: RL Post-Training for General-Purpose and Efficient Reasoning (2026)
- $\textbf{Re}^{2}$: Unlocking LLM Reasoning via Reinforcement Learning with Re-solving (2026)
- Beyond Correctness: Learning Robust Reasoning via Transfer (2026)
- Seeing with You: Perception-Reasoning Coevolution for Multimodal Reasoning (2026)
- Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.01591 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper


