Condition A — GRPO alone (100 RL steps, no curriculum)

Final-step (step 100) policy from Condition A of the 2×2 ablation in Cao 2026 disentangling MaxRL and curriculum learning.

Base Qwen/Qwen3-1.7B-Base
Algorithm GRPO (actor_rollout_ref.algorithm.adv_estimator=grpo)
Curriculum OFF (uniform sampling over POLARIS-53K)
Steps 100
Hardware 4×H200 on ou_sloan_gpu (MIT ORCD)
Training-time val (step 100) MATH P@32 = 0.892, AIME P@32 = 0.247

Reproduction

Downloads last month
11
Safetensors
Model size
2B params
Tensor type
BF16
·
Video Preview
loading

Model tree for Sean13/grpo_nocurriculum_Qwen3-1.7B-100step

Finetuned
(378)
this model

Dataset used to train Sean13/grpo_nocurriculum_Qwen3-1.7B-100step