Condition A — GRPO alone (100 RL steps, no curriculum)

Final-step (step 100) policy from Condition A of the 2×2 ablation in Cao 2026 disentangling MaxRL and curriculum learning.


Base	`Qwen/Qwen3-1.7B-Base`
Algorithm	GRPO (`actor_rollout_ref.algorithm.adv_estimator=grpo`)
Curriculum	OFF (uniform sampling over POLARIS-53K)
Steps	100
Hardware	4×H200 on ou_sloan_gpu (MIT ORCD)
Training-time val (step 100)	MATH P@32 = 0.892, AIME P@32 = 0.247

Reproduction

Safetensors

Model size

2B params

Tensor type

BF16

Video Preview

Base model

Finetuned

(378)

this model