POLARIS-Project/Polaris-Dataset-53K
Viewer • Updated • 53.3k • 2.61k • 37
Final-step (step 100) policy from Condition A of the 2×2 ablation in Cao 2026 disentangling MaxRL and curriculum learning.
| Base | Qwen/Qwen3-1.7B-Base |
| Algorithm | GRPO (actor_rollout_ref.algorithm.adv_estimator=grpo) |
| Curriculum | OFF (uniform sampling over POLARIS-53K) |
| Steps | 100 |
| Hardware | 4×H200 on ou_sloan_gpu (MIT ORCD) |
| Training-time val (step 100) | MATH P@32 = 0.892, AIME P@32 = 0.247 |
scripts/06_train_A_*.sbatch + scripts/_train_inner_AB.shBase model
Qwen/Qwen3-1.7B-Base