lihaocruiser 's Collections LLM-RL
updated
Direct Preference Optimization: Your Language Model is Secretly a Reward
Model
Paper
• 2305.18290
• Published
• 64
Fine-Grained Human Feedback Gives Better Rewards for Language Model
Training
Paper
• 2306.01693
• Published
• 3
Self-Rewarding Language Models
Paper
• 2401.10020
• Published
• 152
Secrets of RLHF in Large Language Models Part II: Reward Modeling
Paper
• 2401.06080
• Published
• 28
ReFT: Reasoning with Reinforced Fine-Tuning
Paper
• 2401.08967
• Published
• 31
sDPO: Don't Use Your Data All at Once
Paper
• 2403.19270
• Published
• 41
The Lessons of Developing Process Reward Models in Mathematical
Reasoning
Paper
• 2501.07301
• Published
• 100
O1 Replication Journey -- Part 3: Inference-time Scaling for Medical
Reasoning
Paper
• 2501.06458
• Published
• 31
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs
Paper
• 2412.21187
• Published
• 40
Reinforcement Learning for Reasoning in Large Language Models with One
Training Example
Paper
• 2504.20571
• Published
• 98