Hacker News

by delifueon 3/22/2025, 7:59 AMwith 1 comments

by delifueon 3/22/2025, 8:01 AM

DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning

The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO

Understanding R1-Zero-Like Training: A Critical Perspective [pdf]