From: https://x.com/zzlccc/status/1903162768083259703
DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning
The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO
From: https://x.com/zzlccc/status/1903162768083259703
DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning
The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO