这就是为什么 DeepSeek R1-Zero 其实在传统任务(如语言、助人性、无害性等)上表现不佳。
最终,他们需要使用一个基于 V3(甚至是 ChatGPT)生成的监督数据集结合多阶段强化学习(RL)来抵消这种效果。
因此,将 DeepSeek R1 称为完全无监督并不公平。
That's why DeepSeek R1-zero doesn't perform well on traditional tasks like language, helpfulness, harmlessness etc
Eventually they needed a multistage RL with supervised dataset from V3 (or even ChatGPT for that matter) to counter this effect .
So, calling DeepSeek R1 fully unsupervised is not fair