in
Cua
The Training Gauntlet
Five SFT runs that scored zero, a failed RL attempt, switching models, and learning to decompose the problem.