SFT can generalize as well as—or better than—RL when trained with the right data.
CUDA 12.2 & cuDNN 9.1.0 works, but official docs recommends CUDA >= 12.4 & cuDNN >= 9.8.0.
conda create -n debunk_sft python=3.10
conda activate debunk_sft
USE_MEGATRON=0 bash setup.sh
git submodule init
git submodule update
pip install -e thirdparty/verl --no-dependencies
pip install -e thirdparty/ragen --no-dependencies
pip install -e thirdparty/alfworld --no-dependencies
pip install -e thirdparty/trl --no-dependecies| Task | Method | Diversity | Format | Link |
|---|---|---|---|---|
| Sokoban | RL | non-diverse | — | 🤗 |
| Sokoban | RL | diverse | — | 🤗 |
| Sokoban | SFT | non-diverse | answer-only | 🤗 |
| Sokoban | SFT | diverse | answer-only | 🤗 |
| Sokoban | SFT | non-diverse | cot | 🤗 |
| Sokoban | SFT | diverse | cot | 🤗 |
| General Points | RL | non-diverse | — | 🤗 |
| General Points | RL | diverse | — | 🤗 |
| General Points | SFT | non-diverse | answer-only | 🤗 |
| General Points | SFT | diverse | answer-only | 🤗 |
| General Points | SFT | non-diverse | cot | 🤗 |
| General Points | SFT | diverse | cot | 🤗 |
Specify your model and data beforhand. For sokoban
bash debunk_sft/scripts/sokoban/sokoban_train_and_eval.sh
For general points
bash debunk_sft/scripts/gp_l/gp_l_train_and_eval.sh
Specify your model and data beforhand. For sokoban
bash debunk_sft/scripts/sokoban/sokoban_grpo.sh
For gp
bash debunk_sft/scripts/gp_l/gp_l_grpo.sh
If you have an academic use, please cite
@article{lin2025debunk,
title={Debunk the Myth of SFT Generalization},
author={Lin, Xiaofeng and Sang, Hejian and Wang, Zhipeng and Zhang, Xuezhou},
journal={arXiv preprint arXiv:2510.00237},
year={2025}
}