In compute_grpo_outcome_advantage, the implementation currently computes the standard deviation of group scores with:
id2std[idx] = torch.std(torch.tensor([id2score[idx]]))
This is incorrect because wrapping id2score[idx] in an extra list creates a tensor of shape (1, N) instead of (N,), leading to wrong std computation.
This bug affects the calculation of group-relative advantages in GRPO outcome supervision and may produce unstable training signals, especially when there are multiple rollouts per prompt.
In compute_grpo_outcome_advantage, the implementation currently computes the standard deviation of group scores with:
id2std[idx] = torch.std(torch.tensor([id2score[idx]]))This is incorrect because wrapping
id2score[idx]in an extra list creates a tensor of shape(1, N)instead of(N,), leading to wrong std computation.This bug affects the calculation of group-relative advantages in GRPO outcome supervision and may produce unstable training signals, especially when there are multiple rollouts per prompt.