Skip to content

Bug in GRPO outcome advantage: torch.std(torch.tensor([id2score[idx]])) incorrectly computes group std. #236

@Hoyant-Su

Description

@Hoyant-Su

In compute_grpo_outcome_advantage, the implementation currently computes the standard deviation of group scores with:
id2std[idx] = torch.std(torch.tensor([id2score[idx]]))
This is incorrect because wrapping id2score[idx] in an extra list creates a tensor of shape (1, N) instead of (N,), leading to wrong std computation.
This bug affects the calculation of group-relative advantages in GRPO outcome supervision and may produce unstable training signals, especially when there are multiple rollouts per prompt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions