Bug in GRPO outcome advantage: torch.std(torch.tensor([id2score[idx]])) incorrectly computes group std.

In compute_grpo_outcome_advantage, the implementation currently computes the standard deviation of group scores with:
`id2std[idx] = torch.std(torch.tensor([id2score[idx]]))`
This is incorrect because wrapping `id2score[idx]` in an extra list creates a tensor of shape `(1, N)` instead of `(N,)`, leading to wrong std computation.
This bug affects the calculation of group-relative advantages in GRPO outcome supervision and may produce unstable training signals, especially when there are multiple rollouts per prompt.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug in GRPO outcome advantage: torch.std(torch.tensor([id2score[idx]])) incorrectly computes group std. #236

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bug in GRPO outcome advantage: torch.std(torch.tensor([id2score[idx]])) incorrectly computes group std. #236

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions