Small Reward Models via Backward Inference

Yike Wang¹, Faeze Brahman², Shangbin Feng¹, Teng Xiao², Hannaneh Hajishirzi¹², Yulia Tsvetkov¹
¹University of Washington, ²Allen Institute for Artificial Intelligence

We propose 🔄 FLIP, a reference-free and rubric-free reward modeling approach: infer the instruction that would most plausibly produce a given response, and use the similarity between the inferred and the original instructions as the reward signal.

Repository structure

Path	Description
`prompts/`	Prompt templates: FLIP (instruction inference) and LLM-judge baselines (pointwise, pairwise, listwise).
`metrics.py`	F1 score and normalization utilities for comparing inferred vs. ground-truth instructions.
`open-instruct/`	Open Instruct fork with FLIP integrated as a GRPO reward (judge type `flip`).

1. Using the FLIP reward (standalone)

Step 1 — Get inferred instruction from a response

Use the templates in prompts/:
- prompts/FLIP_SYSTEM.prompt: system message for the instruction-reconstruction task.
- prompts/FLIP_USER.prompt: user message template; replace {response} with the model’s response.
Call your LLM with these prompts to obtain an inferred instruction (and optional reasoning). The model should output JSON with keys like "REASONING" and "INFERRED INSTRUCTION".

Step 2 — Compute reward with F1

from metrics import f1_score

# prediction = inferred instruction from the LLM
# ground_truth = original instruction
result = f1_score(prediction, ground_truth)["f1"]

metrics.py also provides normalize_answer(s) for normalizing text before comparison (lowercasing, removing punctuation/articles, fixing whitespace).

2. RL training with GRPO (FLIP as reward)

Training runs use the Open Instruct codebase under open-instruct/, with FLIP wired in as the judge type flip.

Example datasets

FLIP: yikeee/rlvr_general_chat_flip
LLM Judge: yikeee/rlvr_general_chat

Both datasets share the same structure and content, consisting of 12k English prompts from the WildChat dataset. They only differ in the judge type specified by the "dataset" attribute. You can use these datasets as-is, or adapt your own data to match the same schema.

Running training

FLIP reward (instruction inference + F1):

bash open-instruct/scripts/train/grpo_flip.sh

LLM-judge baseline (e.g. pointwise quality score):
```
bash open-instruct/scripts/train/grpo_llmjudge.sh
```

Important script variables (in the .sh scripts)

DATASETS — Training mix, e.g. "yikeee/rlvr_general_chat_flip 1.0".
MODEL_NAME_OR_PATH — Policy model (e.g. allenai/Olmo-3-7B-Think-DPO).
--llm_judge_model — Judge model used for FLIP (infer instruction) or LLM-judge (e.g. hosted_vllm/Qwen/Qwen3-4B).

Inside Open Instruct, the FLIP pipeline uses the same idea as the standalone use: the judge model is prompted to infer the instruction from the response, then f1_score(inferred_instruction, ground_truth) (in open-instruct/open_instruct/judge_utils.py and ground_truth_utils.py) gives the reward.

Questions

If you have any questions or comments about our paper, or notice any issues in the code, feel free to reach out at yikewang@cs.washington.edu. We will do our best to respond within one business day.

Citing

If you found this work helpful, please consider starring this repository and citing our paper as shown below:

@article{wang2026small,
  title={Small Reward Models via Backward Inference},
  author={Wang, Yike and Brahman, Faeze and Feng, Shangbin and Xiao, Teng and Hajishirzi, Hannaneh and Tsvetkov, Yulia},
  journal={arXiv preprint arXiv:2602.13551},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1,178 Commits
figs		figs
open-instruct		open-instruct
prompts		prompts
README.md		README.md
metrics.py		metrics.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Small Reward Models via Backward Inference

Repository structure

1. Using the FLIP reward (standalone)

2. RL training with GRPO (FLIP as reward)

Questions

Citing

About

Uh oh!

Releases

Packages

Languages

yikee/FLIP

Folders and files

Latest commit

History

Repository files navigation

Small Reward Models via Backward Inference

Repository structure

1. Using the FLIP reward (standalone)

2. RL training with GRPO (FLIP as reward)

Questions

Citing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages