Skip to content

yikee/FLIP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,178 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Small Reward Models via Backward Inference

Yike Wang1, Faeze Brahman2, Shangbin Feng1, Teng Xiao2, Hannaneh Hajishirzi12, Yulia Tsvetkov1
1University of Washington, 2Allen Institute for Artificial Intelligence

overview

We propose 🔄 FLIP, a reference-free and rubric-free reward modeling approach: infer the instruction that would most plausibly produce a given response, and use the similarity between the inferred and the original instructions as the reward signal.

Repository structure

Path Description
prompts/ Prompt templates: FLIP (instruction inference) and LLM-judge baselines (pointwise, pairwise, listwise).
metrics.py F1 score and normalization utilities for comparing inferred vs. ground-truth instructions.
open-instruct/ Open Instruct fork with FLIP integrated as a GRPO reward (judge type flip).

1. Using the FLIP reward (standalone)

Step 1 — Get inferred instruction from a response

  • Use the templates in prompts/:
    • prompts/FLIP_SYSTEM.prompt: system message for the instruction-reconstruction task.
    • prompts/FLIP_USER.prompt: user message template; replace {response} with the model’s response.
  • Call your LLM with these prompts to obtain an inferred instruction (and optional reasoning). The model should output JSON with keys like "REASONING" and "INFERRED INSTRUCTION".

Step 2 — Compute reward with F1

from metrics import f1_score

# prediction = inferred instruction from the LLM
# ground_truth = original instruction
result = f1_score(prediction, ground_truth)["f1"]

metrics.py also provides normalize_answer(s) for normalizing text before comparison (lowercasing, removing punctuation/articles, fixing whitespace).


2. RL training with GRPO (FLIP as reward)

Training runs use the Open Instruct codebase under open-instruct/, with FLIP wired in as the judge type flip.

Example datasets

Both datasets share the same structure and content, consisting of 12k English prompts from the WildChat dataset. They only differ in the judge type specified by the "dataset" attribute. You can use these datasets as-is, or adapt your own data to match the same schema.

Running training

  • FLIP reward (instruction inference + F1):
    bash open-instruct/scripts/train/grpo_flip.sh
  • LLM-judge baseline (e.g. pointwise quality score):
    bash open-instruct/scripts/train/grpo_llmjudge.sh

Important script variables (in the .sh scripts)

  • DATASETS — Training mix, e.g. "yikeee/rlvr_general_chat_flip 1.0".
  • MODEL_NAME_OR_PATH — Policy model (e.g. allenai/Olmo-3-7B-Think-DPO).
  • --llm_judge_model — Judge model used for FLIP (infer instruction) or LLM-judge (e.g. hosted_vllm/Qwen/Qwen3-4B).

Inside Open Instruct, the FLIP pipeline uses the same idea as the standalone use: the judge model is prompted to infer the instruction from the response, then f1_score(inferred_instruction, ground_truth) (in open-instruct/open_instruct/judge_utils.py and ground_truth_utils.py) gives the reward.


Questions

If you have any questions or comments about our paper, or notice any issues in the code, feel free to reach out at yikewang@cs.washington.edu. We will do our best to respond within one business day.


Citing

If you found this work helpful, please consider starring this repository and citing our paper as shown below:

@article{wang2026small,
  title={Small Reward Models via Backward Inference},
  author={Wang, Yike and Brahman, Faeze and Feng, Shangbin and Xiao, Teng and Hajishirzi, Hannaneh and Tsvetkov, Yulia},
  journal={arXiv preprint arXiv:2602.13551},
  year={2026}
}

About

Small Reward Models via Backward Inference

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published