This is the official repository of CAGE : A Framework for Culturally Adaptive Red-Teaming Benchmark Generation (ICLR 2026)
This repository contains the evaluation framework for assessing the safety of LLM prompts and responses (PER RUBRIC). It supports both Prompt Safety Evaluation (checking if a prompt itself is harmful) and Response Safety Evaluation (checking if a model's response to a prompt is harmful) using GPT-4 based judges with specific rubrics for English and Korean.
FULL DATA (KorSET): LINK
.
├── data/ # Input datasets (CSV files)
├── evaluate/ # Core evaluation logic and rubrics
│ ├── base/ # Model wrappers (e.g., OpenAI GPT)
│ ├── template/ # Safety rubrics and prompt templates
│ │ ├── en/ # English rubrics
│ │ └── ko/ # Korean rubrics
│ └── utils/ # Utility scripts (Logger, etc.)
├── logs/ # Execution logs (automatically created)
├── run/ # Execution scripts
│ └── safety_judge.py # Main entry point for evaluation
└── requirements.txt # Python dependencies
Follow the steps below to set up the environment.
Using Conda:
conda create -n safebench python=3.10 -y
conda activate safebenchOr using venv:
python3.10 -m venv safebench_env
source safebench_env/bin/activatepip install -r requirements.txtThe main script is run/safety_judge.py. It evaluates datasets using the OpenAI API (use GPT-4.1 as llm safety judge).
| Argument | Flag | Required | Description | Default |
|---|---|---|---|---|
| Input File | --input, -i |
✅ | Path to the input CSV file. | - |
| Output Dir | --outdir, -d |
✅ | Directory to save results. | - |
| Output File | --outfile, -o |
✅ | Name of the result CSV file. | - |
| Mode | --mode, -m |
✅ | Evaluation mode: prompt or response. |
- |
| Language | --lang, -l |
✅ | Language of the rubric: en or ko. |
- |
| API Key | --api_key, -a |
✅ | Your OpenAI API Key. | - |
| Model | --model |
Judge model name. | gpt-4.1 |
|
| Prompt Col | --prompt_col, -pc |
Column name for the prompt. | seed |
|
| Response Col | --response_col, -rc |
Column name for the response (only for response mode). | response |
|
| Category Col | --category_col, -cc |
Column name for the category. | category |
Evaluates whether the input prompts themselves are harmful.
python run/safety_judge.py \
--input data/sample_data.csv \
--outdir results \
--outfile result_prompt_ko.csv \
--mode prompt \
--lang ko \
--api_key sk-proj-... Evaluates whether the model's response to a prompt is harmful.
python run/safety_judge.py \
--input data/sample_data.csv \
--outdir results \
--outfile result_response_en.csv \
--mode response \
--lang en \
--api_key sk-proj-... The script generates a CSV file with the following additional columns:
raw_output: The raw JSON output from the judge model.result: The raw judgment result (Ofor Safe,Xfor Unsafe).judge: Interpreted result (Nofor Safe,Yesfor Unsafe).safe_rubric: Binary safety label (Yesfor Safe,Nofor Unsafe).
