🕋 CAGE : A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

This is the official repository of CAGE : A Framework for Culturally Adaptive Red-Teaming Benchmark Generation (ICLR 2026)

This repository contains the evaluation framework for assessing the safety of LLM prompts and responses (PER RUBRIC). It supports both Prompt Safety Evaluation (checking if a prompt itself is harmful) and Response Safety Evaluation (checking if a model's response to a prompt is harmful) using GPT-4 based judges with specific rubrics for English and Korean.

FULL DATA (KorSET): LINK

Directory Structure

.
├── data/                   # Input datasets (CSV files)
├── evaluate/               # Core evaluation logic and rubrics
│   ├── base/               # Model wrappers (e.g., OpenAI GPT)
│   ├── template/           # Safety rubrics and prompt templates
│   │   ├── en/             # English rubrics
│   │   └── ko/             # Korean rubrics
│   └── utils/              # Utility scripts (Logger, etc.)
├── logs/                   # Execution logs (automatically created)
├── run/                    # Execution scripts
│   └── safety_judge.py     # Main entry point for evaluation
└── requirements.txt        # Python dependencies

1. Installation

Follow the steps below to set up the environment.

Create Environment

Using Conda:

conda create -n safebench python=3.10 -y
conda activate safebench

Or using venv:

python3.10 -m venv safebench_env
source safebench_env/bin/activate

Install Packages

pip install -r requirements.txt

2. Usage

The main script is run/safety_judge.py. It evaluates datasets using the OpenAI API (use GPT-4.1 as llm safety judge).

Arguments

Argument	Flag	Required	Description	Default
Input File	`--input`, `-i`	✅	Path to the input CSV file.	-
Output Dir	`--outdir`, `-d`	✅	Directory to save results.	-
Output File	`--outfile`, `-o`	✅	Name of the result CSV file.	-
Mode	`--mode`, `-m`	✅	Evaluation mode: `prompt` or `response`.	-
Language	`--lang`, `-l`	✅	Language of the rubric: `en` or `ko`.	-
API Key	`--api_key`, `-a`	✅	Your OpenAI API Key.	-
Model	`--model`		Judge model name.	`gpt-4.1`
Prompt Col	`--prompt_col`, `-pc`		Column name for the prompt.	`seed`
Response Col	`--response_col`, `-rc`		Column name for the response (only for response mode).	`response`
Category Col	`--category_col`, `-cc`		Column name for the category.	`category`

Examples

1. Prompt Safety Evaluation (Korean)

Evaluates whether the input prompts themselves are harmful.

python run/safety_judge.py \
    --input data/sample_data.csv \
    --outdir results \
    --outfile result_prompt_ko.csv \
    --mode prompt \
    --lang ko \
    --api_key sk-proj-...

2. Response Safety Evaluation (English)

Evaluates whether the model's response to a prompt is harmful.

python run/safety_judge.py \
    --input data/sample_data.csv \
    --outdir results \
    --outfile result_response_en.csv \
    --mode response \
    --lang en \
    --api_key sk-proj-...

3. Output

The script generates a CSV file with the following additional columns:

raw_output: The raw JSON output from the judge model.
result: The raw judgment result (O for Safe, X for Unsafe).
judge: Interpreted result (No for Safe, Yes for Unsafe).
safe_rubric: Binary safety label (Yes for Safe, No for Unsafe).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕋 CAGE : A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

Directory Structure

1. Installation

Create Environment

Install Packages

2. Usage

Arguments

Examples

1. Prompt Safety Evaluation (Korean)

2. Response Safety Evaluation (English)

3. Output

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
evaluate		evaluate
logs		logs
run		run
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

🕋 CAGE : A Framework for Culturally Adaptive Red-Teaming Benchmark Generation

Directory Structure

1. Installation

Create Environment

Install Packages

2. Usage

Arguments

Examples

1. Prompt Safety Evaluation (Korean)

2. Response Safety Evaluation (English)

3. Output

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages