verl-agent

verl-agent is an extension of veRL, specifically designed for training large language model (LLM) agents via reinforcement learning (RL).

Unlike prior approaches that concatenate full interaction histories, verl-agent processes each step independently and is therefore highly scalable for very long-horizon, multi-turn RL training (e.g., tasks in ALFWorld can require up to 50 steps to complete).

verl-agent provides a diverse set of RL algorithms (including our new algorithm GiGPO) and a rich suite of agent environments, enabling the development of reasoning agents in both visual and text-based tasks.

News

[2025.5.22] Add support for RLOO.
[2025.5.19] Our paper has been released. See link.

Key Features

Multi-Turn Agent-Environment Interaction

verl-agent supports multi-step interactive loops between agents and environments. Agents perceive environmental feedback after each step, forming the basis for reinforcement learning.
Scalable for Very Long-Horizon, Multi-Turn Optimization

Prior works like RAGEN and Search-R1 concatenate the entire history of states and responses. This causes the input/output length to grow rapidly with the number of turns, making them difficult to scale to long-horizon scenarios. We implement a step-wise independent interaction paradigm that aligns with standard RL pipelines. Each step is processed individually, without concatenating the entire interaction history into a single input. This makes verl-agent highly scalable for long-horizon tasks.
Parallelized Gym-Style Environments and Group Environments

verl-agent provides a gym-style interface with support for parallelized environments. This enables high-throughput rollouts, speeding up training. In addition, verl-agent introduces the concept of group environments. All environments within a group share identical initial states during reset(). This is especially useful for algorithms like GRPO and DAPO that require multiple rollouts on the same state. You can configure the number of rollouts per group using the env.rollout.n in ppo_trainer.yaml config file.
Diverse RL Algorithms

verl-agent includes implementations of various RL algorithms, such as GRPO, PPO, DAPO, and our new state-of-the-art algorithm GiGPO. It also supports several variants enhanced with dynamic sampling and clip-higher techniques.
Rich Suite of Environments

verl-agent offers a diverse set of interactive environments including embodied AI environments like ALFWorld, visual games such as Sokoban and Gym Cards, and digital interface control tasks like WebShop and AppWorld (experimental).
Vision-Language Agent Support

Beyond text-based agents, verl-agent also supports training vision-language agents. This enables multi-modal reasoning in environments where both visual perception and language understanding are required.

Results

Algorithm	Task	Model	Success Rate
GiGPO	ALFWorld	Qwen2.5-1.5B-Instruct	86.1%
GiGPO	WebShop	Qwen2.5-1.5B-Instruct	67.4%
GiGPO (dynamic)	WebShop	Qwen2.5-1.5B-Instruct	75.0%
GiGPO	Sokoban [6x6]	Qwen2.5-VL-3B-Instruct	81.0%
GiGPO	NumberLine	Qwen2.5-VL-3B-Instruct	100.0%
GiGPO	EZPoints	Qwen2.5-VL-3B-Instruct	100.0%

Note: The W&B logs also include the training records for GRPO.

Installation

Install veRL

conda create -n verl-agent python==3.12 -y
conda activate verl-agent

pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# Install FlashAttention
pip3 install flash-attn --no-build-isolation

# Install verl-agent
pip3 install -e .

# Install compatible vLLM
pip3 install vllm==0.8.2

Install Supported Environments

⚠️ Important: To run an agent in any of these environments, you must first install and configure the corresponding environment. We strongly recommend installing each environment in its own dedicated conda environment to avoid potential package version conflicts.

1. ALFWorld

Install with pip:

pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0

pip install alfworld
pip install thinc==8.3.4
pip install vllm==0.8.2

Download PDDL & Game files and pre-trained MaskRCNN detector (will be stored in ~/.cache/alfworld/):

alfworld-download -f

Use --extra to download pre-trained checkpoints and seq2seq data.

Play a Textworld game:

alfworld-play-tw

2. WebShop

WebShop requires Python 3.9, so begin by creating a new verl-agent-webshop environment

conda create -n verl-agent-webshop python==3.9.18 -y
conda activate verl-agent-webshop

Install WebShop

cd ./agent_system/environments/env_package/webshop/webshop
./setup.sh -d all

Note: If you encounter issues with gdown, you may need visit https://drive.google.com/, get your Google Drive cookie, and paste it into .cache/gdown/cookies.txt. Or you may need to manually download the files.

Verify that WebShop was installed correctly by running:

python run_web_agent_text_env.py

After WebShop is installed, return to the root directory of the repository and install the verl package in verl-agent:

cd repo_root/
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2
# spacy 3.7.2 requires typer<0.10.0,>=0.3.0, but you have typer 0.15.2 which is incompatible.
# weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.15.2 which is incompatible.

The warnings can be safely ignored.

3. Sokoban

pip install matplotlib
pip install gym==0.26.2
pip install gym_sokoban==0.0.6

4. Gym Cards

cd repo_root/
pip3 install -e ./agent_system/environments/env_package/gym_cards/gym-cards/
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0

5. APPWorld (Experimental)

Install APPWorld package in verl-agent (some warnings may be raised, you can ignore them)

cd repo_root/
cd ./agent_system/environments/env_package/appworld/appworld
pip install -e .
python -m appworld.cli install
appworld download data

cd repo_root/
appworld download data

Refresh dependencies in the verl-agent environment:

cd repo_root/
pip install -e .
pip install vllm==0.8.2

You can ignore the warning of incompatiblity for appworld, because we don't run appworld in verl-agent environment.

Create a Dedicated Conda Environment appworld for the APPWorld Server:

conda create -n appworld python=3.12 -y
conda activate appworld

cd ./agent_system/environments/env_package/appworld/appworld
pip install -e .
python -m appworld.cli install

Run Examples

RL Training

We provide out-of-the-box scripts in the "examples/" directory for training agents in different environments.

Here are some examples:

1. GiGPO

GiGPO is our novel algorithm designed to support fine-grained credit assignment in long-horizon LLM agent training. It introduces a two-level grouping mechanism:

Episode-level groups capture overall task success via total returns (like GRPO).
Step-level groups gather repeated states across trajectories to compute relative advantages for individual actions.

GiGPO is fully critic-free, maintains the same GPU memory footprint and LLM rollout cost as GRPO, yet achieves significantly better training efficiency and performance.

bash examples/gigpo_trainer/run_alfworld.sh # ALFWorld

bash examples/gigpo_trainer/run_webshop.sh # WebShop

bash examples/gigpo_trainer/run_sokoban.sh # Sokoban

2. GRPO

GRPO is a critic-free algorithm that estimates relative advantages based on a group of full episode trajectories.

bash examples/grpo_trainer/run_alfworld.sh # ALFWorld

bash examples/grpo_trainer/run_webshop.sh # WebShop

bash examples/grpo_trainer/run_sokoban.sh # Sokoban

3. PPO

PPO is a classic actor-critic algorithm that updates the policy using a clipped objective to ensure stable learning. It requires a separate value network (critic) to estimate state values.

bash examples/ppo_trainer/run_alfworld.sh # ALFWorld

bash examples/ppo_trainer/run_webshop.sh # WebShop

4. RLOO

RLOO. Our implementation uses a leave-one-out estimate and the PPO-clip update (instead of the REINFORCE update), making it closer to LOOP.

bash examples/rloo_trainer/run_alfworld.sh # ALFWorld

bash examples/rloo_trainer/run_webshop.sh # WebShop

5. DAPO

DAPO enhances GRPO with techniques like dynamic sampling and clip-higher.

bash examples/dapo_trainer/run_alfworld.sh # ALFWorld

bash examples/dapo_trainer/run_webshop.sh # WebShop

6. GiGPO (dynamic)

GiGPO uses dynamic sampling and clip-higher from DAPO

bash examples/gigpo_dynamic_trainer/run_alfworld.sh # ALFWorld

bash examples/gigpo_dynamic_trainer/run_webshop.sh # WebShop

bash examples/gigpo_dynamic_trainer/run_sokoban.sh # Sokoban

Prompt-based Agent with GPT-4o

We also provide a prompt-based GPT-4o agent.

bash examples/prompt_agent/run_gpt4o_agent.sh # ALFWorld

Acknowledgement

We gratefully acknowledge the contributions of the veRL team for providing a solid RL infrastructure.

Special thanks to the RAGEN project for their codebase, which inspired early design choices during the development of verl-agent.

We also thank the developers of ALFWorld, Sokoban, Gym Cards, WebShop, and AppWorld for providing high-quality interactive environments used in this project.

Citation

If you find verl-agent useful in your research or applications, we would appreciate it if you could cite our work:

@article{feng2025group,
  title={Group-in-Group Policy Optimization for LLM Agent Training},
  author={Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An},
  journal={arXiv preprint arXiv:2505.10978},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 451 Commits
.github		.github
agent_system		agent_system
docker		docker
docs		docs
examples		examples
gigpo		gigpo
patches		patches
recipe/prime		recipe/prime
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

verl-agent

News

Table of Contents

Key Features

Results

Installation

Install veRL

Install Supported Environments

1. ALFWorld

2. WebShop

3. Sokoban

4. Gym Cards

5. APPWorld (Experimental)

Run Examples

RL Training

1. GiGPO

2. GRPO

3. PPO

4. RLOO

5. DAPO

6. GiGPO (dynamic)

Prompt-based Agent with GPT-4o

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

verl-agent

News

Table of Contents

Key Features

Results

Installation

Install veRL

Install Supported Environments

1. ALFWorld

2. WebShop

3. Sokoban

4. Gym Cards

5. APPWorld (Experimental)

Run Examples

RL Training

1. GiGPO

2. GRPO

3. PPO

4. RLOO

5. DAPO

6. GiGPO (dynamic)

Prompt-based Agent with GPT-4o

Acknowledgement

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages