Skip to content

jie311/verl-agent

 
 

Repository files navigation

verl-agent

arXiv Paper   Homepage

verl-agent is an extension of veRL, specifically designed for training large language model (LLM) agents via reinforcement learning (RL).

Unlike prior approaches that concatenate full interaction histories, verl-agent processes each step independently and is therefore highly scalable for very long-horizon, multi-turn RL training (e.g., tasks in ALFWorld can require up to 50 steps to complete).

verl-agent provides a diverse set of RL algorithms (including our new algorithm GiGPO) and a rich suite of agent environments, enabling the development of reasoning agents in both visual and text-based tasks.

News

  • [2025.5.22] Add support for RLOO.
  • [2025.5.19] Our paper has been released. See link.

Table of Contents

Key Features

  • Multi-Turn Agent-Environment Interaction

    verl-agent supports multi-step interactive loops between agents and environments. Agents perceive environmental feedback after each step, forming the basis for reinforcement learning.

  • Scalable for Very Long-Horizon, Multi-Turn Optimization

    Prior works like RAGEN and Search-R1 concatenate the entire history of states and responses. This causes the input/output length to grow rapidly with the number of turns, making them difficult to scale to long-horizon scenarios. We implement a step-wise independent interaction paradigm that aligns with standard RL pipelines. Each step is processed individually, without concatenating the entire interaction history into a single input. This makes verl-agent highly scalable for long-horizon tasks.

  • Parallelized Gym-Style Environments and Group Environments

    verl-agent provides a gym-style interface with support for parallelized environments. This enables high-throughput rollouts, speeding up training. In addition, verl-agent introduces the concept of group environments. All environments within a group share identical initial states during reset(). This is especially useful for algorithms like GRPO and DAPO that require multiple rollouts on the same state. You can configure the number of rollouts per group using the env.rollout.n in ppo_trainer.yaml config file.

  • Diverse RL Algorithms

    verl-agent includes implementations of various RL algorithms, such as GRPO, PPO, DAPO, and our new state-of-the-art algorithm GiGPO. It also supports several variants enhanced with dynamic sampling and clip-higher techniques.

  • Rich Suite of Environments

    verl-agent offers a diverse set of interactive environments including embodied AI environments like ALFWorld, visual games such as Sokoban and Gym Cards, and digital interface control tasks like WebShop and AppWorld (experimental).

  • Vision-Language Agent Support

    Beyond text-based agents, verl-agent also supports training vision-language agents. This enables multi-modal reasoning in environments where both visual perception and language understanding are required.

Results

Algorithm Task Model Success Rate Training Log Model Checkpoint [Coming Soon]
GiGPO ALFWorld Qwen2.5-1.5B-Instruct 86.1% wandb HF
GiGPO WebShop Qwen2.5-1.5B-Instruct 67.4% wandb HF
GiGPO (dynamic) WebShop Qwen2.5-1.5B-Instruct 75.0% wandb HF
GiGPO Sokoban [6x6] Qwen2.5-VL-3B-Instruct 81.0% wandb HF
GiGPO NumberLine Qwen2.5-VL-3B-Instruct 100.0% wandb HF
GiGPO EZPoints Qwen2.5-VL-3B-Instruct 100.0% wandb HF

Note: The W&B logs also include the training records for GRPO.

Installation

Install veRL

conda create -n verl-agent python==3.12 -y
conda activate verl-agent

pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# Install FlashAttention
pip3 install flash-attn --no-build-isolation

# Install verl-agent
pip3 install -e .

# Install compatible vLLM
pip3 install vllm==0.8.2

Install Supported Environments

⚠️ Important: To run an agent in any of these environments, you must first install and configure the corresponding environment. We strongly recommend installing each environment in its own dedicated conda environment to avoid potential package version conflicts.

1. ALFWorld

Install with pip:

pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0
pip install alfworld
pip install thinc==8.3.4
pip install vllm==0.8.2

Download PDDL & Game files and pre-trained MaskRCNN detector (will be stored in ~/.cache/alfworld/):

alfworld-download -f

Use --extra to download pre-trained checkpoints and seq2seq data.

Play a Textworld game:

alfworld-play-tw

2. WebShop

WebShop requires Python 3.9, so begin by creating a new verl-agent-webshop environment

conda create -n verl-agent-webshop python==3.9.18 -y
conda activate verl-agent-webshop

Install WebShop

cd ./agent_system/environments/env_package/webshop/webshop
./setup.sh -d all

Note: If you encounter issues with gdown, you may need visit https://drive.google.com/, get your Google Drive cookie, and paste it into .cache/gdown/cookies.txt. Or you may need to manually download the files.

Verify that WebShop was installed correctly by running:

python run_web_agent_text_env.py

After WebShop is installed, return to the root directory of the repository and install the verl package in verl-agent:

cd repo_root/
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2
# spacy 3.7.2 requires typer<0.10.0,>=0.3.0, but you have typer 0.15.2 which is incompatible.
# weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.15.2 which is incompatible.

The warnings can be safely ignored.


3. Sokoban

pip install matplotlib
pip install gym==0.26.2
pip install gym_sokoban==0.0.6

4. Gym Cards

cd repo_root/
pip3 install -e ./agent_system/environments/env_package/gym_cards/gym-cards/
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0

5. APPWorld (Experimental)

Install APPWorld package in verl-agent (some warnings may be raised, you can ignore them)

cd repo_root/
cd ./agent_system/environments/env_package/appworld/appworld
pip install -e .
python -m appworld.cli install
appworld download data

cd repo_root/
appworld download data

Refresh dependencies in the verl-agent environment:

cd repo_root/
pip install -e .
pip install vllm==0.8.2

You can ignore the warning of incompatiblity for appworld, because we don't run appworld in verl-agent environment.

Create a Dedicated Conda Environment appworld for the APPWorld Server:

conda create -n appworld python=3.12 -y
conda activate appworld

cd ./agent_system/environments/env_package/appworld/appworld
pip install -e .
python -m appworld.cli install

Run Examples

RL Training

We provide out-of-the-box scripts in the "examples/" directory for training agents in different environments.

Here are some examples:

1. GiGPO

GiGPO is our novel algorithm designed to support fine-grained credit assignment in long-horizon LLM agent training. It introduces a two-level grouping mechanism:

  • Episode-level groups capture overall task success via total returns (like GRPO).
  • Step-level groups gather repeated states across trajectories to compute relative advantages for individual actions.

GiGPO is fully critic-free, maintains the same GPU memory footprint and LLM rollout cost as GRPO, yet achieves significantly better training efficiency and performance.

bash examples/gigpo_trainer/run_alfworld.sh # ALFWorld
bash examples/gigpo_trainer/run_webshop.sh # WebShop
bash examples/gigpo_trainer/run_sokoban.sh # Sokoban

2. GRPO

GRPO is a critic-free algorithm that estimates relative advantages based on a group of full episode trajectories.

bash examples/grpo_trainer/run_alfworld.sh # ALFWorld
bash examples/grpo_trainer/run_webshop.sh # WebShop
bash examples/grpo_trainer/run_sokoban.sh # Sokoban

3. PPO

PPO is a classic actor-critic algorithm that updates the policy using a clipped objective to ensure stable learning. It requires a separate value network (critic) to estimate state values.

bash examples/ppo_trainer/run_alfworld.sh # ALFWorld
bash examples/ppo_trainer/run_webshop.sh # WebShop

4. RLOO

RLOO. Our implementation uses a leave-one-out estimate and the PPO-clip update (instead of the REINFORCE update), making it closer to LOOP.

bash examples/rloo_trainer/run_alfworld.sh # ALFWorld
bash examples/rloo_trainer/run_webshop.sh # WebShop

5. DAPO

DAPO enhances GRPO with techniques like dynamic sampling and clip-higher.

bash examples/dapo_trainer/run_alfworld.sh # ALFWorld
bash examples/dapo_trainer/run_webshop.sh # WebShop

6. GiGPO (dynamic)

GiGPO uses dynamic sampling and clip-higher from DAPO

bash examples/gigpo_dynamic_trainer/run_alfworld.sh # ALFWorld
bash examples/gigpo_dynamic_trainer/run_webshop.sh # WebShop
bash examples/gigpo_dynamic_trainer/run_sokoban.sh # Sokoban

Prompt-based Agent with GPT-4o

We also provide a prompt-based GPT-4o agent.

bash examples/prompt_agent/run_gpt4o_agent.sh # ALFWorld

Acknowledgement

We gratefully acknowledge the contributions of the veRL team for providing a solid RL infrastructure.

Special thanks to the RAGEN project for their codebase, which inspired early design choices during the development of verl-agent.

We also thank the developers of ALFWorld, Sokoban, Gym Cards, WebShop, and AppWorld for providing high-quality interactive environments used in this project.

Citation

If you find verl-agent useful in your research or applications, we would appreciate it if you could cite our work:

@article{feng2025group,
  title={Group-in-Group Policy Optimization for LLM Agent Training},
  author={Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An},
  journal={arXiv preprint arXiv:2505.10978},
  year={2025}
}

About

verl-agent is an extension of veRL, designed for training LLM/VLM agents via RL. verl-agent is also the official code for paper "Group-in-Group Policy Optimization for LLM Agent Training"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 72.2%
  • Jupyter Notebook 16.1%
  • C 7.5%
  • PDDL 1.0%
  • Jsonnet 1.0%
  • Shell 0.9%
  • Other 1.3%