verl-agent is an extension of veRL, specifically designed for training large language model (LLM) agents via reinforcement learning (RL).
Unlike prior approaches that concatenate full interaction histories, verl-agent processes each step independently and is therefore highly scalable for very long-horizon, multi-turn RL training (e.g., tasks in ALFWorld can require up to 50 steps to complete).
verl-agent provides a diverse set of RL algorithms (including our new algorithm GiGPO) and a rich suite of agent environments, enabling the development of reasoning agents in both visual and text-based tasks.
- [2025.5.22] Add support for RLOO.
- [2025.5.19] Our paper has been released. See link.
-
Multi-Turn Agent-Environment Interaction
verl-agentsupports multi-step interactive loops between agents and environments. Agents perceive environmental feedback after each step, forming the basis for reinforcement learning. -
Scalable for Very Long-Horizon, Multi-Turn Optimization
Prior works like RAGEN and Search-R1 concatenate the entire history of states and responses. This causes the input/output length to grow rapidly with the number of turns, making them difficult to scale to long-horizon scenarios. We implement a step-wise independent interaction paradigm that aligns with standard RL pipelines. Each step is processed individually, without concatenating the entire interaction history into a single input. This makes
verl-agenthighly scalable for long-horizon tasks. -
Parallelized Gym-Style Environments and Group Environments
verl-agentprovides a gym-style interface with support for parallelized environments. This enables high-throughput rollouts, speeding up training. In addition,verl-agentintroduces the concept of group environments. All environments within a group share identical initial states duringreset(). This is especially useful for algorithms like GRPO and DAPO that require multiple rollouts on the same state. You can configure the number of rollouts per group using theenv.rollout.nin ppo_trainer.yaml config file. -
Diverse RL Algorithms
verl-agentincludes implementations of various RL algorithms, such as GRPO, PPO, DAPO, and our new state-of-the-art algorithm GiGPO. It also supports several variants enhanced with dynamic sampling and clip-higher techniques. -
Rich Suite of Environments
verl-agentoffers a diverse set of interactive environments including embodied AI environments like ALFWorld, visual games such as Sokoban and Gym Cards, and digital interface control tasks like WebShop and AppWorld (experimental). -
Vision-Language Agent Support
Beyond text-based agents,
verl-agentalso supports training vision-language agents. This enables multi-modal reasoning in environments where both visual perception and language understanding are required.
Note: The W&B logs also include the training records for GRPO.
conda create -n verl-agent python==3.12 -y
conda activate verl-agent
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
# Install FlashAttention
pip3 install flash-attn --no-build-isolation
# Install verl-agent
pip3 install -e .
# Install compatible vLLM
pip3 install vllm==0.8.2
⚠️ Important: To run an agent in any of these environments, you must first install and configure the corresponding environment. We strongly recommend installing each environment in its own dedicated conda environment to avoid potential package version conflicts.
Install with pip:
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0pip install alfworld
pip install thinc==8.3.4
pip install vllm==0.8.2Download PDDL & Game files and pre-trained MaskRCNN detector (will be stored in ~/.cache/alfworld/):
alfworld-download -fUse --extra to download pre-trained checkpoints and seq2seq data.
Play a Textworld game:
alfworld-play-twWebShop requires Python 3.9, so begin by creating a new verl-agent-webshop environment
conda create -n verl-agent-webshop python==3.9.18 -y
conda activate verl-agent-webshopInstall WebShop
cd ./agent_system/environments/env_package/webshop/webshop
./setup.sh -d allNote: If you encounter issues with gdown, you may need visit https://drive.google.com/, get your Google Drive cookie, and paste it into .cache/gdown/cookies.txt.
Or you may need to manually download the files.
Verify that WebShop was installed correctly by running:
python run_web_agent_text_env.pyAfter WebShop is installed, return to the root directory of the repository and install the verl package in verl-agent:
cd repo_root/
pip3 install torch==2.6.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
pip3 install -e .
pip3 install vllm==0.8.2
# spacy 3.7.2 requires typer<0.10.0,>=0.3.0, but you have typer 0.15.2 which is incompatible.
# weasel 0.3.4 requires typer<0.10.0,>=0.3.0, but you have typer 0.15.2 which is incompatible.The warnings can be safely ignored.
pip install matplotlib
pip install gym==0.26.2
pip install gym_sokoban==0.0.6cd repo_root/
pip3 install -e ./agent_system/environments/env_package/gym_cards/gym-cards/
pip3 install gymnasium==0.29.1
pip3 install stable-baselines3==2.6.0Install APPWorld package in verl-agent (some warnings may be raised, you can ignore them)
cd repo_root/
cd ./agent_system/environments/env_package/appworld/appworld
pip install -e .
python -m appworld.cli install
appworld download data
cd repo_root/
appworld download dataRefresh dependencies in the verl-agent environment:
cd repo_root/
pip install -e .
pip install vllm==0.8.2You can ignore the warning of incompatiblity for appworld, because we don't run appworld in verl-agent environment.
Create a Dedicated Conda Environment appworld for the APPWorld Server:
conda create -n appworld python=3.12 -y
conda activate appworld
cd ./agent_system/environments/env_package/appworld/appworld
pip install -e .
python -m appworld.cli installWe provide out-of-the-box scripts in the "examples/" directory for training agents in different environments.
Here are some examples:
GiGPO is our novel algorithm designed to support fine-grained credit assignment in long-horizon LLM agent training. It introduces a two-level grouping mechanism:
- Episode-level groups capture overall task success via total returns (like GRPO).
- Step-level groups gather repeated states across trajectories to compute relative advantages for individual actions.
GiGPO is fully critic-free, maintains the same GPU memory footprint and LLM rollout cost as GRPO, yet achieves significantly better training efficiency and performance.
bash examples/gigpo_trainer/run_alfworld.sh # ALFWorldbash examples/gigpo_trainer/run_webshop.sh # WebShopbash examples/gigpo_trainer/run_sokoban.sh # SokobanGRPO is a critic-free algorithm that estimates relative advantages based on a group of full episode trajectories.
bash examples/grpo_trainer/run_alfworld.sh # ALFWorldbash examples/grpo_trainer/run_webshop.sh # WebShopbash examples/grpo_trainer/run_sokoban.sh # SokobanPPO is a classic actor-critic algorithm that updates the policy using a clipped objective to ensure stable learning. It requires a separate value network (critic) to estimate state values.
bash examples/ppo_trainer/run_alfworld.sh # ALFWorldbash examples/ppo_trainer/run_webshop.sh # WebShopRLOO. Our implementation uses a leave-one-out estimate and the PPO-clip update (instead of the REINFORCE update), making it closer to LOOP.
bash examples/rloo_trainer/run_alfworld.sh # ALFWorldbash examples/rloo_trainer/run_webshop.sh # WebShopDAPO enhances GRPO with techniques like dynamic sampling and clip-higher.
bash examples/dapo_trainer/run_alfworld.sh # ALFWorldbash examples/dapo_trainer/run_webshop.sh # WebShopGiGPO uses dynamic sampling and clip-higher from DAPO
bash examples/gigpo_dynamic_trainer/run_alfworld.sh # ALFWorldbash examples/gigpo_dynamic_trainer/run_webshop.sh # WebShopbash examples/gigpo_dynamic_trainer/run_sokoban.sh # SokobanWe also provide a prompt-based GPT-4o agent.
bash examples/prompt_agent/run_gpt4o_agent.sh # ALFWorldWe gratefully acknowledge the contributions of the veRL team for providing a solid RL infrastructure.
Special thanks to the RAGEN project for their codebase, which inspired early design choices during the development of verl-agent.
We also thank the developers of ALFWorld, Sokoban, Gym Cards, WebShop, and AppWorld for providing high-quality interactive environments used in this project.
If you find verl-agent useful in your research or applications, we would appreciate it if you could cite our work:
@article{feng2025group,
title={Group-in-Group Policy Optimization for LLM Agent Training},
author={Lang Feng, Zhenghai Xue, Tingcong Liu, Bo An},
journal={arXiv preprint arXiv:2505.10978},
year={2025}
}