Skip to content

sid-in-the-loop/verl-agent

 
 

Repository files navigation

Verl-agent-deepresearch

This repository contains an implementation of the deep research agents from the verl-agent project.


Overview

  • The core implementation of the deep research agent, which defines how the agent interacts with the environment, is located in agent_system/environments/env_package/deepresearch.

  • The rollout logic, responsible for generating trajectories, can be found in agent_system/multi_turn_rollout/rollout_loop.py.

  • The Reinforcement Learning (RL) logic is implemented in verl/trainer/ppo/ray_trainer.py.


How to Train the Agent

Data Preparation

  1. Create a new directory for your dataset at agent_system/environments/env_package/deepresearch/deepresearch/data/your_dataset_name.

  2. Place your train.json and val.json files inside this new directory. Ensure they follow the same format as the files in the other existing dataset folders.

  3. Run the following command to convert the JSON files into the Parquet format:

    python examples/data_preprocess/deep_research_data_prepare.py \
        --train_json agent_system/environments/env_package/deepresearch/deepresearch/data/your_dataset_name/train.json \
        --val_json agent_system/environments/env_package/deepresearch/deepresearch/data/your_dataset_name/val.json

Note: The agent reads data directly from the environments (see the relevant code here). The Parquet file is used primarily to ensure data format compatibility and for global step counting within the original Verl framework.

Start Training

To start training, run one of the following scripts depending on your available GPU memory:

./examples/grpo_trainer/run_deepresearch.sh # for 8 × 80GB GPUs

Or

./examples/grpo_trainer/run_deepresearch_l40s.sh # for 8 × 48GB GPUs

Before running the script, make sure to set env.env_name in the configuration to the your_dataset_name you created in the previous step.

You may also want to adjust the following parameters:

  • env.rollout.n: The group size for GRPO.

  • env.max_steps: The maximum number of steps for the search agent.

  • trainer.save_freq: The step frequency for saving checkpoints.

  • trainer.test_freq: The step frequency for performing validation.

  • trainer.total_epochs: The total number of training epochs.

For users with Slurm, you can launch training with resource headers using this command (after setting the necessary configurations in the entry script under /examples/grpo_trainer/):

./scripts/run_sbatch.sh

Note:

  1. The parameters env.use_critique, env.use_dense_reward, and env.use_rule_reward correspond to features that are currently under development. Please ensure they are disabled during training.
  2. Training may require substantial CPU resources, since multiple agent environments run in parallel during rollouts. If CPU capacity is insufficient, the program may stall. You can monitor system usage with ray status.

About

verl-agent is an extension of veRL, designed for training LLM/VLM agents via RL. verl-agent is also the official code for paper "Group-in-Group Policy Optimization for LLM Agent Training"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 66.1%
  • Jupyter Notebook 20.1%
  • C 9.4%
  • Shell 1.8%
  • PDDL 1.3%
  • Yacc 0.6%
  • Other 0.7%