Verl-agent-deepresearch

This repository contains an implementation of the deep research agents from the verl-agent project.

Overview

The core implementation of the deep research agent, which defines how the agent interacts with the environment, is located in agent_system/environments/env_package/deepresearch.
The rollout logic, responsible for generating trajectories, can be found in agent_system/multi_turn_rollout/rollout_loop.py.
The Reinforcement Learning (RL) logic is implemented in verl/trainer/ppo/ray_trainer.py.

How to Train the Agent

Data Preparation

Create a new directory for your dataset at agent_system/environments/env_package/deepresearch/deepresearch/data/your_dataset_name.
Place your train.json and val.json files inside this new directory. Ensure they follow the same format as the files in the other existing dataset folders.

Run the following command to convert the JSON files into the Parquet format:

python examples/data_preprocess/deep_research_data_prepare.py \
    --train_json agent_system/environments/env_package/deepresearch/deepresearch/data/your_dataset_name/train.json \
    --val_json agent_system/environments/env_package/deepresearch/deepresearch/data/your_dataset_name/val.json

Note: The agent reads data directly from the environments (see the relevant code here). The Parquet file is used primarily to ensure data format compatibility and for global step counting within the original Verl framework.

Start Training

To start training, run one of the following scripts depending on your available GPU memory:

./examples/grpo_trainer/run_deepresearch.sh # for 8 × 80GB GPUs

Or

./examples/grpo_trainer/run_deepresearch_l40s.sh # for 8 × 48GB GPUs

Before running the script, make sure to set env.env_name in the configuration to the your_dataset_name you created in the previous step.

You may also want to adjust the following parameters:

env.rollout.n: The group size for GRPO.
env.max_steps: The maximum number of steps for the search agent.
trainer.save_freq: The step frequency for saving checkpoints.
trainer.test_freq: The step frequency for performing validation.
trainer.total_epochs: The total number of training epochs.

For users with Slurm, you can launch training with resource headers using this command (after setting the necessary configurations in the entry script under /examples/grpo_trainer/):

./scripts/run_sbatch.sh

Note:

The parameters env.use_critique, env.use_dense_reward, and env.use_rule_reward correspond to features that are currently under development. Please ensure they are disabled during training.

Training may require substantial CPU resources, since multiple agent environments run in parallel during rollouts. If CPU capacity is insufficient, the program may stall. You can monitor system usage with ray status.

Name		Name	Last commit message	Last commit date
Latest commit History 495 Commits
.github		.github
.vscode		.vscode
agent_system		agent_system
analysis		analysis
deepresearch_outputs_Belief		deepresearch_outputs_Belief
docker		docker
docs		docs
dummy_data/text		dummy_data/text
examples		examples
gigpo		gigpo
logs_new		logs_new
recipe		recipe
scripts		scripts
tests		tests
utils		utils
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
reward_revaluator.py		reward_revaluator.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Verl-agent-deepresearch

Overview

How to Train the Agent

Data Preparation

Start Training

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Verl-agent-deepresearch

Overview

How to Train the Agent

Data Preparation

Start Training

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages