This repository contains an implementation of the deep research agents from the verl-agent project.
-
The core implementation of the deep research agent, which defines how the agent interacts with the environment, is located in
agent_system/environments/env_package/deepresearch. -
The rollout logic, responsible for generating trajectories, can be found in
agent_system/multi_turn_rollout/rollout_loop.py. -
The Reinforcement Learning (RL) logic is implemented in
verl/trainer/ppo/ray_trainer.py.
-
Create a new directory for your dataset at
agent_system/environments/env_package/deepresearch/deepresearch/data/your_dataset_name. -
Place your
train.jsonandval.jsonfiles inside this new directory. Ensure they follow the same format as the files in the other existing dataset folders. -
Run the following command to convert the JSON files into the Parquet format:
python examples/data_preprocess/deep_research_data_prepare.py \ --train_json agent_system/environments/env_package/deepresearch/deepresearch/data/your_dataset_name/train.json \ --val_json agent_system/environments/env_package/deepresearch/deepresearch/data/your_dataset_name/val.json
Note: The agent reads data directly from the environments (see the relevant code here). The Parquet file is used primarily to ensure data format compatibility and for global step counting within the original Verl framework.
To start training, run one of the following scripts depending on your available GPU memory:
./examples/grpo_trainer/run_deepresearch.sh # for 8 × 80GB GPUsOr
./examples/grpo_trainer/run_deepresearch_l40s.sh # for 8 × 48GB GPUsBefore running the script, make sure to set env.env_name in the configuration to the your_dataset_name you created in the previous step.
You may also want to adjust the following parameters:
-
env.rollout.n: The group size for GRPO. -
env.max_steps: The maximum number of steps for the search agent. -
trainer.save_freq: The step frequency for saving checkpoints. -
trainer.test_freq: The step frequency for performing validation. -
trainer.total_epochs: The total number of training epochs.
For users with Slurm, you can launch training with resource headers using this command (after setting the necessary configurations in the entry script under /examples/grpo_trainer/):
./scripts/run_sbatch.shNote:
- The parameters
env.use_critique,env.use_dense_reward, andenv.use_rule_rewardcorrespond to features that are currently under development. Please ensure they are disabled during training.- Training may require substantial CPU resources, since multiple agent environments run in parallel during rollouts. If CPU capacity is insufficient, the program may stall. You can monitor system usage with
ray status.