Mitigating Preference Hacking in Policy Optimization with Pessimism.

Dhawal Gupta

Dhawal Gupta Adam Fisch Christoph Dann Alekh Agarwal Mitigating Preference Hacking in Policy Optimization with Pessimism. 2025 March abs/2503.06810 CoRR https://doi.org/10.48550/arXiv.2503.06810 db/journals/corr/corr2503.html#abs-2503-06810 streams/journals/corr

Alex Ayoub David Szepesvari Francesco Zanini Bryan Chan Dhawal Gupta Bruno Castro da Silva Dale Schuurmans Mitigating the Curse of Horizon in Monte-Carlo Returns. 563-572 2024 RLJ 2 db/journals/rlj/rlj2.html#AyoubSZCGSS24 https://rlj.cs.umass.edu/2024/papers/Paper80.html

Kartik Choudhary Dhawal Gupta Philip S. Thomas ICU-Sepsis: A Benchmark MDP Built from Real Medical Data. 1546-1566 2024 RLJ 4 db/journals/rlj/rlj4.html#ChoudharyGT24 https://rlj.cs.umass.edu/2024/papers/Paper194.html

Mehwash Weqar Shabana Mehfuz Dhawal Gupta Shabana Urooj Adaptive Switching Based Data-Communication Model for Internet of Healthcare Things Networks. 11530-11548 2024 12 IEEE Access https://doi.org/10.1109/ACCESS.2024.3354722 https://www.wikidata.org/entity/Q130050779 db/journals/access/access12.html#WeqarMGU24

Dhawal Gupta Scott M. Jordan Shreyas Chaudhari Bo Liu 0006 Philip S. Thomas Bruno Castro da Silva From Past to Future: Rethinking Eligibility Traces. 12253-12260 2024 AAAI https://doi.org/10.1609/aaai.v38i11.29115 conf/aaai/2024 db/conf/aaai/aaai2024.html#GuptaJCLTS24 Mehwash Weqar Shabana Mehfuz Dhawal Gupta Authentication in IoT Networks via Machine Learning and Deep Learning: A Review. 1-6 2024 ICCCNT https://doi.org/10.1109/ICCCNT61001.2024.10724010 conf/icccnt/2024 db/conf/icccnt/icccnt2024.html#WeqarMG24

Kartik Choudhary Dhawal Gupta Philip S. Thomas ICU-Sepsis: A Benchmark MDP Built from Real Medical Data. 2024 abs/2406.05646 CoRR https://doi.org/10.48550/arXiv.2406.05646 db/journals/corr/corr2406.html#abs-2406-05646

Erfan Entezami Mahsa Sahebdel Dhawal Gupta A Safe Exploration Strategy for Model-free Task Adaptation in Safety-constrained Grid Environments. 2024 abs/2408.00997 CoRR https://doi.org/10.48550/arXiv.2408.00997 db/journals/corr/corr2408.html#abs-2408-00997 streams/journals/corr

Yinlam Chow Aza Tulepbergenov Ofir Nachum Dhawal Gupta Moonkyung Ryu Mohammad Ghavamzadeh Craig Boutilier A Mixture-of-Expert Approach to RL-based Dialogue Management. 2023 ICLR https://openreview.net/forum?id=4FBUihxz5nm conf/iclr/2023 db/conf/iclr/iclr2023.html#ChowTNGRGB23 Dhawal Gupta Yash Chandak Scott M. Jordan Philip S. Thomas Bruno C. da Silva 0001 Behavior Alignment via Reward Function Optimization. 2023 NeurIPS http://papers.nips.cc/paper_files/paper/2023/hash/a5357781c204d4412e44ed9cbcdb08d5-Abstract-Conference.html conf/nips/2023 db/conf/nips/neurips2023.html#GuptaCJT023 Dhawal Gupta Yinlam Chow Azamat Tulepbergenov Mohammad Ghavamzadeh Craig Boutilier Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management. 2023 NeurIPS http://papers.nips.cc/paper_files/paper/2023/hash/12bcf58a1c09a0fcb5310f3589291ab4-Abstract-Conference.html conf/nips/2023 db/conf/nips/neurips2023.html#GuptaCTGB23

Dhawal Gupta Yinlam Chow Mohammad Ghavamzadeh Craig Boutilier Offline Reinforcement Learning for Mixture-of-Expert Dialogue Management. 2023 abs/2302.10850 CoRR https://doi.org/10.48550/arXiv.2302.10850 db/journals/corr/corr2302.html#abs-2302-10850

James E. Kostas Scott M. Jordan Yash Chandak Georgios Theocharous Dhawal Gupta Martha White Bruno Castro da Silva Philip S. Thomas Coagent Networks: Generalized and Scaled. 2023 abs/2305.09838 CoRR https://doi.org/10.48550/arXiv.2305.09838 db/journals/corr/corr2305.html#abs-2305-09838

Simeng Sun Dhawal Gupta Mohit Iyyer Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of RLHF. 2023 abs/2309.09055 CoRR https://doi.org/10.48550/arXiv.2309.09055 db/journals/corr/corr2309.html#abs-2309-09055

Dhawal Gupta Yash Chandak Scott M. Jordan Philip S. Thomas Bruno Castro da Silva Behavior Alignment via Reward Function Optimization. 2023 abs/2310.19007 CoRR https://doi.org/10.48550/arXiv.2310.19007 db/journals/corr/corr2310.html#abs-2310-19007

Dhawal Gupta Scott M. Jordan Shreyas Chaudhari Bo Liu 0006 Philip S. Thomas Bruno Castro da Silva From Past to Future: Rethinking Eligibility Traces. 2023 abs/2312.12972 CoRR https://doi.org/10.48550/arXiv.2312.12972 db/journals/corr/corr2312.html#abs-2312-12972

Tulika Saha Dhawal Gupta Sriparna Saha 0001 Pushpak Bhattacharyya Emotion Aided Dialogue Act Classification for Task-Independent Conversations in a Multi-modal Framework. 277-289 2021 13 Cogn. Comput. 2 https://doi.org/10.1007/s12559-019-09704-5 https://www.wikidata.org/entity/Q126301585 db/journals/cogcom/cogcom13.html#SahaGSB21

Tulika Saha Dhawal Gupta Sriparna Saha 0001 Pushpak Bhattacharyya A hierarchical approach for efficient multi-intent dialogue policy learning. 35025-35050 2021 80 Multim. Tools Appl. 28-29 https://doi.org/10.1007/s11042-020-09070-7 db/journals/mta/mta80.html#SahaGSB21

Tulika Saha Dhawal Gupta Sriparna Saha 0001 Pushpak Bhattacharyya A Unified Dialogue Management Strategy for Multi-intent Dialogue Conversations in Multiple Languages. 99:1-99:22 2021 20 ACM Trans. Asian Low Resour. Lang. Inf. Process. 6 https://doi.org/10.1145/3461763 db/journals/talip/talip20.html#SahaGSB21

Dhawal Gupta Gabor Mihucz Matthew Schlegel James E. Kostas Philip S. Thomas Martha White Structural Credit Assignment in Neural Networks using Reinforcement Learning. 30257-30270 2021 NeurIPS https://proceedings.neurips.cc/paper/2021/hash/fe1f9c70bdf347497e1a01b6c486bdb9-Abstract.html conf/nips/2021 db/conf/nips/neurips2021.html#GuptaMSKTW21

Tulika Saha Dhawal Gupta Sriparna Saha 0001 Pushpak Bhattacharyya Towards integrated dialogue policy learning for multiple domains and intents using Hierarchical Deep Reinforcement Learning. 113650 2020 162 Expert Syst. Appl. https://doi.org/10.1016/j.eswa.2020.113650 db/journals/eswa/eswa162.html#SahaGSB20

Sina Ghiassian Andrew Patterson Shivam Garg 0006 Dhawal Gupta Adam White 0001 Martha White Gradient Temporal-Difference Learning with Regularized Corrections. 3524-3534 2020 ICML http://proceedings.mlr.press/v119/ghiassian20a.html conf/icml/2020 db/conf/icml/icml2020.html#GhiassianP0GWW20

Sina Ghiassian Andrew Patterson Shivam Garg 0006 Dhawal Gupta Adam White 0001 Martha White Gradient Temporal-Difference Learning with Regularized Corrections. 2020 abs/2007.00611 CoRR https://arxiv.org/abs/2007.00611 db/journals/corr/corr2007.html#abs-2007-00611

Tulika Saha Dhawal Gupta Sriparna Saha 0001 Pushpak Bhattacharyya Reinforcement Learning Based Dialogue Management Strategy. 359-372 2018 ICONIP (3) https://doi.org/10.1007/978-3-030-04182-3_32 conf/iconip/2018-3 db/conf/iconip/iconip2018-3.html#SahaGSB18 Alekh Agarwal Alex Ayoub Pushpak Bhattacharyya Craig Boutilier Bryan Chan Yash Chandak Shreyas Chaudhari Kartik Choudhary Yinlam Chow Christoph Dann Erfan Entezami Adam Fisch Shivam Garg 0006 Mohammad Ghavamzadeh Sina Ghiassian Mohit Iyyer Scott M. Jordan James E. Kostas Bo Liu 0006 Shabana Mehfuz Gabor Mihucz Ofir Nachum Andrew Patterson Moonkyung Ryu Sriparna Saha 0001 Tulika Saha Mahsa Sahebdel Matthew Schlegel Dale Schuurmans Bruno C. da Silva 0001Bruno Castro da Silva Simeng Sun David Szepesvari Georgios Theocharous Philip S. Thomas Aza TulepbergenovAzamat Tulepbergenov Shabana Urooj Mehwash Weqar Adam White 0001 Martha White Francesco Zanini