TY - JOUR
T1 - Diversity-Augmented Intrinsic Motivation for Deep Reinforcement Learning
AU - Dai, Tianhong
AU - Du, Yali
AU - Fang, Meng
AU - Bharath, Anil Anthony
PY - 2022/1/11
Y1 - 2022/1/11
N2 - In many real-world problems, reward signals received by agents are delayed or sparse, which makes it challenging to train a reinforcement learning (RL) agent. An intrinsic reward signal can help an agent to explore such environments in the quest for novel states. In this work, we propose a general end-to-end diversity-augmented intrinsic motivation for deep reinforcement learning which encourages the agent to explore new states and automatically provides denser rewards. Specifically, we measure the diversity of adjacent states under a model of state sequences based on determinantal point process (DPP); this is coupled with a straight-through gradient estimator to enable end-to-end differentiability. The proposed approach is comprehensively evaluated on the MuJoCo and the Arcade Learning Environments (Atari and SuperMarioBros). The experiments show that an intrinsic reward based on the diversity measure derived from the DPP model accelerates the early stages of training in Atari games and SuperMarioBros. In MuJoCo, the approach improves on prior techniques for tasks using the standard reward setting, and achieves the state-of-the-art performance on 12 out of 15 tasks containing delayed rewards.
AB - In many real-world problems, reward signals received by agents are delayed or sparse, which makes it challenging to train a reinforcement learning (RL) agent. An intrinsic reward signal can help an agent to explore such environments in the quest for novel states. In this work, we propose a general end-to-end diversity-augmented intrinsic motivation for deep reinforcement learning which encourages the agent to explore new states and automatically provides denser rewards. Specifically, we measure the diversity of adjacent states under a model of state sequences based on determinantal point process (DPP); this is coupled with a straight-through gradient estimator to enable end-to-end differentiability. The proposed approach is comprehensively evaluated on the MuJoCo and the Arcade Learning Environments (Atari and SuperMarioBros). The experiments show that an intrinsic reward based on the diversity measure derived from the DPP model accelerates the early stages of training in Atari games and SuperMarioBros. In MuJoCo, the approach improves on prior techniques for tasks using the standard reward setting, and achieves the state-of-the-art performance on 12 out of 15 tasks containing delayed rewards.
KW - Deep Reinforcement Learning
KW - Curiosity-driven exploration
KW - Determinantal point process
UR - http://dx.doi.org/10.1016/j.neucom.2021.10.040
U2 - 10.1016/j.neucom.2021.10.040
DO - 10.1016/j.neucom.2021.10.040
M3 - Article
VL - 468
SP - 396
EP - 406
JO - Neurocomputing
JF - Neurocomputing
SN - 0925-2312
ER -