Guiding Offline Reinforcement Learning Using Safety Expert

Published in "CODS-COMAD '24: Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), January 2024, Pages 82–90"
Richa Verma, Kartik Bharadwaj, Harshad Khadilkar and Balaraman Ravindran

Offline reinforcement learning is used to train policies in situations where it is expensive or infeasible to access the environment during training. An agent trained under such a scenario does not get corrective feedback once the learned policy starts diverging and may fall prey to the overestimation bias commonly seen in this setting. This increases the chances of the agent choosing potentially unsafe actions, especially in states with insufficient representation in the training dataset. In this paper, we explore the problem of acting safely in sparsely observed regions of the state space. We propose to leverage a safety expert to nudge an offline RL agent towards choosing safe actions in under-represented states in the dataset. The proposed framework in this paper transfers the safety expert’s knowledge into an offline setting for states with high uncertainty to prevent catastrophic failures from occurring in safety-critical domains. We use a simple but effective approach to quantify the state uncertainty based on how frequently they appear in a training dataset. In states with high uncertainty, the offline RL agent mimics the safety expert, otherwise maximizing the long-term reward. Our approach has a plug-and-play nature, i.e., any existing value-based or actor-critic style offline RL algorithm can be guided by a safety expert. We finally show that such guided offline RL algorithms can outperform their state-of-the-art counterparts, reducing the chance of taking unsafe actions while simultaneously retaining competitive performance.