Improving Sample Efficiency in Evolutionary RL using Off-policy Ranking | Robert Bosch Centre for Data Science and Artificial Intelligence

--Dr. Gugan Thoppe--

Reinforcement learning is a type of machine learning technique where the agent learns from the interactive environment by trial and error and has applications in sectors such as self-driving cars, marketing, robotics, gaming, manufacturing etc. However, at times it is difficult to apply reinforcement learning in situations where objective function features such as linearity, convexity or differentiability often are non-existent or difficult to detect. This is where Evolutionary strategy RL can help. However, sample efficiency is quite a challenge in evolutionary RL. Dr. Gugan Thoppe, Assistant Professor at the Department of Computer Science and Automation, Indian Institute of Science (IISc), discussed his research work to improve sample efficiency in the RBCDSAI Seminar entitled “Improving Sample Efficiency in Evolutionary RL using Off-policy Ranking” that was held on 19th April 2022 between 4-5 PM.

Dr Thoppe commenced his talk with a one slide summary where he defined reinforcement learning (RL) as a model of learning in an interactive environment by trial and said their work will be of interest to the practical RL community where the goal is learning in an environment but with fewer interactions possible. He further emphasized that fewer interactions relate to having a good sample efficiency which can be achieved by Evolution strategies algorithms. He informed that they have developed a novel technique to further improve upon the sample efficiency of evolution strategy algorithms. He further told that in their research they have demonstrated the efficacy of their approach in using simulations in a standard testing environment called MujoCo. Their method is basically an off-policy variant of an established method Augmented Random Search (ARS). They found that although their variant and original ARS have almost similar running times, however, the variant developed by their team requires 30% lesser data for the learning process i.e. provides better sample efficiency.

Next, he explained MujoCo (stands for Multiple joint dynamics with Contact) which is a general-purpose physics engine that provides a simulation environment for tasks such as Ant, Hopper, HalfCheetah, walker, humanoid etc. and is used for comparing one algorithm with another when theory results are not available. Next, he showed the results of the performance comparison of ARS, TRES (Trust region Evolution Strategies) and their developed algorithm (Off-policy ARS) on different locomotion tasks (Half Cheetah, Hopper, Walker, Ant) and found that their method outperformed ARS and TRES in terms of sample complexity to do the learning in half cheetah, hopper, walker. However, in ant, it gets stuck in local maxima and is unable to come out of it therefore its performance is poor in that locomotion task. They also compared these algorithms on swimmer and Humanoid tasks and found that their algorithm outperforms others in the humanoid tasks and was also overall better than the other two versions of ARS in swimmer tasks.

Next, he gave an example of how the evolutionary strategy was used earlier in Reinforcement learning by citing the research work of Salimans et al. 2017. He also talked about the research work of Mania et al. 2018 on ARS which is the simplest ES method, suffices to work with linear policies, requires less CPU time and the number of interactions as the order of the magnitude is less versus TRPO ES method as seen in MujoCo. He then pinpointed that ARS is not optimal from the number of interactions perspective and therefore they worked on making ARS better in terms of sample efficiency. He further explained that sample efficiency is a critical aspect in practical RL such as robotics where sample collection is expensive, a robot can often fail and requires active calibration, maintenance and safety checks, therefore, having an algorithm that can learn with fewer environments interaction is the goal of researchers. He then said that evolution strategies are iterative optimization methods inspired by evolution and in the context of RL their goal is to find the optimal policy so the candidate solution for RL will be a different candidate policy.

Next, he talked about the Evolution strategy in the context of Augmented Random Search which is a variation of Basic random search with some tweaks and off-policy ranking. He showed simulation results which showed that their algorithm is able to learn with less data as compared to ARS in half cheetah.

He concluded the talk by listing the future directions of work which included checking the performance of off-policy ARS in practical RL, extending their off-policy ranking idea to other Evolution strategy methods including hybrid algorithms such as deep RL+ Evolution strategy. The talk was well-received by the audience and a question-answer round followed after the talk.

The video is available on our YouTube channel: Link.

Keywords

Reinforcement Learning, Augmented Random Search, Optimization