Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics
We present Bootstrapped Dual Policy Iteration (BDPI), an “actor-critic” reinforcement learning algorithm designed to achieve very high sample-efficiency and exploration quality. Contrary to conventional actor-critic algorithms, BDPI’s actor is robust to off-policy critics, which allows state-of-the-art critics to be used. Such good critics, combined with a good actor, lead to BDPI’s high sample-efficiency.
PhD student specialized in sample-efficient Reinforcement Learning and applications of Reinforcement Learning to real-world tasks. Denis has a master in Computer Science (option Artificial Intelligence), and has started his PhD in Reinforcement Learning in 2016.