Publications

When Do Off-Policy and On-Policy Policy Gradient Methods Align?

Davide Mambelli, Stephan Bongers, Onno Zoeter, Matthijs T. J. Spaan, and Frans A. Oliehoek. When Do Off-Policy and On-Policy Policy Gradient Methods Align?. arXiv:2402.12034, 2024.

Download

pdf 

Abstract

Policy gradient methods are widely adopted reinforcement learning algorithms for tasks with continuous action spaces. These methods succeeded in many application domains, however, because of their notorious sample inefficiency their use remains limited to problems where fast and accurate simulations are available. A common way to improve sample efficiency is to modify their objective function to be computable from off-policy samples without importance sampling. A well-established off-policy objective is the excursion objective. This work studies the difference between the excursion objective and the traditional on-policy objective, which we refer to as the on-off gap. We provide the first theoretical analysis showing conditions to reduce the on-off gap while establishing empirical evidence of shortfalls arising when these conditions are not met.

BibTeX Entry

@Misc{Mambelli24arxiv,
  author =       {Davide Mambelli and Stephan Bongers and Onno Zoeter
                  and Matthijs T. J. Spaan and Frans A. Oliehoek},
  title =        {When Do Off-Policy and On-Policy Policy Gradient
                  Methods Align?},
  howpublished = {arXiv:2402.12034},
  year =         2024
}

Note: This material is presented to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author's copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.

Generated by bib2html.pl (written by Patrick Riley) on Thu Feb 29, 2024 16:15:45 UTC