Papers
arxiv:2103.00107

Revisiting Peng's Q(λ) for Modern Reinforcement Learning

Published on Feb 27, 2021
Authors:
,
,
,
,
,
,

Abstract

Peng's Q(λ) converges to an optimal policy when the behavior policy tracks a greedy policy, proving its theoretical soundness and practical effectiveness in complex continuous control tasks.

AI-generated summary

Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng's Q(λ), a representative example of non-conservative algorithms. We prove that it also converges to an optimal policy provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng's Q(λ) in complex continuous control tasks, confirming that Peng's Q(λ) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng's Q(λ), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2103.00107 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2103.00107 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.