Abstract:
Policy gradient methods and their variants, which aim to find near-optimal policies via gradient-type methods, are an attractive class of RL algorithms as they are applicable to various kinds of policy parameterizations. Despite the empirical success, theoretical underpinnings remain limited until recent years due to the intrinsic non-concavity underlying the value maximization problem. Thanks to the performance difference lemma, the global convergence of policy gradient methods can be established. Furthermore, some variants exhibit linear or even superlinear convergence rates.
In this talk, we will consider infinite-horizon, discounted Markov decision processes (MDPs) with finite states and actions, and focus on the convergence rates of policy gradient type methods with either direct or softmax parameterization (the tabular case). Throughout the talk, we assume exact gradients are accessible.