摘要:
Robust Markov Decision Processes (MDPs) have gained increasing attention as a way to learn a policy that is less sensitive to changes in the environment. While there has been growing interest in analyzing the sample efficiency of robust MDPs, most existing work has focused on a model-based regime, where the transition probability needs to be estimated and computation is intensive, requiring O(|S|^2 |A|) storage in memory. However, designing a model-free method for robust MDPs is difficult, and the main barrier is the nonlinearity of the robust optimization operator.
In this talk, we will first review prior knowledge on robust MDPs. We will then discuss how to construct a 'good' estimator for a non-linear functional of an expectation operator, and introduce two prior works that use this estimator to solve robust MDPs. Building on this, we will also introduce an alternative form of robust MDPs that preserves robustness and is easier to solve with sample-efficient methods.