机器学习与数据科学博士生系列论坛(第三十八期)—— A Statistical Perspective on Off-policy Evaluation in Reinforcement Learning

Abstract:
 
Off-policy evaluation (OPE) is one of the most important tasks in offline reinforcement learning. As opposed to online reinforcement learning tasks, in which the agent can directly interact with the environment and instantly get rewards, OPE problems only assume a given dataset of trajectories, generated by an unknown behavior policy in advance. OPE has a pure statistical formulation; however, most existing works mainly focus on point estimation and lack statistical interpretations and theoretical guarantees, which may impede its application in fields requiring high precision.

In this talk, we try to give a statistical understanding of recent OPE algorithms. In particular, we will review three popular methods for OPE: direct method (DM), importance sampling (IS), and doubly robust estimator (DR), and discuss their close relationship with the estimation of average treatment effect (ATE) in causal inference literature. These estimators additionally need two nuisance components to be estimated, and we will present how to estimate them with theoretical guarantees. We will finally give a selective introduction of recent progress on OPE problems, including remedies for different data assumptions as well as new combinations with traditional statistical approaches.