There has been considerable interest across several fields in methods that reduce the problem of learning good treatment assignment policies to the problem of accurate policy evaluation. Given a class of candidate policies, these methods first effectively evaluate each policy individually, and then learn a policy by optimizing the estimated value function; such approaches are guaranteed to be risk-consistent whenever the policy value estimates are uniformly consistent. However, despite the wealth of proposed methods, the literature remains largely silent on questions of statistical efficiency: there are only limited results characterizing which policy evaluation strategies lead to better learned policies than others, or what the optimal policy evaluation strategies are. In this paper, we build on classical results in semiparametric efficiency theory to develop quasi-optimal methods for policy learning; in particular, we propose a class of policy value estimators that, when optimized, yield regret bounds for the learned policy that scale with the semiparametric efficient variance for policy evaluation. On a practical level, our result suggests new methods for policy learning motivated by semiparametric efficiency theory.
Something about minimizing regret.
<quote> A treatment assignment policy is a mapping from characteristics of units (patients or customers) to which of a set of treatments the customer should receive. Recently, new datasets have become available to researchers and practitioners that make it possible to estimate personalized policies in settings ranging from personalized offers and marketing in a digital environment to online education. In addition, technology companies, educational institutions, and researchers have begun to use explicit randomization with the goal of creating data that can be used to estimate personalized policies.</quote> <quote> In this paper, we showed how classical concepts from the literature on semiparametric efficiency can be used to develop performant algorithms for policy learning with strong asymptotic guarantees. Our regret bounds may prove to be particularly relevant in applications since, unlike existing bounds, they are sharp enough to distinguish between different a priori reasonable policy learning schemes (e.g., ones based on inverse-propensity weighting versus double machine learning), and thus provide methodological guidance to practitioners. More generally, our experience shows that results on semiparametrically efficient estimation are not just useful for statistical inference, but are also directly relevant to applied decision making problems. It will be interesting to see whether related insights will prove to be more broadly helpful for, e.g., sequential problems with contextual bandits, or non-discrete decision making problems involving, say, price setting or capacity allocation.</quote>