Policy gradient methods are a class of reinforcement learning algorithms that optimize a parametric policy by maximizing an objective function which directly measures the performance of the policy. Despite being used in many high-profile applications of reinforcement learning, the update rules used in practice by policy gradient methods are not well-understood. Furthermore, under conditions such as partial observability, the update rules can be highly suboptimal from the perspective of variance analysis. This thesis presents a comprehensive mathematical analysis of policy gradient methods, uncovering misconceptions and suggesting novel solutions to improve their performance.
Advisor: Philip S. Thomas
(Zoom passcode: bellman123)