Content

Speaker

Dhawal Gupta

Abstract

Reinforcement learning (RL) offers a powerful framework for sequential decision-making, enabling agents to solve complex tasks in dynamic environments. Despite its successes, RL faces significant challenges, particularly the temporal credit assignment problem (TCAP), which involves determining the influence of individual actions on delayed outcomes. This dissertation addresses key aspects of TCAP by proposing novel methodologies that enhance value estimation, reformulate algorithmic principles related to credit assignment, and dynamically incorporate additional reward information provided by practitioners. The approach ensures that auxiliary signals improve credit assignment without causing unintended behaviors.

Value estimation is a fundamental component of many RL algorithms, and TCAP  often manifests as high variance in value estimation. The first contribution focuses on variance reduction in linear temporal difference (TD) learning methods. By reinterpreting the value function estimation process as a matrix-multiplication problem, this work applies techniques from approximate matrix multiplication to reduce variance and improve sample efficiency. In initial experiments, the proposed norm-based sampling approach shows improved performance in both on-policy and off-policy learning settings, although it requires access to the environment’s model.

The second contribution re-examines eligibility traces, a widely used extension of TD methods for policy evaluation. Eligibility traces were designed to enhance credit assignment for delayed rewards but may assign credit incorrectly in some learning settings. Specifically, we revisit TD(λ) and its behavior under nonlinear function approximation, highlighting a fundamental misconception in how credit assignment is performed. Correcting this misconception leads to a computationally expensive algorithm. However, further analysis reveals that eligibility traces can be reinterpreted as learning a new form of the value function, whose components have been explored in other contexts.

The third contribution introduces a bi-level optimization framework for reward function design to address TCAP in sparse-reward environments. In these settings, agents often receive feedback only at the end of an episode, making it difficult to associate rewards with earlier actions. We consider the common scenario where practitioners can provide auxiliary information through additional reward functions. Our proposed method dynamically integrates these auxiliary signals with the primary sparse reward—implemented as Behavior Alignment via Reward Function’s Implicit (BARFI) optimization. This framework leverages implicit gradients to adapt reward functions efficiently, demonstrating scalability to high-dimensional control tasks. It ensures that auxiliary signals accelerate learning when beneficial while keeping the agent robust to potential misaligned behaviors.

Advisors

Bruno Castro da Silva and Philip Thomas