Lecture 7: Policy Gradient Methods | Highlights and Annotations by Gistr.

Reinforcement learning's policy gradient methods directly optimize policies for reward maximization, bypassing value function estimation. Monte Carlo methods are introduced, but their high variance motivates actor-critic approaches using a critic to guide policy improvement. Techniques like baselines and compatible function approximation enhance efficiency and stability. This segment contrasts value function approximation methods (from the previous lecture) with the direct parameterization of the policy, which is the focus of the current lecture. It explains how the policy is now a directly manipulated probability distribution over actions, controlled by parameters that are learned. This segment discusses the advantages and disadvantages of policy gradient methods compared to value-based methods. Key advantages include better convergence properties, effectiveness in continuous action spaces, and the ability to learn stochastic policies. The main disadvantage is that naive policy-based methods can be slower and have higher variance. This segment compares gradient-based and gradient-free optimization methods for policy search. It explains the advantages of gradient-based methods (like gradient descent) when the gradient is available, highlighting their efficiency compared to gradient-free methods (hill climbing, simplex method, genetic algorithms). The speaker emphasizes the use of gradient descent due to its simplicity and effectiveness, while mentioning extensions to more sophisticated methods. This segment explains how stochastic policies outperform deterministic policies in scenarios with state aliasing (partial observability or function approximation). It highlights that while a deterministic optimal policy always exists in a fully observable Markov Decision Process (MDP), a stochastic policy can be superior in partially observable MDPs, leading to more efficient solutions using policy search methods.This segment details three different objective functions for policy optimization: start value (episodic environments), average value (continuing environments), and average reward per time step (continuing environments). The speaker explains the applicability of each objective function in different reinforcement learning scenarios and emphasizes that the same policy gradient methods apply to all three, simplifying the optimization process. This segment explains why stochastic policies are sometimes preferable to deterministic ones. It uses the example of Rock, Paper, Scissors to illustrate how a stochastic policy (playing randomly) is necessary for optimal behavior in games with strategic opponents, and it introduces the concept of Nash equilibrium. This segment introduces the likelihood ratio trick, a crucial technique for analytically computing the policy gradient. It explains how this trick simplifies the calculation of expectations, making it computationally feasible to optimize policies, particularly in high-dimensional spaces. The segment also introduces the score function and its significance in adjusting the policy to achieve better outcomes. This segment explains the concept of one-step Markov Decision Processes (MDPs) and their application in deriving the policy gradient. It simplifies the derivation for a basic case, setting the stage for understanding more complex scenarios in subsequent sections. This segment introduces actor-critic methods as a solution to the high variance issue in Monte Carlo policy gradient methods. It explains the roles of the actor (policy) and the critic (value function approximator) and how their combined use leads to more stable and efficient learning. This segment draws a parallel between the policy gradient theorem in reinforcement learning and supervised learning, highlighting the key difference: the inclusion of a value function in reinforcement learning to assess the goodness of actions, unlike supervised learning which relies on teacher feedback.This segment introduces the Monte Carlo policy gradient algorithm, a straightforward approach to policy gradient optimization. It explains how the algorithm uses sampled returns to estimate the gradient and update the policy parameters, providing a practical implementation of the theoretical concepts. This segment clarifies the model-free nature of the policy gradient approach, emphasizing that it doesn't require a complete model of the environment's dynamics. The explanation focuses on how the algorithm uses sampled rewards to update the policy, making it practical for real-world applications. This segment provides a clear classification of reinforcement learning algorithms based on whether they use value functions, policies, or both (actor-critic methods). It distinguishes between value-based methods (e.g., epsilon-greedy) and policy-based methods, highlighting the key difference being direct parameterization of the policy in the latter approach. The discussion also touches upon the relationship between policy parameterization and value-based methods. This segment details the actor-critic method, explaining how it differs from traditional reinforcement learning approaches like epsilon-greedy. It clarifies the process of initializing a policy, iteratively improving it using gradient steps, and the absence of greedy or epsilon-greedy strategies in this method. The discussion includes a Q&A session addressing common questions about policy selection and algorithm behavior. This segment addresses a crucial question about the convergence properties of policy gradient methods. It compares the guarantees of finding a global optimum in value-based methods (with contraction mappings) versus policy-based methods. The discussion highlights the challenges in guaranteeing global optimality with general function approximators like neural networks and mentions this as an open research area. This segment focuses on a key technique to improve actor-critic algorithms: variance reduction using baselines. It introduces the concept of subtracting a baseline function from the policy gradient without altering the direction of ascent. The discussion then shifts to the advantage function, explaining its role in improving the efficiency of the updates by focusing on the relative value of actions compared to the average value of being in a particular state. The segment concludes by showing how the TD error can serve as an unbiased estimate of the advantage function, leading to a more practical algorithm. This segment details various ways to approximate the policy gradient theorem, highlighting Monte Carlo policy gradient, TD error-based estimation, and the introduction of eligibility traces to combine critic information from multiple time steps for more efficient gradient estimation. The explanation emphasizes the trade-off between bias and variance in these estimations and how eligibility traces help achieve a balance. This segment addresses a crucial question in actor-critic methods: how to ensure that using an approximate critic doesn't lead to following an incorrect gradient. It introduces compatible function approximation, a technique that guarantees unbiased gradient estimation by carefully choosing the critic's features to be related to the score function of the policy. The explanation provides a theoretical justification for this approach.This segment introduces a recent advancement in actor-critic methods: the deterministic policy gradient theorem. It contrasts stochastic policies with deterministic ones, highlighting the limitations of stochastic approaches, especially as the policy improves and noise variance increases. The deterministic approach offers a simpler, more efficient alternative, particularly for continuous action spaces, by directly using the gradient of the Q-function to update the policy parameters.