Lecture 6: Value Function Approximation | Highlights and Annotations by Gistr.

Reinforcement learning with function approximation scales control to large state spaces. Incremental methods (e.g., TD learning, Sarsa) and batch methods (e.g., LSPI, experience replay) approximate value functions (V(s), Q(s,a)) using parametric models, updated via gradient descent. Linear function approximation is a key technique, generalizing table lookup. Algorithms aim for approximate solutions via iterative policy improvement. This segment delves into the core concept of value function approximation using parametric function approximators, such as neural networks. It explains how these approximators estimate the true value function using a compact representation (parameter vector W), enabling generalization across unseen states and actions. This segment clearly explains the limitations of traditional table-based methods in reinforcement learning when dealing with large state spaces. It introduces value function approximation as a solution, emphasizing its ability to generalize across states and actions, leading to efficient learning and memory usage. This segment details the use of feature vectors to represent states in a compact way, simplifying the learning process. It describes how linear combinations of features, weighted by learned parameters, can estimate the value function, making the optimization problem convex and easier to solve.This segment focuses on the advantages of using linear combinations of features for value function approximation. It explains how the objective function becomes quadratic (convex), guaranteeing convergence to the global optimum when using gradient descent. The simplicity of the update rule is highlighted. This segment explains how stochastic gradient descent is used to update parameters of a function approximator to minimize the mean squared error between the approximator's predictions and the values provided by an oracle. The explanation clarifies how sampling states and using the error term multiplied by the gradient leads to efficient online learning, even without explicitly computing the expectation. This segment discusses various function approximators suitable for reinforcement learning, focusing on differentiable approximators like linear combinations of features and neural networks. It highlights the unique challenges of reinforcement learning compared to supervised learning, particularly the non-stationarity of value functions and the need to handle non-IID data. This segment connects table lookup methods with linear function approximation by showing how a table lookup can be represented as a special case of linear function approximation with a specific feature vector. This clarifies the relationship between these two approaches and emphasizes that table lookup is a simplified instance of the more general framework.This segment addresses the unrealistic assumption of having an oracle and introduces Monte Carlo and Temporal Difference learning as methods to estimate the target values for the function approximator. It explains how the returns (Monte Carlo) and TD targets are used as substitutes for the oracle's values in the learning process. This segment explains the core concept of Temporal Difference (TD) learning, focusing on how the error between predicted and actual values (TD error) is used to adjust the weights of a function approximator. The analogy of a game, where a blunder creates an error signal used to correct the prediction, makes the concept easily understandable. The explanation includes the mathematical formulation of the weight update rule, clarifying its application in both linear and non-linear cases.This segment addresses a viewer question, clarifying the incremental nature of TD learning. It contrasts this online, step-by-step approach with batch methods, highlighting the relationship between TD learning and supervised learning. The discussion differentiates between various TD targets (Monte Carlo, TD, TD-Lambda), emphasizing the ongoing association between states and their target values during the learning process. The mention of a convergence result by Tsitsiklis and Van Roy adds theoretical weight to the practical approach. This segment delves into a crucial detail of TD learning: why the gradient update only considers the predicted value function and not the target. The explanation uses intuitive reasoning, comparing the process to a spring system and highlighting the importance of avoiding time reversal in the updates. The discussion also mentions residual gradient methods as an alternative approach, but emphasizes their complexity and potential pitfalls.This segment shifts the focus to control problems, explaining how approximate policy evaluation is integrated into the generalized policy iteration framework. It details the iterative process of acting greedily (with exploration) based on a value function (e.g., represented by a neural network), updating the function approximator, and repeating the process. The explanation emphasizes the incremental nature of the updates and the rationale behind using the freshest information available. This segment provides a concrete example of applying the discussed concepts to the classic Mountain Car problem. It describes the state space (position and velocity), the value function visualization, and the effect of policy improvement on the value function's shape. The explanation introduces the SARSA algorithm, highlighting its incremental nature and use of one-step TD returns for Q-value updates, demonstrating a practical application of the previously explained theoretical framework. This segment details the function approximation method used in the Mountain Car problem, explaining how the algorithm adjusts a function approximator (a coarse code) to approximate the optimal value function. The explanation includes the concept of overlapping tiles and how the shape of the function approximator evolves during the learning process, ultimately resulting in a spiral pattern representing the optimal value function. This segment introduces batch methods as a solution to the sample inefficiency of online gradient descent methods. It explains the concept of finding the best-fitting value function to all observed data using a least-squares approach. The segment then details the experience replay technique, where past experiences are stored and randomly sampled to update the function approximator, leading to a more efficient use of data and improved convergence to the least-squares solution. This segment addresses the question of whether different state space representations could simplify the value function's shape. The speaker clarifies that the state is defined by the MDP, and the value function's true shape is inherent. However, using appropriate features can transform the value function into a simpler form, making it easier to learn.This segment presents empirical evidence on the impact of eligibility traces (lambda) and bootstrapping in temporal difference (TD) learning. It shows that a balance between TD(0) and Monte Carlo methods often yields optimal performance, highlighting the importance of bootstrapping for efficiency while acknowledging potential instability in TD methods. The speaker emphasizes the need for algorithms that are effective when bootstrapping.This segment discusses the convergence properties of TD learning algorithms, particularly focusing on on-policy and off-policy learning. It explains the conditions under which TD learning is guaranteed to converge and when it might diverge, differentiating between linear and non-linear function approximation and highlighting the challenges with off-policy learning and bootstrapping. The introduction of gradient TD and emphatic TD methods as solutions to these convergence issues is also discussed. This segment introduces least squares methods as an alternative to iterative approaches like experience replay for policy evaluation, particularly beneficial with linear function approximation. The speaker explains how to find the least squares solution by setting the expected update to zero and solving for the weights, leading to algorithms like LSMC, LSTD, and LSTDLambda. The advantages and limitations of this approach compared to experience replay are also discussed. This segment provides a detailed explanation of the two Q-network method, clarifying the use of old and new parameter vectors to generate target values. It emphasizes the importance of bootstrapping towards frozen targets to avoid instability caused by constantly updating targets, which is particularly crucial with nonlinear function approximation. The speaker also discusses the tuning parameter for switching between networks and the practical effectiveness of this approach. This segment details two key techniques—experience replay and using two Q-networks—that significantly improve the stability of reinforcement learning algorithms when combined with neural networks, preventing issues like the "blowing up" of previous methods. The explanation clarifies how these techniques decorrelate trajectories and create more stable updates, resembling supervised learning. This segment introduces Least Squares Policy Iteration (LSPI), an algorithm that uses least squares methods for both policy evaluation and control. The speaker explains how LSPI converges to the optimal value function and optimal policy, illustrating its application with a random walk example. The example demonstrates LSPI's ability to quickly find the optimal value function and policy even with function approximation, highlighting its efficiency and effectiveness.