Chapter 3 Policy Gradient Methods
In Chapter 2, we relaxed two key assumptions of the MDP introduced in Chapter 1:
- Unknown dynamics: the transition function \(P\) was no longer assumed to be known.
- Continuous states: the state space \(\mathcal{S}\) was extended from finite to continuous.
When only the dynamics are unknown but the MDP remains tabular, we introduced generalized versions of policy iteration (e.g., SARSA) and value iteration (e.g., Q-learning). These algorithms can recover near-optimal value functions with strong convergence guarantees.
When both the dynamics are unknown and the state space is continuous, tabular methods become infeasible. In this setting, we employed function approximation to represent value functions, and generalized SARSA and Q-learning accordingly. We also introduced stabilization techniques such as experience replay and target networks to ensure more reliable learning.
In this chapter, we relax a third assumption: the action space \(\mathcal{A}\) is also continuous. This setting captures many important real-world systems, such as autonomous vehicles and robots. Handling continuous actions requires a departure from the value-based methods of Chapter 2. The key difficulty is that even if we had access to a near-optimal action-value function \(Q(s,a)\), selecting the control action requires solving \[ \max_a Q(s,a), \] which is often computationally expensive and can lead to suboptimal solutions.
To address this challenge, we introduce a new paradigm: policy gradient methods. Rather than learning value functions to derive policies indirectly, we directly optimize parameterized policies using gradient-based methods.
We begin this chapter by reviewing the fundamentals of gradient-based optimization, and then build upon them to develop algorithms for searching optimal policies via policy gradients.
3.1 Gradient-based Optimization
Gradient-based optimization is the workhorse behind most modern machine learning algorithms, including policy gradient methods. The central idea is to iteratively update the parameters of a model in the direction that most improves an objective function.
3.1.1 Basic Setup
Suppose we have a differentiable objective function \(J(\theta)\), where \(\theta \in \mathbb{R}^d\) represents the parameter vector. The goal is to find \[ \theta^\star \in \arg\max_\theta J(\theta). \]
The gradient of the objective with respect to the parameters, \[ \nabla_\theta J(\theta) = \begin{bmatrix} \frac{\partial J}{\partial \theta_1} & \frac{\partial J}{\partial \theta_2} & \cdots & \frac{\partial J}{\partial \theta_d} \end{bmatrix}^\top, \] provides the local direction of steepest ascent. Gradient-based optimization uses this direction to iteratively update the parameters. Note that modern machine learning software tools such as PyTorch allow the user to conveniently query the gradient of any function \(J\) defined by neural networks.
3.1.2 Gradient Ascent and Descent
The simplest method is gradient ascent (for maximization):
\[
\theta_{k+1} = \theta_k + \alpha \nabla_\theta J(\theta_k),
\]
where \(\alpha > 0\) is the learning rate.
For minimization, the update rule uses gradient descent:
\[
\theta_{k+1} = \theta_k - \alpha \nabla_\theta J(\theta_k).
\]
The choice of learning rate \(\alpha\) is critical:
- Too large \(\alpha\) can cause divergence.
- Too small \(\alpha\) leads to slow convergence.
3.1.2.1 Convergence Guarantees
For convex functions \(J(\theta)\), gradient descent (or ascent) can be shown to converge to the global optimum under appropriate conditions on the learning rate.
For non-convex functions—which are common in reinforcement learning—gradient methods may only find so-called first-order stationary points, i.e., points \(\theta\) at which the gradient \(\nabla_\theta J(\theta) = 0\). Nevertheless, they remain effective in practice.
TODO: graph different stationary points
We now formalize the convergence speed of Gradient Descent (GD) for minimizing a smooth convex function. We switch to the minimization convention and write the objective as \(f:\mathbb{R}^d \to \mathbb{R}\) (to avoid sign confusions with \(J\) used for maximization). We assume exact gradients \(\nabla f(\theta)\) are available.
Setup and Assumptions.
(Convexity) For all \(\theta,\vartheta\in\mathbb{R}^d\), \[\begin{equation} f(\vartheta) \;\ge\; f(\theta) + \nabla f(\theta)^\top(\vartheta-\theta). \tag{3.1} \end{equation}\]
(\(L\)-smoothness) The gradient is \(L\)-Lipschitz: for all \(\theta,\vartheta\), \[\begin{equation} \|\nabla f(\vartheta)-\nabla f(\theta)\| \;\le\; L\|\vartheta-\theta\|. \tag{3.2} \end{equation}\] Equivalently (the descent lemma), for all \(\theta,\Delta\), \[\begin{equation} f(\theta+\Delta) \;\le\; f(\theta) + \nabla f(\theta)^\top \Delta + \frac{L}{2}\|\Delta\|^2. \tag{3.3} \end{equation}\]
Consider Gradient Descent with a constant stepsize \(\alpha>0\): \[ \theta_{k+1} \;=\; \theta_k \;-\; \alpha\, \nabla f(\theta_k). \]
Theorem 3.1 (GD on smooth convex function) Let \(f\) be convex and \(L\)-smooth with a minimizer \[ \theta^\star\in\arg\min_\theta f(\theta). \] and the global minimum \(f^\star = f(\theta^\star)\). If \(0<\alpha\le \frac{2}{L}\), then the GD iterates satisfy for all \(k\ge 0\): \[\begin{equation} f(\theta_k) - f^\star \leq \frac{2 (f(\theta_0) - f^\star) \Vert \theta_0 - \theta^\star \Vert^2 }{2 \Vert \theta_0 - \theta^\star \Vert^2 + k\alpha ( 2 - L \alpha) (f(\theta_0) - f^\star)} \tag{3.4} \end{equation}\] In particular, choosing \(\alpha=\frac{1}{L}\) yields the canonical \(O(1/k)\) convergence rate in suboptimality: \[\begin{equation} f(\theta_k) - f^\star \leq \frac{2L \Vert \theta_0 - \theta^\star \Vert^2}{k+4} \tag{3.5} \end{equation}\]
Proof. See Theorem 2.1.14 and Corollary 2.1.2 in (Nesterov 2018).
Strongly Convex Case (Linear Rate). If, in addition, \(f\) is \(\mu\)-strongly convex (\(\mu>0\)), i.e., for all \(\theta,\vartheta\in\mathbb{R}^d\), \[\begin{equation} f(\vartheta)\;\ge\; f(\theta) + \nabla f(\theta)^\top(\vartheta-\theta) \;+\; \frac{\mu}{2}\,\|\vartheta-\theta\|^2. \tag{3.6} \end{equation}\] Then, GD with \(0<\alpha\le \frac{2}{\mu + L}\) enjoys a linear (geometric) rate:
Theorem 3.2 (GD on smooth strongly convex function) If \(f\) is \(L\)-smooth and \(\mu\)-strongly convex, then for \(0<\alpha\le \frac{2}{\mu + L}\), \[\begin{equation} \Vert \theta_k - \theta^\star \Vert^2 \leq \left( 1 - \frac{2\alpha \mu L}{\mu + L} \right)^k \Vert \theta_0 - \theta^\star \Vert^2. \tag{3.7} \end{equation}\] If \(\alpha = \frac{2}{\mu + L}\), then \[\begin{equation} \begin{split} \Vert \theta_k - \theta^\star \Vert & \leq \left( \frac{Q_f - 1}{Q_f + 1} \right)^k \Vert \theta_0 - \theta^\star \Vert \\ f(\theta_k) - f^\star & \leq \frac{L}{2} \left( \frac{Q_f - 1}{Q_f + 1} \right)^{2k} \Vert \theta_0 - \theta^\star \Vert^2, \end{split} \tag{3.8} \end{equation}\] where \(Q_f = L/\mu\).
Proof. See Theorem 2.1.15 in (Nesterov 2018).
Practical Notes.
The step size \(\alpha=\frac{1}{L}\) is optimal among fixed stepsizes for the above worst-case bounds on smooth convex \(f\).
In practice, backtracking line search or adaptive schedules can approach similar behavior without knowing \(L\).
For policy gradients (which maximize \(J\)), apply the results to \(f=-J\) and flip the update sign (gradient ascent). The smooth/convex assumptions rarely hold globally in RL, but these results calibrate expectations about step sizes and motivate variance reduction and curvature-aware methods used later.
3.1.3 Stochastic Gradients
In reinforcement learning and other large-scale machine learning problems, computing the exact gradient \(\nabla_\theta J(\theta)\) is often infeasible. Instead, we use an unbiased estimator \(\hat{\nabla}_\theta J(\theta)\) computed from a subset of data (or trajectories in RL). The update becomes \[ \theta_{k+1} = \theta_k + \alpha \hat{\nabla}_\theta J(\theta_k). \]
This approach, known as stochastic gradient ascent/descent (SGD), trades off exactness for computational efficiency. Variance in the gradient estimates plays an important role in convergence speed and stability.
3.1.3.1 Convergence Guarantees
We now turn to the convergence guarantees of stochastic gradient methods, which replace exact gradients with unbiased noisy estimates. Throughout this section we consider the minimization problem \(\min_\theta f(\theta)\) and assume \(\nabla f\) is available only through a stochastic oracle.
Setup and Assumptions.
Let \(f:\mathbb{R}^d\!\to\!\mathbb{R}\) be differentiable. At iterate \(\theta_k\), we observe a random vector \(g_k\) such that \[ \mathbb{E}[\,g_k \mid \theta_k\,] = \nabla f(\theta_k) \quad\text{and}\quad \mathbb{E}\!\left[\|g_k-\nabla f(\theta_k)\|^2 \mid \theta_k\right] \le \sigma^2. \] We will also use one of the following standard regularity conditions:
- (Convex + \(L\)-smooth) \(f\) is convex and the gradient is \(L\)-Lipschitz.
- (Strongly convex + \(L\)-smooth) \(f\) is \(\mu\)-strongly convex and \(L\)-smooth.
We consider the SGD update \[ \theta_{k+1} \;=\; \theta_k - \alpha_k\, g_k, \] and define the averaged iterate \[ \bar\theta_K := \frac{1}{K+1}\sum_{k=0}^{K}\theta_k. \]
Theorem 3.3 (SGD on smooth convex function) Assume \(f\) is convex and \(L\)-smooth. Suppose there exists \(G\!>\!0\) with \(\mathbb{E}\|g_k\|^2 \le G^2\) for all \(k\).
Choose a constant stepsize \(\alpha_k = \alpha > 0\). Then for all \(K \ge 1\), \[\begin{equation} \mathbb{E}\big[f(\bar\theta_K)\big] - f^\star \leq \frac{\Vert \theta_0 - \theta^\star \Vert^2}{2 \alpha (K+1)} + \frac{\alpha G^2}{2}. \tag{3.9} \end{equation}\]
Choose a diminishing step size \(\alpha_k = \frac{\Vert \theta_0 - \theta^\star \Vert}{G \sqrt{k+1}}\), then \[\begin{equation} \mathbb{E}\big[f(\bar\theta_K)\big] - f^\star \leq \frac{\Vert \theta_0 - \theta^\star \Vert G}{\sqrt{K+1}} = \mathcal{O}\left( \frac{1}{\sqrt{K}} \right). \tag{3.10} \end{equation}\]
Proof. See this lecture note and (Garrigos and Gower 2023).
Remarks.
The bound is on the averaged iterate \(\bar\theta_K\) (the last iterate may be worse by constants without further assumptions).
Replacing the second-moment bound by a variance bound \(\sigma^2\) yields the same rate with \(G^2\) replaced by \(\sigma^2 + \sup_k\|\nabla f(\theta_k)\|^2\).
With a constant stepsize, SGD converges \(\mathcal{O}(1/k)\) up to a neighborhood set by the gradient noise.
The next theorem states the convergence rate of SGD for minimizing strongly convex functions.
Theorem 3.4 (SGD on smooth strongly convex function) Assume \(f\) is \(\mu\)-strongly convex and \(L\)-smooth, and \(\mathbb{E}\!\left[\|g_k\|^2 \right]\le G^2\).
With stepsize \(\alpha_k = \frac{1}{\mu(k+1)}\), the SGD iterates satisfy for all \(K\!\ge\!1\),
\[\begin{equation}
\begin{split}
\mathbb{E}[f(\bar\theta_K)] - f^\star & \leq \frac{G^2}{2 \mu (K+1)} (1 + \log(K+1)), \\
\mathbb{E} \Vert \bar\theta_K - \theta^\star \Vert^2 & \leq \frac{Q}{K+1}, \ \ Q = \max \left( \frac{G^2}{\mu^2}, \Vert \theta_0 - \theta^\star \Vert^2 \right).
\end{split}
\tag{3.11}
\end{equation}\]
Proof. See this lecture note and (Garrigos and Gower 2023).
Practical Takeaways for Policy Gradients.
Use diminishing stepsizes for theoretical convergence (\(\alpha_k \propto 1/\sqrt{k}\) for general convex, \(\alpha_k \propto 1/k\) for strongly convex surrogates).
With constant stepsizes, expect fast initial progress down to a variance-limited plateau; lowering variance (e.g., via baselines/advantage estimation) is as important as tuning \(\alpha\).
TODO: graph the different trajectories between minimizing a convex function using GD and SGD.
3.1.4 Beyond Vanilla Gradient Methods
Several refinements to basic gradient updates are widely used:
- Momentum methods: incorporate past gradients to smooth updates and accelerate convergence.
- Adaptive learning rates (Adam, RMSProp, AdaGrad): adjust the learning rate per parameter based on historical gradient magnitudes.
- Second-order methods: approximate or use curvature information (the Hessian) for more informed updates, though often impractical in high dimensions.
3.2 Policy Gradients
Policy gradients optimize a parameterized stochastic policy directly, without requiring an explicit action-value maximization step. They are applicable to both finite and continuous action spaces and are especially useful when actions are continuous or when “\(\arg\max\)” over \(Q(s,a)\) is costly or ill-posed.
3.2.1 Setup
We consider a Markov decision process (MDP) with (possibly continuous) state space \(\mathcal{S}\), action space \(\mathcal{A}\), unknown dynamics \(P\), reward function \(R(s,a)\), and discount factor \(\gamma\in[0,1)\). Let \(\pi_\theta(a\mid s)\) be a differentiable stochastic policy with parameters \(\theta\in\mathbb{R}^d\).
Trajectory. A state-action trajectory is \(\tau=(s_0,a_0,s_1,a_1,\dots,s_{T})\) with probability density/mass \[\begin{equation} p_\theta(\tau) = \rho(s_0)\prod_{t=0}^{T-1} \pi_\theta(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t), \tag{3.12} \end{equation}\] where \(\rho\) is the initial state distribution and \(T\) is the (random or fixed) episode length.
Return. Define the (discounted) return \[\begin{equation} R(\tau) \;=\; \sum_{t=0}^{T-1}\gamma^t R(s_t,a_t), \tag{3.13} \end{equation}\] and the return-to-go \[\begin{equation} g_t \;=\; \sum_{t'=t}^{T-1}\gamma^{t'-t} R(s_{t'},a_{t'}). \tag{3.14} \end{equation}\]
Optimization objective. The goal is to maximize the expected return \[\begin{equation} J(\theta) \;\equiv\; \mathbb{E}_{\tau\sim p_\theta}\!\left[R(\tau)\right] \;=\; \mathbb{E}\!\left[\sum_{t=0}^{T-1}\gamma^t R(s_t,a_t)\right], \tag{3.15} \end{equation}\] where the expectation is taken over the randomness in (i) the initial state \(s_0 \sim \rho\), (ii) the policy \(\pi_\theta\), and (iii) the transition dynamics \(P\).
3.2.1.1 Policy models
Finite action spaces (\(\mathcal{A}\) discrete). A common choice is a softmax (categorical) policy over a score (logit) function \(f_\theta(s,a)\): \[ \pi_\theta(a\mid s) \;=\; \frac{\exp\{f_\theta(s,a)\}}{\sum_{a'\in\mathcal{A}}\exp\{f_\theta(s,a')\}}. \] Here we use \(\exp\{f_\theta(s,a)\} = e^{f_\theta(s,a)}\) for pretty formatting. Typically \(f_\theta\) is a neural network or a linear function over features.
Continuous action spaces (\(\mathcal{A}\subseteq\mathbb{R}^m\)). A standard choice is a Gaussian policy: \[ \pi_\theta(a\mid s) \;=\; \mathcal{N}\!\big(a;\;\mu_\theta(s),\,\Sigma_\theta(s)\big), \] where \(\mu_\theta(s)\) and (often diagonal) covariance \(\Sigma_\theta(s)\) are differentiable functions (e.g., neural networks) parameterized by \(\theta\). The policy \(\pi_\theta(a \mid s)\) samples actions from the Gaussian parameterized by \(\mu_\theta(s)\) and \(\Sigma_\theta(s)\). Other choices include squashed Gaussians (e.g., \(\tanh\)) or Beta distributions for bounded actions.
3.2.2 The Policy Gradient Lemma
With the gradient-based optimization machinery from Section 3.1, a natural strategy for the policy optimization problem in (3.15) is gradient ascent on the objective \(J(\theta)\). Consequently, the central task is to characterize the ascent direction, i.e., to compute \(\nabla_\theta J(\theta)\).
The policy gradient lemma, stated below, provides exactly this characterization. Crucially, it expresses \(\nabla_\theta J(\theta)\) in terms of the policy’s score function \(\nabla_\theta \log \pi_\theta(a\mid s)\) and returns, without differentiating through the environment dynamics. This likelihood-ratio form makes policy optimization feasible even when the transition model is unknown or non-differentiable.
Theorem 3.5 (Policy Gradient Lemma) Let \(J(\theta)=\mathbb{E}_{\tau \sim p_\theta}[R(\tau)]\) as defined in (3.15) Then: \[\begin{equation} \nabla_\theta J(\theta) \;=\; \mathbb{E}_{\tau\sim p_\theta} \Big[R(\tau)\,\nabla_\theta \log p_\theta(\tau)\Big] \;=\; \mathbb{E}_{\tau\sim p_\theta} \Bigg[\sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t\mid s_t)\;R(\tau)\Bigg]. \tag{3.16} \end{equation}\] By causality (future action does not affect past reward), the full return can be replaced by return-to-go: \[\begin{equation} \nabla_\theta J(\theta) \;=\; \mathbb{E}_{\tau\sim p_\theta} \Bigg[\sum_{t=0}^{T-1} \gamma^t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\;g_t\Bigg]. \tag{3.17} \end{equation}\] Equivalently, using value functions, \[\begin{equation} \nabla_\theta J(\theta) \;=\; \frac{1}{1-\gamma} \mathbb{E}_{s\sim d_\theta,\;a\sim\pi_\theta} \Big[\nabla_\theta \log \pi_\theta(a\mid s)\,Q^{\pi_\theta}(s,a)\Big], \tag{3.18} \end{equation}\] where \(d_\theta\) is the (discounted) on-policy state visitation distribution for infinite-horizon MDPs: \[\begin{equation} d_\theta(s) = (1 - \gamma) \sum_{t=0}^{\infty} \gamma^t \Pr_\theta(s_t=s). \tag{3.19} \end{equation}\]
Proof. We prove the three equivalent forms step by step. Throughout, we assume \(\theta\) parameterizes only the policy \(\pi_\theta\) (not the dynamics \(P\) nor the initial distribution \(\rho\)), and that interchanging \(\nabla_\theta\) with the trajectory integral/sum is justified (e.g., bounded rewards and finite horizon or standard dominated-convergence conditions). Let the return-to-go \(g_t\) be defined as in (3.14).
Step 1 (Log-derivative trick). Write the objective as an expectation over trajectories: \[ J(\theta) \;=\; \int R(\tau)\, p_\theta(\tau)\, d\tau. \] Differentiate under the integral and use \[\begin{equation} \nabla_\theta p_\theta(\tau)=p_\theta(\tau)\nabla_\theta\log p_\theta(\tau) \tag{3.20} \end{equation}\] we can write: \[ \nabla_\theta J(\theta) = \int R(\tau)\,\nabla_\theta p_\theta(\tau)\, d\tau = \int R(\tau)\, p_\theta(\tau)\,\nabla_\theta \log p_\theta(\tau)\, d\tau = \mathbb{E}_{\tau\sim p_\theta}\!\big[R(\tau)\,\nabla_\theta \log p_\theta(\tau)\big], \] which is (3.16) up to expanding \(\log p_\theta(\tau)\). To see why (3.20) is true, write \[ \nabla_\theta \log p_\theta(\tau) = \frac{1}{p_\theta(\tau)} \nabla_\theta p_\theta(\tau), \] using the chain rule.
Step 2 (Policy-only dependence). Factor the trajectory likelihood/mass: \[ p_\theta(\tau) = \rho(s_0)\,\prod_{t=0}^{T-1}\pi_\theta(a_t\mid s_t)\,P(s_{t+1}\mid s_t,a_t). \] Since \(\rho\) and \(P\) do not depend on \(\theta\), \[ \log p_\theta(\tau) = \text{const} \;+\; \sum_{t=0}^{T-1}\log \pi_\theta(a_t\mid s_t) \quad\Rightarrow\quad \nabla_\theta \log p_\theta(\tau) \;=\; \sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t). \] Substitute into Step 1 to obtain the second equality in (3.16): \[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau\sim p_\theta}\!\Bigg[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,R(\tau)\Bigg]. \]
Step 3 (Causality \(\Rightarrow\) return-to-go). Expand \(R(\tau)=\sum_{t=0}^{T-1}\gamma^{t} r_{t}\) (with \(r_{t}:=R(s_{t},a_{t})\)) and swap sums: \[ \mathbb{E} \Bigg[\sum_{t=0}^{T-1}\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,R(\tau)\Bigg] = \sum_{t=0}^{T-1}\sum_{t'=0}^{T-1}\mathbb{E} \big[\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\gamma^{t'} r_{t'}\big]. \] For \(t'<t\), the factor \(\gamma^{t'} r_{t'}\) is measurable w.r.t. the history \(\mathcal{F}_t=\sigma(s_0,a_0,\dots,s_t)\), while \[ \mathbb{E} \big[\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\big|\,\mathcal{F}_t\big] = \sum_{a} \pi_\theta(a\mid s_t)\,\nabla_\theta \log \pi_\theta(a\mid s_t) = \nabla_\theta \sum_{a}\pi_\theta(a\mid s_t) = \nabla_\theta 1 = 0, \] (and analogously with integrals for continuous \(\mathcal{A}\)). Hence by the tower property, \[ \mathbb{E} \big[\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\gamma^{t'} r_{t'}\big]=0\quad\text{for all }t'<t. \] Therefore only the terms with \(t'\ge t\) survive, and \[ \nabla_\theta J(\theta) = \sum_{t=0}^{T-1}\mathbb{E} \Big[\nabla_\theta \log \pi_\theta(a_t\mid s_t)\,\sum_{t'=t}^{T-1}\gamma^{t'} r_{t'}\Big] = \mathbb{E} \Bigg[\sum_{t=0}^{T-1} \gamma^t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\,g_t\Bigg], \] which is (3.17).
Step 4 (Value-function form). Condition on \((s_t,a_t)\) and use the definition of the action-value function: \[ Q^{\pi_\theta}(s_t,a_t) \;\equiv\; \mathbb{E}\!\left[g_t \,\middle|\, s_t,a_t\right]. \] Taking expectations then yields \[ \mathbb{E} \big[\gamma^t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\,g_t\big] = \mathbb{E} \big[\gamma^t \nabla_\theta \log \pi_\theta(a_t\mid s_t)\,Q^{\pi_\theta}(s_t,a_t)\big]. \] Summing over \(t\) and collecting terms with the (discounted) on-policy state visitation distribution \(d_\theta\) (for the infinite-horizon case, e.g., \(d_\theta(s)=(1-\gamma)\sum_{t=0}^\infty \gamma^t\,\Pr_\theta(s_t=s)\); for finite \(T\), use the corresponding finite-horizon weighting), we obtain \[ \nabla_\theta J(\theta) \;=\; \mathbb{E}_{s\sim d_\theta,\;a\sim \pi_\theta} \Big[\nabla_\theta \log \pi_\theta(a\mid s)\,Q^{\pi_\theta}(s,a)\Big], \] which is (3.18).
Conclusion. Combining Steps 1–4 proves all three stated forms of the policy gradient.
3.2.3 REINFORCE
The policy gradient lemma immediately gives us an algorithm. Specifically, the gradient receipe in (3.16) tells us that if we generate one trajectory \(\tau\) by following the policy \(\pi\), then \[\begin{equation} \widehat{\nabla_\theta J} = \sum_{t=0}^{T-1} \nabla_\theta \log \pi_\theta(a_t \mid s_t) R(\tau) \tag{3.21} \end{equation}\] is an unbiased estimator of the true gradient.
With this sample gradient estimator, we obtain the classical REINFORCE algorithm.
- Initialize \(\theta_0\) for the initial policy \(\pi_{\theta_0}(a \mid s)\)
- For \(k=0,1,\dots,\) do:
- Obtain a trajectory \(\tau \sim p_{\theta_k}\)
- Compute the stochastic gradient \(g_k\) as in (3.21)
- Update \(\theta_{k+1} = \theta_k + \alpha_k g_k\)
To reduce variance of the gradient estimator, we can use a minibatch of trajectories. For example, given a batch of \(N\) trajectories \(\{\tau^{(i)}\}_{i=1}^N\) collected by \(\pi_\theta\), define for each timestep the return-to-go \[ g_t^{(i)} = \sum_{t'=t}^{T^{(i)}-1} \gamma^{t'-t} R\!\left(s_{t'}^{(i)},a_{t'}^{(i)}\right). \] An unbiased gradient estimator, from (3.17) is \[\begin{equation} \widehat{\nabla_\theta J} \;=\; \frac{1}{N}\sum_{i=1}^N \sum_{t=0}^{T^{(i)}-1} \gamma^t \nabla_\theta \log \pi_\theta \big(a_t^{(i)}\mid s_t^{(i)}\big) g_t^{(i)}. \tag{3.22} \end{equation}\]
This leads to the following minibatch REINFORCE algorithm.
- Initialize \(\theta_0\) for the initial policy \(\pi_{\theta_0}(a \mid s)\)
- For \(k=0,1,\dots,\) do:
- Obtain N trajectories \(\{ \tau^{(i)} \}_{i=1}^N \sim p_{\theta_k}\)
- Compute the stochastic gradient \(g_k\) as in (3.22)
- Update \(\theta_{k+1} = \theta_k + \alpha_k g_k\)