Chapter 2 Exact Dynamic Programming
In Chapter 1, we introduced the basic formulation of the finite-horizon and discrete-time optimal control problem, presented the Bellman principle of optimality, and derived the dynamic programming (DP) algorithm. We mentioned that, despite being a general-purpose algorithm, it can be difficult to implement DP exactly in practical applications.
In this Chapter, we will introduce two problem setups where DP can in fact be implemented exactly.
2.1 Linear Quadratic Regulator
Consider a linear discrete-time dynamical system xk+1=Akxk+Bkuk+wk,k=0,1,…,N−1, where xk∈Rn the state, uk∈Rm the control, wk∈Rn the independent, zero-mean disturbance with given probability distribution that does not depend on xk,uk, and Ak∈Rn×n,Bk∈Rn×m are known matrices determining the transition dynamics.
We want to solve the following optimal control problem minμ0,…,μN−1E{xTNQNxN+N−1∑k=0(xTkQkxk+uTkRkuk)}, where the expectation is taken over the randomness in w0,…,wN−1. In (2.2), {Qk}Nk=0 are positive semidefinite matrices, and {Rk}N−1k=0 are positive definite matrices. The formulation (2.2) is typically known as the linear quadratic regulator (LQR) problem because the dynamics is linear, the cost is quadratic, and the formulation can be considered to “regulate” the system around the origin x=0.
We will now show that the DP algorithm in Theorem 1.2 can be exactly implemented for LQR.
The DP algorithm computes the optimal cost-to-go backwards in time. The terminal cost is JN(xN)=xTNQNxN by definition.
The optimal cost-to-go at time N−1 is equal to JN−1(xN−1)=minuN−1EwN−1{xTN−1QN−1xN−1+uTN−1RN−1uN−1+‖ where \Vert v \Vert_Q^2 = v^T Q v for Q \succeq 0. Now observe that the objective in (2.3) is \begin{equation} \begin{split} x_{N-1}^T Q_{N-1} x_{N-1} + u_{N-1}^T R_{N-1} u_{N-1} + \Vert A_{N-1} x_{N-1} + B_{N-1} u_{N-1} \Vert_{Q_N}^2 + \\ \mathbb{E}_{w_{N-1}} \left[ 2(A_{N-1} x_{N-1} + B_{N-1} u_{N-1} )^T Q_{N-1} w_{N-1} \right] + \\ \mathbb{E}_{w_{N-1}} \left[ w_{N-1}^T Q_N w_{N-1} \right] \end{split} \end{equation} where the second line is zero due to \mathbb{E}(w_{N-1}) = 0 and the third line is a constant with respect to u_{N-1}. Consequently, the optimal control u_{N-1}^\star can be computed by setting the derivative of the objective with respect to u_{N-1} equal to zero \begin{equation} u_{N-1}^\star = - \left[ \left( R_{N-1} + B_{N-1}^T Q_N B_{N-1} \right)^{-1} B_{N-1}^T Q_N A_{N-1} \right] x_{N-1}. \tag{2.4} \end{equation} Plugging the optimal controller u^\star_{N-1} back to the objective of (2.3) leads to \begin{equation} J_{N-1}(x_{N-1}) = x_{N-1}^T S_{N-1} x_{N-1} + \mathbb{E} \left[ w_{N-1}^T Q_N w_{N-1} \right], \tag{2.5} \end{equation} with S_{N-1} = Q_{N-1} + A_{N-1}^T \left[ Q_N - Q_N B_{N-1} \left( R_{N-1} + B_{N-1}^T Q_N B_{N-1} \right)^{-1} B_{N-1}^T Q_N \right] A_{N-1}. We note that S_{N-1} is positive semidefinite (this is an exercise for you to convince yourself).
Now we realize that something surprising and nice has happened.
The optimal controller u^{\star}_{N-1} in (2.4) is a linear feedback policy of the state x_{N-1}, and
The optimal cost-to-go J_{N-1}(x_{N-1}) in (2.5) is quadratic in x_{N-1}, just the same as J_{N}(x_N).
This implies that, if we continue to compute the optimal cost-to-go at time N-2, we will again compute a linear optimal controller and a quadratic optimal cost-to-go. This is the rare nice property for the LQR problem, that is,
The (representation) complexity of the optimal controller and cost-to-go does not grow as we run the DP recursion backwards in time.
We summarize the solution for the LQR problem (2.2) as follows.
Proposition 2.1 (Solution of Discrete-Time Finite-Horizon LQR) The optimal controller for the LQR problem (2.2) is a linear state-feedback policy \begin{equation} \mu_k^\star(x_k) = - K_k x_k, \quad k=0,\dots,N-1. \tag{2.6} \end{equation} The gain matrix K_k can be computed as K_k = \left( R_k + B_k^T S_{k+1} B_k \right)^{-1} B_k^T S_{k+1} A_k, where the matrix S_k satisfies the following backwards recursion \begin{equation} \hspace{-6mm} \begin{split} S_N &= Q_N \\ S_k &= Q_k + A_k^T \left[ S_{k+1} - S_{k+1}B_k \left( R_k + B_k^T S_{k+1} B_k \right)^{-1} B_k^T S_{k+1} \right] A_k, k=N-1,\dots,0. \end{split} \tag{2.7} \end{equation} The optimal cost-to-go is given by J_0(x_0) = x_0^T S_0 x_0 + \sum_{k=0}^{N-1} \mathbb{E} \left[ w_k^T S_{k+1} w_k\right]. The recursion (2.7) is called the discrete-time Riccati equation.
Proposition 2.1 states that, to evaluate the optimal policy (2.6), one can first run the backwards Riccati equation (2.7) to compute all the positive definite matrices S_k, and then compute the gain matrices K_k. For systems of reasonable dimensions, evalutating the matrix inversion in (2.7) should be fairly efficient.
2.1.1 Infinite-Horizon LQR
In many robotics applications, it is often more useful to study the infinite-horizon LQR problem \begin{align} \min_{u_k} & \quad \sum_{k=0}^{\infty} \left( x_k^T Q x_k + u_k^T R u_k \right) \tag{2.8} \\ \text{subject to} & \quad x_{k+1} = A x_k + B u_k, \quad k=0,\dots,\infty, \tag{2.9} \end{align} where Q \succeq 0, R \succ 0, and A,B are constant matrices. The reason for studying the formulation (2.8) is twofold. First, for nonlinear systems, we often linearize the nonlinear dynamics around an (equilibrium) point we care about, leading to constant A and B matrices. Second, we care more about the asymptotic effect of our controller than its behavior in a fixed number of steps. We will soon see an example of this formulation for balancing a simple pendulum.
The infinite-horizon formulation is essentially the finite-horizon formulation (2.2) with N \rightarrow \infty. Based on our intuition in deriving the finite-horizon LQR solution, we may want to hypothesize that the optimal cost-to-go is a quadratic function \begin{equation} J_{k}(x_{k}) = x_{k}^T S x_{k}, k=0,\dots,\infty \tag{2.10} \end{equation} for some positive definite matrix S, and proceed to invoke the DP algorithm. Notice that we hypothesize the matrix S is in fact stationary, i.e., it does not change with respect to time. This hypothesis makes sense because the A,B,Q,R matrices are stationary in the formulation (2.8). Invoking the DP algorithm we have \begin{equation} x_k^T S x_k = J_k(x_k) = \min_{u_k} \left\{ x_k^T Q x_k + u_k^T R u_k + \Vert \underbrace{A x_k + B u_k}_{x_{k+1}} \Vert_S^2 \right\}. \tag{2.11} \end{equation} The minimization over u_k in (2.11) can again be solved in closed-form by setting the gradient of the objective with respect to u_k to be zero \begin{equation} u_k^\star = - \underbrace{\left[ \left( R + B^T S B \right)^{-1} B^T S A \right]}_{K} x_k. \tag{2.12} \end{equation} Plugging the optimal u_k^\star back into (2.11), we see that the matrix S has to satisfy the following equation \begin{equation} S = Q + A^T \left[ S - SB \left( R + B^T S B \right)^{-1} B^T S \right] A. \tag{2.13} \end{equation} Equation (2.13) is the famous algebraic Riccati equation.
Let’s zoom out to see what we have done. We started with a hypothetical optimal cost-to-go (2.10) that is stationary, and invoked the DP algorithm in (2.11), which led us to the algebraic Riccati equation (2.13). Therefore, if there actually exists a solution to the algebraic Riccati equation (2.13), then the linear controller (2.12) is indeed optimal (by the optimality of DP)!
So the question boils down to if the algebraic Riccati equation has a solution S that is positive definite? The following proposition gives an answer.
Proposition 2.2 (Solution of Discrete-Time Infinite-Horizon LQR) Consider a linear system x_{k+1} = A x_k + B u_k, with (A,B) controllable (see Appendix C.2). Let Q \succeq 0 in (2.8) be such that Q can be written as Q = C^T C with (A,C) observable.
Then the optimal controller for the infinite-horizon LQR problem (2.8) is a stationary linear policy \mu^\star (x) = - K x, with K = \left( R + B^T S B \right)^{-1} B^T S A. The matrix S is the unique positive definite matrix that satisfies the algebraic Riccati equation S = Q + A^T \left[ S - SB \left( R + B^T S B \right)^{-1} B^T S \right] A.
Moreover, the closed-loop system x_{k+1} = A x_k + B (-K x_k) = (A - BK) x_k is stable, i.e., the eigenvalues of the matrix A - BK are strictly within the unit circle (see Appendix C.1.2).
A rigorous proof of Proposition 2.2 is available in Proposition 3.1.1 of (Bertsekas 2012). The proof basically studies the limit of the discrete-time Riccati equation (2.7) when N \rightarrow \infty. Indeed, the algebraic Riccati equation (2.13) is the limit of the discrete-time Riccati equation (2.7) when N \rightarrow \infty. The assumptions of (A,B) being controllable and (A,C) being observable can be relaxted to (A,B) being stabilizable and (A,C) being detectable (for definitions of stabilizability and detectability, see Appendix C).
We have not discussed how to solve the algebraic Riccati equation (2.7). It is clear that (2.7) is not a linear system of equations in S. In fact, the numerical algorithms for solving the algebraic Riccati equation can be highly nontrivial, for example see (Arnold and Laub 1984). Fortunately, such algorithms are often readily available, and as practitioners we do not need to worry about solving the algebraic Riccati equation by ourselves. For example, the Matlab dlqr
function computes the K and S matrices from A,B,Q,R.
Let us now apply the infinite-horizon LQR solution to stabilizing a simple pendulum.
Example 2.1 (Pendulum Stabilization by LQR) Consider the simple pendulum in Fig. 2.1 with dynamics \begin{equation} x = \begin{bmatrix} \theta \\ \dot{\theta} \end{bmatrix}, \quad \dot{x} = f(x,u) = \begin{bmatrix} \dot{\theta} \\ -\frac{1}{ml^2}(b \dot{\theta} + mgl \sin \theta) + \frac{1}{ml^2} u \end{bmatrix} \tag{2.14} \end{equation} where m is the mass of the pendulum, l is the length of the pole, g is the gravitational constant, b is the damping ratio, and u is the torque applied to the pendulum.
We are interested in applying the LQR controller to balance the pendulum in the upright position x_d = [\pi,0]^T with a zero velocity.

Figure 2.1: A Simple Pendulum.
Let us first shift the dynamics so that “0” is the upright position. This can be done by defining a new variable z = x - x_d = [\theta - \pi, \dot{\theta}]^T, which leads to \begin{equation} \dot{z} = \dot{x} = f(x,u) = f(z + x_d,u) = \begin{bmatrix} z_2 \\ \frac{1}{ml^2} \left( u - b z_2 + mgl \sin z_1 \right) \end{bmatrix} = f'(z,u). \tag{2.15} \end{equation} We then linearize the nonlinear dynamics \dot{z} = f'(z,u) at the point z^\star = 0, u^\star = 0: \begin{align} \dot{z} & \approx f'(z^\star,u^\star) + \left( \frac{\partial f'}{\partial z} \right)_{z^\star,u^\star} (z - z^\star) + \left( \frac{\partial f'}{\partial u} \right)_{z^\star,u^\star} (u - u^\star) \\ & = \begin{bmatrix} 0 & 1 \\ \frac{g}{l} \cos z_1 & - \frac{b}{ml^2} \end{bmatrix}_{z^\star, u^\star} z + \begin{bmatrix} 0 \\ \frac{1}{ml^2} \end{bmatrix} u \\ & = \underbrace{\begin{bmatrix} 0 & 1 \\ \frac{g}{l} & - \frac{b}{ml^2} \end{bmatrix}}_{A_c} z + \underbrace{\begin{bmatrix} 0 \\ \frac{1}{ml^2} \end{bmatrix}}_{B_c} u. \end{align} Finally, we convert the continuous-time dynamics to discrete time with a fixed discretization h z_{k+1} = \dot{z}_k \cdot h + z_k = \underbrace{(h \cdot A_c + I )}_{A} z_k + \underbrace{(h \cdot B_c)}_{B} u_k.
We are now ready to implement the LQR controller. In the formulation (2.8), we choose Q = I, R = I, and solve the gain matrix K using the Matlab dlqr
function.
Fig. 2.2 shows the simulation result for m=1,l=1,b=0.1, g = 9.8, and h = 0.01, with an initial condition z^0 = [0.1,0.1]^T. We can see that the LQR controller successfully stabilizes the pendulum at z^\star, the upright position.
You can play with the Matlab code here.

Figure 2.2: LQR stabilization of a simple pendulum.
2.1.2 LQR with Constraints
Let’s explore LQR with constraints in Exercise 9.2
2.2 Markov Decision Process
In Section 2.1, we see that linear dynamics and quadratic costs leads to exact dynamic programming. We now introduce another setup where the number of states and controls is finite (as opposed to the LQR case where x_k and u_k live in continuous spaces). We will see that we can execute DP exactly in this setup as well.
Optimal control in the case of finite states and controls is typically introduced in the framework of a Markov Decision Process (MDP, which is common in Reinforcement Learning). There are many variations of a MDP, and here we only focus on the discounted infinite-horizon MDP. For a more complete treatment of MDPs, I suggest checking out this course at Harvard.
Formally, a discounted infinite-horizon MDP \mathcal{M} = (\mathbb{X},\mathbb{U},P,g,\gamma,\sigma) is specified by
a state space \mathbb{X} that is finite with size |\mathbb{X}|
a control space \mathbb{U} that is finite with size |\mathbb{U}|
a transition function P: \mathbb{X} \times \mathbb{U} \rightarrow \Delta(\mathbb{X}), where \Delta(\mathbb{X}) is the space of probability distributions over \mathbb{X}; specifically, P(x' \mid x, u) is the probability of transitioning into state x' from state x using control u. If the system is deterministic, then P(x' \mid x, u) is nonzero only for a single next state x'
a cost function g: \mathbb{X} \times \mathbb{U} \rightarrow [0,1]; g(x,u) is the cost of taking the control u at state x
a discount factor \gamma \in [0,1)
an initial state distribution \sigma \in \Delta(\mathbb{X}) that specifies how the initial state x_0 is generated; in many cases we will assume x_0 is fixed and \sigma is a distribution supported only on x_0.
In an MDP, the system starts at some state x_0 \sim \sigma. At each step k = 0,1,2,\dots, the system decides a control u_k \in \mathbb{U} and incurs a cost g(s_k,u_k). The control u_k brings the system into a new state x_{k+1} \sim P(\cdot \mid x_k, u_k), at which the controller decides a new control u_{k+1}. This process continues forever.
Controller (policy). In general, a time-varying controller \pi = (\pi_0,\dots,\pi_k,\dots) is a mapping from all previous states and controls to a distribution over current controls. The mapping \pi_k at timestep k is \pi_k: (x_0,u_0,x_1,u_1,\dots,x_k) \mapsto u_k \sim q_k \in \Delta(\mathbb{U}). Note that u_k can be randomized and it is drawn from a distribution q_k supported on the set of controls \mathbb{U}. A stationary controller (policy) \pi: \mathbb{X} \rightarrow \Delta(\mathbb{U}) specifies a decison-making strategy that is purely based on the current state x_k. A deterministic and stationary controller \pi: \mathbb{X} \rightarrow \mathbb{U} excutes a deterministic control u_k at each step.
Cost-to-go and Q-value. Given a controller \pi and an initial state x_0, we associate with it the following discounted infinite-horizon cost \begin{equation} J_\pi(x_0) = \mathbb{E} \left\{ \sum_{k=0}^{\infty} \gamma^k g(x_k, u_k^{\pi}) \right\}, \tag{2.16} \end{equation} where the expectation is taken over the randomness of the transition P and the controller \pi. Note that we have used u^{\pi}_k to denote the control at step k by following the controller \pi. Similarly, we define the Q-value function as \begin{equation} Q_\pi(x_0,u_0) = \mathbb{E} \left\{ g(x_0,u_0) + \sum_{k=1}^{\infty} \gamma^k g(x_k,u_k^{\pi}) \right\}. \tag{2.17} \end{equation} The difference between Q_\pi(x_0,u_0) and J_\pi(x_0) is that at step zero, J_\pi(x_0) follows the controller \pi while Q_\pi(x_0,u_0) assumes the control u_0 is given. By the assumption that g(x_k,u_k) \in [0,1], we have 0 \leq J_\pi(x_0), Q_\pi(x_0,u_0) \leq \sum_{k=0}^{\infty} \gamma^k = \frac{1}{1-\gamma}, \quad \forall \pi.
Our goal is to find the best controller that minimizes the cost function \begin{equation} \pi^\star \in \arg\min_{\pi \in \Pi} J_\pi(x_0) \tag{2.18} \end{equation} for a given initial state x_0, where \Pi is the space of all non-stationary and randomized controllers.
A remarkable property of MDPs is that there exists an optimal controller that is stationary and deterministic.
Theorem 2.1 (Deterministic and Stationary Optimal Policy) Let \Pi be the space of all non-stationary and randomized policies. Define J^\star_\pi(x) = \min_{\pi \in \Pi} J_\pi(x), \quad Q^\star_\pi(x,u) = \min_{\pi \in \Pi } Q_\pi(x,u). There exists a deterministic and stationary policy \pi^\star such that for all x \in \mathbb{X} and u \in \mathbb{U}, J_{\pi^\star}(x) = J^\star(x), \quad Q_{\pi^\star}(x,u) = Q^\star(x,u). We call such a policy \pi an optimal policy.
Proof. See Theorem 1.7 in (Agarwal et al. 2022).
This Theorem shows that we can restrict ourselves to stationary and deterministic policies without losing performance.
In the next, we show how to characterize the optimal policy and value function.
2.2.1 Bellman Optimality Equations
We now restrict ourselves to stationary policies. We first introduce the Bellman Consistency Equations for stationary policies.
Lemma 2.1 (Bellman Consistency Equations) Let \pi be a stationary policy. Then J_\pi and Q_\pi satisfy the following Bellman consistency equations \begin{align} J_\pi(x) &= \mathbb{E}_{u \sim \pi(\cdot \mid x)} Q_\pi(x,u), \tag{2.19} \\ Q_\pi(x,u) &=g(x,u) + \gamma \mathbb{E}_{x' \sim P(\cdot \mid x,u)} J_\pi(x'). \tag{2.20} \end{align}
Proof. By the definition of the cost-to-go function in (2.16), we have J_\pi(x_0) = \mathbb{E}\left\{ \sum_{k=0}^{\infty} \gamma^k g(x_k, \pi(x_k)) \right\} = \mathbb{E}_{u_0 \sim \pi(\cdot \mid x_0)} \underbrace{\mathbb{E} \left\{ g(x_0,u_0) + \sum_{k=1}^{\infty} \gamma^k g(x_k,\pi(x_k)) \right\}}_{Q_\pi(x_0,u_0)}. The above equation holds for any x_0, proving (2.19).
To show (2.20), we recall the definition of the Q-value function (2.17) \begin{align} Q_\pi(x_0,u_0) & = \mathbb{E} \left\{ g(x_0,u_0) + \sum_{k=1}^{\infty} \gamma^k g(x_k,\pi(x_k)) \right\} \\ &= g(x_0,u_0) + \gamma \mathbb{E} \left\{ \sum_{k=1}^{\infty} \gamma^{k-1} g(x_k,\pi(x_k)) \right\} \tag{2.21} \end{align} Now observe that the expectation of the second term in (2.21) is taken over both the randomness of x_1 and the randomness of the policy after x_1 is reached. Therefore, \mathbb{E} \left\{ \sum_{k=1}^{\infty} \gamma^{k-1} g(x_k,\pi(x_k)) \right\} = \mathbb{E}_{x_1 \sim P(\cdot \mid x_0,u_0)} \underbrace{\left\{ \mathbb{E} \left\{ \sum_{k=1}^{\infty} \gamma^{k-1} g(x_1,\pi(x_1)) \right\} \right\}}_{J_\pi(x_1)}. Plugging the above equation back to (2.21), we obtain the desired result in (2.20).
Matrix Representation. It is useful to think of P,g,J_\pi,Q_\pi as matrices. In particular, the transition function P can be considered as a matrix of dimension |\mathbb{X}||\mathbb{U}| \times \mathbb{X}, where P_{(x,u),x'} = P(x' \mid x,u) is the entry of P at the row (x,u) (there are |\mathbb{X}||\mathbb{U}| such rows) and column x' (there are |\mathbb{X}| such columns). The running cost g is vector of |\mathbb{X}||\mathbb{U}| entries. The cost-to-go J_\pi(x) is a vector of |\mathbb{X}| entries. The Q-value function Q_\pi(x,u) is a vector of |\mathbb{X}||\mathbb{U}| entries. We also introduce P^{\pi} with dimension |\mathbb{X}||\mathbb{U}| \times |\mathbb{X}||\mathbb{U}| as the transition matrix induced by a stationary policy \pi. In particular, P^\pi_{(x,u),(x',u')} = P(x' \mid x,u) \pi(u'\mid x'). In words, P^\pi_{(x,u),(x',u')} is the probability that (x',u') follows (x,u).
With the matrix representation, we can compactly write the Bellman consistency equation (2.20) as \begin{equation} Q_\pi = g + \gamma P J_\pi. \tag{2.22} \end{equation} We can also combine (2.19) and (2.20) together and write Q_\pi(x,u) = g(x,u) + \gamma \mathbb{E}_{x' \sim P(\cdot \mid x,u)} \left\{ \mathbb{E}_{u' \sim \pi(\cdot \mid x')} Q_\pi(x',u') \right\} = g(x,u) + \gamma \mathbb{E}_{(x',u') \sim P^\pi(\cdot \mid (x,u))} Q_\pi(x',u'), which, using matrix representation, becomes \begin{equation} Q_\pi = g + \gamma P^\pi Q_\pi. \tag{2.23} \end{equation} Equation (2.23) immediately yields \begin{equation} Q_\pi = (I - \gamma P^\pi)^{-1} g, \tag{2.24} \end{equation} that is, the Q-value function associated with a stationary policy \pi can be simply computed from solving a linear system as in (2.24).3
Lemma 2.1, together with the equivalent matrix equations (2.22) and (2.23), provide the conditions that J_\pi and Q_\pi, induced by any stationary policy \pi, need to satisfy. In the next, we describe the conditions that characterize the optimal policy.
Theorem 2.2 (Bellman Optimality Equations) A vector Q \in \mathbb{R}^{|\mathbb{X}||\mathbb{U}|} is said to satisfy the Bellman optimality equation if \begin{equation} Q(x,u) = g(x,u) + \gamma \mathbb{E}_{x' \sim P(\cdot \mid x,u)} \left\{ \min_{u' \in \mathbb{U}} Q(x',u') \right\}, \quad \forall (x,u) \in \mathbb{X} \times \mathbb{U}. \tag{2.25} \end{equation}
A vector Q^\star is the optimal Q-value function if and only if it satisfies (2.25). Moreover, the deterministic policy defined by \pi^\star(x) \in \arg\min_{u \in \mathbb{U}} Q^\star(x,u) with ties broken arbitrarily is an optimal policy.
Proof. See Theorem 1.8 in (Agarwal et al. 2022).
We now make a few definitions to interpret Theorem 2.2. For any vector Q \in \mathbb{R}^{|\mathbb{X}||\mathbb{U}|}, define the greedy policy as \begin{equation} \pi_{Q}(x) \in \arg\min_{u \in \mathbb{U}} Q(x,u) \tag{2.26} \end{equation} with ties broken arbitrarily. With this notation, by Theorem 2.2, the optimal policy is \pi^\star = \pi_{Q^\star}, where Q^\star is the optimal Q-value function. Similarly, let us define J_Q(x) = \min_{u \in \mathbb{U}} Q(x,u). Note that J_Q has dimension |\mathbb{X}|. With these notations, the Bellman optimality operator is defined as \begin{equation} \mathcal{T}Q = g + \gamma P J_Q, \tag{2.27} \end{equation} which is nothing but a matrix representation of the right-hand side of (2.25). This allows us to concisely write the Bellman optimality equation (2.25) as \begin{equation} Q = \mathcal{T}Q. \tag{2.28} \end{equation} Therefore, an equivalent way to interpret Theorem 2.2 is that Q = Q^\star if and only if Q is a fixed point to the Bellman optimality operator \mathcal{T}.
2.2.2 Value Iteration
Intepreting the optimal Q-value function as the fixed point to the Bellman optimality operator (2.28) leads us to a natural algorithm for solving the optimal control problem.
We start with Q^{(0)} being an all-zero vector and then at iteration t, we perform Q^{(t+1)} \leftarrow \mathcal{T} Q^{(t)}, with \mathcal{T} defined in (2.27). Let us observe the simplicity of this algorithm: at each iteration, one only needs to perform \min_{u \in \mathbb{U}} Q^{(t)}(x,u), which is very efficient when |\mathbb{U}| is not too large.
The next theorem states this simple algorithm converges to the optimal value function.
Theorem 2.3 (Value Iteration) Set Q^{(0)} = 0. For t=0,\dots, perform Q^{(t+1)} \leftarrow \mathcal{T} Q^{(t)}. Let \pi^{(k)} = \pi_{Q^{(k)}} (see the definition in (2.26)). For t \geq \frac{\log \frac{2}{(1-\gamma)^2 \epsilon}}{1-\gamma}, we have J^{\pi^{(t)}} \leq J^\star + \epsilon \mathbb{1}, where \mathbb{1} is the all-ones vector.
Essentially, the value function obtained from value iteration converges to the optimal cost-to-go.
Let us use an example to appreciate this algorithm.
Example 2.2 (Shortest Path in Grid World) Consider the following 10 \times 10 grid world, where the top-right cell is the goal location, and the dark blue colored cells are obstacles.

Figure 2.3: Grid World with Obstacles.
We want to find the shortest path from a given cell to the target cell, while not hitting obstacles.
To do so, we set the state space of the system as \mathbb{X} = \left\{ \begin{bmatrix} r \\ c \end{bmatrix} \middle\vert r,c \in \{ 1,\dots,10 \} \right\} where r is the row index (from top to bottom) and c is the column index (from left to right). The control space is moving left, right, up, down, or do nothing: \mathbb{U} = \left\{ \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \begin{bmatrix} -1 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \end{bmatrix}, \begin{bmatrix} 0 \\ -1 \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \end{bmatrix} \right\}. The system dynamics is deterministic x' = \begin{cases} x + u & \text{if } x + u \text{ is inside the grid} \\ x & \text{otherwise} \end{cases}. We then design the following running cost function g g(x,u) = \begin{cases} 0 & \text{if } x = [1,10]^T \text{ is the target} \\ 20 & \text{if } x \text{ is an obstacle} \\ 1 & \text{otherwise} \end{cases}. Note that g(x,u) defined above does not even satisfy g \in [0,1]. We then use value iteration to solve the optimal control problem with \gamma = 1 J(x_0) = \min_{\pi} \sum_{k=0}^{\infty} g(x_k,\pi(x_k)).
The Matlab script of value iteration converges in 27 iterations, and we obtain the optimal cost-to-go in Fig. 2.4.
Starting from the cell [8,5]^T, the red line in Fig. 2.4 plots the optimal trajectory that clearly avoids the obstacles.
Feel free to play with the size of the grid and the number of obstacles.

Figure 2.4: Optimal cost-to-go and an optimal trajectory.
Example 2.2 shows the simplicity and power of value iteration. However, the states and controls in the grid world are naturally discrete and finite. Is it possible to apply value iteration to optimal control problems where the states and controls live in continuous spaces?
2.2.3 Value Iteration with Barycentric Interpolation
Let us consider the discrete-time dynamics x_{k+1} = f(x_k, u_k) where both x_k and u_k live in a continuous space, say \mathbb{R}^{n} and \mathbb{R}^m, respectively.
A natural idea to apply value iteration is to discretize the state space and control space. For example, suppose x \in \mathbb{R}^2 and we have discretized \mathbb{R}^2 using N points \mathcal{S} = \{s_1,\dots,s_N\} that lie on a 2D grid, as shown in Fig. 2.5. Assume x_k \in \mathcal{S} lies on the mesh grid, the next state x_{k+1} = f(x_k,u_k) will, however, most likely not lie exactly on one of the grid points.

Figure 2.5: Barycentric Interpolation.
Nevertheless, x_{k+1} will lie inside a triangle with vertices s_p, s_q, s_m. We will now try to write x_{k+1} using the vertices, that is, to find three numbers \lambda_p, \lambda_q, \lambda_m such that \lambda_p, \lambda_q, \lambda_m \geq 0, \quad \lambda_p + \lambda_q + \lambda_m = 1, \quad x_{k+1} = \lambda_p s_p + \lambda_q s_q + \lambda_m s_m. \lambda_p,\lambda_q,\lambda_m are called the barycentric coordinates of x_{k+1} in the triangle formed by s_p,s_q,s_m. With the barycentric coordinates, we will assign the transition matrix P(x_{k+1}=s_p \mid x_k,u_k) = \lambda_p, \quad P(x_{k+1}=s_q \mid x_k,u_k) = \lambda_q, \quad P(x_{k+1}=s_m \mid x_k,u_k) = \lambda_m.
Let us apply value iteration with barycentric interpolation to the simple pendulum.
Example 2.3 (Value Iteration with Barycentric Interpolation on A Simple Pendulum) Consider the continuous-time pendulum dynamics in (2.15) that is already shifted such that z=0 corresponds to the upright position. With time discretization h, we can write the discrete-time dynamics as z_{k+1} = \dot{z}_k \cdot h + z_k = f'(z_k,u_k) \cdot h + z_k. We are interested in solving the optimal control problem J(z_0) = \min_{u_k} \left\{ \sum_{k=0}^{\infty} \gamma^k g(x_k,u_k) \right\}, where the running cost is simply g(x_k,u_k) = x_k^T x_k + u_k^2.
We will use the parameters m=1,g=9.8,l=1,b=0.1, and assume the control is bounded in [-4.9,4.9].
We want to compute the optimal cost-to-go in the range z_1 \in [-\pi,\pi] and z_2 \in [-\pi,\pi]. We discretize both z_1 and z_2 using N points, leading to N^2 points in the state space. We also discretize u using N points.
Applying value iteration with \gamma=0.9 and N=50, we obtain the optimal cost-to-go in Fig. 2.6. The value iteration converges in 277 iterations.


Figure 2.6: Optimal cost-to-go with discount factor 0.9.
Applying value iteration with \gamma=0.99 and N=50, we obtain the optimal cost-to-go in Fig. 2.7. The value iteration converges in 2910 iterations.


Figure 2.7: Optimal cost-to-go with discount factor 0.99.
Applying value iteration with \gamma=0.999 and N=50, we obtain the optimal cost-to-go in Fig. 2.8. The value iteration converges in 28850 iterations.


Figure 2.8: Optimal cost-to-go with discount factor 0.999.
You can find the Matlab code here.
References
One can show that the matrix I - \gamma P^\pi is indeed invertible, see Corollary 1.5 in (Agarwal et al. 2022).↩︎