Chapter 1 The Optimal Control Formulation

1.1 The Basic Problem

Consider a discrete-time dynamical system \[\begin{equation} x_{k+1} = f_k (x_k, u_k, w_k), \quad k =0,1,\dots,N-1 \tag{1.1} \end{equation}\] where

\(x_k \in \mathbb{X} \subseteq \mathbb{R}^n\) is the state of the system,
\(u_k \in \mathbb{U} \subseteq \mathbb{R}^m\) is the control we wish to design,
\(w_k \in \mathbb{W} \subseteq \mathbb{R}^p\) a random disturbance or noise (e.g., due to unmodelled dynamics) which is described by a probability distribution \(P_k(\cdot \mid x_k, u_k)\) that may depend on \(x_k\) and \(u_k\) but not on prior disturbances \(w_0,\dots,w_{k-1}\),
\(k\) indexes the discrete time,
\(N\) denotes the horizon,
\(f_k\) models the transition function of the system (typically \(f_k \equiv f\) is time-invariant, especially for robotics systems; we use \(f_k\) here to keep full generality).

Remark (Deterministic v.s. Stochastic). When \(w_k \equiv 0\) for all \(k\), we say the system (1.1) is deterministic; otherwise we say the system is stochastic. In the following we will deal with the stochastic case, but most of the methodology should carry over to the deterministic setup.

We consider the class of controllers (also called policies) that consist of a sequence of functions \[ \pi = \{ \mu_0,\dots,\mu_{N-1} \}, \] where \(\mu_k (x_k) \in \mathbb{U}\) for all \(x_k\), i.e., \(\mu_k\) is a feedback controller that maps the state to an admissible control. Given an initial state \(x_0\) and an admissible policy \(\pi\), the state trajectory of the system is a sequence of random variables that evolve according to \[\begin{equation} x_{k+1} = f_k(x_k,\mu_k(x_k),w_k), \quad k=0,\dots,N-1 \tag{1.2} \end{equation}\] where the randomness comes from the disturbance \(w_k\).

We assume the state-control trajectory \(\{u_k\}_{k=0}^{N-1}\) and \(\{x_k \}_{k=0}^{N}\) induce an additive cost \[\begin{equation} g_N(x_N) + \sum_{k=0}^{N-1} g_k(x_k,u_k) \tag{1.3} \end{equation}\] where \(g_k,k=0,\dots,N\) are some user-designed functions.

With (1.2) and (1.3), for any admissible policy \(\pi\), we denote its induced expected cost with initial state \(x_0\) as \[\begin{equation} J_\pi (x_0) = \mathbb{E} \left\{ g_N(x_N) + \sum_{k=0}^{N-1} g_k (x_k, \mu_k(x_k)) \right\}, \tag{1.4} \end{equation}\] where the expectation is taken over the randomness of \(w_k\).

Definition 1.1 (Discrete-time, Finite-horizon Optimal Control) Find the best admissible controller that minimizes the expected cost in (1.4) \[\begin{equation} \pi^\star \in \arg\min_{\pi \in \Pi} J_\pi(x_0), \end{equation}\] where \(\Pi\) is the set of all admissible controllers. The cost attained by the optimal controller, i.e., \(J^\star = J_{\pi^\star}(x_0)\) is called the optimal cost-to-go, or the optimal value function.

Remark (Open-loop v.s. Closed-loop). An important feature of the basic problem in Definition 1.1 is that the problem seeks feedback policies, instead of numerical values of the controls, i.e., \(u_k = \mu_k(x_k)\) is in general a function of the state \(x_k\). In other words, the controls are executed sequentially, one at a time after observing the state at each time. This is called closed-loop control, and is in general better than open-loop control \[ \min_{u_0,\dots,u_{N-1}} \mathbb{E} \left\{ g_N(x_N) + \sum_{k=0}^{N-1} g_k (x_k, u_k) \right\} \] where all the controls are planned at \(k=0\). Intuitively, a closed-loop policy is able to utilize the extra information received at each timestep (i.e., it observes \(x_{k+1}\) and hence also observes the disturbance \(w_k\)) to obtain a lower cost than an open-loop controller. Example 1.2.1 in (Bertsekas 2012) gives a concrete application where a closed-loop policy attains a lower cost than an open-loop policy.

In deterministic control (i.e., when \(w_k \equiv 0,\forall k\)), however, a closed-loop policy has no advantage over an open-loop controller. This is obvious because at \(k=0\), even the open-loop controller predicts perfectly the consequences of all its actions and there is no extra information to be observed at later time steps. In fact, even in stochastic problems, a closed-loop policy may not be advantageous, see Exercise 1.27 in (Bertsekas 2012).

1.2 Dynamic Programming and Principle of Optimality

We now introduce a general and powerful algorithm, namely dynamic programming (DP), for solving the optimal control problem 1.1. The DP algorithm builds upon a quite simple intuition called the Bellman principle of optimality.

Theorem 1.1 (Bellman Principle of Optimality) Let \(\pi^\star = \{ \mu_0^\star,\mu_1^\star,\dots,\mu_{N-1}^\star \}\) be an optimal policy for the optimal control problem 1.1. Assume that when using \(\pi^\star\), a given state \(x_i\) occurs at timestep \(i\) with positive probability (i.e., \(x_i\) is reachable at time \(i\)).

Now consider the following subproblem where we are at \(x_i\) at time \(i\) and wish to minimize the cost-to-go from time \(i\) to time \(N\) \[ \min_{\mu_i,\dots,\mu_{N-1}} \mathbb{E} \left\{ g_N(x_N) + \sum_{k=i}^{N-1} g_k (x_k, \mu_k(x_k)) \right\}. \] Then the truncated policy \(\{\mu^\star_i,\mu^\star_{i+1},\dots, \mu^\star_{N-1}\}\) must be optimal for the subproblem.

Theorem 1.1 can be proved intuitively by contradiction: if the truncated policy \(\{\mu^\star_i,\mu^\star_{i+1},\dots, \mu^\star_{N-1}\}\) is not optimal for the subproblem, say there exists a different policy \(\{\mu_i',\mu_{i+1}',\dots, \mu_{N-1}'\}\) that attains a lower cost for the subproblem starting at \(x_i\) at time \(i\). Then the combined policy \(\{\mu_0^\star,\dots,\mu^\star_{i-1},\mu_i',\dots,\mu_{N-1}'\}\) must attain a lower cost for the original optimal control problem 1.1 due to the additive cost structure, contradicting the optimality of \(\pi^\star\).

The Bellman principle of optimality is more than just a principle, it is also an algorithm. It suggests that, to build an optimal policy, one can start by solving the last-stage subproblem to obtain \(\{\mu^\star_{N-1} \}\), and then proceed to solve the subproblem containing the last two stages to obtain \(\{ \mu^\star_{N-2},\mu^\star_{N-1} \}\). The recursion continues until optimal policies at all stages are computed. The following theorem formalizes this concept.

Theorem 1.2 (Dynamic Programming) The optimal value function \(J^\star(x_0)\) of the optimal control problem 1.1 (starting from any given initial condition \(x_0\)) is equal to \(J_0(x_0)\), which can be computed backwards and recursively as \[\begin{align} J_N(x_N) &= g_N(x_N) \\ J_k(x_k) &= \min_{u_k \in \mathbb{U}} \displaystyle \mathbb{E}_{w_k \sim P_k(\cdot \mid x_k, u_k)} \displaystyle \left\{ g_k(x_k,u_k) + J_{k+1}(f_k(x_k,u_k,w_k) ) \right\}, \ k=N-1,\dots,1,0. \tag{1.5} \end{align}\] Moreover, if \(u_k^\star = \mu_k^\star(x_k)\) is a minimizer of (1.5) for every \(x_k\), then the policy \(\pi^\star = \{\mu_0^\star,\dots,\mu_{N-1}^\star \}\) is optimal.

Proof. For any admissible policy \(\pi = \{ \mu_0,\dots,\mu_{N-1} \}\), denote \(\pi^k = \{ \mu_k,\dots,\mu_{N-1} \}\) the last-\((N-k)\)-stage truncated policy. Consider the subproblem consisting of the last \(N-k\) stages starting from \(x_k\), and let \(J^\star_k(x_k)\) be its optimal cost-to-go. Mathematically, this is \[\begin{equation} J^\star_{k}(x_k) = \min_{\pi^k} \mathbb{E}_{w_k,\dots,w_{N-1}} \left\{ g_N(x_N) + \sum_{i=k}^{N-1} g_i (x_i,\mu_i(x_i)) \right\}, \quad k=0,1,\dots,N-1. \tag{1.6} \end{equation}\] We define \(J^\star_N(x_N) = g(x_N)\) for \(k=N\).

Our goal is to prove the \(J_k(x_k)\) computed by dynamic programming from (1.5) is equal to \(J^\star_k (x_k)\) for all \(k=0,\dots,N\). We will prove this by induction.

Firstly, we already have \(J^\star_N(x_N) = J_N(x_N) = g(x_N)\), so \(k=N\) holds automatically.

Now we assume \(J^\star_{k+1}(x_{k+1}) = J_{k+1}(x_{k+1})\) for all \(x_{k+1}\), and we wish to induce \(J^\star_{k}(x_{k}) = J_{k}(x_{k})\). To show this, we write \[\begin{align} \hspace{-16mm} J^\star_{k}(x_k) &= \min_{\pi^k} \mathbb{E}_{w_k,\dots,w_{N-1}} \left\{ g_N(x_N) + \sum_{i=k}^{N-1} g_i (x_i,\mu_i(x_i)) \right\} \tag{1.7}\\ &= \min_{\mu_k,\pi^{k+1}} \mathbb{E}_{w_k,\dots,w_{N-1}} \left\{ g_k(x_k,\mu_k(x_k)) + g_N(x_N) + \sum_{i=k+1}^{N-1} g_i(x_i,\mu_i(x_i)) \right\} \tag{1.8}\\ &= \min_{\mu_k} \left[ \min_{\pi^{k+1}} \mathbb{E}_{w_k,\dots,w_{N-1}} \left\{ g_k(x_k,\mu_k(x_k)) + g_N(x_N) + \sum_{i=k+1}^{N-1} g_i(x_i,\mu_i(x_i)) \right\}\right] \tag{1.9}\\ &= \min_{\mu_k} \mathbb{E}_{w_k} \left\{ g_k(x_k,\mu_k(x_k)) + \min_{\pi^{k+1}} \left[ \mathbb{E}_{w_{k+1},\dots,w_{N-1}} \left\{ g_N(x_N) + \sum_{i=k+1}^{N-1} g_i(x_i,\mu_i(x_i)) \right\} \right] \right\} \tag{1.10}\\ &= \min_{\mu_k} \mathbb{E}_{w_k} \left\{ g_k(x_k,\mu_k(x_k)) + J^\star_{k+1}(f_k(x_k,\mu_k(x_k),w_k)) \right\} \tag{1.11}\\ &= \min_{\mu_k} \mathbb{E}_{w_k} \left\{ g_k(x_k,\mu_k(x_k)) + J_{k+1}(f_k(x_k,\mu_k(x_k),w_k)) \right\} \tag{1.12}\\ &= \min_{u_k \in \mathbb{U}} \mathbb{E}_{w_k} \left\{ g_k(x_k,\mu_k(x_k)) + J_{k+1}(f_k(x_k,\mu_k(x_k),w_k)) \right\} \tag{1.13}\\ &= J_k(x_k), \tag{1.14} \end{align}\] where (1.7) follows from definition (1.6); (1.8) expands \(\pi^k = \{ \mu_k, \pi^{k+1}\}\) and \(\sum_{i=k}^{N-1} g_i = g_k + \sum_{i=k+1}^{N-1}\); (1.9) writes the joint minimization over \(\mu_k\) and \(\pi^{k+1}\) as equivalently first minimizing over \(\pi^{k+1}\) and then minimizing over \(\mu_k\); (1.10) is the key step and holds because \(g_k\) and \(w_k\) depend only on \(\mu_k\) but not on \(\pi^{k+1}\); (1.11) follows again from definition (1.6) with \(k\) replaced by \(k+1\); (1.12) results from the induction assumption; (1.13) clearly holds because any \(\mu_k(x_k)\) belongs to \(\mathbb{U}\) and any element in \(\mathbb{U}\) can be chosen by a feedback controller \(\mu_k\); and lastly (1.14) follows from the dynamic programming algorithm (1.5).

By induction, this shows that \(J^\star_k(x_k) = J_k(x_k)\) for all \(k=0,\dots,N\).

The careful reader, especially from a robotics background, may soon become disappointed when seeing the DP algorithm (1.5) because it is rather conceptual than practical. To see this, we only need to run DP for \(k=N-1\): \[\begin{equation} J_{N-1}(x_{N-1}) = \min_{u_{N-1} \in \mathbb{U}} \mathbb{E}_{w_{N-1}} \left\{ g_{N-1}(x_{N-1},u_{N-1}) + J_N(f_{N-1}(x_{N-1},u_{N-1},w_{N-1})) \right\}. \tag{1.15} \end{equation}\]

Two challenges immediately show up:

How to perform the minimization over \(u_{N-1}\) when \(\mathbb{U}\) is a continuous constraint set? Even if we assume \(g_{N-1}\) is convex¹ in \(u_{N-1}\), \(J_N\) is convex in \(x_{N}\), and the dynamics \(f_{N-1}\) is also convex in \(u_{N-1}\) (so that the optimization (1.15) is convex), we may be able to solve the minimization numerically for each \(x_{N-1}\) using a convex optimization solver, but rarely will we be able to find an analytical policy \(\mu_{N-1}^\star\) such that \(u_{N-1}^\star = \mu_{N-1}^\star (x_{N-1})\) for every \(x_{N-1}\) (i.e., the optimal policy \(\mu_{N-1}^\star\) is implict but not explict).
Suppose we can find an anlytical optimal policy \(\mu_{N-1}^\star\), say \(\mu_{N-1}^\star = K x_{N-1}\) a linear policy, how will plugging \(\mu_{N-1}^\star\) into (1.15) affect the complexity of \(J_{N-1}(x_{N-1})\)? One can see that even if \(\mu_{N-1}^\star\) is linear in \(x_{N-1}\), \(J_{N-1}\) may be highly nonlinear in \(x_{N-1}\) due to the composition with \(g_{N-1}\), \(f_{N-1}\) and \(J_N\). If \(J_{N-1}(x_{N-1})\) becomes too complex, then clearly it becomes more challenging to perform (1.15) for the next step \(k=N-2\).

Due to these challenges, only in a very limited amount of cases will we be able to perform exact dynamic programming. For example, when the state space \(\mathbb{X}\) and control space \(\mathbb{U}\) are discrete, we can design efficient algorithms for exact DP. For another example, when the dynamics \(f_k\) is linear and the cost \(g_k\) is quadratic, we will also be able to compute \(J_k(x_k)\) in closed form (though this sounds a bit surprising!). We will study these problems in more details in Chapter 2.

For general optimal control problems with continuous state space and control space (and most problems we care about in robotics), unfortunately, we will have to resort to approximate dynamic programming, basically variations of the DP algorithm (1.5) where approximate value functions \(J_k(x_k)\) and/or control policies \(\mu_k(x_k)\) are used (e.g., with neural networks and machine learning).² We will introduce several popular approximation schemes in Chapter 3. We will see that, although exact DP is not possible anymore, the Bellman principle of optimality still remains one of the most important guidelines for designing approximation algorithms. Efficient algorithms for approximate dynamic programming, preferrably with performance guarantees, still remain an active area of research.

1.3 Infinite-horizon Formulation

So far we are focusing on problems with a finite horizon \(N\), what if the horizon \(N\) tends to infinity?

In particular, consider the controller \(\pi\) now contains an infinite sequence of functions \[ \pi = \{ \mu_0,\dots \} \] and let us try to find the best policy that minimizes the cost-to-go starting from \(x_0\) subject to the same dynamics as in (1.1) (with \(N\) tends to infinity and \(f_k \equiv f\)) \[\begin{equation} J_{\pi}(x_0) = \mathbb{E} \left\{ \sum_{k=0}^{\infty} g(x_k, \mu_k(x_k)) \right\} \tag{1.16}, \end{equation}\] where the expectation is taken over the (infinite number of) disturbances \(\{w_0,\dots \}\).

We can write (1.16) equivalently as \[ J_{\pi}(x_0) = \lim_{N \rightarrow \infty} J_\pi^N(x_0), \] where, with a slight abuse of notation, \(J_\pi^N(x_0)\) is (1.4) with \(g_N(x_N)\) set to zero.

Now we invoke the dynamic programming algorithm in Theorem 1.2. We will first set \(J_N(x_N) = g_N(x_N)=0\), and then compute backwards in time \[ J_k(x_k) = \min_{u_k \in \mathbb{U}} \mathbb{E}_{w_k} \left\{ g(x_k,u_k) + J_{k+1}(f(x_k,u_k,w_k)) \right\}, \quad k=N-1,\dots,0. \] To make our presentation easier later, the above DP iterations are equivalent to \[\begin{align} J_0(x_0) &= 0 \\ J_{k+1}(x_{k+1}) &= \min_{u_k \in \mathbb{U}} \mathbb{E}_{w_k} \left\{ g(x_k,u_k) + J_k(f(x_k,u_k,w_k)) \right\}, \quad k=0,\dots,N, \tag{1.17} \end{align}\] where I have done nothing but reversed the time indexing.

Observe that when \(N \rightarrow \infty\), (1.17) performs the recursion an infinite number of times.

We may want to conjecture three natural consequences of the infinite-horizon solution:

The optimal infinite-horizon cost is the limit of the corresponding \(N\)-stage optimal cost as \(N \rightarrow \infty\), i.e., \[ J^\star(x) = \lim_{N \rightarrow \infty} J_N(x_N), \] where \(J_N(x_N)\) is computed from DP (1.17).
Bacause \(J^\star\) is the result of DP (1.17) when \(N\) tends to infinity, if the DP algorithm converges to \(J^\star\), then \(J^\star\) should satisfy \[\begin{equation} J^\star(x) = \min_{u \in \mathbb{U}} \mathbb{E}_w \left\{ g(x,u) + J^\star(f(x,u,w)) \right\}, \quad \forall x \tag{1.18} \end{equation}\] Note that (1.18) is an equation that \(J^\star(x)\) should satisfy for all \(x\). In fact, this is called the Bellman Optimality Equation.
If \(\mu(x)\) satisfies the Bellman equation (1.18), i.e., \(u = \mu(x)\) minimizes the right-hand side of (1.18) for any \(x\), then the policy \(\pi = \{\mu,\mu,\dots \}\) should be optimal. This is saying, the optimal policy is time-invariant.

In fact, all of our conjectures above are true, for most infinite-horizon problems. For example, in Chapter 2.2, we will investigate the Markov Decision Process (MDP) formulation, under which the above conjectures all hold. However, one should know that there also exist many infinite-horizon problems where our conjectures will fail, and there are many mathematical subtleties in rigorously proving the conjectures.

The reader should see why it can be more convenient to study the infinite-horizon formulation: (i) the optimal cost-to-go is only a function of the state \(x\), but not a function of timestep \(k\); (ii) the optimal policy is time-invariant and easier to implement.

Value Iteration. The Bellman optimality equation (1.18) also suggests a natural algorithm for computing \(J^\star(x)\). We start with \(J(x)\) being all zero, and then iteratively update \(J(x)\) by performing the right-hand side of (1.18). This is the famous value iteration algorithm. We will study it in Chapter 2.2.

As practitioners, we may simply execute the dynamic programming (value iteration) algoithm without carefully checking if our problem satisfies the assumptions. If the algorithm converges, oftentimes the problem indeed satisfies the assumptions. Otherwise, the algorithm may fail to converge, as we will see in Example 2.3.

References

———. 2012. Dynamic Programming and Optimal Control: Volume i. Vol. 1. Athena scientific.

You may want to read Appendix B if this is your first time seeing “convex” things.↩︎
Another possible solution is to discretize continuous states and controls. However, when the dimension of state and control is high, discretization becomes too expensive in terms of memory and computational complexity.↩︎