Gabriel Maher

Creating Artificial Intelligence with Model Predictive Control

July 31st, 2019

There is a lot of hype around reinforcement learning right now. Many of the approaches devise methods to learn policies, or value functions for the task at hand. These are then used to create agents that can select actions that are good in some sense.

However, one issue with reinforcement learning is that it tends to need many training iterations to converge. Interestingly enough, if we have a model of the environment we can solve for the optimal actions directly. This is much faster than the typical reinforcement learning loop. In this post I explore Model Predictive Control in the case of continuous systems, which is an example of how we can use environment models to create agents that act optimally in continuous systems.

Model Predictive Control for Continuous Systems

To keep things simple, we can consider a linear control system $$ x_{t+1} = Ax_t + Bu_t $$ $x$ is the system state vector, $u$ are the controls, $A$ is the matrix governing how the next state depends on the current state and $B$ determines how our controls influence the next state.

The problem now is, given some initial state $x_0$, how do find suitable controls $u^*$ to achive some objective? For example, we could be interested in reaching some goal state $x*$, or maximizing some reward $r(x)$, all the while not spending too much energy which puts limits on the possible $u$. Typically we can formulate the objective as a cost function. A simple but flexible cost function is $$ J(x,u) = (x_N-x^*)^TP(x_N-x^*) + \sum_{i=1}^{N-1}(x_N-x^*)^TQ(x_N-x^*) + u_i^TRu_i $$ Here $P$ is a matrix scaling the error between the final state and the goal state. $Q$ does the same except for all the intermediate states. $R$ can be thought of as scaling the control energy, which measures how much control effort we are expending. If we consider a 1 dimensional case, and let all the matrices be the identity, the cost becomes $$ J(x,u) = (x_N-x^*)^2 + \sum_{i=1}^{N-1}(x_i-x^*)^2 + u_i^2 $$ So we see that, in some sense, the cost is similar to a least squares error on the goal state, and an $L_2$ regularization of the controls.

Finding the optimal controls

Given our system, and cost function, how do we find the controls that minimize the cost? At first glance, why not just differentiate the cost with respect to the controls and find the controls that make the derivative 0? This is exactly what we are going to do, but there is one detail we need to deal with. The state $x_t$ at a given time depends on all the previous controls $u_{t-1},...,u_0$. So if we want to differentiate with respect to the controls we need to know what that relationship looks like. So we can write out the evolution of the system for a number of steps $$ x_0 = x_0 $$

$$ x_1 = Ax_0 + Bu_0 $$

$$ x_2 = Ax_1 + Bu_1 = A^2x_0 + ABu_0 + Bu_1 $$

$$ x_3 = Ax_2 + Bu_2 = A^3x_0 + A^2Bu_0 + ABu_1 + Bu_2 $$

In general $$ x_t = A^tx_0 + \sum_{i=0}^{t-1}A^{t-1-i}Bu_{i} $$

We can write this as a system $$ X = \begin{bmatrix} A \\ \vdots \\ A^N \end{bmatrix} x_0 + \begin{bmatrix} B & 0 & \dots & 0 \\ AB & B & \ddots & \vdots \\ \vdots & \vdots & \ddots & 0 \\ A^{N-1}B & A^{N-2}B & \dots & B \end{bmatrix} \begin{bmatrix} u_0 \\ u_1 \\ u_2 \\ \vdots \\ u_{N-1} \end{bmatrix} = S_x x_0 + S_u U $$ where $N$ is the number of steps ahead we are considering. There is a subtle issue here, and that is how do we know how far ahead we should plan? The problem is that at the beginning we do not know how long it will take us to reach our goal. Unfortunately there is no easy answer. What we can do is simply pick a horizon, use that to plan our controls, and then replan frequently. This is known as receding horizon control.

Using the matrices $S_x$ and $S_u$ we can write the cost function as $$ \begin{aligned} J(x,u) &= (S_x x_0 + S_u U - X^*)^T\tilde{Q}(S_x x_0 + S_u U - X^*) + U^T\tilde{R}U \\ &= z^T\tilde{Q}z + U^T(S_u\tilde{Q}S_u+\tilde{R})U + 2z^T\tilde{Q}S_uU \\ &= z^T \tilde{Q} z + U^THU + 2z^T\tilde{Q}S_uU \end{aligned} $$ where $$ z = S_x x_0 - X^* $$ $$ X^* = \begin{bmatrix} x^* \\ \vdots \\ x^* \end{bmatrix} $$ $$ \tilde{Q} = \begin{bmatrix} Q & 0 & \dots & 0 \\ 0 & Q & \ddots & \vdots \\ \vdots &\ddots & \ddots & 0\\ 0 & \dots & 0 & P \end{bmatrix} $$ and $$ \tilde{R} = \begin{bmatrix} R & 0 & \dots & 0 \\ 0 & R & \ddots & \vdots \\ \vdots & 0 & \ddots & 0 \\ 0 & \dots & 0 & R \end{bmatrix} $$

We can now differentiate the cost function with respect to $U$, producing $$ HU + S_u^T\tilde{Q}z = 0 $$ So we can solve for the controls by solving the system $$ U = -H^{-1}S_u^T\tilde{Q}z $$

Receding Horizon Control

As mentioned earlier, we do not know beforehand how many steps ahead we have to plan. So even if we solve for the optimal controls at the current step, our solution might not be accurate many steps into the future. In that case we will not get to where we want to be.

However, executing a few of calculated controls will likely bring us closer to our goal. So with receding horizon control what we do is

We plan $N$ steps ahead to get the controls $U$
Execute one (or more) of the first control actions and arrive at a new state $x_t$
Go back to the planning stage and calculate new controls with the new state.

In this way we move closer and closer to our goal state, and our controls become more and more accurate.

Example

Consider a simple one dimensional car $$ \begin{aligned} s_{t+1} &= s_t + v_t \\ v_{t+1} &= v_t + u_t \end{aligned} $$ with $s_t$ the distance, $v_t$ the velocity, and $u_t$ the acceleration. Letting $x_t = [s_t,u_t]^T$ we then have

$$ A = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} , B = \begin{bmatrix} 0 \\ 1 \end{bmatrix} $$

Our problem now is that the car starts at some vector $x_0$ and we want it to be at some other state $x^*$. For the sake of simplicity let us assume $$ x^* = \begin{bmatrix} 0 \\ 0 \end{bmatrix} $$ Let us also assume that the car starts at rest, some distance away, e.g. $$ x_0 = \begin{bmatrix} -1 \\ 0 \end{bmatrix} $$

For the cost function, here we do not really have a particular reason for a particular choice of the matrices $Q,P$ and $R$, so let us assume that $Q=I,P=I$ and $r_0=1$.

We can now solve the control problem. Here are solutions when planning $N=5$ steps ahead.

We see that we start out with a large acceleration, which then decreases due to the control effort penalty. if we set $r_0=10$, making the penalty significantly larger, we get the following solutions

So we see that increasing the control penalty leads to smaller control values and smoother state changes, but it does take longer to reach the goal state.

Code for this post is available here.

A lot of inspiration came from Ben Recht's survey paper on reinforcement learning (paper link) and the textbook Predictive Control for linear and hybrid systems by Borrelli, Bemporad and Morari.