LPs and MDPs

(I originally wrote this post in response to a student’s question for Georgia Tech’s CS 7642 Reinforcement Learning and Decision Making course.)

Solving MDPs via linear programming

Markov decision processes (MDPs) arise in control and reinforcement learning (RL) problems. Modulo notational and cultural differences, control and RL study many similar problems. I highly recommend checking out Ben Recht’s blog series An Outsider’s Tour of Reinforcement Learning for some quality commentary on the control versus RL divide.1

There are a variety of methods that we can use to solve MDPs, including value iteration, policy iteration, and linear programming. And when we say that MDPs are polynomial-time solvable, we’re implicitly relying on the fact that linear programs are polynomial-time solvable.

That said, linear programming solutions tend to be overlooked. As is often the case with linear programming formulations, the primal and dual problems provide perspective in the problem at hand. In this post, we derive a dual that offers an network flow interpretation of an MDP. Let’s go.

Let’s say we have an MDP with $m$ states $s_1, \dots, s_m \in \mathcal{S}$ and $n$ actions $a_1, \dots, a_n \in \mathcal{A}$. We’ll write the value function $v(\cdot)$ as an $m$-dimensional vector $v \in \reals^m$ and the action value function $q(\cdot, \cdot) \in \reals^{m, n}$ matrix.

We can find the optimal value function

$$ {\small v^\star(s) \stackrel{\cdot}{=} \max_\pi v_\pi(s) = \max_a \left( R(s, a) + \gamma \sum_{s'} T(s, a, s') v(s') \right), \quad \text{for all $s \in \mathcal{S}$} } $$

by solving the linear program

$$ \begin{array}{ll} \mbox{minimize} & \ones^T v \\
\mbox{subject to} & v \succeq R_a + \gamma T_a v \quad \text{for $a = 1, \dots, n$}, \end{array} $$

where $v$ is the optimization (aka, decision) variable, $R_a \in \reals^m$ is a vector of rewards for action $a$ and $T_a \in \reals^{m \times m}$ is the transition matrix corresponding to action $a$. We’ll call this the primal linear program.

We form a dual linear program of the primal problem by calculating its Lagrangian, eliminating the primal variable, and maximizing the dual function (aka, the Lagrangian with the primal variable eliminated).

Calculating the Lagrangian

The Lagrangian is the primal’s objective function plus penalties for violating the primal’s constraints (rewritten as $0 \succeq R_a + \gamma T_a v - v$)

$$ {\small L(v, \lambda) = \ones^T v + \underbrace{\lambda_1^T (R_1 + \gamma T_1 v - v)}_{\text{negative for active constraint}} + \cdots + \lambda_n^T (R_n + \gamma T_n v - v), } $$

where $\lambda_a \in \reals^m$ and $\lambda_a \succeq 0$. \textit{Since this isn’t a detective novel, I’ll clue you in to the fact that these $\lambda$s correspond to the action-value function $q$!}

Before eliminating the primal variables, we’ll rewrite the Lagrangian so that all the primal variables are grouped together

$$ {\small L(v, \lambda) = \left( \ones + (\gamma T_1 - I)^T \lambda_1 + \cdots + (\gamma T_n - I)^T \lambda_n \right)^T v + \lambda_1^T R_1 + \cdots + \lambda_n^T R_n, } $$

where $I$ is the identity matrix.


And see that $\ones$? That’s the 1 unit of policy that we inject into the problem!


Eliminating the primal variable

Now let’s eliminate the primal variable by minimizing $L(v, \lambda)$ over $v$

$$ {\small g(\lambda) = \inf_v L(v, \lambda) = \begin{cases} \lambda_1^T R_1 + \cdots + \lambda_n^T R_n & 0 = \ones + (\gamma T_1 - I)^T \lambda_1 + \cdots + (\gamma T_n - I)^T \lambda_n \\
-\infty & \text{otherwise}. \end{cases} } $$

To see this, let’s look at the gradient of $L(v, \lambda)$ with respect to $v$ and set it equal to zero so that we find a critical point—in particular the minimum since $L$ is convex in $v$:

$$ \nabla_v L(v, \lambda) = \ones + (\gamma T_1 - I)^T \lambda_1 + \cdots + (\gamma T_n - I)^T \lambda_n \stackrel{\text{set}}{=} 0. $$

Maximizing the dual function

The last step is to maximize the dual function $g$ subject to the constraints on $\lambda$ (i.e., $\lambda_a \succeq 0$) and the constraints that arose from eliminating $v$; doing so gives the dual problem

$$ \begin{array}{ll} \mbox{maximize} & \lambda_1^T R_1 + \cdots + \lambda_n^T R_n \\
\mbox{subject to} & 0 = \ones + (\gamma T_1 - I)^T \lambda_1 + \cdots + (\gamma T_n - I)^T \lambda_n \\
& \lambda_a \succeq 0, \quad \text{for $a = 1, \dots, n$}. \end{array} $$

Translating to a policy flow problem

To tie this problem back to the policy flow problem from Professor Isbell and Littman’s CS 7642 AAA lecture, let’s rewrite things in terms of scalar variables

$$ {\small \begin{array}{ll} \mbox{maximize} & \displaystyle \sum_s \sum_a \lambda_a(s) R(s, a) \\
\mbox{subject to} & 0 = 1 + \gamma \displaystyle \sum_s \sum_a T(s, a, s') \lambda_a(s) - \displaystyle \sum_a \lambda_a(s), \quad \text{for $s' = 1, \dots, n$}\\
& \lambda_a(s) \ge 0, \quad \text{for $a = 1, \dots, n$ and $s = 1, \dots, n$}. \end{array} } $$

And our last step is to write $\lambda_a(s) = q_{s, a}$ so that we recover

$$ {\small \begin{array}{ll} \mbox{maximize} & \displaystyle \sum_s \sum_a q_{s, a} R(s, a) \\
\mbox{subject to} & 1 + \gamma \displaystyle \sum_s \sum_a T(s, a, s') q_{s, a} = \displaystyle \sum_a q_{s, a}, \quad \text{for $s' = 1, \dots, n$}\\
& q_{s, a} \ge 0, \quad \text{for $a = 1, \dots, n$ and $s = 1, \dots, n$}. \end{array} } $$

Next up

tbd.

References

This discussion draws on topics from two of my favorite books:


  1. Bertsekas' recent book, Reinforcement Learning and Optimal Control, also highlights some of the similarities and differences between the fields (terminology, notation, etc.). ↩︎

husband | son | brother | statistician masquerading as a computer scientist | mathematical optimizer masquerading as a statistician | locomotor | board-rider | meditator | provider for 🐕 ∧ 🐈‍⬛ ∧ 🐴 | lifetime learner | k🥷🏼

My nerdy interests include statistics, optimization and computation, and how those things manifest in finance, economics, biology, physics… really in any field.