A Gentle Intro to MDP

5 minute read


Suffering from my 'memoryless property' a lot, with a MDP coursework alarm ringing in my head at the moment, and vaguely remember this dynamic programming research talk from Chris Kirkbride recently; I decided to organize all that I know about MDP in this one blog post. Hopefully after finish writing this post, I'll have a clean and organized storage of MDP in my head; and hopefully after finish reading this post,  you'll get something useful as well.

Just some basics, a Markov Decision Process (MDP) can be described by the following quintuple:

$latex MDP=<S, A, T, R, \gamma> $

where S are the different states, A the actions, T the transition probabilities under certain action, i.e. $latex P(s^{'}|s,a) $, R the rewards, and $latex \gamma $ the discount factor which is usually used to discount the importance of future rewards. Now that we've defined all the important ingredients for MDP, we can choose to either use value iteration or policy iteration to calculate the expected rewards for each state.

The idea of MDP is being widely used in places where sequential decision making is needed. It is kind of a nondeterministic search, where the outcome of the actions are uncertain. I scribbled below a simplified grid environment for a robot. Image a robot starts at somewhere in this grid, each time the robot can decide to move to any one of the neighbouring grids, get either rewards or punishments. We assume the actions are stochastic, so the resulting next state is a function of the current state and action taken. As the robot moves around in this grid, a sequence of rewards/punishments will be generated.

The main goal here is to compare these different sequences of rewards, and choose one route with the highest cumulative reward.


  • Value Iteration

Obviously in the above environment, we want to avoid the -5, -8 grids, and step into the +3, +10 grids. We can formulate this problem with a beginning point and an end point. There is really no end, and we can simply use an arbitrary end point. In value iteration, we let $latex V_{k} $ denote the value function assuming there are k steps to go, and let $latex Q_{k} $ denote the state-action value function assuming there are k steps to go. These can then be calculated recursively. We initialize our algorithm with an arbitrary function $latex V_{0} $, and use the following Bellman equation to get the functions for k+1 steps to go: (remember we are going backwards in time)

$latex Q_{k+1}^{\pi}(s,a)= r(s) + \gamma \sum_{s^{'}}P(s^{'}|s,a)\sum_{a^{'}}\pi(s^{'}|s,a)Q^{\pi}_{k}(s^{'},a^{'}) $

In the case of deterministic policies, the action is given by the policy and the above state-action value function $latex Q_{k+1}^{\pi}(s,a) $ reduces to $latex V^{\pi}(s) $, which then simplifies the above equation into:

$latex  V^{\pi}(s) = r(s) + \gamma \sum_{s^{'}}P(s^{'}|s,a)V^{\pi}(s^{'}) $

In an environment of N states, the Bellman equation is a set of N linear equations, one for each state. We can start with a guess V for the value of each state, and calculating from this a better estimate: V

  • Policy Iteration

In policy iteration, as the name suggests, we iterate over many different policies and choose the one that gives us the highest payoff. To be specific, we usually choose a threshold and iterate our algorithm over different policies many times, stop when the difference of value functions for different policies is smaller than this threshold. You can basically think of a policy as a string of actions through time, thus the number of policies for one MDP problem is the number of all possible actions to the power of the number of states. This exponential growth of problem size with the number of states, as you'd expect, results in serious dimensionality problems. It is one of the main challenges in the area of reinforcement learning, and was termed curse of dimensionality by Richard Bellman.

Let's think about the simple grid example above in the policy iteration setting. Suppose the robot starts at top left and is heading to the bottom right as its destination. We give our algorithm an initial value, say zero. We now have to initialize the policy, which in effect, is to decide which grid to take as a next step. Surely, in order to get higher rewards the robot should avoid the grids with negative returns. With full knowledge of the environment, we can then calculate the corresponding value function according to the above Bellman equation. This process, to put in an academic jargon, is called an evaluation of the specific policy (policy evaluation).

The next step, is to take this value function and to calculate the corresponding best set of actions for it. Forget about all these obscure terms, put yourself in the robot's shoes, and you'll know what to do. Assume now we have full information about the environment (each grid can only be visited once), the following random drawing is the route I take without much thinking. (probably some policy iteration needed :D )


The best actions to take for a specific value function is to take the action from each state that maximize the corresponding future payoff. This is termed, policy improvement. These two steps, policy evaluation and policy improvement are repeated until the policy does not change anymore. I find this problem so much more interesting and approachable once I put it in a game setting, and envision myself as the one playing games (making decisions). Of course, MDP has so much more serious applications in real life. Similar areas such as reinforcement learning and optimal control are things I'd like to look further into the details, and I shall talk more about those things later in my post.

Stay tuned, keen beans!