Model-Free Reinforcement Learning

4 minute read


In one of my previous posts, I wrote briefly about Markov Decision Processes (MDP). Today, let's move into the area of reinforcement learning (RL), which is strongly linked to MDP, in that it also deals with problems that require sequential decision making.  The difference lies in that, instead of waiting till the end of the time horizon before we choose our policy (a set of actions specified for each time point at each state), we base our decisions on all the accumulative experience earned in the past. Real-life applications abound in a wide range of fields, including robotics, economics, and control theory. We term the decision maker in a system as an agent. The idea of RL is to empower the agent with both retro-dictive and pre-dictive powers.

In general, RL methods fall into two broad classes: model-based and model-free. Model-based methods use an explicit form to model the environment and the agent, which is a reasonable choice when we have a very good understanding of the mechanism of the system. Whereas when it comes to systems that are too complicated to model, we might as well learn from past experience, which is exactly what human brains are good at.  That's part of the reason why in some literature, RL is referred to as Neuro-Dynamic Programming (see here for a comprehensive summary of this area). We can think of model-free methods as trial-and-error learning in a Pavlovian way.

Following is a dummy proof drawing of an agent interacting with the environment. (It seems I'm getting better at my Inkscape skills :D) Every time the agent takes an action from the current state, a new state will be reached and a corresponding reward earned.

[caption id=”attachment_201” align=”aligncenter” width=”274”]interact Figure 1: agent interacting with environment.[/caption]

The rest of this post will mainly be focused on model-free methods, and especially temporal difference (TD) learning. It is one of the most well-known RL methods that combines classic dynamic programming and Monte Carlo methods. Future rewards are being estimated based on partial knowledge of the system at present. The simplest TD method called TD(0) is described as follows:

$latex V(s_{t})\leftarrow V(s_{t})+\alpha \left [ r_{t+1}+\gamma V(s_{t+1})-V(s_{t}) \right ] $

where  $latex \alpha $ is called the learning rate, which decides how much weight we should put on the newly gained experience against our current knowledge. $latex \gamma V(s_{t+1})-V(s_{t})  $  is the actual "temporal difference" part. In problems with higher dimensions, or when the state space is too large, an exhaustive exploration of every state is simply infeasible. The problem facing us is when to exploit our current knowledge about the system, when to explore the unknown. Depending on this trade-off, methods again can be classified into two classes: on-policy SARSA and off-policy Q-Learning.

  • On-policy SARSA

SARSA stands for state-action-reward-state-action. The state-action value function is updated as follows,

$latex Q(s,a)\leftarrow Q(s,a)+\alpha \left \{ r(s')+\gamma Q(s',a')-Q(s,a) \right \} $

which is mostly the same as the TD method described above. The main difference between SARSA and the aforementioned TD(0) is that, now we consider the state-action pair together rather than only the states as in TD(0).  Also, now that the action chosen is not necessarily the one that maximizes the immediate reward, as we allow some exploration of the unknown. SARSA learns from the feedback of a new experience through the current policy being implemented.

  • Off-policy Q-Learning

Q-learning is the off-policy version of SARSA. It takes the action that maximizes the immediate state-action function value. The update rule for state-action value function is shown as follows:

$latex Q(s,a)\leftarrow Q(s,a)+\alpha \left \{ r(s')+\gamma \underset{a'}{maxa} Q(s',a')-Q(s,a) \right \} $

Essentially, Q-learning method is adding on top of value iteration some probabilistic opportunity to explore the state space. Instead of updating the value function estimate for every new experience, we choose to take the action in next state that gives us the maximum value function estimate among all possible states. Intuitively, we can think of the agent in Q-learning is not as smart enough as the agent in SARSA to learn from their past mistakes immediately. The agent does send feedback to the current state to update the value estimate at state s, but the action a at state s will not be updated.

Please see the following slide show for a toy example on how to update the value function:

[gallery ids=”244,243,242,241,240” type=”slideshow” orderby=”rand”]


In case you're interested in knowing more about RL, I recommend you to take this Machine Learning: Reinforcement Learning online course on Udacity.


[1] R.S. Sutton and A.G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press, 2nd ed., 2017.

[2] P. Dayan and Y. Niv, Reinforcement Learning: The Good, The Bad and The Ugly. ELSEVIER, 2008.