Reinforcement Learning Soup: MDPs, Policy vs. Value Learning, Q-Learning and Deep-Q-Networks

3 min readAug 31, 2017

Reinforcement learning is one of the hottest topics in artificial intelligence(AI) and one that has been broadly covered in this blog. Today, I would like to get a bit more technical and refresh some of the concepts I previously covered about policy vs. value learning and discuss the concept of Q-Learning as one of the most interesting techniques in modern reinforcement learning applications.

MDPs and Policy vs. Value Learning

Reinforcement learning is one of the AI disciplines that resembles human thinking the closest. Essentially, reinforcement learning models AI scenarios using a combination of environments and rewards . In that world, the role of an AI agent is learn about the environment while maximizing its total reward. One of the most popular mechanisms to represent reinforcement learning problems is known as Markov Decision Processes(MDPs) which decomposes scenarios as a series of states, connected by actions and associated to a specific reward. In MDPs, an AI agent can transition from state to state by selecting and action and obtaining the corresponding rewards.

Conceptually, MDPs aim to help AI agents to find the optimal policy in a target environment. Policies are defined by the action an AI agent takes on a specific state. The objective of MDP policies is to maximize the future return fro the AI agent. The biggest challenges on any MDP scenario is always how to tech the AI agent to the reward. Broadly speaking, the solutions to this challenge fall into two main categories: policy and value learning.

Policy learning focuses on directly inferring a policy that maximizes the reward on a specific environment. Contrasting with that approach, value learning tries to quantify the value of every state-action pair. Let’s explain those concepts using an example of an AI agent trying to learn a new chess opening. Using policy reinforcement learning, the AI agent would try to infer a strategy to develop the pieces in a way that can achieve certain well-known position. In the case of value-learning, the AI agent would assign a value to every position and select the moves that score higher. Taking a psychological perspective, policy-learning is closer to how adults reason through cognitive challenges while value-learning is closer to how babies learn.

Q-Learning and Deep-Q-Networks

Q-Learning is one of the most popular of value reinforcement learning. Conceptually, Q-Learning algorithms focus on learning a Q-Function that qualifies a state-action pair. A Q-Value represents the expected long term reward of a Q-Learning algorithm assuming that it takes a perfect sequence of actions from a specific state.

One of the main theoretical artifacts of Q-Learning is known as The Bellman Equation and it states that “the maximum future reward for a specific action, is the current reward plus the maximum reward for taking the next action”. That recursive rule seems to make a lot of sente but it runs into all sorts of practical issues.

The main challenge with Q-Learning and the Bellman Equation comes to the compute cost associated with estimating all combinations of state-action rewards. The computation cost quickly gets out of control in problems involving a decent number of states. To deal with that challenge, there are techniques that try to approximate a Q-function instead of learning an exact one by evaluating all possible Q-Values.

One of the biggest breakthroughs in Q-Learning cam from Alphabet’s subsidiary DeepMind when they used deep neural networks to estimate the Q-Value of all possible actions for a given state. This technique is called Deep-Q-Networks and has become one of the best-known forms of Q-Learning. I will deep dive into Deep-Q-Networks in a future post.

Reinforcement Learning Soup: MDPs, Policy vs. Value Learning, Q-Learning and Deep-Q-Networks

Written by Jesus Rodriguez