The Bandit Problem in Reinforcement Learning

Reinforcement learning is becoming one of the most popular disciplines in modern artificial intelligence(AI). even though many experts believe that reinforcement learning is far from becoming mainstream, it is hard to ignore the large number of cognitive scenarios that can be address with some variation of reinforcement learning techniques. In recent months, the release of tools and frameworks such as OpenAI Gym or DeepMind Lab are helping AI researchers to streamline reinforcement learning tasks but many challenges still remain. One of the most well know challenges in reinforcement learning is known as the Exploration-Exploitation Dilemma and is illustrated in computer science literature in a series of techniques known as the bandit problems.

To understand the relationship between bandit problems and reinforcement learning let’s start with a quick recap of the latter (you can see previous articles in this blog about reinforcement learning). Conceptually, reinforcement learning learning helps AI agents to learn a specific environment via a combination of exploration, reward and punishment. Imagine playing a game without knowing the rules and, after a certain number of moves, learning that you’ve won or lost. That’s the essence of reinforcement learning.

AI theory covers two fundamental types of reinforcement learning algorithms: active and passive. In passive reinforcement learning, an AI agent is assumed to know what to do but it must learn the “utility” associated with the different states of the environment. An active reinforcement learning agent must learn what to do as well as the utility of the states. The concept of “exploration” plays a key role in active reinforcement learning scenarios. By exploration we are referring to the task of an AI agent learning as much as possible about an environment in order to behave in it. The concept of exploration is directly related to the bandit problems.

The Exploration-Exploitation Dilemma and the Bandit Problem

Exploration is a key piece of reinforcement learning based agents but also a really expensive one. By exploring an environment, an AI agent can improve its knowledge of it but it doesn’t directly maximizes the reward or utility of its specific state. At any given point, a reinforcement learning agent faces the tradeoff between exploration to maximize its long-term knowledge and exploitation to maximize its reward. Let’s call that friction the Exploration-Exploitation Dilemma and is well represented in AI theory with the bandit problems. Here is textbook example:

In a casino, a one-armed bandit is a slot machine. An n-armed bandit has n levers. A gambler can insert a coin, pull a lever and collect the winnings(if any). At any give state, the gambler must decide which lever to play on each successive coin; the one with the best payoff or the one that has not been tried yet? With one decision, the gambler will choose to maximize the reward based on its current knowledge of the environment at the risk of never finding an optimal strategy( exploring new levers) while the other strategy optimizes the long term knowledge of the environment but sacrifices immediate rewards.

The Exploration-Exploration-Exploitation Dilemma is present in many well-known active reinforcement learning scenarios such as stock trading strategies or budget allocation. In order to avoid the heavy computation burden of finding a completely optimal strategy to solve the Exploration-Exploitation Dilemma, many reinforcement learning algorithms rely on heuristics that make assumptions about the knowledge. More about that in a future post…

CEO of IntoTheBlock, Chief Scientist at Invector Labs, I write The Sequence Newsletter, Guest lecturer at Columbia University, Angel Investor, Author, Speaker.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store