Reinforcement Learning

Michael Notter @ EPFL Extension School · 16 minutes

How AIs can use punishment and reward as a motivation – just like us

Reinforcement learning is one of three categories of how a machine can learn. In reinforcement learning, the machine ‘lives’ in an environment and learns through its behavior how to make the right decisions to achieve a specific goal. The learning strategy behind such an approach is very similar to how we humans learn to make our decisions.

A short example

Let’s consider a child who learns how to ride a bicycle, for example. She will try things out, fall multiple times in all directions, before finally managing to move forward for a few meters. Falling down is scary and painful, so she will avoid actions leading to such outcomes in the future. On the other hand, riding forward is a thrilling experience, so she will perform actions leading to this outcome more often.

In other words, actions that lead to a reward will be reinforced, and actions that lead to painful and negative experiences will be punished or avoided.

It is important to note that reaching the reward is not about just one single individual action. It is much more about the right sequence of actions and decisions: getting on the bike, shifting the balance on the seat to avoid a fall, pedalling smoothly with both legs at the right speed, steering the wheel to keep the balance and avoid obstacles.

Being able to ride the bike means the child has learned through experiences how to behave in its environment to ride the bike (i.e. to get the reward). She now has some notion of how riding a bike works, what to avoid, how important keeping your balance is, what will happen if there is not enough speed, etc. In other words, she has learned some rules (or policies) about her environment.

It’s important to highlight that the child learned some properties of the environment without needing to understand the exact physical concepts of gravitation, motion, trajectories, etc. Ergo, it’s not about finding the ‘true’ rules of the environment, but more so learning some strategies to survive and optimally behave in that environment.

What makes reinforcement learning special?

Reinforcement learning is quite different from the other two categories of machine learning. Much more than just finding the right labels (supervised learning), or finding an underlying structure in the data (unsupervised learning), reinforcement learning is about finding the right sequence of actions to reach a goal. While training data is available from the start in both supervised and unsupervised learning, that’s not the case with reinforcement learning. Instead, the AI creates its own data by being an active participant in an environment of reward and punishment. Through time and experiences, the AI learns which actions eventually lead to a good or a bad outcome.

How reinforcement learning works

At its core, reinforcement learning is about an AI learning how to interact with its environment to eventually reach a goal. It’s about an agent learning through trial and error how to perform a sequence of actions in an interactive environment to achieve a specific goal. Before each action, the agent can observe the current state of itself and the environment, and based on that and previous experiences decide what to do next.

The notion of trial and error is very important for this kind of learning. The goal for the agent is not to fully understand the environment or find the best solution – the environment is usually too complex to understand and full of uncertainty, therefore it’s not about learning the single best sequence to solve a problem – but more about how to react to an ever-changing environment. And it is much more about understanding the best strategy (also called policy1) at any given point in time, depending on the current situation the agent finds itself in.

For example, a young bird learning to steal bread crumbs in the park needs to understand that some food vendors might react differently when they are approached, while some customers might have pets with them, which requires a different strategy altogether. And depending on the weather and past experiences, certain days might prove to be more rewarding than others.

However, where this analogy breaks down is that a reinforcement learning AI would explore all potential paths and actions. In full trial and error fashion, such an AI bird would curiously fly towards a cat, dive to the bottom of a lake, or fly away from the bread crumbs to see if that increases the chance of acquiring food. By doing so, the AI bird might also discover some unique and innovative solutions. For example, stealing money from people in the park and bringing it to the vendor might lead to the biggest reward of receiving a complete hot dog and not just some bread crumbs.

It is this balance between the exploration of uncharted territory and exploitation of current knowledge (e.g. the food vendor has food) that can lead to very strong reinforcement learning models.

Early steps of reinforcement learning

One of the challenges for reinforcement learning is the availability of a useful or realistic environment where an agent can go through millions or even billions of iterations to learn the right sequence of decisions.

It therefore shouldn’t come as a surprise that the first famous examples of reinforcement learning are connected to video games. In a video game, the actual environment is already simulated and controlled by the computer. Having a fully computerized and digital environment means that each iteration of trial and error can be executed as quickly as the computer allows and not be limited by the speed of a human playing the video game.

In 2013, the British AI company DeepMind introduced a reinforcement learning algorithm that was able to play Atari video games – most of the time much better than humans. This was a remarkable achievement. And what was particularly amazing was how the same AI system was able to learn how to play not just one Atari video game, but multiple ones! Pong, Breakout, Space Invaders – with the same learning strategy, one single AI was able to learn all of these video games at a superhuman level.

Even more remarkable is how the AI learned to do those tasks in the first place, because the only information the AI gets is the pixel values on the screen and whether a game is over. The goal of the task was to play for as long as possible and the longer it played, the higher the reward. So the AI learned to avoid ‘game over’ and as an indirect consequence learned how to increase the game’s score. That’s it! The AI does not receive any particular information about the scores, which objects on the screen the enemies represent, or what effect moving to the left or right has, etc.

Initially therefore, the AI does random things: moving around in all directions, potentially trying to take action 1 (e.g. shooting) or action 2 (e.g. jumping) in random sequences. After much trial and error, the AI might learn that if the pixels in the top corner change (i.e. the score goes up which is a positive reward), then that is a good thing. However, the AI does not understand what scores are or how to read them – it just learns what to look for in its environment to increase the possibility of eventual reward. Similar to a child learning how to ride a bike, she does not need to know the full physics behind it to get the rewarding thrill of moving forward.

Thanks to this and other success stories, DeepMind was acquired by Google the following year for a staggering price of $500million.

But this was only the start of DeepMind’s success story and the beginning of impressive reinforcement learning AI systems. In April 2016, DeepMind’s AlphaGo beat world champion Lee Sedol in the game of Go four times out of five.2

Go is a strategic board game similar to chess, but is considered to be much more complicated due to the staggering amount of possible moves at each turn. For a long time, it was considered impossible for an AI to play Go as it requires creativity and strategic thinking.

However, the environment and rules of this board game can easily be simulated on a computer. DeepMind’s researchers were able to expose AI agents to this environment, but letting the AI play against humans to train its strategies was far too slow. That’s why DeepMind came up with an impressive little tweak: letting the AI agent play against itself! In doing so, AlphaGo was able to train on millions of games against itself to find the initial strategies and approaches that eventually let it beat the best human player in the world.

Application of reinforcement learning

While the reinforcement learning approach has been shown to work very well for certain research and video game problems, the transition to everyday life and industry is still rather slow. Again, this is due to the fact that simulating the environment and iterating through trial and error is not always feasible or possible. Nonetheless, reinforcement learning has already led to impressive AI agents influencing the world around us, as can be seen in the following examples:

Autonomous cars

One of the more obvious examples would be self-driving cars. In such a framework, the AI agent’s reward could be to put safety first, to minimize ride time, or to maximize a driver’s comfort, for example. While we definitely cannot put thousands of cars into a city and just “let them learn the right behavior”, we can however use reinforcement learning to train AI models to develop autonomous driving capabilities.

Initially, such AI agents are trained in a simulated computer environment that perfectly mirrors the roads and cities of the real world. While the simulation might be on a computer, the input data these AI agents would receive are recorded sensory information from real world cars driven by humans. Although it would take a long time to record sufficient data by hand, if the endeavor was undertaken by researchers at a big car manufacturing company with thousands of cars on the road, it would be possible to collect millions of hours of driving examples in no time. Then using this data from human drivers, the agent can become good enough to eventually replace the human driver first in some situations, then in most situations, before eventually taking over in all situations.

The reason why we would use reinforcement learning for such a problem – and not a supervised learning approach – is simply that the conditions on the road change constantly, and reinforcement learning models are masters in adapting to environmental changes. We don’t need a system that learns the best action for most situations – we need an agent that is capable of performing the best sequence of actions given the current state of the environment that will lead to the highest reward, i.e. the best outcome.

Assembly line

Reinforcement learning agents are also very powerful in assembly lines or instances where robots are used. Imagine a robot in a supermarket warehouse that needs to fetch items from storage to put them into a cardboard box which then goes out to the customers. Each item that the robot must pick up has a different shape, density and weight, meaning the grip of the robot’s arm needs to be adjusted to prevent dropping or breaking the item.

There are multiple reasons why this framework calls for reinforcement learning. Initially, the robot arm can be trained in a simulated computer environment where the robot just learns how to pick up things of different shapes. Subsequently the arm can test its skills in the physical world with a few example items. While the iteration in the physical world is usually much slower than what we want, the fact we can train the same AI agent on a few hundred robot arms in parallel – plus the fact that each grabbing of an item only takes a few seconds to execute – allows us to train millions of iterations each day and get a functioning AI model in no time.

Other applications

Using reinforcement learning, drones are able to learn how to navigate difficult terrain, trains learn how to find the optimal schedule to manage traffic on complex rail networks, and tech companies like Google are able to reduce the cooling costs for its data centres by 40%.

In 2020, DeepMind was able to solve a 50-year-old problem in biology with its AlphaFold AI model. 3 AlphaFold is capable of predicting the shape a protein will fold with a very high accuracy, thus providing a potential and acceptable solution to the famous protein folding problem.

And with regards to video games, reinforcement learning has come a long way since the time it played Atari games. Nowadays, AI agents learn to successfully play real-time strategy video games like StarCraft (AlphaStar), Dota 2 (OpenAI Five) and Minecraft.

The challenges with reinforcement learning

We’ve learned the biggest challenges for reinforcement learning are the availability of a useful environment and the amount of time required to go through a sufficient amount of trial and error. The solution to these problems is to initially train the AI agent in a computer simulated and simplified version of the actual environment, before bringing the AI into the real world.

Unexpected behavior and unwanted solutions

However, this is not the only challenge we face when training a reinforcement learning agent. As mentioned, reinforcement learning works by rewarding or punishing certain behavior. Human involvement during this learning process is limited to defining the environment and specifying the system of rewards and penalties. So if our environment is faulty, or our definition of the reward system is sub-optimal, then the agent might discover these issues and as a consequence come up with an unexpected solution.

Let’s return to the example of a child learning how to ride a bike. If she is told she will receive a reward if she doesn’t fall down, her solution might be to never get on the bike in the first place – and thus never fall down. Or if we tell her that the size of the reward depends on the distance the bike travels, she might exploit the loophole by carrying the bike to a train station and putting it on an overnight train to the other side of the continent.

Every parent has learned this lesson: clear instructions are crucial, especially when it’s about a lucrative reward. It’s no different for AI agents. Even if we don’t know how the optimal goal should be achieved, by clearly specifying what behavior should take place (e.g. the child needs to be sitting on the bike and the bike needs to move forward due to the child’s own muscle motion) and by specifying what behavior will be punished (e.g. standing still), the AI agent can hopefully be guided along the intended path.

Delayed reward

Another challenge in reinforcement learning is the issue of delayed rewards. Each action an AI agent takes influences not only the immediate reward that the agent receives, but also the next environment state and, through that, all subsequent rewards. This makes it very difficult for the AI to establish which sequence of actions eventually leads to the highest reward.

One solution to this problem is to have huge amounts of training in which the agent tries to explore multiple different behaviors to (hopefully!) eventually stumble over something that gives the first reward signal, and thus ‘gets the ball rolling’ in the right direction.

Exploration versus exploitation

Another important thing to keep in mind when training an AI agent is striking the right balance between exploration and exploitation. If an agent only applies what it has tried in the past and knows is effective, it might never realize there are even better strategies that lead to even bigger rewards. Similarly, if the agent explores too many different strategies, it might never find the right combination of actions that lead to a reward. Ultimately, the AI agent needs to exploit the actions it knows are good, while at the same time exploring new actions to make sure it doesn’t miss out on better ones.

Technical challenges

Additionally, there are also some technical challenges that can arise when training a reinforcement learning agent – most notably the amount of training needed to develop a useful AI agent. In the case of the Atari games, for example, the agent needs to go through over 20 million timesteps to solve the task, which corresponds to around 80 hours of gameplay. This makes experimentation with different agents and reward and punishment strategies quite cumbersome.

As a consequence of this, it is often necessary to dedicate a large amount of computing power to train an agent. This makes reinforcement learning very resource-hungry. Let’s look at an example of an extreme case: OpenAI Five is an agent able to play an online multiplayer game called Dota 2 at the level of a human world champion player. However, this agent took 10 months to train – during which the agent experienced about 45,000 years of gameplay – running on 256 GPUs and over 128,000 CPUs in parallel. As a consequence, the cost for training such an agent is in the millions of dollars.

Unknown randomness

Finally, reinforcement learning algorithms are still brittle. Even the most reliable algorithms, implemented bug-free by experts, will sometimes fail to learn a good strategy. This is simply due to the fact that the random initiation of the model parameters before the training even started might have been unfavorable. With supervised learning, an unlucky random initialization may cause the training to optimize more slowly, while with reinforcement learning there is a real chance that the training will not optimize at all.


In contrast to the other two machine learning categories, reinforcement learning is still a rather young discipline mostly used in research settings. But as the examples in this article show, in those cases where reinforcement learning can be used, the AI agents are very strong.

Even if the fruits of these approaches are not yet very plentiful, this kind of machine learning speaks to us much more than others – probably because it is very similar to how we learn to navigate the world ourselves, where good actions are usually rewarded and bad ones punished.

It is also these types of AI models that currently seem to be the closest to what one could call a machine capability of creativity. From time to time, an AI agent might find a legitimate and acceptable solution to a problem that is completely different from what experts thought possible, reasonable or useful. For example, in the case of the board game Go, the AI agent AlphaGo has shown unexpected and innovative new strategies to play the game.

One thing is clear: this is just the beginning of reinforcement learning! And we can expect to hear much more about this new and exciting category of machine learning in the years to come.

  1. A policy defines what action the agent should choose when in a given situation. 

  2. For more on the story how AlphaGo won against Lee Sedol, check out the corresponding Wikipedia article or the AI developers’ homepage

  3. For more on the story on AlphaFold, check out the corresponding Wikipedia article or the AI developers’ homepage


Artificial Neural Networks