Understanding Reinforcement Learning

Reinforcement Learning is a set of methods inspired by psychology that focuses on learning behavior from rewards. In general setting virtual agent or robot is receiving information about environment state, commits an action and obtains some amount of reward. These 3 steps are repeating forever or until some final state reached. The goal of the agent is to learn which actions will lead to the highest benefit in short and long run.

Basic Intuition

Lets start with a common diagram:

At each time step the Agent receives the State of the Environment, selects an Action and gains Reward.

State s may represent positions of the chess pieces, video game screen, sensors readings from factory machine, self-driving vehicle cameras or any other real or virtual agent’s inputs. Overall State contains all data about environment that is available to the agent.

Action a may be a key press in the game, robot movement, change in system’s parameters or any other available output.

As a result Agent collects some reward r that shows how well it performs. Usually it is a single number, where positive values symbolize good behavior, negative values — bad, and 0 represents absence of reward.

During training agent learns a policy p that determines which actions to take in the current state. A common approach is to learn (state; action; reward) triplets and use greedy tactic which simply means to take the most profitable action at each step. Usually expected reward for particular action also includes future rewards that may be obtained. This allows to take into consideration far-reaching consequences of current actions, but their effect is decreasing according to discount rate γ.

An example of a policy for a simple robot:

  • If there is an obstacle ahead — turn to the right
  • Otherwise move forward

You may find some interesting online demos of Reinforcement Learning in the playground.


Before training agent may move in one direction all the time, but given some reward for avoiding barriers it learns when and how to do that. Additionally, to force more exploration of potential behavior most algorithms include probability ε of making a random action. In the beginning almost every motion may be picked occasionally, but with time agent learns which of them provide the biggest rewards.

Environment type is equally important. Many virtual environments has Markov property: subsequent state depends only on the current one. It means that to predict the next states you don’t need to remember previous history. However, some abstract and most real settings are not so simple.

To be able to construct plans agent may have an internal model of the world. This model predicts possible states s’ given previous state s and selected action a. In case of Markov settings this model may be quite accurate and allow planning far ahead. Usually agents learn internal models as Probabilistic Graphical Models, where nodes represent state variables. On the contrary, sometimes environments are so straightforward that no model required, or so complex that agent will never be able to give even short-term predictions.

In fact there are two general types of learning: Q-learning and policy-learning.

First one is focused on estimation of the function Q(a,s) that represents future reward outcome for action a in state s. When you get a good approximation of Q, you can simply follow a greedy policy and get good results. This approach constitutes around 80% of the modern RL.

Second one is focused on the direct search for an optimal policy π*. This search omits approximation of Q or internal model and constructs the final state->action mappings. Sometimes even elementary Genetic Algorithms can find a good policy much quicker than Q-learning techniques.

To get a better idea about RL you may watch this video:


While Reinforcement Learning is a broad field with hundreds of different methods, some general classes deserve at least brief description.

  • On-policy techniques use agent’s current policy both for acting and training. In contrast, Off-policy methods use separate policies for exploration and exploitation.
  • Actor-Critic methods handle both Q- and policy-learning to get the best from both worlds.
  • Bayesian approach implies that policy and internal model have specific prior beliefs and learn using Bayes logic.
  • Reinforcement Learning with Deep Neural Networks in the last few years has shown great results with many different approaches.
  • Multi-Agent Reinforcement Learning is an active area of research. Some teams were able to teach virtual agents to communicate and solve toy problems jointly, but it’s quite far from being a practical method.


Current state-of-the-art techniques in Reinforcement Learning are able to beat the best human players in Chess, Go, many video games, poker and many others. They also can efficiently control datacenter’s cooling systems, manufacturing, trading, advertising and robotics in general. DeepMind, AI startup that developed some of those methods, was acquired by Google for over $600 million.

Reinforcement Learning is quite universal in terms of utilization. With Neural Networks it may work with inputs like video streams and perform complex operations in the real world.




Trying to explain complex things in simple words. edezhic@gmail.com

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Key Mystery about Deep Learning Neural Network

How to annotate scans for NLP

What is machine learning?

Machine Learning Rules in a Nutshell

[DL] 5. Weight Initialization and Batch Normalization

Complex Nonlinearities Epsiode 2: Harmonic Exciter

[DL] 12. Unsampling: Unpooling and Transpose Convolution

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Egor Dezhic

Egor Dezhic

Trying to explain complex things in simple words. edezhic@gmail.com

More from Medium

Reinforcement Learning: from trial & error to deep Q-learning

Safe Deep Reinforcement Learning

Eddie’s Take: Deep Reinforcement Learning with Relational Inductive Biases

Game development using Pygame & Reinforcement Learning with example