top of page


Updated: Aug 6, 2021

Reinforcement learning is an area of Machine Learning. It is about taking suitable action to maximize reward in a particular situation. The learner is not told which actions to take, but instead must discover which actions yield the most reward by trying them. It is employed by various software and machines to find the best possible behavior or path it should take in a specific situation.


Main points in Reinforcement learning :

  • Input: The input should be an initial state from which the model will start.

  • Output: There are many possible outputs as there are variety of solution to a particular problem.

  • Training: The training is based upon the inputs, the model will return a state and the user will decide to reward or punish the model based on its output.

  • The model keeps continues to learn.

  • The best solution is decided based on the maximum reward.

Types of Reinforcement:

There are two types of Reinforcement:

1. Positive:– Positive Reinforcement is defined as when an event, occurs due to a particular behaviour, increases the strength and the frequency of the behaviour. In other words, it has a positive effect on behaviour.

Advantages of reinforcement learning are:

  • Maximizes Performance.

  • Sustain Change for a long period of time.

Disadvantages of reinforcement learning:

  • Too much Reinforcement can lead to overload of states which can diminish the results.

2. Negative :– Negative Reinforcement is defined as strengthening of a behaviour because a negative condition is stopped or avoided.

Advantages of reinforcement learning:

  • Increases behaviour.

  • Provide defiance to minimum standard of performance

Disadvantages of reinforcement learning:

  • Provide defiance to minimum standard of performance.viour..

Reinforcement Learning Workflow Overview:

In general, five different areas need to be addressed with reinforcement learning.

1.Environment-You need an environment where your agent can learn. You need to choose what should exist within the environment and whether it’s a simulation or a physical setup

2.Reward-You need to think about what you ultimately want your agent to do and craft a reward function that will incentivize the agent to do just that.

3.Policy-You need to choose a way to represent the policy. Consider how you want to structure the parameters and logic that make up the decision-making part of the agent.

4.Training-You need to choose an algorithm to train the agent that works to find the optimal policy parameters

5.Deploy-Finally, you need to exploit the policy by deploying it in the field and verifying the results.

Q-learning is one of the commonly used model-free RL algorithms. The basic update rule for q-learning using some basic python syntax can be implemented where q-learning uses future rewards to influence the current action given a state and therefore helps the agent select best actions that maximize total reward.

Q-learning algorithm:


Q-learning is an off policy reinforcement learning algorithm that seeks to find the best action to take given the current state. It’s considered off-policy because the q-learning function learns from actions that are outside the current policy, like taking random actions, and therefore a policy isn’t needed. More specifically, q-learning seeks to learn a policy that maximizes the total reward.


What’s ‘Q’?

The ‘q’ in q-learning stands for quality. Quality in this case represents how useful a given action is in gaining some future reward.

Create a q-table:

When q-learning is performed we create what’s called a q-table or matrix that follows the shape of [state, action] and we initialize our values to zero. We then update and store our q-values after an episode. This q-table becomes a reference table for our agent to select the best action based on the q-value.

import numpy as np# Initialize q-table values to 0Q = np.zeros((state_size, action_size))

Q-learning and making updates:

The next step is simply for the agent to interact with the environment and make updates to the state action pairs in our q-table Q[state, action].

Taking Action: Explore or Exploit

An agent interacts with the environment in 1 of 2 ways. The first is to use the q-table as a reference and view all possible actions for a given state. The agent then selects the action based on the max value of those actions. This is known as exploiting since we use the information we have available to us to make a decision.

The second way to take action is to act randomly. This is called exploring. Instead of selecting actions based on the max future reward we select an action at random. Acting randomly is important because it allows the agent to explore and discover new states that otherwise may not be selected during the exploitation process. You can balance exploration/exploitation using epsilon (ε) and setting the value of how often you want to explore vs exploit. Here’s some rough code that will depend on how the state and action space are setup.

import random# Set the percent you want to explore
epsilon = 0.2if random.uniform(0, 1) < epsilon:
    Explore: select a random action    """
    Exploit: select the action with max value (future reward)    """

Updating the q-table:

The updates occur after each step or action and ends when an episode is done. Done in this case means reaching some terminal point by the agent. A terminal state for example can be anything like landing on a checkout page, reaching the end of some game, completing some desired objective, etc. The agent will not learn much after a single episode, but eventually with enough exploring (steps and episodes) it will converge and learn the optimal q-values or q-star (Q∗).

Here are the 3 basic steps:

  1. Agent starts in a state (s1) takes an action (a1) and receives a reward (r1).

  2. Agent selects action by referencing Q-table with highest value (max) OR by random (epsilon, ε).

  3. Update q-values.

Here is the basic update rule for q-learning:

# Update q valuesQ[state, action] = Q[state, action] + lr * (reward + gamma * np.max(Q[new_state, :])Q[state, action])

In the update above there are a couple variables that we haven’t mentioned yet. Whats happening here is we adjust our q-values based on the difference between the discounted new values and the old values. We discount the new values using gamma and we adjust our step size using learning rate (lr). Below are some references.

  • Learning Rate: lr or learning rate, often referred to as alpha or α, can simply be defined as how much you accept the new value vs the old value. Above we are taking the difference between new and old and then multiplying that value by the learning rate. This value then gets added to our previous q-value which essentially moves it in the direction of our latest update.

  • Gamma: gamma or γ is a discount factor. It’s used to balance immediate and future reward. From our update rule above you can see that we apply the discount to the future reward. Typically this value can range anywhere from 0.8 to 0.99.

  • Reward: reward is the value received after completing a certain action at a given state. A reward can happen at any given time step or only at the terminal time step.

  • Max: np.max() uses the numpy library and is taking the maximum of the future reward and applying it to the reward for the current state. What this does is impact the current action by the possible future reward. This is the beauty of q-learning. We’re allocating future reward to current actions to help the agent select the highest return action at any given state.







26 views0 comments

Recent Posts

See All
bottom of page