Q-Learning: Tabular to Neural Networks
Rahul Dass
ECE 753: Project Report
May 1, 2018
1. Introduction
In the field of Machine Learning, there are three-primary forms of learning: Supervised Learning (SL),
Unsupervised Learning and Reinforcement Learning (RL). SL has been the most extensively studied and
commonly it forms the basis for some classification technique. Within such domains, in order for any
measurable success, the classification algorithm would require large amounts of data upfront. However, most
often than not, we do not have access to such data. Instead, within the realm of RL, a system via the use of an
agent can experiment using trial-and-error with its environment in an attempt to discover an optimal solution to
achieve a goal. This would require a reward mechanism such that positive decisions are rewarded and
consequently negative decisions are punished. This is paradigm known as RL where the agent will look for a
solution which maximizes the reward obtained and minimize the punishment. In this project, we explore two
environments: FrozenLake-v0 and CartPole-v0, within OpenAI gym to experiment with a RL agent following
an algorithm called Q-learning.
2. Q-Learning
In this project, we implemented an algorithm, Q-learning, that belongs a family of RL algorithms called
Model-Free Learning (MFL). The agent experiences a sequence of < st , at , rt , st+1 > that represents <current
state, current action, immediate reward, next state> based on the transition function and the expected reward
sequences. The transition function describes the mechanism by which an agent can navigate around within its
environment from state to state. The expected reward sequence is what any RL agent is tasked with in
maximizing as it transitions to a goal state.
Upon an agent’s decision for transitioning into a new state-action pair by undertaking < st , at > the agent will
update the value it received for taking a when it was in s. This state-action < st , at > value is known as the
Q-value. In MFL, the algorithm uses the agent’s experience sequences to build an estimate of the Q-function.
And so, every time a new experience comes along, we tweak/improve the estimate for the Q-function. It is from
the best guess of the Q-value, which tells the agent what the best action to take is and that would at+1 .
Q-learning is an algorithm that can help an agent learn its long-term expected rewards. There are a few
variations of the algorithm, but they all have the same update mechanism for the Q-values governed by the
Bellman Equation. The following is the pseudo-code for general algorithm:
1. Initialize Q
2. From s , choose a to maximize Q( s , a ) (or explore)
3. Observing transition < s , a , r , s′ >:
a. ΔQ(s, a) = αt (r + γmaxa′Q(s′, a′) − Q(s, a))
NB: the Qs above are all Qs (i.e. estimates of Q)
4. Decay αt
Step 1: We first begin by initializing the Q-values. In section 3, we discuss this further taking into account the
two different implementations of the Q-learning algorithm. Regardless, those initial Q-values most likely will
be incorrect, else the agent would not need to learn. However, it is important to note that by guessing initial
Q-values that are either too high or low will make a difference to how the algorithm will behave.
Step 2: The Agent acts. For each step in Q-learning, the agent is in some state and asks what is the best thing to
do? Naively, we could simply use the estimated Q-function, “lookup” the action that will give the highest
estimated reward and simply take that action. However, the Q-function in the beginning is not usually correct.
Following this strategy will lead to information that is not helpful to the agent when trying to optimize for
rewards. Assume that the Q-function contains the “belief” that a low value implies that the agent would not opt
to take that specific action leading to a particular state. As that action was not learnt, the agent needs to explore.
Step 3: As the agent builds its experience represented via a <s, a, r, s’>, we use that information to update Q(s,
a) using the Bellman Equation: ΔQ(s, a) = αt (r + γmaxa′Q(s′, a′) − Q(s, a)) where Q(s, a) is the state-action
pair that the agent just undertook and this is the what we are going to learn about). The Q-value will be updated
by a small amount via the learning rate αt , in the direction of the immediate reward obtained r . It is then added
to γ maxa′Q(s′, a′) − Q(s, a) representing the discounted expected value of the next state minus the Q-value of
the state that we just left from.
This difference is key and it represents how valuable we thought the previous ( s, a ) state-action value pair was
and maybe how value we should think it is. As we just got r and transitioned to the next state. Thus, this
difference represents an error signal. Should they be exactly the same then Q(s′, a′) − Q(s, a) = 0 , implying
that ΔQ(s, a) does not change at all. However, if the difference was too big, then the learning rate along with
the gamma factor will tell the Q-value to decrease its original estimate. Else, it will tell the Q-value to increase
its original estimate for (s, a) if the difference was too small.
Step 4: We also need to decide what to do with αt . As we shall see in Section 3, that we can change this
quantity over time, and maybe we should if we would like certain number of guarantees.
In conclusion, the choices made for steps 1, 2 and 4 guarantee that Q(s, a) → Q* (s, a) , where our estimate of
the Q-function converges to the optimal solution of the Bellman Equation Q* (s, a) , in the limit.
ΔQ(s, a) = αt (r + γmaxa′Q(s′, a′) − Q(s, a)) is quite a powerful equation. In one line of code and it enables us
to solve a linear program (i.e. the Bellman Eq) given enough time and experience. A disadvantage to this
algorithm is that it did take time to find the optimal Q-values. The learning algorithm in an attempt to avoid
being stuck in some local optimal had to did take some actions that may seem to be a waste of time or pointless,
but the agent continually updates the Q-values and so it is learning.[1]
This form of learning is known as off-policy. The agent is learning about the optimal behavior, even though it is
not necessarily behaving optimally. While the update equation has a hypothetical notion of taking the max of all
(s’, a’) states, even though it may not in fact be able to take the best or max due to the exploratory or random
nature of Q-learning. Nonetheless, these updated Q-values will converge to the optimal policy, even though
that’s not the policy that the agent is following.[5-8]
3. FrozenLake-v0
We implemented and tested the Q-learning algorithm using an environment provided by OpenAI gym called
FrozenLake-v0: in a world where frisbees are in such short supply and are of utmost importance and value, an
RL agent is tasked with navigating from the start state to goal state without moving into a hole and finding said
frisbee without any guidance or prior information of the world it is in. The environment is modeled as a 4x4
gridworld where each cell has the possibility of being one of four options: S - start block (safe), F - frozen
block, H - hole (danger) and G - goal state (frisbee located). The agent is given four degrees of freedom in terms
of motion: Up, Down, Right, Left. To make the endeavor even more compelling, the environment introduces an
obstacle namely in the form of wind, that blows the agent into a block that it did not choose to move into. The
reward mechanism that is provided to the agent is a single scalar value of 1 if the agent reaches the goals, else
for every other step taken the agent will receive a 0.
During the first half of the project, we first implemented the Q-learning algorithm using a lookup table to store
the Q-values for every state (row) and action (column) pair. Recall from section 2, the Q-value is an evaluation
of how good it is to take an action for being in a given state. In an attempt to generalize the problem for future
more realistic “open world” environments as opposed to a 4x4 gridworld, we implemented the Q-learning
algorithm using a single layer neural network.
Tabular approach - In a 4x4 gridworld, there is a total of 16 possible states (one per block) and an agent
conducting four possible actions upon each state, gives a Q-table of the size 16x4. We first initialize the table by
setting all values to 0. Then we observe the rewards for each action taken and update the table accordingly using
the Bellman equation outlined in section 2. A few things to point out during implementation, we ran a total of
2000 games (epochs) where within each game we gave the agent a maximum of 99 steps that it could make. If it
was not able to find the goal state by 99 steps then we consider that epoch as a failed game. The learning rate
that was implemented was set at 0.8 and the discount factor was set to 0.95.
Our results are as follows, where we obtained a score of 0.448 after running the simulation for 2000 games.
With a final Q-table of:
Neural Network approach - While the agent can learn by looking up from the Q-table as to which action
would give it the most reward, in a real-world environment, the state-action space is infinitely larger. Thus,
instead of a table, we use a function approximator to represent the state-action pairs. Take any number of states
that can be represented by a one-hot vector (1x16) containing 4 Q-values, one per action. The task for the agent
is to then learn how to map the vector to the Q-values.
The updating in this regard does not take place in the table directly. Instead, we use backpropagation and a loss
function, where the weights in the neural network act as the previous values of the “old table”. The nature of the
loss function is the sum-of-squared loss, representing the difference between the current predicted Q-value and
the “target value”. The target values are computed and the gradients are passed through the network. The loss
function: Loss = Σ(Qtarget − Q)2 similar to the Q-value computed by the Bellman equation.
The results show from Figure 3 that approximately around the 750 episode, the agent has found the optimal path
to the goal state and thus on average has been able to get the full reward more frequently and in fewer number
of steps as confirmed by Figure 2 which shows the total number of steps taken by the agent per episode. The
odd white lines in the Figure 3 post the 750 mark, indicates that the despite the agent exploring it was not able
to find the goal state even though it had found the optimal policy. This further demonstrates that the agent is
trying to explore if there is another policy that it may have not found yet.[3]
Thus, we have demonstrated that the agent is able to estimate future reward for every state using Q-learning via
a single layer neural network. However, despite the added flexibility provided by the mechanism of a neural
network, prediction is not stable due to the non-linear nature of the loss function. There are “tricks” that we
could have explored to make it converge faster, however it still does take a long time. For instance, we could
use a notion called “experience replay” which stores the <s, a, r, s’> experience and randomly us es
known
experiences that has lead to the goal state during the training period instead of just looking at the most recent
transition. This helps avoid similarity in training samples and helps the agent in finding the global optimum.
4. CartPole-v0
We further explored the notion of using neural networks to model the Q-table in another environment provided
by OpenAI gym called CartPole-v0. Here the control task is simple where there is a cart with two possible
motions: left or right and it has to balance a pole that is fixed upon it. Below is a screenshot of what an instance
within such an environment would look like:
The goal in this case would be to balance this “inverted pendulum” within the span of 200 frames, where each
frame is a consequence of an action being taken. However, when the pole is in excess of +/- 15o, that would be
considered as a failed state.
We split the problem into four phases and are outlined as follows:
● Phase 1: Generate random games. This helped us explore what the environment looks like and in terms
of output, we expected to fail very quickly i.e. the pole kept falling every time the cart moved.
● Phase 2: Within such an environment, we generated initial training samples by moving the cart
randomly. We realize the data in this case would be imperfect and none will beat the required score.
Nevertheless, we will use these samples to build the network in order to train it and optimize.
● Phase 3: Create a neural network consisting of five layers. Train the network using the training data
from the previous step and try to fit the data in order to be able to play the game.
● Phase 4: With a trained model, over five epochs, we play the game.
As we can see from Figure 5, while the model goes through the five epochs, the loss function decreases and the
final accuracy is at 62.5%. This is what we want because if the accuracy was higher this could imply that the
model has overfit to the training data and thus, would be more likely to fail if the game was not being played
exactly to conditions upon which it was trained. Overall, we can see that the model determines that choice 1
(moving left) has been opted 49.3% of the time, while choice 2 (moving right) was opted 50.7% of the time in
order to obtain an overall score of 170.9. We have also explored with the idea of setting the number of epochs
to 3 and found generally better overall scores as well.[4]
5. Conclusion
We set out to experiment how an RL algorithm called Q-learning helps an agent optimize its decisions in order
to reach a goal. We explored two environments: FrozenLake-v0 and CartPole-v0 and implemented the
Q-learning using the naive “lookup” table approach and generalizing the state-action space to model real-world
environments using neural networks. We were able to conclude that both approaches have merits and were
successful in helping an agent learn how to navigate its environment without prior knowledge in order to obtain
its goal. We are excited and encouraged by the results to further explore other MFL algorithms such as SARSA
and the other family of RL algorithms called Model-Based Learning.
6. References
[1] Watkins C. J. C. H. and Dayan P., Q-learning (Machine Learning Springer 1992)
[2] OpenAI Gym, www.gym.openai.com
[3] Juliani A., Simple Reinforcement Learning with TensorFlow Part 0, 2015
[4] Sentex, Machine Learning with Python - Youtube Series
[5] Matiisen T., Demystifying Deep Reinforcement Learning, 2015
[6] Littman M., Reinforcement Learning improves behavior from evaluative feedback (Nature, 2015)
[7] Sutton, R. and Barto A. G., Reinforcement Learning: An Introduction 1st Edition (MIT Press, 1998)
[8] Littman M., Basics of Computational Reinforcement Learning (RLDM, 2015)