Academia.eduAcademia.edu

Q-Learning: Tabular to Neural Networks

AI-generated Abstract

The paper discusses the implementation of Q-learning, a form of reinforcement learning (RL), in various environments using OpenAI Gym, focusing primarily on two scenarios: FrozenLake-v0 and CartPole-v0. It explains how Q-learning operates as a model-free learning algorithm, utilizing the Q-values to inform an agent's decisions on maximizing rewards through trial-and-error. Furthermore, the paper explores enhancing Q-learning with neural networks for the CartPole-v0 task, detailing a structured training approach across multiple phases to balance an inverted pendulum, while analyzing performance metrics to avoid overfitting.

Q-Learning: Tabular to Neural Networks Rahul Dass ECE 753: Project Report May 1, 2018 1. Introduction In the field of Machine Learning, there are three-primary forms of learning: Supervised Learning (SL), Unsupervised Learning and Reinforcement Learning (RL). SL has been the most extensively studied and commonly it forms the basis for some classification technique. Within such domains, in order for any measurable success, the classification algorithm would require large amounts of data upfront. However, most often than not, we do not have access to such data. Instead, within the realm of RL, a system via the use of an agent can ​experiment using trial-and-error ​with its environment in an attempt to discover an optimal solution to achieve a goal. This would require a reward mechanism such that positive decisions are ​rewarded and consequently negative decisions are ​punished​. This is paradigm known as RL where the agent will look for a solution which maximizes the reward obtained and minimize the punishment. In this project, we explore two environments: FrozenLake-v0 and CartPole-v0, within OpenAI gym to experiment with a RL agent following an algorithm called Q-learning. 2. Q-Learning In this project, we implemented an algorithm, Q-learning, that belongs a family of RL algorithms called Model-Free Learning (MFL). The agent experiences a sequence of < st , at , rt , st+1 > that represents <current state, current action, immediate reward, next state> based on the transition function and the expected reward sequences. The transition function describes the mechanism by which an agent can navigate around within its environment from state to state. The expected reward sequence is what any RL agent is tasked with in maximizing as it transitions to a goal state. Upon an agent’s decision for transitioning into a new state-action pair by undertaking < st , at > the agent will update the value it received for taking ​a when it was in ​s​. This state-action < st , at > value is known as the Q-value. In MFL, the algorithm uses the agent’s experience sequences to build an estimate of the Q-function. And so, every time a new experience comes along, we tweak/improve the estimate for the Q-function. It is from the best guess of the Q-value, which tells the agent what the best action to take is and that would at+1 . Q-learning is an algorithm that can help an agent learn its long-term expected rewards. There are a few variations of the algorithm, but they all have the same update mechanism for the Q-values governed by the Bellman Equation. The following is the pseudo-code for general algorithm: 1. Initialize Q 2. From s , choose a to maximize Q( s , a ) (or explore) 3. Observing transition < s , a , r , s′ >: a. ΔQ(s, a) = αt (r + γmaxa′Q(s′, a′) − Q(s, a)) NB: the Qs above are all Qs​ ​(i.e. estimates of Q) 4. Decay αt Step 1: We first begin by initializing the Q-values. In section 3, we discuss this further taking into account the two different implementations of the Q-learning algorithm. Regardless, those initial Q-values most likely will be incorrect, else the agent would not need to learn. However, it is important to note that by guessing initial Q-values that are either too high or low will make a difference to how the algorithm will behave. Step 2: The Agent acts. For each step in Q-learning, the agent is in some state and asks what is the ​best thing to do? Naively, we could simply use the estimated Q-function, “lookup” the action that will give the highest estimated reward and simply take that action. However, the Q-function in the beginning is not usually correct. Following this strategy will lead to information that is not helpful to the agent when trying to optimize for rewards. Assume that the Q-function contains the “belief” that a low value implies that the agent would not opt to take that specific action leading to a particular state. As that action was not learnt, the agent needs to explore. Step 3: As the agent builds its experience represented via a <s, a, r, s’>, we use that information to update Q(s, a) using the Bellman Equation: ΔQ(s, a) = αt (r + γmaxa′Q(s′, a′) − Q(s, a)) where Q(s, a) is the state-action pair that the agent just undertook and this is the what we are going to learn about). The Q-value will be updated by a small amount via the learning rate αt , in the direction of the immediate reward obtained r . It is then added to γ maxa′Q(s′, a′) − Q(s, a) representing the discounted expected value of the next state minus the Q-value of the state that we just left from. This difference is key and it represents ​how valuable we thought the previous ​( s, a ) state-action value pair was and maybe how value we should think it is. ​As we just got r and transitioned to the next state. Thus, this difference represents an error signal. Should they be exactly the same then Q(s′, a′) − Q(s, a) = 0 , implying that ΔQ(s, a) does not change at all. However, if the difference was too big, then the learning rate along with the gamma factor will tell the Q-value to decrease its original estimate. Else, it will tell the Q-value to increase its original estimate for (s, a) if the difference was too small. Step 4:​ We also need to decide what to do with αt . As we shall see in Section 3, that we can change this quantity over time, and maybe we should if we would like certain number of guarantees. In conclusion, the choices made for steps 1, 2 and 4 guarantee that Q(s, a) → Q* (s, a) , where our estimate of the Q-function converges to the optimal solution of the Bellman Equation Q* (s, a) , in the limit. ΔQ(s, a) = αt (r + γmaxa′Q(s′, a′) − Q(s, a)) is quite a powerful equation. In one line of code and it enables us to solve a linear program (i.e. the Bellman Eq) given enough time and experience. A disadvantage to this algorithm is that it did take time to find the optimal Q-values. The learning algorithm in an attempt to avoid being stuck in some local optimal had to did take some actions that may seem to be a waste of time or pointless, but the agent continually updates the Q-values and so it is learning.​[1] This form of learning is known as ​off-policy​. The agent is learning about the optimal behavior, even though it is not necessarily behaving optimally. While the update equation has a hypothetical notion of taking the max of all (s’, a’) states, even though it may not in fact be able to take the best or max due to the exploratory or random nature of Q-learning. Nonetheless, these updated Q-values will converge to the optimal policy, even though that’s not the policy that the agent is following.​[5-8] 3. FrozenLake-v0 We implemented and tested the Q-learning algorithm using an environment provided by OpenAI gym called FrozenLake-v0: in a world where frisbees are in such short supply and are of utmost importance and value, an RL agent is tasked with navigating from the start state to goal state without moving into a hole and finding said frisbee without any guidance or prior information of the world it is in. The environment is modeled as a 4x4 gridworld where each cell has the possibility of being one of four options: S - start block (safe), F - frozen block, H - hole (danger) and G - goal state (frisbee located). The agent is given four degrees of freedom in terms of motion: Up, Down, Right, Left. To make the endeavor even more compelling, the environment introduces an obstacle namely in the form of wind, that blows the agent into a block that it did not choose to move into. The reward mechanism that is provided to the agent is a single scalar value of 1 if the agent reaches the goals, else for every other step taken the agent will receive a 0. During the first half of the project, we first implemented the Q-learning algorithm using a lookup table to store the Q-values for every state (row) and action (column) pair. Recall from section 2, the Q-value is an evaluation of how good it is to take an action for being in a given state. In an attempt to generalize the problem for future more realistic “open world” environments as opposed to a 4x4 gridworld, we implemented the Q-learning algorithm using a single layer neural network. Tabular approach​ - In a 4x4 gridworld, there is a total of 16 possible states (one per block) and an agent conducting four possible actions upon each state, gives a Q-table of the size 16x4. We first initialize the table by setting all values to 0. Then we observe the rewards for each action taken and update the table accordingly using the Bellman equation outlined in section 2. A few things to point out during implementation, we ran a total of 2000 games (epochs) where within each game we gave the agent a maximum of 99 steps that it could make. If it was not able to find the goal state by 99 steps then we consider that epoch as a failed game. The learning rate that was implemented was set at 0.8 and the discount factor was set to 0.95. Our results are as follows, where we obtained a score of 0.448 after running the simulation for 2000 games. With a final Q-table of: Neural Network approach - While the agent can learn by looking up from the Q-table as to which action would give it the most reward, in a real-world environment, the state-action space is infinitely larger. Thus, instead of a table, we use a function approximator to represent the state-action pairs. Take any number of states that can be represented by a one-hot vector (1x16) containing 4 Q-values, one per action. The task for the agent is to then learn how to map the vector to the Q-values. The updating in this regard does not take place in the table directly. Instead, we use backpropagation and a loss function, where the weights in the neural network act as the previous values of the “old table”. The nature of the loss function is the sum-of-squared loss, representing the difference between the current predicted Q-value and the “target value”. The target values are computed and the gradients are passed through the network. The loss function: Loss = Σ(Qtarget − Q)2 similar to the Q-value computed by the Bellman equation. The results show from Figure 3 that approximately around the 750 episode, the agent has found the optimal path to the goal state and thus on average has been able to get the full reward more frequently and in fewer number of steps as confirmed by Figure 2 which shows the total number of steps taken by the agent per episode. The odd white lines in the Figure 3 post the 750 mark, indicates that the despite the agent exploring it was not able to find the goal state even though it had found the optimal policy. This further demonstrates that the agent is trying to explore if there is another policy that it may have not found yet.​[3] Thus, we have demonstrated that the agent is able to estimate future reward for every state using Q-learning via a single layer neural network. However, despite the added flexibility provided by the mechanism of a neural network, prediction is not stable due to the non-linear nature of the loss function. There are “tricks” that we could have explored to make it converge faster, however it still does take a long time. For instance, we could use a notion called “experience replay” which stores the <s, a, r, s’> experience and randomly us es known experiences that has lead to the goal state during the training period instead of just looking at the most recent transition. This helps avoid similarity in training samples and helps the agent in finding the global optimum. 4. CartPole-v0 We further explored the notion of using neural networks to model the Q-table in another environment provided by OpenAI gym called CartPole-v0. Here the control task is simple where there is a cart with two possible motions: left or right and it has to balance a pole that is fixed upon it. Below is a screenshot of what an instance within such an environment would look like: The goal in this case would be to balance this “inverted pendulum” within the span of 200 frames, where each frame is a consequence of an action being taken. However, when the pole is in excess of +/- 15​o​, that would be considered as a failed state. We split the problem into four phases and are outlined as follows: ● Phase 1: Generate random games. This helped us explore what the environment looks like and in terms of output, we expected to fail very quickly i.e. the pole kept falling every time the cart moved. ● Phase 2: Within such an environment, we generated initial training samples by moving the cart randomly. We realize the data in this case would be imperfect and none will beat the required score. Nevertheless, we will use these samples to build the network in order to train it and optimize. ● Phase 3: Create a neural network consisting of five layers. Train the network using the training data from the previous step and try to fit the data in order to be able to play the game. ● Phase 4: With a trained model, over five epochs, we play the game. As we can see from Figure 5, while the model goes through the five epochs, the loss function decreases and the final accuracy is at 62.5%. This is what we want because if the accuracy was higher this could imply that the model has overfit to the training data and thus, would be more likely to fail if the game was not being played exactly to conditions upon which it was trained. Overall, we can see that the model determines that choice 1 (moving left) has been opted 49.3% of the time, while choice 2 (moving right) was opted 50.7% of the time in order to obtain an overall score of 170.9. We have also explored with the idea of setting the number of epochs to 3 and found generally better overall scores as well.​[4] 5. Conclusion We set out to experiment how an RL algorithm called Q-learning helps an agent optimize its decisions in order to reach a goal. We explored two environments: FrozenLake-v0 and CartPole-v0 and implemented the Q-learning using the naive “lookup” table approach and generalizing the state-action space to model real-world environments using neural networks. We were able to conclude that both approaches have merits and were successful in helping an agent learn how to navigate its environment without prior knowledge in order to obtain its goal. We are excited and encouraged by the results to further explore other MFL algorithms such as SARSA and the other family of RL algorithms called Model-Based Learning. 6. References [1] Watkins C. J. C. H. and Dayan P., ​Q-learning​ (​Machine Learning Springer 1992) [2] OpenAI Gym, www.gym.openai.com [3] Juliani A., ​Simple Reinforcement Learning with TensorFlow Part 0, ​2015 [4] Sentex, ​Machine Learning with Python - Youtube Series [5] Matiisen T., ​Demystifying Deep Reinforcement Learning, ​2015 [6] Littman M., ​Reinforcement Learning improves behavior from evaluative feedback​ (​Nature, 2015​) [7] Sutton, R. and Barto A. G., ​Reinforcement Learning: An Introduction 1st​ ​ Edition​ (​MIT Press, 1998​) [8] Littman M., ​Basics of Computational Reinforcement Learning​ (​RLDM, 2015​)