0% found this document useful (0 votes)
72 views29 pages

RL With LCS

This document discusses using LCS (learning classifier systems) for reinforcement learning tasks. It defines reinforcement learning problems using the Markov Decision Process (MDP) framework. It describes common reinforcement learning algorithms like SARSA and Q-learning that aim to learn value functions. Long path learning, where optimal policies require long action sequences, remains a challenge for LCS. Maintaining exploration vs exploitation is also important for reinforcement learning algorithms.

Uploaded by

arturoraymundo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
72 views29 pages

RL With LCS

This document discusses using LCS (learning classifier systems) for reinforcement learning tasks. It defines reinforcement learning problems using the Markov Decision Process (MDP) framework. It describes common reinforcement learning algorithms like SARSA and Q-learning that aim to learn value functions. Long path learning, where optimal policies require long action sequences, remains a challenge for LCS. Maintaining exploration vs exploitation is also important for reinforcement learning algorithms.

Uploaded by

arturoraymundo
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODP, PDF, TXT or read online on Scribd
You are on page 1/ 29

Towards Reinforcement Learning with LCS

Towards Reinforcement Learning with LCS

Having until now concentrated on how LCS can handle regression and classication tasks, this chapter returns to the prime motivator for LCS, which are sequential decision tasks.

Towards Reinforcement Learning with LCS

Towards Reinforcement Learning with LCS


Problem Denition The sequential decision tasks that will be considered are the ones describable by a Markov Decision Process (MDP) Some of the previously used symbols will be assigned a new meaning.

Problem Denition

Let X be the set of states x X of the problem domain, that is assumed to be of nite size1 N , and hence is mapped into the natural numbers N. In every state xi X , an action a out of a nite set A is performed and causes a state transition to xj . The probability of getting to state xj after performing action a in state xi is given by the transition function p(xj |xi , a), which is a probability distribution over X, conditional on X A. The positive discount factor R with 0 < 1 determines the preference of immediate reward over future reward.

Problem Denition

The aim is for every state to choose the action that maximises the reward in the long run, where future rewards are possibly valued less that immediate rewards.

The Value Function, the Action-Value Function and Bellmans Equation

The approach taken by dynamic programming (DP) and reinforcement learning (RL) is to dene a value function V : X R that expresses for each state how much reward we can expect to receive in the long run.

Problem Types

The three basic classes of innite horizon problems are stochastic shortest path problems, discounted problems, and average reward per step problems, all of which are well described by Bertsekas and Tsitsiklis [17]. Here, only discounted problems and stochastic shortest path problems are considered, where for the problems and stochastic shortest path problems are considered, where for the latter only proper policies that are guaranteed to reach the desired terminal state are assumed.

Dynamic Programming and Reinforcement Learning

In this section, some common RL methods are introduced, that learn these functions while traversing the state space without building a model of the transition and reward function. These methods are simulation-based approximations to DP methods, and their stability is determined by the stability of the corresponding DP method.

Dynamic Programming Operators

Bellmans Equation is a set of equations that cannot be solved analytically. Fortunately, several methods have been developed that make nding its solution easier, all of which are based on the DP operators T and T.

Value Iteration and Policy Iteration

The method of value iteration is a straightforward application of the contraction property of T and is based on applying T repeatedly to an initially arbitrary value vector V until it converges to the optimal value vector V* . Convergence can only be guaranteed after an innite number of steps, but the value vector V is usually already close to V* after few iterations.

Value Iteration and Policy Iteration

Various variants to these methods exist, such as asynchronous value iteration, that at each application of T only updates a single state of V. Modied policy iteration performs the policy evaluation step by approximating V by Tn V for some small n. Asynchronous policy iteration mixes asynchronous value iteration with policy iteration by at each step either i) updating some states of V by asynchronous value iteration. ii) improving the policy of some set of states by policy improvement. Convergence criteria for these variants are given by Bertsekas and Tsitsiklis [17].

Approximate Dynamic Programming

If N is large, we prefer to approximate the value function rather than representing the value for each state explicitly Approximate value iteration is performed by approximating the value iteration update Vt+1 = TVt by

Approximate Dynamic Programming

where is the approximation operator that, for the used function approximation technique, returns the value function estimate approximation Vt+1 that is closest to by The only approximation that will be considered is the one most similar to approximation value iteration and is the temporal-dierence solution which aims at nding the xed point by the update

SARSA()

Coming to the rst reinforcement learning algorithm, SARSA stands for State-Action-Reward-State-Action, as SARSA(0) requires only information on the current and next state/action pair and the reward that was received for the transition. It conceptually performs policy iteration and uses TD() to update its action-value function Q. More specically it performs optimistic policy iteration, where in contrast to standard policy iteration the policy improvement step is based on an incompletely evaluated policy.

Q-Learning

9.5 Further Issues

Besides the stability concerns when using LCS to perform RL, there are still some further issues to consider, two of which will be discussed in this section: The learning of long paths, and How to best handle the explore/exploit dilemma.

9.5.1 Long Path Learning

The problem of long path learning is to nd the optimal policy in sequential decision tasks when the solution requires learning of action sequences of substantial length. While a solution was proposed to handle this problem [12], it was only designed to work for a particular problem class, as will be shown after discussing how XCS fails at long path learning. The classier set optimality criterion from Chap. 7 might provide better results, but in general, long path learning remains an open problem. Long path learning is not only an issue for LCS, but for approximate DP d RL in general.

XCS and Long Path Learning

Consider the problem that is shown in Fig. 9.2. The aim is to nd the policy that reaches the terminal state x6 from the initial state x1a in the shortest number of steps. In RL terms, this aim is described by giving a reward of 1 upon reaching the terminal state, and a reward of 0 for all other transitions4 . The optimal policy is to alternately choose actions 0 and 1, starting with action 1 in state x1a .

XCS and Long Path Learning

The optimal value function V over the number of steps to the terminal state is for a 15-step corridor nite state world shown in Fig. 9.3(a). As can be seen, the dierence of the values of V between two adjacent states decreases with the distance from the terminal state.

Using the Relative Error

Barry proposed two preliminary approaches to handle the problem in long path learning in XCS, both based on making the error calculation of a classier relative to its prediction of the value function [12]. The rst approach is to estimate the distance of the matched states to the terminal state and scale the error accordingly, but this approach suers from the inaccuracy of predicting this distance.

Using the Relative Error

A second, more promising alternative proposed in his study is to scale the measured prediction error by the inverse absolute magnitude of the prediction. The underlying assumption is that the dierence in optimal values between two successive states is proportional to the absolute magnitude of these values

A Possible Alternative?

It was shown in Sect. 8.3.4 that the optimality criterion that was introduced in Chap. 7 is able to handle problem where the noise diers in dierent areas of the input space. Given that it is possible to use this criterion in an incremental implementation, will such an implementation be able to perform long path learning?

A Possible Alternative?

Let us assume that the optimality criterion causes the size of the area of the input space that is matched by a classier to be proportional to the level of noise in the data, such that the model is rened in areas where the observations are known to accurately represent the data-generating process. Considering only measurement noise, when applied to value function approximation this would lead to having more specic classiers in states where the dierence in magnitude of the value function for successive states is low, as in such areas this noise is deemed to be low. Therefore, the optimality criterion should provide an adequate value function approximation of the optimal value function, even in cases where long action sequences need to be represented.

9.5.2 Exploration and Exploitation

Maintaining the balance between exploiting current knowledge to guide action selection and exploring the state space to gain new knowledge is an essential problem for reinforcement learning. Too much exploration implies the frequent selection of sub-optimal actions and causes the accumulated reward to decrease. Too much emphasis on exploitation of current knowledge, on the other hand, might cause the agent to settle on a sub-optimal policy due to insucient knowledge of the reward distribution [228, 209]. Keeping a good balance is important as it has a signicant impact on the performance of RL methods.

9.5.2 Exploration and Exploitation

There are several approaches to handling exploration and exploitation: one can choose a sub-optimal action every now and then, independent of the certainty of the available knowledge, Or one can take this certainty into account to choose actions that increase it. A variant of the latter is to use Bayesian statistics to model this uncertainty, which seems the most elegant solution but is unfortunately also the least tractable.

9.6 Summary

Despite sequential decision tasks being the prime motivator for LCS, they are still the ones which LCS handle least successfully. This chapter provides a primer on how to use dynamic programming and reinforcement learning to handle such tasks, and on how LCS can be combined with either approach from rst principles.

9.6 Summary

An essential part of the LCS type discussed in this book is that classiers are trained independently. This is not completely true when using LCS with reinforcement learning, as the target values that the classiers are trained on are based on the global prediction, which is formed by all matching classiers in combination. In that sense, classiers interact when forming their action-value function estimates. Still, besides combining classier predictions to form the target values, independent classier training still forms the basis of this model type, even when used in combination with RL.

9.6 Summary

Overall, using LCS to approximate the value or action-value function in RL is appealing as LCS dynamically adjust to the form of this function and thus might provide a better approximation than standard function approximation techniques. It should be noted, however, that the eld of RL is moving quickly, and that QLearning is by far not the best method that is currently available. Hence, in order for LCS to be a competitive approach to sequential decision tasks, they also need to keep track with new developments in RL, some of which were discussed when detailing the exploration/exploitation dilemma that is an essential component of RL. In summary, it is obvious that there is still plenty of work to be done until LCS can provide the same formal development as RL currently does. Nonetheless, the initial formal basis is provided in this chapter, upon which other research can build further analysis and improvements to how LCS handles sequential decision tasks eectively, competitively, and with high reliability.

You might also like