0% found this document useful (0 votes)

72 views29 pages

RL With LCS

This document discusses using LCS (learning classifier systems) for reinforcement learning tasks. It defines reinforcement learning problems using the Markov Decision Process (MDP) framework. It describes common reinforcement learning algorithms like SARSA and Q-learning that aim to learn value functions. Long path learning, where optimal policies require long action sequences, remains a challenge for LCS. Maintaining exploration vs exploitation is also important for reinforcement learning algorithms.

Uploaded by

arturoraymundo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views29 pages

RL With LCS

Uploaded by

arturoraymundo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODP, PDF, TXT or read online on Scribd

You are on page 1/ 29

Towards Reinforcement Learning with LCS

Having until now concentrated on how LCS can handle regression and classication tasks, this chapter returns to the prime motivator for LCS, which are sequential decision tasks.

Towards Reinforcement Learning with LCS

Problem Denition The sequential decision tasks that will be considered are the ones describable by a Markov Decision Process (MDP) Some of the previously used symbols will be assigned a new meaning.

Problem Denition

Let X be the set of states x X of the problem domain, that is assumed to be of nite size1 N , and hence is mapped into the natural numbers N. In every state xi X , an action a out of a nite set A is performed and causes a state transition to xj . The probability of getting to state xj after performing action a in state xi is given by the transition function p(xj |xi , a), which is a probability distribution over X, conditional on X A. The positive discount factor R with 0 < 1 determines the preference of immediate reward over future reward.

Problem Denition

The aim is for every state to choose the action that maximises the reward in the long run, where future rewards are possibly valued less that immediate rewards.

The Value Function, the Action-Value Function and Bellmans Equation

The approach taken by dynamic programming (DP) and reinforcement learning (RL) is to dene a value function V : X R that expresses for each state how much reward we can expect to receive in the long run.

Problem Types

The three basic classes of innite horizon problems are stochastic shortest path problems, discounted problems, and average reward per step problems, all of which are well described by Bertsekas and Tsitsiklis [17]. Here, only discounted problems and stochastic shortest path problems are considered, where for the problems and stochastic shortest path problems are considered, where for the latter only proper policies that are guaranteed to reach the desired terminal state are assumed.

Dynamic Programming and Reinforcement Learning

In this section, some common RL methods are introduced, that learn these functions while traversing the state space without building a model of the transition and reward function. These methods are simulation-based approximations to DP methods, and their stability is determined by the stability of the corresponding DP method.

Dynamic Programming Operators

Bellmans Equation is a set of equations that cannot be solved analytically. Fortunately, several methods have been developed that make nding its solution easier, all of which are based on the DP operators T and T.

Value Iteration and Policy Iteration

The method of value iteration is a straightforward application of the contraction property of T and is based on applying T repeatedly to an initially arbitrary value vector V until it converges to the optimal value vector V* . Convergence can only be guaranteed after an innite number of steps, but the value vector V is usually already close to V* after few iterations.

Value Iteration and Policy Iteration

Various variants to these methods exist, such as asynchronous value iteration, that at each application of T only updates a single state of V. Modied policy iteration performs the policy evaluation step by approximating V by Tn V for some small n. Asynchronous policy iteration mixes asynchronous value iteration with policy iteration by at each step either i) updating some states of V by asynchronous value iteration. ii) improving the policy of some set of states by policy improvement. Convergence criteria for these variants are given by Bertsekas and Tsitsiklis [17].

Approximate Dynamic Programming

If N is large, we prefer to approximate the value function rather than representing the value for each state explicitly Approximate value iteration is performed by approximating the value iteration update Vt+1 = TVt by

Approximate Dynamic Programming

where is the approximation operator that, for the used function approximation technique, returns the value function estimate approximation Vt+1 that is closest to by The only approximation that will be considered is the one most similar to approximation value iteration and is the temporal-dierence solution which aims at nding the xed point by the update

SARSA()

Coming to the rst reinforcement learning algorithm, SARSA stands for State-Action-Reward-State-Action, as SARSA(0) requires only information on the current and next state/action pair and the reward that was received for the transition. It conceptually performs policy iteration and uses TD() to update its action-value function Q. More specically it performs optimistic policy iteration, where in contrast to standard policy iteration the policy improvement step is based on an incompletely evaluated policy.

Q-Learning

9.5 Further Issues

Besides the stability concerns when using LCS to perform RL, there are still some further issues to consider, two of which will be discussed in this section: The learning of long paths, and How to best handle the explore/exploit dilemma.

9.5.1 Long Path Learning

The problem of long path learning is to nd the optimal policy in sequential decision tasks when the solution requires learning of action sequences of substantial length. While a solution was proposed to handle this problem [12], it was only designed to work for a particular problem class, as will be shown after discussing how XCS fails at long path learning. The classier set optimality criterion from Chap. 7 might provide better results, but in general, long path learning remains an open problem. Long path learning is not only an issue for LCS, but for approximate DP d RL in general.

XCS and Long Path Learning

Consider the problem that is shown in Fig. 9.2. The aim is to nd the policy that reaches the terminal state x6 from the initial state x1a in the shortest number of steps. In RL terms, this aim is described by giving a reward of 1 upon reaching the terminal state, and a reward of 0 for all other transitions4 . The optimal policy is to alternately choose actions 0 and 1, starting with action 1 in state x1a .

XCS and Long Path Learning

The optimal value function V over the number of steps to the terminal state is for a 15-step corridor nite state world shown in Fig. 9.3(a). As can be seen, the dierence of the values of V between two adjacent states decreases with the distance from the terminal state.

Using the Relative Error

Barry proposed two preliminary approaches to handle the problem in long path learning in XCS, both based on making the error calculation of a classier relative to its prediction of the value function [12]. The rst approach is to estimate the distance of the matched states to the terminal state and scale the error accordingly, but this approach suers from the inaccuracy of predicting this distance.

Using the Relative Error

A second, more promising alternative proposed in his study is to scale the measured prediction error by the inverse absolute magnitude of the prediction. The underlying assumption is that the dierence in optimal values between two successive states is proportional to the absolute magnitude of these values

A Possible Alternative?

It was shown in Sect. 8.3.4 that the optimality criterion that was introduced in Chap. 7 is able to handle problem where the noise diers in dierent areas of the input space. Given that it is possible to use this criterion in an incremental implementation, will such an implementation be able to perform long path learning?

A Possible Alternative?

Let us assume that the optimality criterion causes the size of the area of the input space that is matched by a classier to be proportional to the level of noise in the data, such that the model is rened in areas where the observations are known to accurately represent the data-generating process. Considering only measurement noise, when applied to value function approximation this would lead to having more specic classiers in states where the dierence in magnitude of the value function for successive states is low, as in such areas this noise is deemed to be low. Therefore, the optimality criterion should provide an adequate value function approximation of the optimal value function, even in cases where long action sequences need to be represented.

9.5.2 Exploration and Exploitation

Maintaining the balance between exploiting current knowledge to guide action selection and exploring the state space to gain new knowledge is an essential problem for reinforcement learning. Too much exploration implies the frequent selection of sub-optimal actions and causes the accumulated reward to decrease. Too much emphasis on exploitation of current knowledge, on the other hand, might cause the agent to settle on a sub-optimal policy due to insucient knowledge of the reward distribution [228, 209]. Keeping a good balance is important as it has a signicant impact on the performance of RL methods.

9.5.2 Exploration and Exploitation

There are several approaches to handling exploration and exploitation: one can choose a sub-optimal action every now and then, independent of the certainty of the available knowledge, Or one can take this certainty into account to choose actions that increase it. A variant of the latter is to use Bayesian statistics to model this uncertainty, which seems the most elegant solution but is unfortunately also the least tractable.

9.6 Summary

Despite sequential decision tasks being the prime motivator for LCS, they are still the ones which LCS handle least successfully. This chapter provides a primer on how to use dynamic programming and reinforcement learning to handle such tasks, and on how LCS can be combined with either approach from rst principles.

9.6 Summary

An essential part of the LCS type discussed in this book is that classiers are trained independently. This is not completely true when using LCS with reinforcement learning, as the target values that the classiers are trained on are based on the global prediction, which is formed by all matching classiers in combination. In that sense, classiers interact when forming their action-value function estimates. Still, besides combining classier predictions to form the target values, independent classier training still forms the basis of this model type, even when used in combination with RL.

9.6 Summary

Overall, using LCS to approximate the value or action-value function in RL is appealing as LCS dynamically adjust to the form of this function and thus might provide a better approximation than standard function approximation techniques. It should be noted, however, that the eld of RL is moving quickly, and that QLearning is by far not the best method that is currently available. Hence, in order for LCS to be a competitive approach to sequential decision tasks, they also need to keep track with new developments in RL, some of which were discussed when detailing the exploration/exploitation dilemma that is an essential component of RL. In summary, it is obvious that there is still plenty of work to be done until LCS can provide the same formal development as RL currently does. Nonetheless, the initial formal basis is provided in this chapter, upon which other research can build further analysis and improvements to how LCS handles sequential decision tasks eectively, competitively, and with high reliability.

DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
5SC28 L9 Machine Learning Systems Control
No ratings yet
5SC28 L9 Machine Learning Systems Control
75 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
No ratings yet
COMP 4901Z: Reinforcement Learning: 2.3 Value Function Approximation
55 pages
Simulink Fundamentals PDF
No ratings yet
Simulink Fundamentals PDF
7 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
Chapter 3
No ratings yet
Chapter 3
14 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Architecture of Industrial Automation Systems
No ratings yet
Architecture of Industrial Automation Systems
10 pages
Issues in Using Function Approximation For Reinforcement Learning
No ratings yet
Issues in Using Function Approximation For Reinforcement Learning
9 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
RL DP and Value and Policy
No ratings yet
RL DP and Value and Policy
4 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
A Short Tutorial On Reinforcement Learning: Review and Applications
No ratings yet
A Short Tutorial On Reinforcement Learning: Review and Applications
5 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Active Learning For Reward Estimation in Inverse Reinforcement Learning
No ratings yet
Active Learning For Reward Estimation in Inverse Reinforcement Learning
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Business Process Modeling Template
No ratings yet
Business Process Modeling Template
10 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
CS229
No ratings yet
CS229
17 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
ECS 170 Artificial Intelligence Lecture 1
No ratings yet
ECS 170 Artificial Intelligence Lecture 1
23 pages
Sem Process Modeling With BPMN 13-14 PDF
No ratings yet
Sem Process Modeling With BPMN 13-14 PDF
133 pages
Hoffer Msad6e ch14
No ratings yet
Hoffer Msad6e ch14
44 pages
Assignment - 2023 - Week - 2-With Solution PDF
No ratings yet
Assignment - 2023 - Week - 2-With Solution PDF
5 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Final Year Projects List
No ratings yet
Final Year Projects List
4 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Klasifikaciona (Ema Metoda Za Rje (Avanje Problema Layouta
No ratings yet
Klasifikaciona (Ema Metoda Za Rje (Avanje Problema Layouta
52 pages
26FEntity Relationship Model (Bob)
No ratings yet
26FEntity Relationship Model (Bob)
18 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
4) Perspective Model
No ratings yet
4) Perspective Model
5 pages
Risk Management
No ratings yet
Risk Management
30 pages
What Is Agile?: Agile Model Waterfall Model
No ratings yet
What Is Agile?: Agile Model Waterfall Model
9 pages
Muda Mura Muri Completo
No ratings yet
Muda Mura Muri Completo
9 pages
Cuadro de Alcanos Alquenos Alquinos
No ratings yet
Cuadro de Alcanos Alquenos Alquinos
3 pages
Advanced Supply Chain Management Poirier en 1412 PDF
No ratings yet
Advanced Supply Chain Management Poirier en 1412 PDF
5 pages
Data Analytics Programs
No ratings yet
Data Analytics Programs
8 pages
Uptu and Utu Erp Syllabus
No ratings yet
Uptu and Utu Erp Syllabus
2 pages
SBCX
No ratings yet
SBCX
7 pages
Paper 5-Temperature Control System Using Fuzzy Logic Technique
No ratings yet
Paper 5-Temperature Control System Using Fuzzy Logic Technique
5 pages
DML 1
No ratings yet
DML 1
10 pages
Classification of Cyber Attacks Using Support Vector Machine
100% (1)
Classification of Cyber Attacks Using Support Vector Machine
4 pages
Production Planning Presentation
No ratings yet
Production Planning Presentation
7 pages
Publisher: Pearson India
No ratings yet
Publisher: Pearson India
5 pages
PPC Corrective and Preventive Action PDF
No ratings yet
PPC Corrective and Preventive Action PDF
5 pages
Pros and Cons of AI
No ratings yet
Pros and Cons of AI
3 pages
3M Model
No ratings yet
3M Model
6 pages
Chapter 1
No ratings yet
Chapter 1
1 page
Assignment 11
No ratings yet
Assignment 11
2 pages
EEMB 2 Study Sheet Midterm 1 W14
No ratings yet
EEMB 2 Study Sheet Midterm 1 W14
2 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet

RL With LCS

Uploaded by

RL With LCS

Uploaded by

Towards Reinforcement Learning with LCS

Towards Reinforcement Learning with LCS

Towards Reinforcement Learning with LCS

Towards Reinforcement Learning with LCS

The Value Function, the Action-Value Function and Bellmans Equation

Dynamic Programming and Reinforcement Learning

Dynamic Programming Operators

Value Iteration and Policy Iteration

Value Iteration and Policy Iteration

Approximate Dynamic Programming

Approximate Dynamic Programming

9.5 Further Issues

9.5.1 Long Path Learning

XCS and Long Path Learning

XCS and Long Path Learning

Using the Relative Error

Using the Relative Error

9.5.2 Exploration and Exploitation

9.5.2 Exploration and Exploitation

You might also like