0% found this document useful (0 votes)
38 views

SARSA Reinforcement Learning Algorithm

SARSA (State-action-reward-state-action) is an on-policy reinforcement learning algorithm used to train a Markov decision process model by updating the policy based on actions taken. It utilizes a Q-table to store state-action estimates and employs an epsilon-greedy policy to balance exploration and exploitation during learning. Unlike Q-learning, which is off-policy and selects actions based on the highest Q-value, SARSA updates its Q-values based on the actions taken under the current policy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views

SARSA Reinforcement Learning Algorithm

SARSA (State-action-reward-state-action) is an on-policy reinforcement learning algorithm used to train a Markov decision process model by updating the policy based on actions taken. It utilizes a Q-table to store state-action estimates and employs an epsilon-greedy policy to balance exploration and exploitation during learning. Unlike Q-learning, which is off-policy and selects actions based on the highest Q-value, SARSA updates its Q-values based on the actions taken under the current policy.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 5

SARSA Reinforcement

Learning Algorithm: A Guide


State-action-reward-state-action (SARSA) is an on-policy reinforcement learning
algorithm used to teach a new Markov decision process policy. Learn how it works
and how to code it.

State-action-reward-state-action (SARSA) is an on-policy


algorithm designed to teach a machine learning model a
new Markov decision process policy in order to
solve reinforcement learning challenges. It’s an algorithm where,
in the current state (S), an action (A) is taken and the agent gets
a reward (R), and ends up in the next state (S1), and takes action
(A1) in S1. Therefore, the tuple (S, A, R, S1, A1) stands for the
acronym SARSA.

It’s called an on-policy algorithm because it updates the policy


based on actions taken.

WHAT IS SARSA?
SARSA is an on-policy algorithm used in reinforcement learning
to train a Markov decision process model on a new policy. It’s an
algorithm where, in the current state (S), an action (A) is taken
and the agent gets a reward (R), and ends up in the next state
(S1), and takes action (A1) in S1, or in other words, the tuple S,
A, R, S1, A1.

SARSA Algorithm
The algorithm for SARSA is a little bit different from Q-learning.
In the SARSA algorithm, the Q-value is updated taking into
account the action, A1, performed in the state, S1. In Q-learning,
the action with the highest Q-value in the next state, S1, is used
to update the Q-table.

A video tutorial on how SARSA works in machine learning. | Video: Pankaj Porwal.

MORE ON MACHINE LEARNING: Markov Chain Explained

How Does the SARSA Algorithm


Work?
The SARSA algorithm works by carrying out actions based on
rewards received from previous actions. To do this, SARSA
stores a table of state (S)-action (A) estimate pairs for each Q-
value. This table is known as a Q-table, while the state-action
pairs are denoted as Q(S, A).

The SARSA process starts by initializing Q(S, A) to arbitrary


values. In this step, the initial current state (S) is set, and the
initial action (A) is selected by using an epsilon-greedy algorithm
policy based on current Q-values. An epsilon-greedy policy
balances the use of exploitation and exploration methods in the
learning process to select the action with the highest estimated
reward.

Exploitation involves using already known, estimated values to


get more previously earned rewards in the learning process.
Exploration involves attempting to find new knowledge on
actions, which may result in short-term, sub-optimal actions
during learning but may yield long-term benefits to find the best
possible action and reward.

From here, the selected action is taken, and the reward (R) and
next state (S1) are observed. Q(S, A) is then updated, and the
next action (A1) is selected based on the updated Q-values.
Action-value estimates of a state are also updated for each
current action-state pair present, which estimates the value of
receiving a reward for taking a given action.

The above steps of R through A1 are repeated until the


algorithm’s given episode ends, which describes the sequence of
states, actions and rewards taken until the final (terminal) state
is reached. State, action and reward experiences in the SARSA
process are used to update Q(S, A) values for each iteration.

Find out who's hiring.


See all Data + Analytics jobs at top tech companies & startups
VIEW 2881 JOBS

SARSA vs. Q-learning


The main difference between SARSA and Q-learning is that
SARSA is an on-policy learning algorithm, while Q-learning is an
off-policy learning algorithm.

In reinforcement learning, two different policies are also used for


active agents: a behavior policy and a target policy. A behavior
policy is used to decide actions in a given state (what behavior
the agent is currently using to interact with its environment),
while a target policy is used to learn about desired actions and
what rewards are received (the ideal policy the agent seeks to
use to interact with its environment).

If an algorithm’s behavior policy matches its target policy, this


means it is an on-policy algorithm. If these policies in an
algorithm don’t match, then it is an off-policy algorithm.

SARSA operates by choosing an action following the current


epsilon-greedy policy and updates its Q-values accordingly. On-
policy algorithms like SARSA select random actions where non-
greedy actions have some probability of being selected,
providing a balance between exploitation and exploration
techniques. Since SARSA Q-values are generally learned using
the same epsilon-greedy policy for behavior and target, it
classifies as on-policy.

Q-learning, unlike SARSA, tends to choose the greedy action in


sequence. A greedy action is one that gives the maximum Q-
value for the state, that is, it follows an optimal policy. Off-policy
algorithms like Q-learning learn a target policy regardless of
what actions are selected from exploration. Since Q-learning
uses greedy actions, and can evaluate one behavior policy while
following a separate target policy, it classifies as off-policy.

SARSA algorithm is a slight variation of the popular Q-Learning algorithm.


For a learning agent in any Reinforcement Learning algorithm it’s policy can
be of two types:-

1. On Policy: In this, the learning agent learns the value function according
to the current action derived from the policy currently being used.
2. Off Policy: In this, the learning agent learns the value function according
to the action derived from another policy.
Q-Learning technique is an Off Policy technique and uses the greedy
approach to learn the Q-value. SARSA technique, on the other hand, is
an On Policy and uses the action performed by the current policy to learn
the Q-value.
This difference is visible in the difference of the update statements for each
technique:-

1.

Here, the update equation for SARSA depends on the current state, current
action, reward obtained, next state and next action. This observation lead to
the naming of the learning technique as SARSA stands for State Action
Reward State Action which symbolizes the tuple (s, a, r, s’, a’).

You might also like