Dynamic Programming in Reinforcement Learning
1. What is Dynamic Programming in Reinforcement Learning?
- Dynamic Programming (DP) is a group of algorithms used in reinforcement learning to find the
optimal policy and value functions when we have a perfect model of the environment.
- The environment is modeled as a Markov Decision Process (MDP).
- DP uses Bellman equations to compute the value of states or state-action pairs.
- The main idea is to break down a complex problem into smaller subproblems and solve them
recursively.
- DP provides exact solutions, but it is computationally expensive, so it is not used in large real-world
problems.
- Other RL methods like Monte Carlo and Temporal Difference are approximations of DP.
2. Policy Evaluation (Prediction)
- Policy Evaluation is used to calculate how good a policy pi is by estimating the value function vpi(s)
for all states s.
- It uses the Bellman expectation equation: vpi(s) = suma pi(a|s) sums',r p(s', r|s, a) [r + gamma
vpi(s')].
- Instead of solving the equation directly, we use iterative updates starting from an initial guess.
- This method is called Iterative Policy Evaluation and continues until the value function converges.
- The updates are based on expected values, not samples, and are done through multiple sweeps of
the state space.
3. Policy Improvement
- Policy Improvement improves a given policy by checking if different actions provide higher value.
- We compute the action-value function: qpi(s, a) = sums',r p(s', r|s, a) [r + gamma vpi(s')].
- If a different action gives a higher value, we update the policy at that state.
- The new policy pi' is better if vpi'(s) >= vpi(s) for all states s. This is the Policy Improvement
Theorem.
- Acting greedily with respect to the value function usually improves the policy.
4. Policy Iteration
- Policy Iteration finds the optimal policy by repeating Policy Evaluation and Policy Improvement.
- It starts with any policy and evaluates it using Iterative Policy Evaluation.
- Then it improves the policy using the Policy Improvement step.
- These steps are repeated until the policy does not change anymore.
- The final policy and value function are both optimal.
5. Value Iteration
- Value Iteration simplifies Policy Iteration by combining evaluation and improvement into one step.
- It updates value using the Bellman Optimality Equation: v(s) = maxa sums',r p(s', r|s, a) [r +
gamma v(s')].
- Values are updated directly and repeatedly until they converge.
- Once the values stabilize, the optimal policy is formed by choosing the action with the highest
value in each state.
- It is faster than Policy Iteration because it avoids full evaluation.
6. Asynchronous Dynamic Programming
- Asynchronous DP updates the value of states in any order instead of all at once.
- In regular DP, we perform full sweeps of all states, but in Asynchronous DP, we update one or a
few states at a time.
- This is useful in large problems where full sweeps are expensive.
- It still converges if all states are updated enough times.
- It is more practical and flexible for real-world applications.
7. Generalized Policy Iteration (GPI)
- GPI is the general idea of combining Policy Evaluation and Policy Improvement.
- Evaluation and improvement are done together, not in fixed steps.
- As value functions improve, policies improve, and vice versa.
- This process continues until both the policy and value function converge.
- GPI is the foundation for many advanced reinforcement learning algorithms.