Finding optimal policies with Dynamic Programming
Dynamic Programming (DP) is a base for many RL algorithms. The main paradigm of DP algorithms is to use the state- and action-value functions as tools to find the optimal policy, given a fully-known model of the environment. In this section, we'll see how to do that.
Policy evaluation
We'll start with policy evaluation, or how to compute the state-value function,

, given a specific policy, π. This task is also known as prediction. As a reminder, we'll assume that the state-value function is a table. We'll implement policy evaluation using the state-value Bellman equation we defined in the Bellman equations section. Let's start:
- Input the following:
- The policy, π.
- A small threshold value, θ, which is used to assess when to stop.
- Initialize the following:
- The Δ variable with 0. We'll use it in combination with θ to assess whether to stop.
- The table with some value for all states.
- Repeat until :
- For each state si in , do the following:
- Extract...