Policy objective functions
Let's discuss now how to optimize a policy. In policy methods, our main objective is that a given policy

with parameter vector

finds the best values of the parameter vector. In order to measure which is the best, we measure

the quality of the policy

for different values of the parameter vector

.
Before discussing the optimization methods, let's first figure out the different ways to measure the quality of a policy

:
- If it's an episodic environment, can be the value function of the start statethat is if it starts from any state, then the value function of it would be the expected sum of reward from that state onwards. Therefore,

- If it's a continuing environment, can be the average value function of the states. So, if the environment goes on and on forever, then the measure of the quality of the policy can be the summation of the probability of being in any state s that istimes the value of that state that is, the expected reward from that state onward. Therefore...