Throughout this book, we approach two main types of model-free algorithms: the ones based on the gradient of the policy, and the ones based on the value function. From the first family, we saw REINFORCE, actor-critic, PPO, and TRPO. From the second, we saw Q-learning, SARSA, and DQN. As well as the way in which the two families learn a policy (that is, policy gradient algorithms use stochastic gradient ascent toward the steepest increment on the estimated return, and value-based algorithms learn an action value for each state-action to then build a policy), there are key differences that let us prefer one family over the other. These are the on-policy or off-policy nature of the algorithms, and their predisposition to manage large action spaces. We already discussed the differences between on-policy and off-policy in the previous...
United States
Great Britain
India
Germany
France
Canada
Russia
Spain
Brazil
Australia
South Africa
Thailand
Ukraine
Switzerland
Slovakia
Luxembourg
Hungary
Romania
Denmark
Ireland
Estonia
Belgium
Italy
Finland
Cyprus
Lithuania
Latvia
Malta
Netherlands
Portugal
Slovenia
Sweden
Argentina
Colombia
Ecuador
Indonesia
Mexico
New Zealand
Norway
South Korea
Taiwan
Turkey
Czechia
Austria
Greece
Isle of Man
Bulgaria
Japan
Philippines
Poland
Singapore
Egypt
Chile
Malaysia