Chapter 5, Picking up the Toys
- The origin of the Q-learning title is the doctoral thesis of Christopher John Cornish Hellaby Watkins from King’s College, London in May, 1989. Evidently, the Q just stands for “quantity”.
- Only pick the Q-states that are relevant and follow-ons to the current state. If one of the states is impossible to reach from the current position, or state, then don’t consider it.
- If the learning rate is too small, the training can take a very long time. If the learning rate is too large, the system does not learn a path, but instead “jumps around” and may miss the minimum or optimum solution. If the learning rate is too big, the solution may not converge, or suddenly drop off.
- The discount factor works by decreases the reward as the path length gets longer. It is usually a value just short of 1.0 (for example, 0.93). Changing the discount factor higher may have the system reject valid longer paths and not find a solution. If the discount is too small, then paths may be very long.
- You would adjust the fitness function to consider path length as a factor in the fitness.
- You can implement the SARSA technique into program 2 ,as follows:
# SARSA = State, Action, Reward, State, Action Q[lastStat,lastAction]=reward+gamma*Q[stat2,action] #Q[stat,action]=reward + gamma * np.max(Q[stat2])
- Generally, increasing the learning rate shortens the learning time in generations, up to a limit where the path jumps out of the valid range. For our example program, the lowest learning rate that returns a valid solution is 5, and the highest value is 15.
- It causes the simulation to run much faster, but takes many more generations to find a solution.