
Update Rule in Temporal difference

The update rule TD(0) Q-Learning: Q(t-1) = (1-alpha) * Q(t-1) + (alpha) * (Reward(t-1) + gamma* Max( Q(t) ) ) Then take either the current best action (to optimize) or a random action (to explorer) Where MaxNextQ is the maximum Q that can be got in the next state... But in TD(1) I think update rule will be: Q(t-2) = (1-alpha) * Q(t...