questions about q-learning

q-learning

Update Rule in Temporal difference

The update rule TD(0) Q-Learning: Q(t-1) = (1-alpha) * Q(t-1) + (alpha) * (Reward(t-1) + gamma* Max( Q(t) ) ) Then take either the current best action (to optimize) or a random action (to explorer) Where MaxNextQ is the maximum Q that can be got in the next state... But in TD(1) I think update rule will be: Q(t-2) = (1-alpha) * Q(t...

artificial-intelligence

machine-learning

markov-models

q-learning

temporal-difference

ansaurus

q-learning

Update Rule in Temporal difference