ansaurus

Question

Answer 1

+1 A:

This is as expected, since the Q estimate isn't the expected reward, it's the expected return, which is the (possibly discounted via gamma) amount of reward I'd expect to reap from that state/action if I started there and followed my policy until the end of the episode or forever.

If you give me some buttons, and one of those buttons always produces $1 when pressed, then the true expected reward for pressing that button is $1. But the true expected return for pressing the button is infinity dollars, assuming I get infinite number of chances to push a button.

kwatford 2009-12-03 00:10:54

I didn't understand very well your point. It is clear to me that Q estimate isn't the expected reward, but I don't see what is the point that if I leave this up all the week long, I will find that the states near the goal state have Qvalues near 9M or something, instead of having a kind of gradient with 100 in the goal state and getting lower and lower as I get farther from it.

devoured elysium 2009-12-03 00:15:37

Regardless of what state it starts in, it can get to the goal state in a few steps, after which it can simply revist the goal state as often as it likes. So the expected return of nearly any state/action pair will go towards infinity (or the upper limit dictated by your gamma value). If you want to get meaningful Q-values out of a continuing task, then you need to design the rewards to be meaningful.

kwatford 2009-12-03 00:30:18

ansaurus

tags:

views:

answers:

QLearning and never-ending episodes

related questions