views:

76

answers:

2

Let's assume we're in a room where our agent can move along the xx and yy axis. At each point he can move up, down, right and left. So our state space can be defined by (x, y) and our actions at each point are given by (up, down, right, left). Let's assume that wherever our agent does an action that will make him hit a wall we will give him a negative reward of -1, and put him back in the state he was before. If he finds in the center of the room a puppet he wins +10 reward.

When we update our QValue for a given state/action pair, we are seeing what actions can be done in the new state and computing what is the maximum QValue that is possible to get there, so we can update our Q(s, a) value for our current state/action. What this means is that if we have a goal state in the point (10, 10), all states around it will have a QValue a bit smaller and smaller as they get farther. Now, in relationship to the walls, it seems to me the same is not true.

When the agent hits a wall(let's assume he's in the position (0, 0) and did the action UP), he will receive for that state/action a reward of -1, thus getting a QValue of -1.

Now, if later I am in the state (0, 1), and assuming all the other actions of state (0,0 0) are zero, when calculating the QValue of (0, 1) for the action LEFT, it will compute it the following way:

Q([0,1], LEFT) = 0 + gamma * (max { 0, 0, 0, -1 } ) = 0 + 0 = 0

This is, having hit the wall doesn't propagate to nearby states, contrary to what happens when you have positive reward states.

In my optic this seems odd. At first I thought finding state/action pairs giving negative rewards would be learningwise as good as positive rewards, but from the example I have shown above, that statement doesn't seem to hold true. There seems to be a bias in the algorithm for taking far more into consideration positive rewards than negative ones.

Is this the expected behavior of QLearning? Shouldn't bad rewards be just as important as positive ones? What are "work-arounds" for this?

+2  A: 

Negative feedback only propagates when it is the only possible outcome from a particular move.

Whether this is deliberate or unintentional I do not know.

Anon.
That's a nice conclusion, yes.
devoured elysium
+1  A: 

You can avoid negative awards by increasing the default reward from 0 to 1, the goal reward from 10 to 11, and the penalty from -1 to 0.

There are tons of scientific publications on Q-learning, so I'm sure there are other formulations that would allow for negative feedback.

EDIT: I stand corrected, this doesn't change the behaviour as I stated earlier. My thought process was that the formulation with negative feedback could be replaced by one without.

The reason for your observation is that you have no uncertainty on the outcome of your actions or the state it is in, therefore your agent can always choose the action it believes has optimal reward (thus, the max Q-value over all future actions). This is why your negative feedback doesn't propagate: the agent will simply avoid that action in the future.

If, however, your model would include uncertainty over the outcome over your actions (e.g. there is always a 10% probability of moving in a random direction), your learning rule should integrate over all possible future rewards (basically replacing the max by a weighted sum). In that case negative feedback can be propagated too (this is why I thought it should be possible :p ). Examples of such models are POMDPs.

catchmeifyoutry
That's a very good idea indeed.
devoured elysium
Could you elaborate on how you think this works? Because it doesn't. In your case, max() will end up showing 1 (the default value) even if you decrease some (but not all) of the components.
Anon.
Actually now that I think of it, it will make the agent want to go to the (before) negative reward states, as they are higher than 0(the default value for the qvalue empty table).
devoured elysium
I just now noticed I misread your original post. Your idea will yield exactly the same results as the original QLearning formulation, catch.
devoured elysium
yes, sorry guys, i've updated my post
catchmeifyoutry
I get all you said but your last paragraph. Why would uncertainty make it propagate negative rewards? If in each action I do I can with 10% probability choose a different one, in average, I will commit the same "error" to all of them, just decreasing the overall expected value of all the other actions. Is this what you mean?
devoured elysium
Yes, except ``you'' wouldn't choose a different one, ``fate'' would.If I'm on a square next to a wall, any action has has a small (or large when moving to the wall) probability of a negative award. This means that the expected reward of standing next to a wall will become less than standing further away from the wall, as they probability of accidentally ending in a bad state is smaller there.
catchmeifyoutry
Yes, thanks. I get it. Anyway, for the deterministic case, is there anyway I could make states that I don't want my agent to get in, to propagate just as positive rewards do?
devoured elysium
So in the way you propose, no.But why would you not want them to get into those states? As long as a penalty can be avoided in that state, moving to the state is by itself not necessarily bad. Plus, if a move to the state is suboptimal, the agent would learn to prefer actions that are better, given sufficient time.If there would be a way to propagate negative awards in your example, the agent might never succeed if reaching the reward requires to pass a wall, after one early bad experience.Which answers your original question, bad rewards shouldn't propagate (if there is no uncertainty).
catchmeifyoutry