views:

59

answers:

1

What difference to the algorithm does it make having a big or small gamma value? In my optic, as long as it is neither 0 or 1, it should work exactly the same. On the other side, whatever gamma I choose, it seems the Qvalues get pretty close to zero really quickly(I'm having here values on the order of 10^-300 just in a quick test). How do usually people plot Qvalues (i'm plotting a (x, y, best QValue for that state) given that problem? I'm trying to get around with logarithms but even then it feels kinda awkward.

Also, I don't get what is the reason behind having and alpha parameter in the Q Learning update function. It basically sets the magnitude of the update we are going to make to the Q value function. I have the idea that it is usually decreased over time. What is the interest in having it decrease over time? An update value in the beginning should have more importance than 1000 episodes later?

Also, I was thinking that a good idea for exploring the state space every time the agent doesn't want to do the greedy action would be to explore any state that still has a zero QValue(this means, at least most of the times, a state never before done), but I don't see that referred in any literature. Are there any downsides to this? I know this can't be used with (at least some) generalization functions.

Other idea would be to keep a table of visited states/actions, and try to do the actions that were tried less times before in that state. Of course this can only be done in relatively small state spaces(in my case it is definitely possible).

A third idea for late in the exploration process would be to look not only to the selected action looking for the best qvalues but also look inside all those actions possible and that state, and then in the others of that state and so.

I know those questions are kinda unrelated but I'd like to hear the opinions of people that have worked before with this and (probably) struggled with some of them too.

A: 

I haven't worked with systems exactly like this before, so I don't know how useful I can be, but...

Gamma is a measure of the agent's tendency to look forward to future rewards. The smaller it is, the more the agent will tend to take the action with the greatest reward, regardless of resultant state. Agents with larger gamma will learn long paths to big rewards. As for all Q values approaching zero, have you tried with a very simple state map (say, one state and two actions) with gamma=0? That should quickly approach Q=reward.

The idea of reducing alpha is to damp down oscillations in the Q values, so that the agent can settle into a stable pattern after a wild youth.

Exploring the state space? Why not just iterate over it, have the agent try everything? There's no reason to have the agent actually follow a course of action in its learning-- unless that's the point of your simulation. If the idea is just to find the optimal behavior pattern, adjust all Q's, not just the highest ones along a path.

Beta
The point in doing Q-Learning is not to iterate over all space. It's precisely to learn as fast as possible(i.e., having giant state spaces, learning fast how to explore them well enough for a given task). If the ideia were to iterate over it, then I'd use a typical search system(breath first, deep search, etc).Also, I don't get what is the point of setting a gamma to zero. It will only do the actions that lead to the goal being updated. All the others will be equal to zero.
devoured elysium