Hello
I am currently using Q-Learning to try to teach a bot how to move in a room filled with walls/obstacles. It must start in any place in the room and get to the goal state(this might be, to the tile that has a door, for example). Currently when it wants to move to another tile, it will go to that tile, but I was thinking that in the future I might add a random chance of going to another tile, instead of that. It can only move up, down, left and right. Reaching the goal state yields +100 and the rest of the actions will yield 0.
I am using the algorithm found here, which can be seen in the image bellow.
Now, regarding this, I have some questions:
- When using Q-Learning, a bit like Neural Networks, I must make distinction between a learning phase and a using phase? I mean, it seems that what they shown on the first picture is a learning one and in the second picture a using one.
- I read somewhere that it'd take an infinite number of steps to reach to the optimum Q values table. Is that true? I'd say that isn't true, but I must be missing something here.
I've heard also about TD(Temporal Differences), which seems to be represented by the following expression:
Q(a, s) = Q(a, s) * alpha * [R(a, s) + gamma * Max { Q(a', s' } - Q(a, s)]
which for alpha = 1, just seems the one shown first in the picture. What difference does that gamma make, here?
- I have run in some complications if I try a very big room(300x200 pixels, for example). As it essentially runs randomly, if the room is very big then it will take a lot of time to go randomly from the first state to the goal state. What methods can I use to speed it up? I thought maybe having a table filled with trues and falses, regarding whatever I have in that episode already been in that state or not. If yes, I'd discard it, if no, I'd go there. If I had already been in all those states, then I'd go to a random one. This way, it'd be just like what am I doing now, knowing that I'd repeat states a less often that I currently do.
- I'd like to try something else than my lookup table for Q-Values, so I was thinking in using Neural Networks with back-propagation for this. I will probably try having a Neural Network for each action (up, down, left, right), as it seems it's what yields best results. Are there any other methods (besides SVM, that seem way too hard to implement myself) that I could use and implement that'd give me good Q-Values function approximation?
- Do you think Genetic Algorithms would yield good results in this situation, using the Q-Values matrix as the basis for it? How could I test my fitness function? It gives me the impression that GA are generally used for things way more random/complex. If we watch carefully we will notice that the Q-Values follow a clear trend - having the higher Q values near the goal and lower ones the farther away you are from them. Going to try to reach that conclusion by GA probably would take way too long?
Thanks