There is a pattern to dealing with most MDP problems, but I think you've probably omitted some information from the problem description, most likely it has to do with the state you're trying to reach, or the way an episode ends (what happens if you run off the edge of the grid). I've done my best to answer your questions, but I've appended a primer on the process I use to deal with these types of problems.
Firstly utility is a fairly abstract measure of how much you want to be in a given state. It's definitely possible to have two states with equal utility, even when you measure utility with simple heuristics (Euclidean or Manhattan distance). In this case, I assume that the utility value and reward are interchangeable.
In the long term, the objective in these types of problems tends to be, how do you maximise your expected (long term) reward? The learning rate, gamma, controls how much emphasis you place on the current state versus where you would like to end up - effectively you can think of gamma as a spectrum going from, 'do the thing the benefits me most in this timestep' to at the other extreme 'explore all my options, and go back to the best one'. Sutton and Barto in there book on reinforcement learning have some really nice explanations of how this works.
Before you get started, go back through the question and make sure that you can confidently answer the following questions.
- What is a state? How many states are there?
- What is an action? How many actions are there?
- If you start in state u, and you apply an action a, what is the probability of reaching a new state v?
So the answers to the questions?
- A state is a vector (x,y). The grid is 5 by 5, so there are 25 states.
- There are four possible actions, {E,N,S,W}
- The probability of successfully reaching an adjacent state after applying a suitable action is 0.7, the probability of not moving (staying in the same state is 0.3). Assuming (0,0) is the top left cell and (4,4) is the bottom right cell, the following table shows a small subset of all of the possible transitions.
Start State Action Final State Probability
---------------------------------------------------
(0,0) E (0,0) 0.3
(0,0) E (1,0) 0.7
(0,0) E (2,0) 0
...
(0,0) E (0,1) 0
...
(0,0) E (4,4) 0
(0,0) N (0,0) 0.3
...
(4,4) W (3,4) 0.7
(4,4) W (4,4) 0.3
How can we check that this makes sense for this problem?
- Check that the table has an appropriate number of entries. On a 5 by 5 grid there are 25 states and 4 actions, so the table should have 100 entries.
- Check to make sure that for an start state / action pair, only two entries have non-zero probability of occuring.
Edit. answering the request for the transition probabilities to the target state. The notation below assumes
- v is the final state
- u is the source state
- a is the action, where it isn't mentioned, it's implied that the action applied isn't relevant.
P( v=(3,3) | u =(2,3), a=E ) = 0.7
P( v=(3,3) | u =(4,3), a=W ) = 0.7
P( v=(3,3) | u =(3,2), a=N ) = 0.7
P( v=(3,3) | u =(3,4), a=S ) = 0.7
P( v=(3,3) | u =(3,3) ) = 0.3