This question focuses on evaluating Markov Decision Processes (MDPs) and Bellman Equations in reinforcement learning. Assume an agent is navigating a grid world with the following characteristics:
-
The grid is a 3x3 matrix with each cell representing a state s \in \{s_1, s_2, \dots, s_9\}:
-
The agent can move up, down, left, or right. If the agent tries to move outside the grid, it stays in the same position.
-
The rewards are given as follows:
- R(s_1, \text{attempts right}) = 1
- R(s_2, \text{attempts down}) = 2
- All other rewards are 0.
-
The transition probabilities are stochastic:
- With probability 0.7, the agent actually moves in the chosen direction.
- With probability 0.1, the agent actually moves left.
- With probability 0.1, the agent actually moves right.
- With probability 0.1, the agent actually moves down.
The agent starts at s_1. Assume a discount factor of \gamma = 0.9.
Consider performing Q-learning with the value function V initialised to 0. Calculate the expected state-action value Q(s_1, \text{right}) after attempting to move right in the first iteration.
