2024 IAIO Problem 6.1

This question focuses on evaluating Markov Decision Processes (MDPs) and Bellman Equations in reinforcement learning. Assume an agent is navigating a grid world with the following characteristics:

  • The grid is a 3x3 matrix with each cell representing a state s \in \{s_1, s_2, \dots, s_9\}:

  • The agent can move up, down, left, or right. If the agent tries to move outside the grid, it stays in the same position.

  • The rewards are given as follows:

    • R(s_1, \text{attempts right}) = 1
    • R(s_2, \text{attempts down}) = 2
    • All other rewards are 0.
  • The transition probabilities are stochastic:

    • With probability 0.7, the agent actually moves in the chosen direction.
    • With probability 0.1, the agent actually moves left.
    • With probability 0.1, the agent actually moves right.
    • With probability 0.1, the agent actually moves down.

The agent starts at s_1. Assume a discount factor of \gamma = 0.9.

Consider performing Q-learning with the value function V initialised to 0. Calculate the expected state-action value Q(s_1, \text{right}) after attempting to move right in the first iteration.

Note: In the Q function in this problem, the second argument should be understood as the agent’s attempted action, not the realized action.

We have

\begin{align*} Q \left(s_1, \text{attempts right} \right)) & = R \left(s_1, \text{attempts right} \right)) + \gamma \left( 0.7 V \left( s_2 \right) + 0.1 V \left( s_1 \right) + 0.1 V \left( s_2 \right) + 0.1 V \left( s_4 \right) \right) \\ & = R \left(s_1, \text{attempts right} \right)) + \gamma \left( 0.7 \cdot 0 + 0.1 \cdot 0 + 0.1 \cdot 0 + 0.1 \cdot 0 \right) \\ & = 1 . \end{align*}