2025 USA-NA-AIO Round 2, Problem 2, Part 4

Part 4 (5 points, non-coding task)

At position l_1 in an attending sequence, for head h, the information extracted from attending to a being attended sequence is given by

\mathbf{o}_{h,l_1} = \sum_{l_2 = 0}^{L_2 - 1} \alpha_{h, l_1 l_2} \mathbf{v}_{l_2,h} .

We hereafter call \mathbf{o}_{h,l_1} a pre-out-projection output vector.

Do the following tasks.

  1. What is the shape of vector \mathbf{o}_{h,l_1}?

  2. We concatenate \left\{\mathbf{o}_{h,l_1} : h \in \left\{ 0, 1 , \cdots , H-1 \right\} \right\} along axis 0:

    \mathbf{o}_{l_1} = \begin{bmatrix} \mathbf{o}_{0,l_1} \\ \mathbf{o}_{1,l_1} \\ \vdots \\ \mathbf{o}_{H-1,l_1} \end{bmatrix}

    What is the shape of \mathbf{o}_{l_1}?

  3. We project \mathbf{o}_{l_1} to a post-out-projection output vector via an out-projection matrix:

    \mathbf{x}_{l_1}^{out} = \mathbf{W}^O \mathbf{o}_{l_1} \in \Bbb R^{D_1} ,

    where

    \mathbf{W}^O = \begin{bmatrix} \mathbf{W}^O_0 & \mathbf{W}^O_1 & \cdots & \mathbf{W}^O_{H-1} \end{bmatrix}

    What is the shape of \mathbf{W}^O_h for each h \in \left\{ 0 , 1 , \cdots , H-1 \right\} and \mathbf{W}^O?

\color{green}{\text{### WRITE YOUR SOLUTION HERE ###}}

  1. The shape of \mathbf{o}_{h,l_1} is \left( D_v, \right).

  2. The shape of \mathbf{o}_{l_1} is \left( H \cdot D_v, \right).

  3. For each head h, the shape of \mathbf{W}^O_h is \left( D_1, D_v \right).

    The shape of \mathbf{W}^O is \left( D_1, H \cdot D_v \right).

\color{red}{\text{""" END OF THIS PART """}}