Part 4 (5 points, non-coding task)
At position l_1 in an attending sequence, for head h, the information extracted from attending to a being attended sequence is given by
We hereafter call \mathbf{o}_{h,l_1} a pre-out-projection output vector.
Do the following tasks.
-
What is the shape of vector \mathbf{o}_{h,l_1}?
-
We concatenate \left\{\mathbf{o}_{h,l_1} : h \in \left\{ 0, 1 , \cdots , H-1 \right\} \right\} along axis 0:
\mathbf{o}_{l_1} = \begin{bmatrix} \mathbf{o}_{0,l_1} \\ \mathbf{o}_{1,l_1} \\ \vdots \\ \mathbf{o}_{H-1,l_1} \end{bmatrix}What is the shape of \mathbf{o}_{l_1}?
-
We project \mathbf{o}_{l_1} to a post-out-projection output vector via an out-projection matrix:
\mathbf{x}_{l_1}^{out} = \mathbf{W}^O \mathbf{o}_{l_1} \in \Bbb R^{D_1} ,where
\mathbf{W}^O = \begin{bmatrix} \mathbf{W}^O_0 & \mathbf{W}^O_1 & \cdots & \mathbf{W}^O_{H-1} \end{bmatrix}What is the shape of \mathbf{W}^O_h for each h \in \left\{ 0 , 1 , \cdots , H-1 \right\} and \mathbf{W}^O?