At each position l_1 in an attending sequence, we concatenate queries \left\{ \mathbf{q}_{l_1,h} : h \in \left\{ 0, 1, \cdots , H-1 \right\} \right\} along axis 0 to get
At each position l_2 in a being attended sequence, we concatenate keys/values \mathbf{m} \in \left\{ \mathbf{k}, \mathbf{v} \right\}\left\{ \mathbf{m}_{l_2,h} : h \in \left\{ 0, 1, \cdots , H-1 \right\} \right\} along axis 0 to get
In part 1, we found the shapes in each head to be D_{qk}\times D_1, D_{qk}\times D_2 and D_{v}\times D_2. Since they are concatenated now, we simply multiply the number of rows by H. This yields \boxed{HD_{qk}\times D_1, HD_{qk}\times D_2} and \boxed{HD_{v}\times D_2}.
q_{l_1} is a concatenation of all the query vectors, which are D_{qk}\times 1. So we get \boxed{HD_{qk}\times 1}.
q_{l_1} is obtained from multiplying W^Q with the input x_{l_1}.
Similar to question 2, we have \boxed{HD_{qk}\times 1, HD_{v}\times 1 }
They are obtained from multiplying W^K and W^V with y_{l_2}, respectively.
Should we write down as pure math or python? For math should it be: A \in \mathbb{R}^{m \times n}, instead of (m, n)? And for vector v \in \mathbb{R}^n instead of (n,)?