2025 USA-NA-AIO Round 2, Problem 2, Part 3

Part 3 (10 points, non-coding task)

Define function \text{Softmax}: \Bbb R^d \rightarrow \Bbb R^d, with the $i$th output value as

\text{Softmax}_i \left( \mathbf{z} \right) = \frac{\exp \left( z_i \right)}{\sum_{j=0}^{d-1} \exp \left( z_j \right)} .

At position l_1 in the attending sequence, its attention score to position l_2 in the being attended sequence for head h is denoted as \alpha_{h, l_1 l_2}.

We can write \alpha_{h, l_1 l_2} in the following form:

\alpha_{h, l_1 l_2} = \text{Softmax}_{l_2} \left( \color{red}{\boxed{???}} \right) ,

What is the formula in the above red box (reasoning is not required)?

\color{green}{\text{### WRITE YOUR SOLUTION HERE ###}}

\alpha_{h, l_1 l_2} = \text{Softmax}_{l_2} \left( \frac{\mathbf{q}_{h, l_1}^\top \mathbf{K}_h^\top}{\sqrt{D_{qk}}} \right) ,

where

\mathbf{K}_h = \begin{bmatrix} \mathbf{k}_{h, 0}^\top \\ \mathbf{k}_{h, 1}^\top \\ \vdots \\ \mathbf{k}_{h, L_2-1}^\top \end{bmatrix} \in \Bbb R^{L_2 \times D_{qk}} .

\color{red}{\text{""" END OF THIS PART """}}