2025 USA-NA-AIO Round 2, Problem 2, Part 14

Part 14 (5 points, non-coding task)

In generative AI, such as GPT, we autoprogressively generate tokens. For a given position l, the keys and values on this position \mathbf{k}_l and \mathbf{v}_l are repeatly used in generating tokens for positions l' > l.

Therefore, the values of \mathbf{k}_l and \mathbf{v}_l are typically stored in cache (no need to revise your code in earlier parts if your code does not support this). We call such storage as kv-cache.

Do the following tasks to compute kv-cache in different models while doing autoregressive inference: (reasoning is required)

  1. In MHA, the kv-cache at each position is 2 D. Explain why.

  2. In MLA, what is the kv-cache at each position?

\color{green}{\text{### WRITE YOUR SOLUTION HERE ###}}

  1. In MHA, \mathbf{k}_l, \mathbf{v}_l \in \Bbb R^D. Therefore, the kv-cache at each position is \boxed{2D}.

  2. In MLA, because \mathbf{W}^{\mathbf{DKV}} \in \Bbb R^{r \times D}, we have \mathbf{\hat k}_l, \mathbf{\hat v}_l \in \Bbb R^r.

    In addition, because \mathbf{\hat k}_l = \mathbf{\hat v}_l, the kv-cache at each position is \boxed{r}.

\color{red}{\text{""" END OF THIS PART """}}