Part 14 (5 points, non-coding task)
In generative AI, such as GPT, we autoprogressively generate tokens. For a given position l, the keys and values on this position \mathbf{k}_l and \mathbf{v}_l are repeatly used in generating tokens for positions l' > l.
Therefore, the values of \mathbf{k}_l and \mathbf{v}_l are typically stored in cache (no need to revise your code in earlier parts if your code does not support this). We call such storage as kv-cache.
Do the following tasks to compute kv-cache in different models while doing autoregressive inference: (reasoning is required)
-
In MHA, the kv-cache at each position is 2 D. Explain why.
-
In MLA, what is the kv-cache at each position?