Part 5 (10 points, coding task)
In this part, you are asked to build your own multi-head attention module that subclasses nn.Module.
-
For simplicity, we ignore any masking. That is, each position in an attending sequence attends to all positions in a being attended sequence.
-
In your code, you do not need to worry about whether your code is efficient in an autoprogressive token generation process when your module is used in inference in a GPT-like task.
That is, if we use your code in a GPT-like task to autoprogressively generate tokens, it is totally fine if you repeatly generate the same key and value at a given position rather than more efficiently storing their values in cache.
-
The class name is
MyMHA. -
Attributes:
-
D_1: Dimension of a hidden state/token in an attending sequence. -
D_2: Dimension of a hidden state/token in a being attended sequence. -
D_v: Dimension of a value vector. -
D_qk: Dimension of a query/key vector. -
H: Number of heads. -
W_Q: A linear module whose weights is a query-projection matrix. The shape should be consistant with your answer in Part 2. No bias. -
W_K: A linear module whose weights is key-projection matrix. The shape should be consistant with your answer in Part 2. No bias. -
W_V: A linear module whose weights is value-projection matrix. The shape should be consistant with your answer in Part 2. No bias. -
W_O: A linear module whose weights is an out-projection matrix. The shape should be consistant with your answer in Part 4. No bias.
-
-
Method
__init__:-
Inputs
-
D_1 -
D_2 -
D_qk -
D_v -
H
-
-
Outputs
- None
-
What to do inside this method
- Initialize attribute values
-
-
Method
forward:-
Inputs:
-
An attending sequence (tensor) with shape
(B,L_1,D_1) -
A being addended sequence (tensor) with shape
(B,L_2,D_2)
-
-
Outputs
- Post-out-projection outputs with shape
(B,L_1,D_1)
- Post-out-projection outputs with shape
-
What to do inside this method
-
Compute the outputs
-
After each operation, add a comment on the tensor shape
-
Do not use any loop
-
-