Part 5 (10 points, coding task)
In this part, you are asked to build your own multi-head attention module that subclasses nn.Module
.
-
For simplicity, we ignore any masking. That is, each position in an attending sequence attends to all positions in a being attended sequence.
-
In your code, you do not need to worry about whether your code is efficient in an autoprogressive token generation process when your module is used in inference in a GPT-like task.
That is, if we use your code in a GPT-like task to autoprogressively generate tokens, it is totally fine if you repeatly generate the same key and value at a given position rather than more efficiently storing their values in cache.
-
The class name is
MyMHA
. -
Attributes:
-
D_1
: Dimension of a hidden state/token in an attending sequence. -
D_2
: Dimension of a hidden state/token in a being attended sequence. -
D_v
: Dimension of a value vector. -
D_qk
: Dimension of a query/key vector. -
H
: Number of heads. -
W_Q
: A linear module whose weights is a query-projection matrix. The shape should be consistant with your answer in Part 2. No bias. -
W_K
: A linear module whose weights is key-projection matrix. The shape should be consistant with your answer in Part 2. No bias. -
W_V
: A linear module whose weights is value-projection matrix. The shape should be consistant with your answer in Part 2. No bias. -
W_O
: A linear module whose weights is an out-projection matrix. The shape should be consistant with your answer in Part 4. No bias.
-
-
Method
__init__
:-
Inputs
-
D_1
-
D_2
-
D_qk
-
D_v
-
H
-
-
Outputs
- None
-
What to do inside this method
- Initialize attribute values
-
-
Method
forward
:-
Inputs:
-
An attending sequence (tensor) with shape
(B,L_1,D_1)
-
A being addended sequence (tensor) with shape
(B,L_2,D_2)
-
-
Outputs
- Post-out-projection outputs with shape
(B,L_1,D_1)
- Post-out-projection outputs with shape
-
What to do inside this method
-
Compute the outputs
-
After each operation, add a comment on the tensor shape
-
Do not use any loop
-
-