2024 IAIO Question 4.2

The feed-forward network within each Transformer layer consists of two linear transformations: one with input dimension d and output dimension f=2048, and another with input dimension f and output dimension d. Compute the total number of parameters for the feed-forward network in a single Transformer layer.

In this part, we do not consider bias.

In the first layer mapped from dimension d to dimension f, the number of parameters (weights) is df.

In the second layer mapped from dimension f to dimension d, the number of parameters (weights) is df.

Therefore, in the feed-forward network in a single transformer layer, the total number of parameters is

\begin{align*} df + df & = 2 df \\ & = 2,097,152 . \end{align*}