The feed-forward network within each Transformer layer consists of two linear transformations: one with input dimension d and output dimension f=2048, and another with input dimension f and output dimension d. Compute the total number of parameters for the feed-forward network in a single Transformer layer.
In this part, we do not consider bias.
In the first layer mapped from dimension d to dimension f, the number of parameters (weights) is df.
In the second layer mapped from dimension f to dimension d, the number of parameters (weights) is df.
Therefore, in the feed-forward network in a single transformer layer, the total number of parameters is
\begin{align*}
df + df
& = 2 df \\
& = 2,097,152 .
\end{align*}