2024 IAIO Question 4.1

DALL-E uses a discrete VAE (Variational Autoencoder) to encode images into tokens and then generates images from these tokens using a Transformer architecture. Suppose the model uses a vocabulary of V=8192 discrete tokens. Each token is represented by an embedding vector of dimensionality d=512.

Calculate the total number of parameters in the embedding matrix used for encoding these tokens.

V d = 4,194,304.

The embedding matrix will take a one-hot vector of size 8192 to an embedding vector of size 512, and this can only happen through a 8192*512 matrix which has 2^{13}\cdot 2^9 = \boxed{2^{22}} parameters.