Part 10 (5 points, non-coding task)
In this part, you are asked to answer some questions about a CLIP model that you shall build in the next part.
-
Write your answers in the text cell below.
-
To get answers, you may need to run experimental code to better learn the ViT and Bert models.
-
We only grade your answers in the text cell.
-
Image encoder
-
Define
model_image = ViTModel.from_pretrained('google/vit-base-patch16-224')
. We use all blocks except the last pooler layer. That is, this ViT model has two outputs: with their key names aslast_hidden_state
andpooler_output
. You should take the value associated with the keylast_hidden_state
. -
From the last hidden state, we project from position 0 to a latent space with dimension
embedding_size
(e.g., 512). The output is called image embedding.
-
-
Text encoder
-
Define
model_text = BertModel.from_pretrained('bert-base-uncased')
. We use all blocks except the last pooler layer. That is, this Bert model has two outputs: with their key names aslast_hidden_state
andpooler_output
. You should take the value associated with the keylast_hidden_state
. -
From the last hidden state, we project from position 0 to a latent space with dimension
embedding_size
(e.g., 512). The output is called text embedding.
-
Answer the following questions. (Reasoning is required only for Question 3)
-
Let
image_batch
be with shape(B,3,224,224)
. What is the shape ofmodel_image(image_batch)[
last_hidden_state]
? -
Let
token_id_batch
andattention_mask_batch
be with shape(B,L)
. What is the shape ofmodel_text(input_ids = token_id_batch, attention_mask = attention_mask_batch)['last_hidden_state']
? -
For both the image encoder and the text encoder, we project the last hidden state from position 0 to a latent space with the same dimension
embedding_size
.3.1. Why do we add this additional out-projection layer?
3.2. Why this layer is added on position 0 only?
3.3. Why the output dimensions from these two encoders are the same?