Part 10 (5 points, non-coding task)
In this part, you are asked to answer some questions about a CLIP model that you shall build in the next part.
-
Write your answers in the text cell below.
-
To get answers, you may need to run experimental code to better learn the ViT and Bert models.
-
We only grade your answers in the text cell.
-
Image encoder
-
Define
model_image = ViTModel.from_pretrained('google/vit-base-patch16-224'). We use all blocks except the last pooler layer. That is, this ViT model has two outputs: with their key names aslast_hidden_stateandpooler_output. You should take the value associated with the keylast_hidden_state. -
From the last hidden state, we project from position 0 to a latent space with dimension
embedding_size(e.g., 512). The output is called image embedding.
-
-
Text encoder
-
Define
model_text = BertModel.from_pretrained('bert-base-uncased'). We use all blocks except the last pooler layer. That is, this Bert model has two outputs: with their key names aslast_hidden_stateandpooler_output. You should take the value associated with the keylast_hidden_state. -
From the last hidden state, we project from position 0 to a latent space with dimension
embedding_size(e.g., 512). The output is called text embedding.
-
Answer the following questions. (Reasoning is required only for Question 3)
-
Let
image_batchbe with shape(B,3,224,224). What is the shape ofmodel_image(image_batch)[last_hidden_state]? -
Let
token_id_batchandattention_mask_batchbe with shape(B,L). What is the shape ofmodel_text(input_ids = token_id_batch, attention_mask = attention_mask_batch)['last_hidden_state']? -
For both the image encoder and the text encoder, we project the last hidden state from position 0 to a latent space with the same dimension
embedding_size.3.1. Why do we add this additional out-projection layer?
3.2. Why this layer is added on position 0 only?
3.3. Why the output dimensions from these two encoders are the same?