Part 5 (5 points, non-coding task)
Note that our final goal is to build a CLIP neural network. For the image data, we will use Vision Transformers (ViT) to extract image embeddings.
With the above high level information, please explain the reasons behind the following things that you did in Part 4.
-
Why the channel dimension is ahead of the height and width dimensions?
-
Why the sizes of all images are normalized to
(224,224)
? -
Why each pixel value is normalized between -1 and 1?