2025 USA-NA-AIO Round 2, Problem 3, Part 5

USAAIO · May 14, 2025, 10:51pm

Part 5 (5 points, non-coding task)

Note that our final goal is to build a CLIP neural network. For the image data, we will use Vision Transformers (ViT) to extract image embeddings.

With the above high level information, please explain the reasons behind the following things that you did in Part 4.

USAAIO · May 14, 2025, 10:51pm

\color{green}{\text{### WRITE YOUR SOLUTION HERE ###}}

The input of ViT requires the channel dimension to go ahead of the height and width dimensions.
ViT model requires this dimension.
ViT model requires data to fall into this range.

\color{red}{\text{""" END OF THIS PART """}}

Topic		Replies	Views
2025 USA-NA-AIO Round 2, Problem 3, Part 10 2025 USA-NA-AIO Round 2	1	67	May 14, 2025
2025 USA-NA-AIO Round 2, Problem 3, Part 11 2025 USA-NA-AIO Round 2	1	115	May 14, 2025
2025 USA-NA-AIO Round 2, Problem 3, Part 13 2025 USA-NA-AIO Round 2	1	82	May 14, 2025
2025 USA-NA-AIO Round 2, Problem 3, Part 4 2025 USA-NA-AIO Round 2	1	64	May 14, 2025
2025 USA-NA-AIO Round 2, Problem 3, Part 1 2025 USA-NA-AIO Round 2	1	184	May 14, 2025