Part 15 (5 points, coding task)
In this part, you are asked to define a loss function.
Let I_i and T_j be image i's embedding and text j's embedding, respectively. Let B be the batch size. Let \tau be the temperature.
Then the loss function is defined as
L = \frac{1}{2}
\left(- \frac{1}{B} \sum_{i = 0}^{B-1} \log \frac{\exp \left( \text{SIM} \left( I_i, T_i \right) / \tau \right) }
{\sum_{j = 0}^{B-1} \exp \left( \text{SIM} \left( I_i, T_j \right) / \tau \right)}
- \frac{1}{B} \sum_{i = 0}^{B-1} \log \frac{\exp \left( \text{SIM} \left( I_i, T_i \right) / \tau \right)}
{\sum_{j = 0}^{B-1} \exp \left( \text{SIM} \left( I_j, T_i \right) / \tau \right)}
\right) ,
where
\text{SIM} \left( I_i, T_j \right)
= \frac{I_i^\top T_j}{|| I_i ||_2 || T_j ||_2} .