2025 USA-NA-AIO Round 2, Problem 3, Part 15

Part 15 (5 points, coding task)

In this part, you are asked to define a loss function.

Let I_i and T_j be image i's embedding and text j's embedding, respectively. Let B be the batch size. Let \tau be the temperature.

Then the loss function is defined as

L = \frac{1}{2} \left(- \frac{1}{B} \sum_{i = 0}^{B-1} \log \frac{\exp \left( \text{SIM} \left( I_i, T_i \right) / \tau \right) } {\sum_{j = 0}^{B-1} \exp \left( \text{SIM} \left( I_i, T_j \right) / \tau \right)} - \frac{1}{B} \sum_{i = 0}^{B-1} \log \frac{\exp \left( \text{SIM} \left( I_i, T_i \right) / \tau \right)} {\sum_{j = 0}^{B-1} \exp \left( \text{SIM} \left( I_j, T_i \right) / \tau \right)} \right) ,

where

\text{SIM} \left( I_i, T_j \right) = \frac{I_i^\top T_j}{|| I_i ||_2 || T_j ||_2} .
### WRITE YOUR SOLUTION HERE ###

def CLIP_loss_fn(image_embedding, text_embedding):
    image_embedding = image_embedding / torch.norm(image_embedding, dim = -1, keepdim = True)
    text_embedding = text_embedding / torch.norm(text_embedding, dim = -1, keepdim = True)
    sim = torch.sum(image_embedding.unsqueeze(1) * text_embedding.unsqueeze(0), dim = -1)
    loss = .5 * (-torch.mean(torch.diagonal(torch.log_softmax(sim / torch.exp(model_CLIP.log_tau), dim = 0))) \
                 -torch.mean(torch.diagonal(torch.log_softmax(sim / torch.exp(model_CLIP.log_tau), dim = 1))))
    return loss

""" END OF THIS PART """