Shibani Santurkar
2 years
We perform an apples-to-apples comparison of CLIP with a matched image-only approach (a variant of SimCLR). We train both with the same loss function, architecture, training data, data augmentations, etc., to isolate the effect of language (caption) supervision.