A Simple Framework for Contrastive Learning of Visual Representations

Why is this work important?

The contrastive learning idea¹ (of making representations of an image agree with each other) is not new; it dates back to a Becker & Hinton paper² in 1992. But previously proposed contrastive learning requires specialized architecture or a memory bank.

This new framework removes the need for that. They spell out the necessary elements for having a successful simple contrastive learning procedure.

Why is this work different?

SimCLR algorithm is composed of three steps:

transformation,
representation,
projection.

The latter two steps are done by a neural network. The projection neural network is typically one or two linear (or non-linear) layer.

How does it work?

In this contrastive learning setup, an image is transformed in two different ways. These two resulting data points are called a positive pair. Each will be inputted to the same neural network (composed of a representation network and a projection head network) that will extract a feature vector for each of them.

We want the extracted representations of two transformed images to agree with each other. The contrastive learning task will have a loss function that is going to maximize the “agreement” between these two feature vectors.

With these representations, we can then use these for linear classification. Or we can use the obtained network as an initial point for fine-tuning or transfer learning procedures.

The authors try out many different conditions, including different combinations of data augmentations, different architectures, and loss functions.

The end goal of experimenting with these different conditions is to determine which one leads to the best representations for a subsequent supervised learning task. They use a “linear evaluation protocol” which is going to train a classifier directly on these extracted representations, from the different settings. The better the final performance on the classification task, the better the representation is thought to be.

What are the results?

Learning representations in the unsupervised setting benefit from³ :

Composing augmentations, particularly the combination of random cropping and random color distortion. Unsupervised learning benefits more from this than supervised counterparts.
Augmentations can replace the design of complex architectures as previously done in contrastive learning. For example, random cropping of an image can easily provide adjacent viewers and global/local views for the object in that image.
Increasing depth and width. Unsupervised learning benefits more from this than supervised counterparts.
Adding a projection head improves the representation learned. If it’s a non-linear projection it’s better. However note that the representation used for the downstream task should be that of the layer before the projection head, not the one after (See Section 4.2).
The contrastive training loss that resulted in the best performance is the NT-Xent loss (Normalized, Temperature-scaled cross entropy loss). They find that the use of l2 normalization and temperature scaling is crucial for the resulting representation.
Training longer and larger batch sizes leads to better representations as well. That is because the network ends up seeing more negative examples, which facilitates convergence.
Compared with state-of-the-art (other methods for learning representations): they consistently outperform them for the linear evaluation mode and in the semi-supervised mode (in which the model is fine-tuned over few data points in ImageNet).