“Comparison is the thief of joy”, the adage goes.
But comparison is a powerful technique in machine learning that has only become more important with recent advances in AI. A recent paper by Yann LeCun et al “On the duality between contrastive and non-contrastive self-supervised learning” tried to summarize the state of contrastive learning and unify the variants, so it is a good opportunity to brush up on some fundamentals.
Intuition : Labels are bad for learning
Supervised learning requires large datasets of feature-label pairs so that models can be trained to use the feature to predict the label. Back around 2012-2015 when deep learning shot to the top of the field beating all other methods, supervised learning was the dominant paradigm. But we reached the limits of this paradigm very quickly.
Imagine you want to train a model to classify if an image is a dog or a cat. You create a labeled dataset of 1000s (or millions) of images of dogs and cats. You train a supervised model with the best SOTA neural network architecture. The model works very well for 95% of the cases, but you find a long tail of errors/failures. That’s fine, no model is perfect, let’s see if we can understand the errors and find ways to fix them.
All cat pics are indoors and most dog pics happen to be outdoors. So the model classifies the rare outdoor cat pics as dogs.
Many images have visible pet food; dog pics have image of dog treats, while cat pics have cat food. So the model misclassifies the rare images of cats with dog food as dogs and vice versa.
How would you fix these errors? The fact that all cat pics are indoors and dog food will almost always be in images of dogs and not cats is not some false pattern in the noise. These are legitimate signals, facts that do contain some information about the world, it is just not the information we wanted our model to learn.
Unfortunately, in supervised learning, it is hard to control what the model actually learns. The model has only one objective - predict the labels, and everything it learns it puts to the use of predicting that label. The only way to prevent the model from learning the idiosyncracies of the dataset is to remove the idosyncracies of the dataset, i.e add more labeled data. We need more pics of cats outdoors and dogs with cat food. But this is clearly a rabbit hole. As we get closer to 100% accuracy, the long tail of edge cases keeps getting longer. What if labradors image are always outdoors and poodles are always indoors? Do we now collect indoor labradors and outdoor poodles? Or outdoor cats with dog food and indoor dogs with cat food?
Intuition : Learning representations
Say we give a child two images of dogs and two images of cats. The child can tell immediately that the two dogs look similar, the two cats look similar and the two dogs look different from the two cats. The child does not even need to know about the label - dog or cat. They can still tell which animals are similar and which are different. In fact, the label (dog vs cat) is something people made up ‘after’ they observed the similarities and differences between dogs and cats and decided to distinguish the two groups using the label.
Secondly, when the child learns that a specific animal is a “dog”, they also instantly learn that all the similar animals are also “dogs” and all the dissimilar ones are “not dogs”. This is extremely label efficient since labelling just a single animal automatically labelled 100s of other objects. Thus having a pre-computed comparison graph codifying the relationships between objects allows the child to extract lots of knowledge from a single piece of information.
This is the basis of ‘contrastive’ learning, a method of representation learning that was all the rage back in 2018-2019 before next token prediction based GPTs stole the spotlight. In contrastive learning, we create “labels” based on similarity and contrasts in the data. We create pairs of images and train models to classify whether a pair is similar or dissimilar. Thus a pair of two cats is labeled 0 or ‘similar’ and a cat-dog pair is labeled 1 or ‘dissimilar’. A model that is trained to do this, implicitly learns a representation of dogs vs cats, but in a much more robust way.
How does this new model handle images of dog food? The model is trying to learn that pairs of images of dogs are similar, that includes pairs where one image has dogfood and one does not. This teaches the model to learn that dogfood is not what makes dogs and forces the model to learn what dogs look like. In fact the more diverse, noisy and irrelevant features the dataset has, the model becomes more robust and learns not only what makes a dog, but what doesn’t. This makes contrastive learning models not only very good at most prediction/detection tasks, but also highly robust to noise and adversarial attacks that try to trick or beat the model with tampered data.
Conjecture : Learning from reality
Contrastive learning has been shown to be more accurate and robust than supervised models. Interestingly, these state of the art results are mostly demonstrated on well-studied and labeled academic datasets. This leads me to 2 interesting conclusions.
Learning objectives have huge alpha. Even in fully labeled datasets and using identical model sizes, how you use the labels to set up the training objective makes a huge difference to what the model learns. This is very good sign because there is hope that future advances can make deep learning learn more with less data and compute and scaling is not our only path forward.
The real power of contrastive learning kicks in when there are no labels at all. For example, if we are using the images were uploaded by Instagram users, we could define images uploaded by the same user as similar and images uploaded by different users are dissimilar. Such a model will learn all kinds of interesting information about the world. In the limit, temporal and sensory continuity is an infinite font of contrastive knowledge. Two frames of video that are only 1 second apart are more similar than two frames of video 1 hour apart. People with similar diets are likely to have similar builds. Animals that look similar are also likely to have similar diets. Environments that look the same have similar temperatures. There are all kinds of such relationships that can be inferred simply by observing the world around us. As you can guess, this sounds eerily like how humans learn from the environment.
Thus contrastive learning is a powerful method of training AI that can learn by just observing and finding the patterns in the environment. This makes it natural complement to forecasting (see my article next-token-prediction) and imagination (see my article on generative diffusion) as one of the fundamental building blocks of intelligence.
And there you have it, my summary of contrastive learning, how it is inspired by human learning and how it makes much more efficient use of data to learn useful information and discard noise. For more intuitions on AI/ML, subscribe here and follow me on Twitter. You can also check out my other blog and projects on nirsd.com.