After being busy for the last couple of months (and some major life and career changes), I am back to our regular programming. Here’s to a great 2024!
Let’s welcome the new year by taking a fresh perspective at the question on everyone’s mind - are we in an AI boom or a bubble? GPT4 clearly represents a huge leap forward in AI capabilities. But why did no one see it coming? And is it the beginning of a rapid acceleration of AI capabilities or we just found a new ceiling for at least the next 5-10 years? There is no definitive answer to this big question, but this year’s NeurIPS best paper tries to analyze one aspect of the question. The paper is titled "Are Emergent Abilities of Large Language Models a Mirage?".
What is emergence?
The scaling laws of deep neural networks (more on scaling laws) discovered by Deepmind and OpenAI suggest that increasing model and data sizes has a predictable effect of the training objective loss. However, the scaling laws are unable to predict the “capabilities” of the model, as evidenced by the phenomenon of emergence. Emergent capabilities are defined as abilities that are not present in smaller models and then suddenly and unpredictably found at a certain scale threshold.
Intuition : Emergent capabilities are just discontinuous metrics
The paper makes a bold claim - Emergence is a mirage, there is no such thing.
Capabilities of models are measured by metrics such as accuracy, but not all metrics are created equal. Therefore in addition to the scale of the models and the loss, we must also account for the behavior of the metrics. Let’s look at an example -
When an LLM produces an answer to a prompt, it is spitting out a long sequence of tokens. In effect the LLM is make tens or even hundreds of predictions (one for each token). If we used a metric like Token Edit Distance which measures how far each token in the prediction is from each token in the label, the performance of an LLM would increase linearly with the probability of predicting each token correctly.
However if we use a metric like Exact Match which is 1 if every token in the whole predicted output matches every token in the label and 0 if one or more tokens are mismatched, the performance is much more unpredictable. As the model gets better and better at predicting the next token, Token Edit Distance keeps improving, but Exact Match stays at 0. Until a certain threshold is reached when the model starts getting the exact answer correctly and the Exact Match suddenly goes from 0 to 1. Thus Exact Match is a discontinuous metric. (The top row is Accuracy and shows much sharper increases than the bottom row which is Token Edit Distance.)
The paper claims that all so-called emergent capabilities of LLMs are actually just cause by discontinuous metrics. This is why we observe a large sudden rise in capabilities on only certain metrics, while other metrics that scale linearly with loss show no emergent behavior.
One interesting find in the paper is that nonlinear metrics also show quasi-emergent behavior since they can increase sharply with scale but these can be predicted by (nonlinear) scaling laws. For example, imagine a metric Atleast-k-Match which turns 1 if at least k tokens match and 0 otherwise. For small k, this is similar to Token Edit Distance, as k increases, it becomes a better approximation of ExactMatch. Importantly, Atleast-k-Match is not discontinuous, rather it is a nonlinear metric which scales as the kth power of the probability of predicting a single token correctly. This metric exhibits quasi-emergence, it increases sharply with increasing scale, but it is still predictable using scaling laws.
Conjecture : Which metrics matter?
The paper ends here but leaves the most important question unanswered. If different metrics scale differently with model size, which metrics should we use?
My intuition is that the nonlinear/discontinuous metrics are very important.
For example, consider the Turing test. An AI and a human talk to each other and an evaluator looks at the transcript and tries to identify the AI. What metric is the evaluator likely to use? For bad language models, it is possible that there are obviously bad individual words/tokens. But this stops working as the AI gets better and better at mimicking human conversation. If you have ever tried to figure out where a tweet or a website was written by ChatGPT, individual words are rarely ever wrong. When it is possible to tell, it is usually because ChatGPT doesn’t make a cohesive point, gets sidetracked or forgets its chain of thought. Such things cannot be measured at the token level, they only vary at the global narrative level. Therefore a metric that could detect human vs AI text would likely be discontinuous.
A similar intuition can be drawn about images generated with AI. CGI in movies, even in 2023, has not been been able to cross the uncanny valley (the more realistic it is the more unnerving and creepy it is). Now StableDiffusion, DalleE and their ilk seem to have finally made it across, they can make images that are aesthetically pleasing. Some of these images are hard to tell from real photos, but even when you can tell that an image is AI generated, it is rarely about errors in generation (when there are no hands in the frame :). So what metrics make an image ‘uncanny’ or identifiable as AI-generated? Just like text, the distinguishing metrics would be at the global narrative level. Maybe the characters are all looking at the camera, or the shadows are striking a different pose than the character, or perhaps the grain of the camera or texture of background is not realistic. These are not errors are individual pixel level. These are discontinuous metrics, and depend on whether a model can represent these concepts internally. In a way, we can think of the image telling a story and the story can either be sequence of unrelated but logically and grammatically correct sentences, or it can be a cohesive narrative that describes a single scene or concept vividly. Thus the test of realistic image generation in very much like the Turing test for realistic language generation and only measurable through highly nonlinear/discontinuous metrics.
Conclusion
Let us summarize what we have understood so far.
Known - Scaling laws show that larger models and larger datasets lead to higher per-token or per-pixel accuracies.
Known - Large models show emergence - at some scale, the models suddenly and unpredictably acquire a capability not present in smaller models.
Paper - Emergent capabilities are almost always measured using nonlinear or discontinuous w.r.t individual token accuracy. Metrics that scale linearly with token accuracy show smooth, predictable scaling, i.e. no emergence.
Hypothesis - Once AI improves beyond a threshold, local per-token accuracy ceases to be a good metric to judge quality. The best metrics - the ones that measure conceptual understanding, narrative and consistency - are global and discontinuous metrics.
Hypothesis - This means that all the capabilities that we really care (or scare) about are likely to be emergent capabilities that suddenly show up at some scale.
Hypothesis - Finally, nonlinear approximations of discontinuous metrics might give us an approximate scaling laws for even these emergent capabilities. This seems the most promising direction of research to me. If we could approximate highly discontinuous 0-1 capabilities like intelligence, persuasion and self-awareness using nonlinear metrics, we might be able to predict how much compute and data we would need to create intelligent, persuasive or self-aware AI.
So are you excited for what’s coming in 2024?
And there you have it, my thoughts on why some AI capabilities seem to ‘emerge’ spontaneously at large scales, which capabilities are likely to be emergent and how we may be able to predict them better. For more intuitions on AI/ML, subscribe here and follow me on Twitter. You can also check out my other blog and projects on nirsd.com.