Ever since the craze for all things AI-generated, I've been wondering what would happen if the world was so filled with AI-generated content (text, software, pictures, music) that AI training sets dominated. Content created by AI. We can already see hints of this on GitHub. In February 2023, GitHub stated that 46% of all checked-in code was written by Copilot. This is great for business, but what does it mean for future generations of Copilot? In the near future, new models will learn the code you write. The same goes for all other generative AI applications. DALL-E 4 is trained on data containing images generated by DALL-E 3, Stable Diffusion, Midjourney, etc. GPT-5 is trained on a text set containing text generated from GPT-4. etc. This is inevitable. What does this mean about the quality of output they produce? Will that quality improve or worsen?
I'm not the only one wondering about this. At least one research group has experimented with training generative models on content generated by generative AI and found that over successive generations the output becomes more tightly constrained and less likely to be original or unique. Generative AI output becomes more similar to itself with less variation over time. They reported their results in “The Curse of Recursion,” a paper worth reading. (Andrew Ng's newsletter has a great summary of these results.)
Learn faster. Take a deeper dive. Look further.
I don't have the resources to recursively train a large model, but I thought of a simple experiment that might be similar. What happens if you take a list of numbers, calculate their mean and standard deviation, and repeat the process by creating a new list using them? This experiment only requires simple statistics, not AI.
Although it doesn't use AI, this experiment can show how models can break down when trained on generated data. In many ways, a generative model is a correlation engine. Given a prompt, it generates the word most likely to come next, then the word most likely to come after that, and so on. If the word “To be” pops up, it's reasonable that the next word will be “or.” The next word is much more likely to be “not.” etc. The model's predictions are somewhat correlated. Which word is most strongly correlated with the word that came before it? If we train a new AI based on the output and repeat the process, what will be the result? Will we have more variants or fewer?
To answer these questions, I wrote a Python program that generates a long list of random numbers (1,000 elements) according to a Gaussian distribution with mean 0 and standard deviation 1. Take the mean and standard deviation of that list and use them to generate another list of random numbers. After 1,000 iterations, the final mean and standard deviation were recorded. These results have implications. The standard deviation of the final vector is almost always much smaller than the initial value of 1. However, because the variance is very large, we decided to perform the experiment 1,000 times (1,000 replicates) and calculate the average of the final standard deviation. Each experiment. (1,000 experiments is overkill. 100 or even 10 experiments would produce similar results.)
When I did this, the standard deviation of the list increased to roughly 0.45 (I won't say “convergence”). It was almost always between 0.4 and 0.5, although it still varied. (I also calculated the standard deviation of the standard deviation, which was neither interesting nor suggestive.) This result was noteworthy. My intuition told me that the standard deviation would not collapse. I expected it to stay close to 1, and the experiment would serve no purpose other than getting my laptop's fan running. But with these initial results in hand, I had no choice but to go further. I continued to increase the number of repetitions. As the number of iterations increased, the standard deviation of the final list became increasingly smaller, dropping to .0004 at 10,000 iterations.
I think I know why. (It's very likely that a real statistician would look at this problem and say, “This is an obvious consequence of the law of large numbers.”) Looking at the standard deviations one at a time makes a lot of difference. You'll create a first list with a standard deviation of 1, but when you calculate the standard deviation of that data, you'll likely get something like 1.1 or .9 or almost any standard deviation. If you repeat the process multiple times, standard deviations below 1 will dominate, although this is not very likely. Shrinks the “tails” of the distribution. If you create a list of numbers with a standard deviation of 0.9, you are much less likely to get a list with a standard deviation of 1.1, and more likely to get a standard deviation of 0.8. Once the tails of a distribution start to disappear, there is little chance of them growing back.
What does this mean?
My experiments show that feeding the output of a random process back to its input shrinks the standard deviation. This is exactly what the author of “The Curse of Recursion” explains when working directly with generative AI. “The tails of the distribution have disappeared.” Almost completely. My experiments provide a simplified way to think about collapse and show that model collapse is what we should expect.
Model collapse poses a serious problem for AI development. On the surface, preventing this is easy. Simply exclude AI-generated data from your training set. But at least for now, this is not possible because tools to detect AI-generated content have proven to be inaccurate. Watermarking brings its own challenges, including whether generative AI developers will implement it, but watermarking can help. As difficult as it is to remove AI-generated content, collecting human-generated content can be just as important. If AI-generated content replaces human-generated content, it may be difficult to find high-quality human-generated content.
If so, the future of generative AI may be bleak. As AI-generated results become the majority of training data, their ability to surprise and delight will diminish. It will become predictable, boring, boring and probably no less likely to be “hallucinating” than it is now. We still need to be ourselves to be unpredictable, interesting and creative.