The Sequence Chat: The End of Data. Or Maybe Not
One of the most passionate arguments in generative AI.
Are we running out of data? This is a contentious debate in the world of generative AI, with passionate supporters and detractors on both sides. Most large foundation models have been virtually trained on the entirety of the internet, using datasets like Fine-Web that encapsulate much of the publicly available data. As a result, the question of AI hitting a 'data wall' has become increasingly relevant. After all, the famous scaling laws are fundamentally dependent on the availability of vast amounts of data. Without these scaling laws, one could argue that the entire foundation model ecosystem risks plateauing.
This essay explores the thesis of the "end of data" for AI models, examining both sides of the argument and delving into potential solutions such as extracting higher quality data and generating synthetic datasets.
Let’s start with some points that validate the data wall argument: