The importance of this view is further reinforced by independent observations made by Ilya Sutskever, former Chief Scientist of OpenAI. Sutskever proposed the concept of 'Peak Data' at the NeurIPS international conference on machine learning held in December 2024. This concept, reminiscent of the 'Peak Oil' concept in the petroleum industry, suggests that high-quality learning data has already reached its peak and is now on a declining trend.
Modern AI development, especially the learning process of large-scale language models (LLMs), requires vast amounts of text data available on the Internet. These models have been utilizing all forms of text data created by humanity, such as web pages, books, academic papers, and social media posts, as learning materials. However, industry leaders have recognized that there are clear limitations to high-quality data, especially data containing specialized knowledge or academic content.