Synthetic data may lead to “model collapse” in AI systems

July 28, 2024 As AI models grow larger, their insatiable need for data is increasingly being met by other AI models generating synthetic data. While this can fill knowledge gaps for chatbots, it also presents the risk of destabilizing them. AI-generated data has been supplementing certain fields like medical imaging for years, but the current trend sees it being used more broadly due to rising costs and restrictions on high-quality human-made data.

Big tech companies like Meta, Google, and Anthropic are integrating synthetic data with human-generated data to train their AI models. For instance, Google’s DeepMind has developed AlphaGeometry 2, capable of solving math Olympiad problems, using synthetic data. However, new research indicates that an over-reliance on AI-generated data can lead to “model collapse,” where the model’s outputs become incoherent after several generations of being trained on its own generated data.

This phenomenon occurs because the AI model gradually loses information about less common data in the training set, leading to increased errors and eventual collapse. This issue poses a significant risk for underrepresented groups and languages, potentially leading to a loss of fairness even in initially unbiased datasets. Nonetheless, targeted use of AI-generated data can address some of these limitations, such as reducing toxic responses and improving model fairness through algorithmic reparation.

The debate continues on whether synthetic data can truly represent the breadth of human experience and surpass current models. While retaining a portion of human-generated data can help prevent model collapse, distinguishing real data from synthetic data remains challenging. Properly leveraged, AI-generated data holds immense potential, but indiscriminate use could lead to significant problems. The future of AI training lies in balancing real and synthetic data to harness the best outcomes.

Top Stories

Related Articles

January 22, 2026 The rise of interconnected supply chains is leaving the retail and wholesale industries vulnerable to increasingly sophisticated more...

January 22, 2026 Apple’s once-solid relationship with Taiwan Semiconductor Manufacturing Company (TSMC) is facing challenges, thanks in part to the more...

January 22, 2026 At the annual Axios panel in Davos, Jennifer Morris, CEO of the Nature Conservancy (TNC), emphasized a more...

January 22, 2026 A new survey from subscription bundling platform Bango reveals that for a growing number of U.S. ChatGPT more...

Picture of Jim Love

Jim Love

Jim Love's career in technology spans more that four decades. He's been a CIO and headed a world wide Management Consulting practice. As an entrepreneur he built his own tech business. Today he is a podcast host with the popular tech podcasts Hashtag Trending and Cybersecurity Today with over 14 million downloads. As a novelist, his latest book "Elisa: A Tale of Quantum Kisses" is an Audible best seller. In addition, Jim is a songwriter and recording artist with a Juno nomination and a gold album to his credit. His music can be found at music.jimlove.com
Picture of Jim Love

Jim Love

Jim Love's career in technology spans more that four decades. He's been a CIO and headed a world wide Management Consulting practice. As an entrepreneur he built his own tech business. Today he is a podcast host with the popular tech podcasts Hashtag Trending and Cybersecurity Today with over 14 million downloads. As a novelist, his latest book "Elisa: A Tale of Quantum Kisses" is an Audible best seller. In addition, Jim is a songwriter and recording artist with a Juno nomination and a gold album to his credit. His music can be found at music.jimlove.com

Jim Love

Jim is an author and podcast host with over 40 years in technology.

Share:
Facebook
Twitter
LinkedIn