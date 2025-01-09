Tesla CEO Elon Musk has said that all available human data for AI training, including books, was exhausted last year, aligning with other experts who had reached similar conclusions.

Musk, who also owns an AI company, xAI, stated this during a livestream conversation with Stagwell chairman Mark Penn, which was streamed on X.

Former OpenAI chief scientist Ilya Sutskever had earlier hinted this in December, noting that the AI industry had reached what he called “peak data,” and predicted a lack of training data would force a shift away from the way models are developed today.

Shift to synthetic data

According to Musk, the next available option to train AI right now is synthetic data, which is data generated by AI themselves.

“AI is advancing on the hardware front, and on the software front, it’s now moving to synthetic data, because we’ve actually run out of all human data. We’ve literally run out of the entire internet, all books ever written, and all interesting videos.

“we’ve now exhausted the cumulative sum of human knowledge in AI training and that happened last year. So, the only way to then supplement that is with synthetic data, which AI creates.

“It’ll sort of write an essay or come up with a thesis, and then and then it will grade itself and sort of go through this process of self-learning with synthetic data,” Musk said.

Challenges of using synthetic data

The Tesla CEO, however, noted that using synthetic data to train AI comes with its own challenges, especially in the area of ascertaining the correctness of its answer.

“This is always challenging because how do you know the answer is hallucinated or real? So, it’s difficult to find the ground truth,” he said.

Meanwhile, some researchers have also suggested that synthetic data can lead to model collapse, where a model becomes less “creative” and more biased in its outputs, eventually seriously compromising its functionality.

What you should know

Tech giants like Microsoft, Meta, OpenAI, and Anthropic, are already using synthetic data to train flagship AI models.

Gartner estimates 60% of the data used for AI and an­a­lyt­ics projects in 2024 were syn­thet­i­cally gen­er­ated.

Microsoft’s Phi-4, which was open-sourced early Wednesday, was trained on synthetic data alongside real-world data. So were Google’s Gemma models.

Anthropic used some synthetic data to develop one of its most performant systems, Claude 3.5 Sonnet, and Meta fine-tuned its most recent Llama series of models using AI-generated data.