May 29, 2024
Escaping the Data Vacuum: Rethinking LLM Training and the Future of AI
By Nick Reese
May 29, 2024
On April 16, 2024, the New York Times released a story about artificial intelligence (AI), large language models (LLM), and what they called “AI’s Original Sin.” Featured on the New York Times podcast The Daily , Time investigative reporter Cade Metz talks about some of the lengths to which the big-name brand LLM tech companies went to obtain data to train their models in the early days. The article is replete with stories of copyright infringement and paints a dark picture of the future where LLMs are trained by synthetic data generated by other LLMs. This is a tangled mess and is one potential and very stark outcome of the path of some of the most advanced AI in the world. However, the future of AI is far from certain and a world where LLMs are a copy of a copy does not have to be our destiny. If we can shift how we think of and apply LLMs to our lives and our business processes, we can paint a much brighter picture. A word on the “original sin.”
According to the NYT article, large LLM tech companies had a problem early in their training cycle. They were running out of data. Literally. They were running out of available data on the internet to feed into their models to train them. That sounds impossible when one considers not just the “size” of the internet but the combined total of human knowledge in the form of text and photos that resides there. But it’s true. OpenAI gobbled up so much data it had to start looking in other places, which led it to YouTube. In another mind-stretching exercise, try to imagine the hours of video content on YouTube, in total. All one must do is transcribe the audio of these videos and create a new data source. Never mind that doing so is explicitly a violation of YouTube’s terms of service. After all, what’s the best type of data? MORE data!
This is exactly what happened, and Mr. Metz’s reporting uncovered this voracious data appetite. But it does not stop there. Officials inside Meta recognized they had to keep pace and needed to find EVEN MORE data. According to the NYT, these officials talked seriously about acquiring a major publishing company just so they could get access to their entire catalog of text for MORE DATA STILL.
The problem is that the entire catalog of human created content is actually finite. It is being created every day but not as quickly as web scrapers and automated data ingest bring it to the models for training. One source in the article estimated that tech companies will “run out” of new human created data by 2026 necessitating the creation of synthetic data, or data created by AI. You think LLM hallucinations are bad now? Wait until you train AI on data it created itself.
Share
This begs some compelling questions about the future of AI, especially LLMs. Right now, users of Gemini, Claude, and ChatGPT can ask these models ANYTHING. Need a funky picture of a kangaroo giving a presentation to a corporate board? Easy. Want someone to talk to late on a Wednesday night? Done. Need a legal case briefing for court? Ok, bad example but you get it. The idea is that these models include ALL the knowledge of the internet and can answer any question you give them. The trillions of pages of text they have learned from give them tremendous power to help humans with certain kinds of tasks. How we use our LLM tools is still being decided while the designers of these models continue to feed them new data, probably to include this very post. Hello, Claude! But will we always use LLMs the way we do now, and will those models need to be trained on the entire body of human knowledge plus the entire body of synthetic AI knowledge?
As LLM technology vacuums increased data, it will create a data vacuum. One where LLMs enter unknown and thus untrusted territory. No one is sure what synthetic data training will do to our LLMs but there is reason for skepticism. Already, the potential for hallucinations is an ever-present possibility that everyone from students to lawyers submitting legal briefs must watch out for. There could be another outcome that was not predicted that could impact on their accuracy and robustness and thus their output. But the data vacuum problem is an unforced error because we humans need to think deeper about how we use LLM technology and about training. The future of LLMs does not have to be a giant, copyright infringing vacuum sucking up all available data and leading to a new data vacuum of a different kind. Instead, LLMs can be narrowly trained for specific purposes and deployed to users securely and with measures built in to increase trust. Now entering a vacuum-less AI future.
LLM technology can save humans incredible amounts of time and allow domain experts to be experts but be better informed. This is AI’s superpower. It is good at things that humans are not, like finding patterns in HUGE datasets and presenting data in powerful visuals. Making sense of that data and making decisions based on it is the human superpower. It is not necessary for an LLM to be trained in all human knowledge for this to occur.
The vacuum future can be avoided all together but applying LLMs to small, targeted use cases and training them on the data necessary for the desired outcome. Do we need (or want) the entire body of Reddit to generate a report on weather data? Lawyers often get hundreds of thousands of pages of discovery in a legal case. An LLM can be trained just on that data and give those lawyers hundreds of hours back because instead of reading every document over late-night Chinese food, they can query the LLM and get the answer back in seconds. Moving time consuming tasks to AI frees human brains to think critically, evaluate data, and make decisions.
Instead of talking about an LLM data vacuum, we should be talking about how to be more efficient in the training processes for LLMs and how to get the most accurate, hallucination-free outputs. Not every model needs to be trained on the entirety of the internet plus the synthetic internet AI eventually creates. LLM technology can move forward as a secure, privacy-preserving tool that enables humans to get the best results and make the best decisions. Meta can buy all the publishing companies it wants. LLM use can, and will, evolve. We collectively must step away from how we have grown accustomed to using our current version of LLMs and start thinking about ways to apply them to specific problems with specific datasets. Real datasets, not synthetic.
Share