As a programmer and CTO, I've developed a rough rule of thumb when scaling software systems. When you scale your load (users, page views, messages, etc) by 10x, something breaks. Usually, it's something pretty fundamental. And the result is that you need to replace a critical component or rearchitect the system entirely. Some concrete examples would be implementing caches, adding load balancers, or sharding databases.
But instead of adding 10x more load to a system, what happens when you add 10x capacity? Presumably, the process won't break. But I still think there's a fundamental change to the system - wholly new capabilities are unlocked, even if they're not immediately apparent.
History is littered with examples of 10x improvements - the spinning jenny, the Bessemer process, the assembly line - and their downstream effects. Perhaps one of the most famous capacity increases is Moore's law, which predicted the doubling of CPU power every 1-2 years and has held up for decades. It’s hard to imagine life as we know it without Moore’s Law. If computers had "maxed out" in the 80s or 90s, it's unclear whether we would ever get to AI, smartphones, or even the modern internet.
The context of this discussion is two (arguably three) 10x improvements that have gotten a lot of attention in the last week. To be clear, the attention is what's new: very smart people have been working on them for months if not years. Yet it still begs the question: what new capabilities have we just unlocked that we haven't fully grasped yet?
Gemini 1.5 and the end of RAG
Gemini 1.5 was released last week, and with it a whole bunch of benchmarks. I haven't had a chance to play with it yet - though if you work at Google and can get me access, email me - but early reports seem to suggest it's pretty good.
The new context window stood out the most: 1 million tokens. Relative to existing models, that's a considerable increase. GPT-4 Turbo has a context window of 128,000 tokens, while Claude 2.1 goes up to 200K. But Google's managed to grow that to 1 million, and even 10 million in research settings. Moreover, Google's put out some impressive test results regarding recall - the model can accurately find results even in the middle of hours of audio and video, or millions of words.
There are a lot of cool things that you can do with a 10x bigger context window. You can toss in an entire codebase for analysis, test generation, or API documentation. You can "read" a novel or "watch" a movie in seconds - and get back detailed summaries. You can filter massive datasets like server logs, financial data, or research measurements to look for patterns or find outliers.
And you can start to build a crude approximation of "short-term memory," one that's much closer to a human than we currently have. One of the most significant downsides of ChatGPT is how forgetful it is - after a few dozen messages, it loses context from the beginning of your conversation. There are some workarounds - we can summarize past context and provide it as a prompt - but they still eat up precious tokens.
Ten million tokens, on the other hand, are about 7.5 million words. That's fifty thousand hours of conversation or thirty thousand hours of reading. Would you remember 99% of the details from your last fifty thousand hours of conversation? When paired with some form of "long-term memory," we may soon be able to talk to LLMs that can remember who we are and what we've previously discussed.
The other impact of Gemini's huge token window is that it makes much of our existing RAG tooling less valuable (and potentially obsolete in the long run). We've done some RAG tutorials before, but the main idea is breaking a large corpus into searchable chunks, then retrieving and passing the most relevant chunks to an LLM, so it has extra context when generating an answer.
As our context windows grow, the use cases for RAG become more narrow. There are some situations where it still makes sense, like searching vast databases of scholarly papers or building AI-powered search engines. But many of the consumer-facing use cases I've seen for RAG, like chatting with your company's internal documents, will probably be much better served by a 1 million token context window with 99% retrieval accuracy.
Groq's turbocharged tokens
Groq is a pretty interesting example of a company that has been doing its thing for a long time, quietly, and is now having a bit of a viral moment. The website demo hosts open-source models and serves responses incredibly fast:
Many have been wondering how that's possible - and the answer is custom AI chips. Groq has been building chips explicitly designed for AI since 2016: the CEO previously worked at Google on the Tensor Processing Unit, or TPU, which powers much of Google's cloud AI workloads.
In digging into Groq, I stumbled across a Forbes article from 2021, and it's remarkable how much of the AI narrative resembles that of 2024.
Groq’s chips are next-generation ones that are geared towards so-called inference tasks. They use knowledge from deep learning to make new predictions on data. Groq says that its chips, called tensor streaming processors, are 10 times faster than competitors. “It’s the most powerful one ever built,” Ross says. “It’s not just how many operations per second, but latency.”
Chipmakers have been in a race to power the rapid development in AI applications. Nvidia, whose chips were first invented for rendering video games, has been in the lead.
Instead of "tensor streaming processors," their current LLM demos are powered by LPUs, or Language Processing Units. I won't go into all of the technical details - Groq's website does a pretty good job of explaining why LPUs are much better for doing LLM interference.
Suffice to say, the LPU is intended to provide several advantages over traditional graphics processing units (GPUs) that are commonly used for AI tasks. These advantages include faster token generation, lower latency, and increased efficiency - and as far as I can tell, their chips do the job. Benchmarks performed by Artificial Analysis point to Groq being able to serve tokens up to 13 times faster than Microsoft Azure.
When it comes to the capabilities enabled by Groq's lightning fast speed, those are a little harder to predict. The obvious use case is real-time AI conversations - being able to talk to a chatbot with near-zero latency and receive responses back. In a situation like that, we would come pretty close to simulating live conversations with a person.
There are more opportunities outside of chatbots - being able to build agents that can handle data in real-time, at high speeds, can unlock use cases in finance, medicine, and robotics. And if LPUs (or other specialized chips) can handle mediums like audio or video, then we're on the verge of being able to create infinite, real-time movies or even simulated worlds.
In truth, I'm sure there are many, many use cases that are waiting to be discovered - we'll figure them out once we actually have blazing fast LLMs and other models in the hands of many. For now, the only way to build with LPUs is to partner with Groq (and likely spend millions on an on-prem system), or apply for API access (research teams only).
The limits of scale
In all of this, I still haven't mentioned Sora, which many are hailing as a massive leap forward in video generation. I'm inclined to see it as impressive, but not necessarily a 10x breakthrough, given things like Google's Project Lumiere. But it's without a doubt at least a 10x improvement from where AI-generated video was 12 months ago, and shows that it's not just text: videos can also massively benefit in quality from simply adding more computing power.
That, of course, hints at the question(s) at the heart of all these improvements: When will "adding more computing power" start to produce diminishing returns? How much more computing power can we build as a species? And with each 10x improvement bringing fundamentally new capabilities, how many more jumps should we expect?
The short answer is: we don't know. I've talked before about how we're somewhere on the S-curve with AI:
The adoption of every new technology, from landlines to laptops, takes the shape of an S-curve, or sigmoid. It starts off slow, while the tech is in its infancy. Then, it begins to grow exponentially as mass adoption occurs. And finally, inevitably, it plateaus as the technology reaches maturity. With the AI of today (ChatGPT and friends), we really don't know where we are on the S-curve.
We don't know when increasing parameters or datasets will plateau. We don't know when we'll discover the next breakthrough architecture akin to Transformers. And we don't know how good GPUs, or LPUs, or whatever else we're going to have, will become.
Yet, when when you consider that Moore's Law held for decades... suddenly Sam Altman's goal of raising seven trillion dollars to build AI chips seems a little less crazy.