Large language models have been around for a few years, and we're starting to get the hang of building actual products with them. Every company is wondering how to "add AI" and take advantage of the new capabilities that LLMs bring. But the more you work with them, the more you realize these models have limitations.
It may not seem obvious when you're using ChatGPT for the first time, or if you're building on a simple prototype or demo. But as you get more experienced and comfortable with large language models, you notice some rough edges. I'm far from the first person to point this out - there have been some excellent analyses of these downsides in the past.
However, I wanted to take a look at some of the major constraints when it comes to building applications with large language models.
Hallucinations
First and foremost are hallucinations. I've said it before and I'll say it again: ChatGPT will lie to you.
At its core, a language model is a word prediction machine. If we're being reductive, it's autocomplete on steroids. At a large enough scale, this kind of tech can appear to grasp how the world works and what is true or false - but that isn't the case.
As a result, we get hallucinations - facts and figures the model confidently cites, that are incorrect or invented from thin air. While treating an LLM's output as gospel can be tempting, it can get you into trouble. One of the most extreme cases we've seen involves New York lawyers now facing sanctions after not only using ChatGPT to write a legal brief, but relying on hallucinated court cases as precedent in those arguments.
Hallucinations are a pretty major hurdle for LLMs. We're working to solve them, but they also go hand-in-hand with the lack of a long-term memory system for language models.
Long Term Memory
Every model has some concept of a "knowledge cutoff date" - the latest bits of information in their training data. If you know a model's knowledge cutoff, you know not to trust any information related to topics after that date. But wouldn't it be better to give language models access to data in real-time, rather than trying to bake it all in during training?
We have some workarounds, like retrieval-augmented generation (RAG), which pulls out the most related pieces of information from a document library and attempts to layer that information onto an existing language model. But RAG has its own limitations - the output can get sloppy as multiple sources are added, and you might be using the wrong database queries to get the answers you want.
A rough analogy here would be something like SQL databases. We take for granted that modern databases can be queried and will return fast, accurate, consistent answers. But that isn't the case with LLMs - relying on their long-term memory (especially given hallucinations) is risky at best, and dangerous at worst.
Short Term Memory
LLMs are better at managing information in their short-term or working memory. We refer to this as a "context window"- the number of words (or tokens) a model can keep track of before forgetting previous text or messages.
Right now, ChatGPT has a context window of up to 32,000 tokens, depending on the model (larger windows are available via the API). Anthropic's Claude model has a working memory of up to 100,000 tokens, which is impressive. But anecdotally, it doesn't seem to do a stellar job of using that whole context, and its answers degrade as conversations get longer.
To some extent, increasing the context window can solve a range of problems. In my mind, the "realism" of state-of-the-art chatbots is mainly due to the bigger context windows. Earlier chatbots would immediately forget what you had said to them, while GPT-4 can maintain a coherent conversation for quite a while. But at a certain point, the conversation is lost to the model's working memory, and we're facing our long-term memory problem again.
Prompting
Yet even when working within the current memory constraints of LLMs, things aren't as stable as we might like. The fact that "prompt engineering" is a job is a testament to this - it can take quite a lot of coaxing to get what we want out of models. If you're trying to build a reliable, consistent product, that means finding ways to manage your library of prompts, and benchmarking their results.
And even if you can get the perfect prompt for your use case, you might find that 1) it doesn't work 20% of the time, 2) it stops working with the latest version of GPT-4, or 3) changing a single word produces dramatically different results. That means more tooling, more edge case checking, and more complexity to manage your LLM infrastructure.
This is a very nascent space, and there are plenty of startups trying to solve these problems. That said, it's likely too early to say precisely what the best practices are around managing these issues - we're all figuring it out in real-time.
Speed / Cost
When using LLMs, you quickly encounter a delicate balance of speed, cost, and quality. The best results come from models like GPT-4, but using it is both slow and (relatively) expensive. GPT-3.5-Turbo is extremely fast and cheap, but isn’t as sophisticated in many areas.
Besides third-party models, you can choose to fine-tune your own model (whether for accuracy or privacy reasons), but that brings with it a host of new challenges. You'll need to curate your training data, set up training runs, and benchmark the results of your fine-tuning. You'll need to host the model somewhere and set up streaming API endpoints. You'll need to pay for ongoing inference costs, and in all likelihood, that will probably be as expensive or more than using something like GPT-3.5-Turbo.
Each application is different, so there's no one-size-fits-all solution. You may need to run a combination of fine-tuned models and GPT-4, or maybe something like Llama 2 just works out of the box for your needs.
Quicksand
There’s a term I used - “AI quicksand.” Right now, things are progressing so fast that it’s easy to accidentally get squashed by new launches from OpenAI or Google. Just this past week, OpenAI announced new image and voice features for ChatGPT, killing dozens of products that had been trying to do the same thing.
But if you’re avoiding ChatGPT, this problem can happen with the tools and infrastructure you’re using. The “best” open-source model might be totally different in 6 months. Building your own feature for LLMs to use tools or templateize prompts might be a waste of time when leading tools like LangChain or LlamaIndex add new features weekly.
There is some value in being a first mover, but not much. And as new models and features shift the ground beneath our feet, choosing your tech stack has never been more crucial, or more challenging.
Alignment
And of course, there's always the underlying problem of alignment. I don't mean alignment in the "AI is going to kill us all" kind of way - I mean more in the "AI is a tool that might not do what I expect" kind of way.
In its purest form, a language model tries to guess the next word in a sequence. It doesn't have the concept of following instructions, or trying to be helpful, or avoiding dangerous or offensive content. Some may now take this for granted with ChatGPT, but OpenAI has done a lot of work to avoid confusing or problematic responses. Their models are given plenty of feedback from humans on what constitutes a "good" or "bad" answer - often trying to make the answers more helpful while avoiding harmful content.
If you're not using ChatGPT, or if you're fine-tuning a foundation model, you'll want to take this into account. Billions of pieces of text went into the model's training, and no human read all of it. The model could decide to go on an expletive-filled rant, or spit out someone's specific address and phone number. You can work your way around this by fine-tuning a model to get the specific kind of output you want, but again - that means more infrastructure.
What else?
These are the challenges that I've seen building my own projects and talking to others building LLM-powered applications. If you're one of those people - let's chat! I'm interested in interviewing founders, investors, or anyone with an up-front perspective on building AI companies.
But what else is there? There are certainly challenges to AI research and training these massive models that I didn't outline here: reliance on tokenizers, high pre-training costs, lack of reproducibility. What are you seeing, though? Leave a comment or a reply!
Great roundup! I'm working on combining language models with formal methods (e.g., via intermediate code generation) to try and make them more robust but it is definitely proving extra hard because even if you embed the actual correct answer in the prompt, the LM can still bypass it completely and pull from its faulty, hallucinating-prone long-memory memory. One approach I'm very excited about is grammar-restricted output.