In space travel, there’s a concept known as the "Wait Calculation" or the "Wait Paradox."
The paradox suggests that as spaceships advance, we'll drastically shorten interstellar travel time. And that means that ships launched later can overtake those launched earlier. So there's a difficult choice - to launch now and bet on today's tech, or wait for the possibility of faster/better options.
In AI, we see a similar problem. The field is advancing so fast that launching a product today risks being outpaced by future technology. There is some value in being a first mover, but not much.
As a result, AI infrastructure is about as stable as quicksand when it comes to building products. And as new models and features shift the ground beneath our feet, choosing your tech stack has never been more crucial, or more challenging.
Product quicksand
An example use case is building a custom wrapper around a large language model. Say you're building a fine-tuned version of ChatGPT, or making a productivity tool using Claude. For most startups, getting the technology stack right is a key early decision. A bad choice won't kill the company outright, but it can drag on output. If the founders are focused on the stability of their technology, they're not focused on product and growth.
But large language models are advancing so fast it's not clear which one to choose today. If you want your chatbot to have the biggest working memory possible, then Claude is the way to go (for now). The latest Claude models have a context window or 100,000 tokens (~75,000 words). On the other hand, GPT-4 is still considered the most advanced model available. And while it only has a 4,000 token context window, OpenAI is also slowly rolling out a 32,000 token model.
So at best, you're making an educated bet. And the wrong choice means either being tied to the wrong technology or spending precious time rewriting your code for a different model.
Some projects are trying to make this easier by building layers on top of different LLMs. Ideally, using these projects means changing models only takes a simple configuration change. But these projects are running into their own quicksand, when it comes to duplicating LLM features.
Feature quicksand
In case you missed it, OpenAI launched ChatGPT function calling two weeks ago. With it, you can now "describe" functions or API calls to ChatGPT, and it can decide to use them. If it decides to use a function, it will send back structured data for you to pass to those functions or API calls.
It might sound tedious, but this is a pretty big deal. It gives ChatGPT the native ability to use software tools. And this ability to delegate work and execute sub-tasks are the building blocks for autonomous AI agents. ChatGPT can create arbitrary plans and use external software to make them a reality.
The thing is, though, this isn't a new idea. Langchain, a library for building complex applications on top of LLMs, has had function calling for a while. It's the scaffolding behind many of the open-source agent projects we've seen, like AutoGPT.
But with ChatGPT functions, Langchain's developers have new competition for new users building agents. And they still need to duplicate their existing work and integrate ChatGPT's new feature to stay competitive. It's a bit of a hamster wheel, and one that will get worse as LLMs diversify their feature sets.
This is a tiny example, though - OpenAI has been doing this all year at a far bigger scale. GPT-4 mostly obsoleted GPT-3, except it can't be fine-tuned - meaning companies who customized GPT-3 now have a difficult choice. Should they switch to the state-of-the-art model, double down on their customizations, or wait and hope GPT-4 becomes fine-tunable?
Open source quicksand
Of course, there's also the question of whether to use a proprietary model at all. If your organization requires data security, then it’s an obvious choice. Or, if you have the resources and talent and need to own your tech stack, you should likely go the open-source route. But for everyone else, it can be a hard decision.
Over the last few years, we've seen powerful open-source models surpassed by private ones. Stable Diffusion is a foundational open-source model, and has improved at a remarkable rate. But Midjourney, built atop Stable Diffusion, has somehow improved even faster. It's hard to argue that Midjourney is not the most realistic AI image generator right now.
And this is true across multiple mediums - images, speech, transcription, chat. ElevenLabs has more realistic voices than Coqui. GPT-4 is still better than Llama, Alpaca, Falcon, or any other zoological open-source LLM. In today's world, centralized companies have the resources and focus to create better models.
But that might change! With image generation, a new technique known as LoRA decentralizes model fine-tuning. Users with everyday hardware can fine-tune models at a far faster rate than a single company could. So while Midjourney may be the best general-purpose option, Stable Diffusion + LoRA might be better for your specific use case.
Currently, you have to choose whether to rent or own your AI technology - without knowing which option will produce better models in the long run.
So what to build with, then?
To be clear, I'm not suggesting that you should avoid working with generative AI tools entirely. Regardless of your choice, you're still ahead of the curve by using and building with these new technologies, period. It's still crazy to me that only 14% of US adults have tried ChatGPT. But you can do a few things when building to try and save yourself some future headaches.
Be wary of integrating too deeply with any given AI tool. While tight integrations can lead to better UX, know that you're making a technology bet. And until the dust settles, it's better to stay agile with lightweight integrations with the various large language models.
If you have the resources, have a migration plan to avoid vendor lock-in. Vendor lock-in isn't a new problem - companies have dealt with it for ages, particularly in the age of cloud platforms. And we can use similar tactics to plan around AI infrastructure changes. Understand what the alternatives are and how you would migrate to them. Develop fallback plans in case you need to switch platforms.
Work on valuable features that are not AI-focused. Any product that's just a thin wrapper around ChatGPT or Stable Diffusion will not have a long shelf life. But if your product is valuable even without AI, then having the latest models and features won't be critically important. And you'll have more time to wait or switch if you need to.
If this all seems a bit overwhelming - it is! Even folks doing this sort of research for a living are having difficulty keeping up. Some aren't keeping up at all - ChatGPT has obsoleted ML teams from tech companies, universities, and governments. But as I said, you're still way ahead of the curve by using this tech. And the uncertainty is the price of being an early adopter.
I've actually been wondering just how difficult it was to "plug into" the underlying LLM model from a coding perspective.
In my simple brain it was about "just" making some boxes for input that send it to the model and "just" some UI to make the output formatted to whatever the interface/app is supposed to do.
So I kind of assumed switching which model you communicate with (Claude, GPT, or other) was a relatively simple tweak. Sounds like it's a lot more complex and the degree of lock-in isn't trivial.
By the way, I'm quite sure Midjourney isn't built on top of Stable Diffusion. There was a brief period in late last year when Midjourney V3 had a "--beta" parameter that generated images using a separate Stable Diffusion-bases model. Then Emad (Stability AI) miscommunicated / oversold what was happening, which is where the confusion came from. They also dropped SD altogether for V4 and onwards. (More good discussions about this in the comments here: https://www.reddit.com/r/StableDiffusion/comments/10liqip/if_midjourney_runs_stable_diffusion_why_is_its/)
Then again, since the MJ model is a black box, it's hard to have a definitive read on it.