From Stable Diffusion to Stable Everything

Inside Stability AI's roster of AI models.

May 22, 2024

Editor’s note: An earlier version of this post originally appeared on AI Supremacy.

Most generative AI enthusiasts have heard of Stable Diffusion, the open-source image generation model that helped kick off the current deluge of AI-generated images. But many people don't know that the parent company, Stability AI, has released an entire array of models - pushing the boundaries of what's possible in image, video, language, code, and 3D modeling.

In the past six months alone, Stability has released or announced an impressive roster of new AI models, giving the open-source community a chance to try and keep up with proprietary AI. If you're a proponent of democratizing access to AI, it's worth understanding what's now available and what’s coming soon.

Stable Diffusion 3: Open source, state-of-the-art images

The splashiest of the new releases is Stable Diffusion 3, the company's most capable text-to-image model. SD3 is supposed to perform much better with multi-subject prompts, written text, and overall image quality. Stability CEO Emad Mostaque also noted that it will be able to accept multimodal inputs, including video.

Stable Diffusion 3 is actually a family of models ranging in size from 800M to 8B parameters. Unlike previous versions, it uses a new diffusion transformer architecture, which the company has detailed in a recent research paper. It also employs "flow matching" to smoothly transition from noise to image without simulating every intermediate step.

Early samples suggest SD3 outputs are comparable to other state-of-the-art models like DALL-E 3 and Midjourney regarding quality and prompt-following ability. But for now, the model is only available through an early preview waitlist.

Stable Cascade: A new, more efficient image generator

Similarly, Stable Cascade was announced just before SD3. It's also a text-to-image model but has some critical architectural differences from the company's other image generators. First, it's based on the Würstchen architecture and is internally composed of three models called "Stages." Without getting too much into the technical details, the latter two stages will be available in different sizes, ranging from 700M to 3.6B parameters.

As a result, Stable Cascade is much more efficient than previous models and can be far easier to train and fine-tune on consumer hardware - no massive GPU farm needed. Early testing has also indicated that the model offers much more aesthetically pleasing images right from the start, which has historically been a drawback of Stable Diffusion compared to other tools like Midjourney.

The existence of Stable Cascade suggests that Stability is pursuing different types of image-generation models for various use cases. While Stable Diffusion 3 is meant to be a flagship model, it likely requires a significant amount of compute to run at scale. With Stable Cascade (and SDXL Turbo, which we'll talk about next), there may be an option for smaller teams or individuals to run the model locally and quickly, even if it isn't the most advanced option available. The licensing of Stable Cascade seems to reflect that - it's currently unavailable for commercial use.

SDXL Turbo: Real-time text-to-image creation

In the beginning, there was Stable Diffusion. The first version of Stability AI's image model (version 1.4) was released in August 2022, with versions 1.5, 2.0, and 2.1 all following within just four months. But as the industry advanced, there was a need for bigger and better models, so Stability built SDXL, which was over three times larger than prior versions. However, the bigger size meant slower image generations - leading to SDXL Turbo.

Released last November, SDXL Turbo enables text-to-image generation in real time. The quality is likely lower relative to other models like SD3, but seeing how fast the model can run is very impressive. I’m used to waiting seconds, if not minutes, for images to generate, and seeing SDXL Turbo create new images as fast as I can type out a prompt is magical.

A new distillation technology makes this possible, reducing the number of "steps" required to make a final output image from 50 to 1. Adding more steps will still improve the quality of the final image, but SDXL Turbo's use case is speed.

Like some of the other models we'll discuss, commercial use of SDXL Turbo is available through Stability's paid membership. For everyone else, it's available as a research preview - the code and weights are freely accessible, but you aren't allowed to use it to make money.

Stable Video Diffusion: AI-generated video clips

As the name suggests, Stable Video Diffusion brings Stable Diffusion's powerful image synthesis capabilities to video. The models (a 14-frame version and a 25-frame version) were also announced last November and brought with them a number of highly realistic-looking (if very short) video clips. At the time of the announcement, the models were slightly ahead of competitors such as Runway and Pika Labs (though two months later, OpenAI would stun the world with its video model Sora).

The company has more plans for the model, as it has noted that Stable Video can be easily adapted to downstream tasks such as multi-view synthesis from a single image. But for now, the model is only available for commercial use for those with a paid membership, though there is a waitlist for those who want to use it in the browser.

Stable Audio

In September 2023, the company released Stable Audio, which generated 90-second music and audio clips from a text prompt. Six months later, we now have Stable Audio 2.0, which provides audio clips up to three minutes long. The model also supports audio-to-audio, which lets users upload and transform based on existing samples.

Interestingly, both models have made a point of being exclusively trained on a licensed music and sound effects dataset. While there isn't yet a clear precedent on what IP protections apply to training AI models, Stability seems to be trying to sidestep the debate entirely with a tool that's less likely to run into legal issues.

Stable LM 2 and Stable Code 3B: Language and coding models

Last year, Stability AI released its first language model: Stable LM. In January of this year, the company put out a sequel: Stable LM 2. The model is a 1.6B parameter SLM ("small" language model), trained on English, Spanish, German, Italian, French, Portuguese, and Dutch. At the time of release, the company compared Stable LM 2 to other small models, such as Phi-2 and Falcon 1B, noting Stable LM 2's favorable benchmark performance.

Stable LM 2 is currently available for commercial use as part of the Stability AI membership. It's difficult to know precisely what separates Stable LM from an ever-increasing field of large language models, and I can't say I've seen much evidence that Stable LM is being used in the wild. That said, the company is planning more language model releases, and there's room to differentiate later on.

One of those differentiated use cases for language models is code generation. And there's a model for that: Stable Code 3B, which is fine-tuned for code and is competitive with Meta's CodeLlama 7B.

Stable Zero123 and TripoSR: Exploring the world of 3D modeling

The last major bucket from Stability AI are 3D models: Stable Zero123 and TripoSR.

Stable Zero123 is an AI model specialized in generating 3D objects from a given image. Two versions are available, one for research purposes only and another (Zero123C) for commercial applications. The model builds on existing research but uses the same architecture as Stable Diffusion 1.5 while significantly improving the training dataset. However, Stable Zero123 requires some beefy hardware to run.

TripoSR is a more recent model release from the first week of March. Unlike the other models above, TripoSR was released in partnership with another company, Tripo AI. Like Zero123, TripoSR is a model to generate 3D objects from input images. Unlike Zero123, TripoSR requires far less compute (it's theoretically possible to run it even without GPUs), and is available under an MIT license, meaning no fees to use it commercially.

Stability's path forward

All these models were only released in the last six months - an indicator of how fast the AI space continues to move. Moreover, this isn't even the full suite of Stability's releases. We didn't cover MindEye2, a model developed in partnership with MedARC that turns fMRI data into images. There are also older models for LLMs (Stable Beluga) and chatbots (Stable LM Zephyr), but they're about a year old, and haven't kept up with the competition.

However, Stability's path forward is not without challenges. As the company grapples with the financial realities of sustaining open-source development (despite repeated denials from the CEO, the company may be facing financial and/or investor pressure), it may need to find new ways to monetize its technology without compromising its core values.

Introducing paid memberships for commercial use is a step in this direction, though some feel that this betrays Stability’s mission, as the models aren’t truly “open-source.” The company is also experimenting with paid hosting services, putting their models behind easily accessible APIs. Recently, the company has put Stable Diffusion 3 behind a hosted API for easier access. It has also created Stable Artisan, a Discord bot capable of leveraging Stable Diffusion 3 and Stable Video Diffusion. A look at their pricing page shows the various ways they're charging for usage.

Despite these challenges, one thing is clear: Stability AI has already left an indelible mark on the world of AI. Its commitment to open-source has accelerated the pace of innovation and sparked a wave of creativity across the field. As the company continues to push forward with new releases and partnerships, it's poised to shape the future of AI in profound and exciting ways. I (and the rest of the AI community) will undoubtedly be watching closely to see what Stability comes up with next.