The AI email startup that's taking on Gmail
A conversation with Andrew Lee, CEO of Shortwave and cofounder of Firebase.
If you're not familiar with Firebase, it's a platform for developers who don't want to host their own backend infrastructure. After being acquired by Google in 2014, it's now a key part of Google Cloud Platform and is used by millions of businesses.
These days, Andrew works on Shortwave, an email client that started as a replacement for Google Inbox, but has quickly become a leader in the AI-for-email space. The Shortwave AI assistant can summarize threads, search your inbox, or write your replies for you.
Full disclosure: I've been using Shortwave for a couple of months now, and I'm a pretty big fan. I used to use Google Inbox pretty heavily, so it was refreshing to find a worthy successor. Having the ability to summarize emails was the cherry on top - I wanted that feature in Gmail so badly, I made my own Chrome extension to do it.
But Shortwave's approach is much, much more thorough than my slapped-together approach. In our conversation, Andrew and I dove into the company's AI architecture, what he learned building Firebase and how he's applying that to Shortwave, and how he thinks about competing with Google/Gmail.
Five key takeaways
With AI, being lean is a big advantage. Shortwave is able to outpace Google because it can iterate faster with new AI technology, and it doesn't have to worry about working with potential “competitors.”
I think we have a few advantages. One is we're just a startup. We can move really fast. So we have something live that works today. Google has Duet AI, which hasn't launched anything at this level. It has some very basic writing features, but most of the stuff we've talked to salespeople about is "coming next year."
I used to work at Google. I have some good insight into why it's hard for a big company to move very quickly. People at Google are very sharp and they're good at what they do, but it is a very big challenge to move a huge organization with billions of people forward at a rapid clip. And so we can outrun them.
We also have the benefit of being able to use the best technology, wherever it is. I think Google is gonna be extremely reluctant to just start calling the OpenAI API, for example. I think they're be very reluctant to use like open source models from Microsoft, which we do.
Making a fast, capable AI app takes more than "just" a few LLM calls. Shortwave's architecture, which they recently detailed in a great blog post, shows the lengths that they've gone to build something that is both more capable than a basic ChatGPT integration, while also being lightning fast.
Every time you make a request in our product, there's a dozen LLM calls going out and there's a cross encoder that's running, and there's embeddings that are being generated.
The first thing we do is we take your query and the history that you've had with the chat and a bunch of contextual information about the state of the world. For example, what's open on your screen, whether you've written a draft, what time zone you're in, things like that.
All so we can figure out what you're talking about, and we ask an LLM "what information would be useful in answering this question?" Do we check your calendar? Do we search email history? Do we pull in documentation? Do we look at your settings? Do we, take the current thread and just stick it in there?
There's a whole bunch of stuff that we can do. And once we've determined that, it allows us to kind of modularize our system where we say, "Hey we know we need these three things." And each one of those pieces of information can then go off in parallel and load information. The most interesting one by far is our AI search infrastructure, where we go off and we use a bunch of AI tech to find emails that are relevant to the query and allow it to answer questions that are answerable in your email history.
But then we take the output of all those tools, we bring them back together, we throw them in a big master prompt and we have it answer your question. And we do that whole thing, the dozen or so calls, the cross encoder, and the embeddings, and all of that - in three seconds.
The current RAG approaches have significant limitations. RAG (retrieval augmented generation) is currently the most popular way of giving LLMs "long-term memory," by fetching relevant documents and handing them to a prompt. But Andrew discussed why that doesn't work amazingly well, and how they're trying to work around it.
The kind of standard approach to document search that AI folks are doing is the embedding plus vector database approach. Where you take all of the history, you embed it, you store that in a vector database somewhere, and then you take the query, you embed that, you do a search with cosine similarity, you find the relevant documents, you pull those in and you put them in a prompt.
But it doesn't actually produce as good of results as you might like because it only works in cases where the documents that answer your question have semantic similarity with the question itself. Sometimes this is true, right? But if I say, "when am I leaving on Friday," and what you're really looking for is the time of your flight, and that email doesn't have the word "leaving" in there at all.
So we wanted to go a step further and say, okay, we want to be even smarter than this. We wanna find things that don't have necessarily semantic similarity. And still answer your question, pull those in. So the way we do that is, we have a whole bunch of different what we call fetchers, a whole bunch of different methods for loading those emails into memory.
So we look for names. We look for time ranges, we look for labels, we look for keywords. There's a few other things that we look for. And then we go and we just load all of those emails. We're going to pull all the things that are semantically similar, and the ones that match relevant keywords and the ones in the date range, and the ones involving the right people, et cetera.
As always, talking to users is incredibly important. This is one of the things that YCombinator drills into its founders, and with good reason. Shortwave spent over a year experimenting with crazy collaboration features, but ultimately came back to focus on a great single-user experience.
When I started this company, I said to myself, I'm not going to be like all those other second time founders that think they know everything. That jump in and think it's going to be easy. I'm going to do this from first principles, and we're gonna talk to our users and we're going to iterate really fast, and we're going to be scrappy.
And we did that. We talked to our users, we were disciplined. But it was still just a brutal refresher on how much you have to do that. Like how much you have to talk to users, how much you have to be willing to admit your ignorance and throw out stuff that isn't working.
We tried all kinds of features. Like the current state of the product is iteration number, I don't know, 10 or something. For the first year of the product, basically everybody churned. Because we had this much more crazy rethink about how email works, which in retrospect was not a particularly good idea.
Sometimes backwards compatibility is inevitable. Many founders are trying to build new and better software, and as a result ship their minimum viable product (MVP) with a bare bones set of features. But sometimes you need to actually support the entire universe of features that your customers actually want - especially if you have established competitors.
One of the decisions I wish I would've made earlier is to say that we're going all in on supporting the full breadth of email. There's a lot of stuff in email that feels ancient that you might, starting out fresh, be like, we shouldn't bother doing this. A good example would be like BCC.
Kids these days haven’t heard of a carbon copy, much less a blind carbon copy. It's kind of this weird, esoteric thing. And for a while we didn't have it. We said, we're gonna build a different primitive that's gonna do some of the things that BCC does.
And I think what we learned was people are so used to some of these things. And in order to play nicely with existing systems, to play nicely with Gmail, to play nicely with other people's email clients - you really have to support these things fully.
You can build cool stuff, but they have to be layered on top. So you can build a nicer interface doing X, Y, Z, but the underlying stuff needs to be like totally standard. And I think we should have accepted that much sooner and said, we are just going to support everything that email does and then build simplifications on top as workflows rather than trying to simplify the underlying model.
And three things I learned for the first time:
Gmail pioneered the idea of threaded conversations in email, which was not something email was originally designed for. As a result, Gmail still has a setting where you can disable threads entirely!
Firebase originally started as an app called SendMeHome, which was meant to help people return lost items. The founders pivoted twice, listening to what their users wanted, and eventually landed on Firebase.
HyDE (Hypothetical Document Embeddings) is a RAG technique that involves creating fake documents that might have relevant words that aren't in the document itself (like "leaving" vs "flight" from the quotes above), and using those as a stepping stone to find the right underlying documents.
The full interview (both audio and transcript) is available for paid subscribers.
Artificial Ignorance is reader-supported. If you found this interesting or insightful, consider becoming a free or paid subscriber.