A deep dive on AI-generated voices

I tried designing synthetic speech and cloning my voice.

Mar 01, 2023

Note: I strongly recommend listening to the audio while you read this post. For the best experience, open it in a browser or via the Substack app.

A neon microphone. Artwork created with Midjourney.

Ever since I can remember, I've loved accents. Maybe it's how my brain is wired, but I love listening to the slight changes in emphasis and phonetics. I've spent hours "reverse engineering" the differences between Scottish and Irish accents1.

We don’t often consider the value of our voices, but it's incredible how expressive they are. With a single sound, we can broadcast emotions, intentions, and identity. Our voices are fundamental to who we are, as much as our name or our culture. And hearing the voice of a familiar friend, a loved one, or even a celebrity can trigger all sorts of memories and emotions in an instant.

The richness of our voices, paired with our ability to generate speech and language, is part of humanity's secret sauce. It's what distinguishes us from almost all other animals. And it's something that, despite all our advances, technology hasn't been able to replicate. Until now.

Take a listen to the audio clip below, of an upbeat English gentleman.

1×

0:00

-0:06

It probably sounds pretty normal to you. And yet, that voice doesn't exist. It is 100% AI-generated.

Recently, I spent some time experimenting with Resemble AI and ElevenLabs, two companies that do AI voice-generation2. Resemble AI is decently established with advanced features, while ElevenLabs made headlines recently with their beta release.

And after spending a few days playing with this technology, I'm completely blown away by how advanced it is. In my mind, we were still years away from anything this good.

How it works

At the heart of both platforms, you are using text-to-speech (TTS) models to convert words into audio. TTS systems have been around for a while; we're just used to hearing them as flat, robotic voices. But recent advances in deep learning have led to far more natural-sounding speech.

Resemble AI and ElevenLabs provide out-of-the-box voices for you to use. Resemble AI's were decent, but ElevenLabs' were outstanding.

For the sake of consistency, I used the same script to test all the voices - an excerpt from a recent post:

Since we've seen so many "state of the art" chatbots before, there's a trap that we fall into (or at least I did). We want to see if we can "outsmart" it, so we ask it questions about things that we already know, as a test. I was 100% guilty of this - my first experiments with ChatGPT and Github Copilot left me unimpressed, because I wanted answers that I already knew.

Here are two of Resemble AI's default voices, Beth and Justin.

1×

0:00

-0:19

1×

0:00

-0:22

And here are two of ElevenLabs' default voices, Adam and Antoni.

1×

0:00

-0:22

1×

0:00

-0:23

Overall, Resemble AI has advanced project management features, and offers a ton of fine-tuning options for your audio. However, the voice output from ElevenLabs was unmatched. Their platform is still in beta, and lacking plenty of features, so keep that in mind. I’ve documented a full feature breakdown in a separate post.

Lab grown audio

Without a doubt though, the most fun I had in this whole process was in the ElevenLabs VoiceLab.

Both platforms can only save a fixed number of voices, and with ElevenLabs those voices are either designed or cloned.

Designing a voice, on its face, is deceptively simple. Choose a gender (male/female), an age (young/middle-aged/old), an accent (American/British/African/Australian/Indian), and an accent strength. Give it some text to read, and randomly generate a new voice.

But I spent hours (and all my free-tier credits) creating voices this way. Since the underlying model is non-deterministic, it creates a slightly (or not so slightly) different voice each time. I kept tweaking the accent strength or the age range to see what the software would do. It felt a bit like playing a slot machine, where a unique persona would emerge each time. In the end, I generated a couple of voices that I loved listening to: Hugh and Allison3.

1×

0:00

-0:22

1×

0:00

-0:24

The other part of the VoiceLab is cloning a voice. This feature exists with a few different platforms - record yourself speaking, then create a copy of your voice to use over and over again.

To be honest, I felt a bit nervous about cloning my voice. Although both companies say that you own all the rights to your voice, we still don’t know all of the implications of this technology. But in the name of science, I gave it a try.

In practice, I was able to create a voice that sounded somewhat natural, but the consensus among my friends was that it didn't really sound like me. As much as I'd like to include an audio clip of my cloned voice, there are some pretty good reasons why that’s a very bad idea. See the Ethics section below.

Use cases

After experimenting with this software, I'm already thinking about potential use cases. Sure, the tech still has some maturing to do. But as we've talked about before, this is the worst its ever going to be. Now almost anyone can have a bespoke voice actor without hiring a person. Creating the That has some major implications for businesses4.

Narration/Voiceover. This one is pretty obvious. Independent authors can now get an audiobook recorded in a matter of minutes. Publications that want to offer audio-versions of their content can do so without breaking the bank. In fact, I'm planning on doing so with future posts on Artificial Ignorance.

Entertainment. We've been on the path to fully synthetic entertainment for a while now5, and this was the missing piece. We can now create characters (in real-time) that are separate from the look and sound of their creators. If this becomes widespread enough, we could also see custom voices become a social media standard like Snapchat filters or Zoom backgrounds.

Software assistants. If we combine this tech with ChatGPT, we could finally realize the vision of home assistants like Alexa. Maybe one day you could have a normal conversation with your Google Home. Even better, you could play the right Spotify playlist on the first try.

Sales/Marketing. Sales people adore tools to tailor their outreach, and this is no exception. Expect companies to send custom voice messages about how their product would be a great fit. On the marketing side, this simplifies one step of recording advertisements or podcasts. Pair it with ChatGPT to supercharge your marketing content by translating and re-recording for international markets.

Customer service. In the world of customer service, it's dramatically more expensive to offer live phone support versus text-based chat. Tools like these might bring down the cost of call centers if everyone is typing rather than talking. Though if that doesn't work, at least call trees will sound more exciting.

Abuse cases

But on top of the use cases, I'm also thinking of the abuse cases. This technology is ripe for impersonation and misuse. I mentioned earlier that ElevenLabs made headlines with their beta release - that was because 4chan immediately used it to clone celebrity voices reading offensive material.

In earlier drafts of this post, I was tempted to include audio samples of my cloned voice. But as I learned more about the space, it became pretty obvious that that was a bad idea. Just last week, a journalist broke into his own bank account using an AI-cloned voice. Today, scammers send phishing emails from "a family member stranded overseas" - what if they could also leave a voicemail?

Would not send money to that voice claiming to be you stuck in Mexico and needing cash. – My friend Trevor

Currently, it takes a few minutes worth of audio to clone someone's voice. But that threshold is getting a lot smaller, very quickly. Microsoft announced (but has not released) a model that can clone someone's voice with 3 seconds of audio. In the future, I'm going to be paranoid anytime I'm being recorded.

Resemble AI has a stated ethics policy, and both companies make you confirm that you have proper consent before cloning a voice. But a checkbox isn't going to stop determined bad actors, and it's an open question what happens when this technology is more widely available.

Beyond illegal activity, it's also worth considering the societal impacts. If I was a voice actor, I'd feel pretty nervous right now. And in fact, we're already seeing contracts asking for the digital rights to actors' voices. My guess is we'll see more lawsuits as voiceover jobs are replaced with AI.

Maybe I’m wrong here. There’s still a lot of work to be done in terms of adding emotion, tone, or inflection. Synthetic voices might remain a niche thing, useful only to big companies hell-bent on automating every last piece of their customer service setup. Plus, voice actors have the advantage of being able to take feedback quickly.

But I keep coming back to the value of our voices. Natural speech has been one of the goalposts of AI for a very long time, and it feels like we’re finally on the cusp of reaching that goal. I’m not sure what happens when we commoditize that part of ourselves.

For a more in-depth look at the bells and whistles of each platform, take a look at the feature breakdown post.

Share Artificial Ignorance

Fun fact: after a few drinks, I can do a pretty convincing Scottish accent.

There are dozens more out there though! This is an ongoing issue - there are SO many new AI companies that it’s impossible even to keep track of them. Eventually, leaders will emerge, and many will die out, but for now I’m mostly going by word of mouth and news coverage. Feel free to suggest products for me to try!

When you save a voice, you’re asked to give it a name, which felt oddly intimate - like naming a pet.

These are only the ones I could think of off the top of my head! If you’re thinking of more use cases, send me an email or leave a comment.

I'm talking about VTubers, cartoon avatars animated using face and body tracking cameras (some as simple as your iPhone camera). But up until now, a human still had to voice the avatar.

A deep dive on AI-generated voices

I tried designing synthetic speech and cloning my voice.

How it works

Lab grown audio

Use cases

Abuse cases

Discussion about this post