Discover more from Artificial Ignorance
Designing an AI voiceover system for my newsletter
Meet the cast of Artificial Ignorance.
If you've read the archives, you're familiar with Hugh. Hugh is a prototype smart assistant I built, with a persona and voice to match.
Hugh's voice comes from ElevenLabs, the runner-up in my "most realistic voices" review. Hugh's voice was so realistic, in fact, that it fooled NPR for a hot second:
I deeply enjoy Hugh's voice because most AI-narrated content is mediocre. It gets the job done but in a sterile, perfunctory way. It's hard to find synthetic speakers with emotion or pauses for dramatic effect. I want fewer robot voices and more realistic AI speakers!
So, in an effort to be the change I want to see, I decided to add voiceovers to the Artificial Ignorance archives. Along the way, I learned a lot about what not to do and built a bunch of custom software to make the process go faster.
Keep reading to learn more about the process. But without further ado, here's the audio cast of Artificial Ignorance:
Hugh will be voicing long-form essays.
Evelyn is your go-to voice for weekly news roundups.
Chase walks through the nuts and bolts of AI projects.
Lauren nicely narrates the product review posts.
The quick and dirty approach
I want to acknowledge something upfront: I put way more effort into this project than I had to. There are plenty of voiceover tools available, though they don't have Hugh.
And even if I had to keep using Hugh's voice, I could have pasted the entire text of a post into ElevenLabs' dashboard. But the quick and dirty approach has a few problems:
Fixing mistakes is pretty costly. A word or phrase may get mispronounced, or I might have a typo in my post. Not something everyone will care about! But I did, so fixing it would mean regenerating the entire text - and burning thousands of credits. Instead, I wanted to regenerate small sections to correct the pronunciation errors.
There's no reliable way to add pauses. Like any good narration, I wanted to add silence after the title or between major sections. There isn't a way to control this in ElevenLabs yet, and I didn't want to use an audio editor. Plus, there isn’t a way to test different length pauses without regenerating the entire text.
Each post can only have a single voice. I have a few ideas for multi-speaker content, but creating said content would mean stitching together multiple files.
So, I did what any good engineer would do: I made my own half-baked solution.
Building the narration rig
With my use case in mind, the architecture of what I wanted was pretty straightforward.
Given a post URL, break the post down into roughly paragraph-sized clips.
For each clip, configure the default settings: text, voice, and starting/ending pauses.
Create a UI to adjust the settings and generate the audio. Audio clips should be playable via the UI.
Export a final file that stitches together the different clips, including the silences before and after each one.
The end result:
In practice, most of the code was scaffolding to store and easily edit the clips. All of the heavy lifting takes place within a few Python functions:
Django: I've been writing Django code for over a decade. It has its tradeoffs, but as a solo developer it is the fastest framework I know for getting something off the ground.
Newspaper: In planning this project, I discovered a fascinating library called newspaper. It's gone unmaintained for a while, but it worked remarkably well at extracting the text of an article and breaking it into paragraphs.
HTMX: The first version of the narrator was simple HTML forms, but that became clunky pretty fast. Rather than integrating an entire front-end JS framework, I opted to use HTMX. It's a lightweight library to build more modern user interfaces.
ElevenLabs: After months of hacking together my own library, I was finally able to use the official ElevenLabs Python SDK. At some point I'll need to migrate the original Hugh code to the official library.
PyDub: A pretty versatile tool for managing audio clips in Python.
How to get the most from your AI voices
I learned a few tips and tricks for working with AI-generated voices. Some are broadly applicable, but others are specific to the tool I was using.
Use longer clips. Clips composed of a single phrase or short sentence had a tendency to be less accurate than longer clips. I often had to try regenerating a title or header multiple times to get the right results.
Play with formatting. While the voices can’t understand bold or italic text, it did adjust when given quotes, dashes, and ellipses. For what it’s worth, ElevenLabs plans to introduce more tools this year to help control tone and emphasis.
Try a lot of voices. Specifically for ElevenLabs, I generated a ton of different voices. Each one has a unique personality. Many can sound somewhat dull (at least to me), but every so often you find a voice that’s the perfect fit.
Some edits are required. Not all written text translates well to a spoken medium. Terms like “this/that” should probably be converted to “this and that”. Bulletpoints might make more sense a numbers, or section titles might benefit from a “part one” when read aloud.
Be prepared for edge cases. It was interesting to see which words tripped the AI voices up. "GitHub" is pronounced correctly 99% of the time, but "Github" had around a 20% error rate. Spelled-out words like "LLM" or "www" were often slightly incorrect.
When in doubt, regenerate. The audio is non-deterministic, so regenerating will change up the cadence. While there are some basic controls, I find that you can often get wildly different results by regenerating the audio with different settings.
This is just a prototype, but I’m going to keep working on this as Artificial Ignorance grows. It’s got plenty of rough edges, and I want to make a bunch of usability improvements. Stuff like:
Rearranging the order of clips.
Regenerating all the clips at once.
Uploading separate audio to splice between clips - like the audio from this post.
Adding background music or sound effects.
Making custom formatting tweaks automatic.
Keep up with the latest projects and experiments.
One more thing
As of today, all published essays have voiceovers, as does the latest AI roundup. Project writeups, product reviews, and the rest of the archives will get voiceovers shortly (as soon as I buy more ElevenLabs credits).