Discover more from Artificial Ignorance
I made an Alexa
So I built it. Meet Hugh.
To build Hugh, I first sketched out my plan at a high level.
Write the server code:
Transcribe audio and return the resulting text.
Generate an AI response given a text prompt.
Convert an AI response into speech and save it as an mp3 file.
Send an mp3 file to the user to be played.
Write the browser code:
Record audio and send it off for transcription.
Submit a text box to generate a spoken AI response.
Play an audio file.
Add some polish (nice design, images, etc).
I also wanted to try using a few different AI tools:
ChatGPT to write the code itself (as much as it could)
Whisper to transcribe audio
ChatGPT (API) to respond to questions
ElevenLabs (and the Hugh voice I created) to generate speech
Midjourney to create an avatar
I was pleasantly surprised by how good the results were. ChatGPT created backend code that was about 80% of what I needed. And to top it off, it explained the results step by step. I wish I could've had something like this back when I was studying programming!
Ultimately, I did have to edit a decent bit of the code. But I'm guessing it saved at least 20-30 minutes of fiddling around with initial setup work.
Then it was time to actually add the AI! There were three pieces to this:
OpenAI's Whisper, which transcribes audio to text
OpenAI's ChatGPT, which answers questions given a user prompt
ElevenLabs, which converts text to spoken audio
The OpenAI software was ridiculously easy to use. Seriously, here's the entire code for both transcribing audio prompts and generating ChatGPT responses:
But after filling in the AI gaps, I had a working (if barebones) app.
Unfortunately, ChatGPT didn’t have a flair for the artistic.
Luckily, I could ask it to help with that too.
With some more massaging, I had a reasonable-looking page.
But the last thing missing was some character. Specifically, an avatar for Hugh. To generate Hugh's Avatar, I turned to Midjourney, an image-generating AI.
If you aren't familiar, the way Midjourney works is you have to join the Midjourney Discord server, then use the `imagine` prompt to whip up an image. Writing an effective prompt is an art form in itself, but even vague prompts usually get decent results.
In this case, I started with “
a smart british gentleman, studio ghibli, chibi, digital art.” Midjourney generates four different low-resolution images per prompt, which can be further modified. Here were Hugh's:
For any image, you can choose to 1) generate more variations of the image, or 2) upscale the image to a higher resolution. I liked the top right image the best, so I made some variations.
I was happy with the bottom left image, so I had Midjourney create an upscaled version.
And with a little bit of image editing work, I had my avatar!
Artificial Ignorance is reader-supported. If you found this interesting or insightful, consider becoming a free or paid subscriber.
Hugh is a decent conversation partner, but there are certainly some upgrades he could benefit from3. What I’d love to do is dig deeper into the mechanics of GPT - prompt engineering is well and good, but what does it take to fine-tune a model? And I'm still planning on building more voice projects, including narration for Artificial Ignorance.
In the meantime, I don’t think I’ll be asking Hugh for help with any coding.
You'll need your own API keys for OpenAI and ElevenLabs. They both have a reasonable free tier to play with, but heavy usage will cost a little bit.
Streaming the audio rather than saving it to a file would make things faster. And currently, there’s a limit on how long conversations can run.