I made an Alexa

Meet Hugh.

Mar 08, 2023

∙ Paid

I recently came across this tweet 1:

Here’s the recipe to make Siri/Alexa 10x better: 1. Whisper to convert speech to text. Best open-source speech model out there. 2. ChatGPT to generate smart home API calls and/or text response. 3. VALL-E to synthesize speech. It can mimic anyone’s voice sample! Quick figure 1/3

And I realized that with OpenAI and ElevenLabs, you could do this - make your own smart assistant - just by connecting APIs. No fancy machine learning necessary.

So I built it. Meet Hugh.

The plan

To build Hugh, I first sketched out my plan at a high level.

Write the server code:
- Transcribe audio and return the resulting text.
- Generate an AI response given a text prompt.
- Convert an AI response into speech and save it as an mp3 file.
- Send an mp3 file to the user to be played.
Write the browser code:
- Record audio and send it off for transcription.
- Submit a text box to generate a spoken AI response.
- Play an audio file.
Add some polish (nice design, images, etc).

I also wanted to try using a few different AI tools:

ChatGPT to write the code itself (as much as it could)
Whisper to transcribe audio
ChatGPT (API) to respond to questions
ElevenLabs (and the Hugh voice I created) to generate speech
Midjourney to create an avatar

I made an Alexa

Meet Hugh.

The plan

This post is for paid subscribers