I recently came across this tweet1:


And I realized that with OpenAI and ElevenLabs, you could do this - make your own smart assistant - just by connecting APIs. No fancy machine learning necessary.
So I built it. Meet Hugh.
The plan
To build Hugh, I first sketched out my plan at a high level.
Write the server code:
Transcribe audio and return the resulting text.
Generate an AI response given a text prompt.
Convert an AI response into speech and save it as an mp3 file.
Send an mp3 file to the user to be played.
Write the browser code:
Record audio and send it off for transcription.
Submit a text box to generate a spoken AI response.
Play an audio file.
Add some polish (nice design, images, etc).
I also wanted to try using a few different AI tools:
ChatGPT to write the code itself (as much as it could)
Whisper to transcribe audio
ChatGPT (API) to respond to questions
ElevenLabs (and the Hugh voice I created) to generate speech
Midjourney to create an avatar