

Discover more from Artificial Ignorance
I recently came across this tweet1:


And I realized that with OpenAI and ElevenLabs, you could do this - make your own smart assistant - just by connecting APIs. No fancy machine learning necessary.
So I built it. Meet Hugh.
The plan
To build Hugh, I first sketched out my plan at a high level.
Write the server code:
Transcribe audio and return the resulting text.
Generate an AI response given a text prompt.
Convert an AI response into speech and save it as an mp3 file.
Send an mp3 file to the user to be played.
Write the browser code:
Record audio and send it off for transcription.
Submit a text box to generate a spoken AI response.
Play an audio file.
Add some polish (nice design, images, etc).
I also wanted to try using a few different AI tools:
ChatGPT to write the code itself (as much as it could)
Whisper to transcribe audio
ChatGPT (API) to respond to questions
ElevenLabs (and the Hugh voice I created) to generate speech
Midjourney to create an avatar
The code
Apart from the newfangled AI bits, I wanted to stick with tools I already knew, so I used Python, HTML, and JavaScript. I started by asking ChatGPT to write the Python server code I needed.
I was pleasantly surprised by how good the results were. ChatGPT created backend code that was about 80% of what I needed. And to top it off, it explained the results step by step. I wish I could've had something like this back when I was studying programming!
Ultimately, I did have to edit a decent bit of the code. But I'm guessing it saved at least 20-30 minutes of fiddling around with initial setup work.
I tried the same thing with the HTML and JavaScript, but the results were less accurate. I had to re-word my prompt a few times to get ChatGPT to stop barking up the wrong tree. The issue was partly due to my inability to articulate what I wanted. For example, I had never built a project that records browser audio before, so I didn't know what to ask for.
Then it was time to actually add the AI! There were three pieces to this:
OpenAI's Whisper, which transcribes audio to text
OpenAI's ChatGPT, which answers questions given a user prompt
ElevenLabs, which converts text to spoken audio
The OpenAI software was ridiculously easy to use. Seriously, here's the entire code for both transcribing audio prompts and generating ChatGPT responses:
But after filling in the AI gaps, I had a working (if barebones) app.
Unfortunately, ChatGPT didn’t have a flair for the artistic.
The polish
Luckily, I could ask it to help with that too.
With some more massaging, I had a reasonable-looking page.
But the last thing missing was some character. Specifically, an avatar for Hugh. To generate Hugh's Avatar, I turned to Midjourney, an image-generating AI.
If you aren't familiar, the way Midjourney works is you have to join the Midjourney Discord server, then use the `imagine` prompt to whip up an image. Writing an effective prompt is an art form in itself, but even vague prompts usually get decent results.
In this case, I started with “a smart british gentleman, studio ghibli, chibi, digital art
.” Midjourney generates four different low-resolution images per prompt, which can be further modified. Here were Hugh's:
For any image, you can choose to 1) generate more variations of the image, or 2) upscale the image to a higher resolution. I liked the top right image the best, so I made some variations.
I was happy with the bottom left image, so I had Midjourney create an upscaled version.
And with a little bit of image editing work, I had my avatar!
Next steps
Building this project was a ton of fun. You can find the code here: https://github.com/IgnoranceAI/hugh.2
Hugh is a decent conversation partner, but there are certainly some upgrades he could benefit from3. What I’d love to do is dig deeper into the mechanics of GPT - prompt engineering is well and good, but what does it take to fine-tune a model? And I'm still planning on building more voice projects, including narration for Artificial Ignorance.
In the meantime, I don’t think I’ll be asking Hugh for help with any coding.
You'll need your own API keys for OpenAI and ElevenLabs. They both have a reasonable free tier to play with, but heavy usage will cost a little bit.
Streaming the audio rather than saving it to a file would make things faster. And currently, there’s a limit on how long conversations can run.
I made an Alexa
Dope!
Hi, my name is Alejandro from Costa Rica, how can I contact you to talk about a project