Tutorial: How to narrate video with Sora, GPT-Vision, and ElevenLabs

The future of entertainment is going to be a wild ride.

Feb 28, 2024

Everyone is fascinated by Sora, the new text-to-video model from OpenAI. That’s mainly because of how real many of the outputs look, and how well objects move and interact. We're not going to dive into the technicals of Sora today, but if you want to dive deeper, there are several good posts from Visually AI, Interconnects, AI Supremacy, and the Algorithmic Bridge.

One of my great joys in this world is nature documentaries - I've watched Planet Earth more times than I can count. And what I realized, watching the Sora demo videos, is that we're on the verge of being able to create infinite nature documentaries.

So, I created a process to make these without human intervention. Take a look at the results for yourself (and turn your audio on!):

Here's a rough breakdown of how I did it:

Take an existing Sora video (as I don't have access to Sora yet)
Process the video frames to send them to GPT-Vision
Use ChatGPT to write a voiceover script
Create the audio narration using ElevenLabs
Edit the audio, then merge it with the original video
Bonus: Generate new videos using Runway’s text-to-video model

Creating a video script using GPT-Vision

GPT-Vision is capable of understanding images but not video1. So, as a workaround, we can break up the video frames using OpenCV and batch them together.

Before you start, you'll want to run:

pip install openai opencv-python requests

Once we have our requirements installed, we can use OpenCV to read the video frames:

def get_video_frames(video_filename: str) -> list:
    video = cv2.VideoCapture(video_filename)
    base64Frames = []

    while video.isOpened():
        success, frame = video.read()
        if not success:
            break
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))

    video.release()
    return base64Frames

In doing so, we now have a bag of JPEGs that we can pass to ChatGPT (or more specifically, "GPT-4 with Vision"). I was surprised by how easy this was: you literally just provide the list of image URLs (or data contents) to the API. I'm also passing in a prompt to get a response back once ChatGPT has analyzed the still frames.

from openai import OpenAI
client = OpenAI(api_key="OPENAI_API_KEY")

def generate_script(video_filename: str) -> str:
    video_frames = get_video_frames(video_filename)
    if not video_frames:
        raise ValueError("No frames found.")

    prompt = (
        "These are frames of a video."
        "Write a voiceover script in the style of a nature documentary."
        "Include only the first sentence of the narration."
    )

    messages = [
        {
            "role": "user",
            "content": [
                prompt,
                *map(lambda x: {"image": x, "resize": 768}, video_frames[0::60]),
            ],
        },
    ]

    params = {
        "model": "gpt-4-vision-preview",
        "messages": messages,
        "max_tokens": 120,
    }

    result = client.chat.completions.create(**params)
    return result.choices[0].message.content

There are a few nuances here. For starters, I've attached a custom prompt to specify that I want the output to be a nature documentary voiceover. I'm explicitly asking for the first sentence (as well as limiting the output to 120 tokens) because in practice, the video clips I'm working with are ~20 seconds or less, and I need to keep the narration short. In the future, I'd like to dynamically adjust the prompt and the token count based on the video length.

I'm also specifying the gpt-4-vision-preview model, as other ChatGPT versions (such as GPT-4 Turbo) don't work with vision yet.

Narrating the video with ElevenLabs

As I've mentioned repeatedly, ElevenLabs is one of my favorite tools for generating speech. While it's definitely on the expensive side, the expressiveness and realism of the voices are hard to beat. In this case, I used ElevenLabs to generate a voice that would go well with our AI-generated nature scenes.

Review: The best AI-generated voices

Charlie Guo

May 4, 2023

Read full story

I've used ElevenLabs before, so creating the audio was pretty straightforward. While they have a Python library, I opted to use the API directly with requests.

def generate_audio(text: str, output_path: str) -> str:
    url = f"https://api.elevenlabs.io/v1/text-to-speech/{VOICE_ID}"
    headers = {
        "xi-api-key": ELEVENLABS_API_KEY,
        "Content-Type": "application/json"
    }

    payload = {
        "model_id": "eleven_multilingual_v2",
        "text": text,
    }
    response = requests.request("POST", url, json=payload, headers=headers)

    with open(output_path, "wb") as output:
        output.write(response.content)

    return output_path

We now have an audio file to go with our video file! There's just one problem: the timing isn't going to sync up exactly. The audio could be shorter or longer, depending on the length of the script and the pronunciation of each word. It's not a complete showstopper - you can still merge the two and either end the audio early, or have still frames at the end of the video. But I wanted to take a shot at dynamically editing the audio.

Editing and combining the audio and video

There are a few ways to go about merging the two, but the simplest was to modify the audio - that way, only one input is changing, and we're not left managing video frames in order to avoid antialiasing.

With that in mind, we can install some new tools to help us work with the audio and video (you'll also need ffmpeg installed as a command-line tool).

pip install pydub ffmpeg-python

This approach has some pitfalls - the biggest one being that our editing method, ffmpeg, can only stretch/shrink audio by a factor of 2 in one go (meaning half as short or twice as long).

That's probably a good thing - since more audio distortion would make our final output seem weirdly fast or slow. But it puts extra emphasis on our script being tight, and closely aligned with the actual length of out input video.

def adjust_audio_duration(video_filename, audio_filename):
    video_duration = get_video_duration(video_filename)
    audio_duration = get_audio_duration(audio_filename)
    
    output_filename = audio_filename.replace(".mp3", "_adjusted.mp3")
    slowdown_factor = audio_duration / video_duration
    if slowdown_factor < 0.5 or slowdown_factor > 2:
        raise ValueError("Slowdown factor must be between 0.5 and 2.")
    
    (ffmpeg
        .input(audio_filename)
        .filter_('atempo', slowdown_factor)  # the fast/slow adjustment
        .output(output_filename)
        .run(overwrite_output=True)
    )
  
    return output_filename

And with that out of the way, we can take the final step of combing our audio and video files:

def combine_audio_and_video(video_filename, audio_filename):
    output_filename = video_filename.replace(".mp4", "_final.mp4")

    ffmpeg.output(
        ffmpeg.input(video_filename),
        ffmpeg.input(audio_filename),
        output_filename,
        vcodec='copy',
        acodec='aac',
        strict='experimental'
    ).run(overwrite_output=True)

    return output_filename

Et voila! On-demand nature documentary narration.

Bonus: Generating videos with Runway

Unfortunately, I was mostly forced to use the videos that OpenAI have posted online publicly, as I don't have direct access to Sora yet. If/when access is more widely available, I plan to update the process with an actual video generation step using the Sora API.

But there are a number of existing text-to-video tools out there, and we can use them to generate additional (albeit lower quality) videos. It also serves to as a stark example of why people can’t stop talking about Sora - the leap in realism is massive.

Here’s a video that I generated using Runway, a model that was considered state-of-the-art just a few months ago:

Creating the video was pretty easy: all I had to do was enter a prompt, optionally configure a few stylistic settings, and then wait for the model to render the video. The generated videos are only 4 seconds long, but you can extend the video in additional 4-second increments (up to 3 times). I did that in the example above: but notice how the seahorses start to warp and morph into other things.

That’s one of the things that sets Sora apart: it “understands” how objects should behave over time, and correctly generates a subsequent video frame, rather than taking the pixels and smoothly warping them into the next frame.

Key takeaways

I’m excited (and a little terrified) for the future of entertainment and media. While I normally dislike companies making product announcements with nothing more than a waitlist, in this case I respect that OpenAI announced Sora without making it publicly available. It gives the world at least a little time to inoculate itself against potential video-based sources of misinformation.

Right now, generating 10-second clips takes between minutes and hours. But we know it’s possible to dramatically speed up the generation rate, especially with custom AI chips. In the coming years, it’s not difficult to imagine custom, real-time entertainment: ChatGPT writes a script, Sora generates the video, and audio models create sound effects and voiceovers.

Tech like this is going to be amazing for indie filmmakers and other creatives - which is likely why OpenAI is giving them early access to Sora. I’m excited too: each new foundation model release has unlocked my ability to make projects that I never would have considered before. But it’s also a reminder of how fast things are moving, and how much change is yet to come.

That will likely change, as other models like Gemini 1.5 can take video input and understand it.

Przemek

Mar 3, 2024

Nicely done! Thank you for sharing the workflow 👍

Expand full comment

Daniel Nest

I enjoyed these clips when you shared them on Twitter / X the other day. Cool to see the process behind it. I'm quite sure you could probably find third-party tools for every step of the process that lets us non-coders recreate the effect without diving into Python code.

Really exciting indeed!