AlexDev

Join the Gang Gang and have the latest AI and Tech content.

Home page View on GitHub

SlideGen

Posted on 9 August 2023.
python llm linux flask html

About

In this post I will showcase how to use more of the openai API endpoints to create a really simple tool that can autogenerate short videos on a specific topic in a slideshow format.

NOTE: This project will require you to have a paid openai account. This means that you will an API KEY that you can find on your openai account. You will have to use this API key in the web application.

Quickstart

To run the application you will need ffmpeg and poetry installed. Also make sure to export your api key.

export OPENAI_API_KEY=sk-...
sudo apt install ffmpeg
poetry install
echo "Please create a presentation about sunflowers." | poetry run slide-gen

Walkthrough

Requirements

For this project I wanted to experiment with using a new package manager for Python. So I have tried poetry. Compared to pyenv this should make it easier to install and run dependencies without a need to manually create the environment yourself. (But in the background it still uses environments).

For this project I have used Python 3.10. Next, for accessing the openai tools I have used the openai library, and for the TTS I have used the fakeyou library. We will also need to use ffmpeg to edit the final video ffmpeg-python. This means that you also have to install ffmpeg. On Ubuntu

sudo apt install ffmpeg

Poetry uses the pyproject.toml file to keep the dependencies. To create a new project using poetry you can run

poetry new slide-gen

Then you can add the dependencies

[tool.poetry.dependencies]
python = "^3.10"
openai = "^0.27.8"
fakeyou = "^1.2.5"
ffmpeg-python = "^0.2.0"

and then install them

poetry install

NOTE: Make sure you have ffmpeg install, you can check with which ffmpeg and that you also install the requirements.

Prompt

The easiest way to work with ChatGPT for something that requires a more strict language is to use JSON as the output. ChatGPT (and also some other llms) are good at following the rules of using only JSON if you use a system prompt.

SYSTEM = """Your job is to create a slide presentation for a video. \
In this presentation you must include a speech for the current slide and a \
description for the background image. You need to make it as story-like as \
possible. The format of the output must be in JSON. You have to output a list \
of objects. Each object will contain a key for the speech called "text" and a \
key for the image description called "image".

For example for a slide presentation about the new iphone you could output \
something like:

```
[
  {
    "text": "Hello. Today we will discuss about the new iphone",
    "image": "Image of a phone on a business desk with a black background"
  },
  {
    "text": "Apple is going to release this new iphone this summer",
    "image": "A group of happy people with phones in their hand"
  },
  {
    "text": "Thank you for watching my presentation",
    "image": "A thank you message on white background"
  }
]
```

Make sure to output only JSON text. Do not output any extra comments.
"""

In this system prompt we give the LLM an explanation for the task and we state the fact that it must use JSON. Then we give it an example of how to output a presentation.

Next, to get the response from ChatGPT we have to set the messages to the system prompt and the text got from stdin from the user

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "system",
            "content": system,
        },
        {"role": "user", "content": prompt},
    ],
)

Slides

Then we load the json from the output

presentation = json.loads(response.choices[0].message.content)

now it can still happen that ChatGPT would output a non JSON string, but it should be a very rare case.

Next, we can iterate over all the objects in the JSON list and generate the image and the TTS.

We can use the Image endpoint, and choose the largest image. The image will be hosted on a website so we can also download it locally.

response = openai.Image.create(
    prompt=slide["image"], n=1, size="1024x1024"
)
image_url = response["data"][0]["url"]

urllib.request.urlretrieve(image_url, path)

And to generate the tts we can run

fakeyou.FakeYou().say(slide["text"], speaker).save(path)

Voice

To allow changing the voice of the speaker we have to use the FakeYou endpoint to get all the available speakers and search for the one that we want to use.

try:
    voices = fakeyou.FakeYou().list_voices()
    index = voices.title.index(speaker)
    return voices.modelTokens[index]
except ValueError:
    print("Speaker not found using default...")
    return SPEAKER

the limitation of this is that you need to use the exact same name as the one from https://fakeyou.com/.

FFmpeg

To create the video we want to concatenate all the images and audio files. Basically we want to set the first image as a video source and the first audio file as the audio source for it. Then continue with the second image and so on.

We can use the Python bindings for ffmpeg to achieve that. First we have to prepare the images and audio files in order

input_streams = []
for image_file, audio_file in zip(image_files, audio_files):
    input_streams.append(ffmpeg.input(image_file))
    input_streams.append(ffmpeg.input(audio_file))

Then we have to run concat

ffmpeg.concat(*input_streams, v=1, a=1).output(
    os.path.join(output, "video.mp4"),
    pix_fmt="yuv420p",
).run()

The default pix_fmt used by ffmpeg does not work well with playing the video in the browser, so the best alternative is to use yuv420p as the pix_fmt.

Example

To create a first presentation you can use a pipe to send the input into the tool

echo "Please create a presentation about sunflowers." | poetry run slide-gen

Conclusion

This project makes use of multiple AI API endpoints to create different resources and then merges all of them into a final product. This can showcase how to combine different AI services to create your own.