OpenAI Sora Safe Rollout Plans for AI Video Generation Guide

We are training AI to understand and simulate how the physical world moves. Our goal is to create models that help people solve problems that involve practical interactions.

Meet Sora, our text-to-video model. Sora can create videos up to a minute long, maintaining both visual quality and the details you request in your instructions.

Today, we are sharing Sora with the Red team to identify any risks or harms. At the same time, we are inviting visual artists, designers, and filmmakers to provide feedback to improve Sora’s usefulness for creative work. This collaborative input helps shape Sora for future applications.

To support all of these goals, we are sharing our research progress early. This helps us collaborate with people outside OpenAI, gather diverse feedback, and give the public insight into near-future AI capabilities.

Sora can create elaborate scenes with several characters, varied movements, and accurate details in both the subject and the background. The model understands what you ask for in your instructions and how these elements work in the real world.

The model understands language well; it can interpret prompts accurately. Sora creates interesting characters with several shots, keeping them and the visual style uniform throughout.

The current model is not perfect yet. It can struggle to convey physics in complex scenes and may miss some cause-and-effect details such as a cookie not showing a bite mark after someone bites it. The model might also mix up directions (e.g., left and right) or struggle with detailed descriptions of events over time (e.g., camera movements).

Safety

We will take important safety steps before Sora becomes available in OpenAI’s products. We are working with Red Teamers. They are experts in areas like misinformation, hateful content, and bias. Their job is to test the model in challenging ways.

We are also creating tools to help spot misleading content. These include a detection classifier that determines whether a video was made by Sora. In the future, if we release Sora in an OpenAI product, we plan to include C2PA metadata.

Along with developing new safety techniques for Sora, we are also using the safety methods we created for our DAL·E3 products. These methods work for Soratoo.

For example, when Sora is part of an OpenAI product, our text classifier will block prompts that break our usage policies. These include those asking for extreme violence, sexual content, hateful images, celebrity likenesses, or someone else’s intellectual property. We have also built strong image classifiers. They review every video frame to ensure it complies with our policies before it is shown to users.

We will talk with policymakers, educators, and artists worldwide to learn about their concerns. We want to find ways to use this new technology. Even with extensive research and testing, we cannot predict every potential positive or negative use. That is why we think learning from real-world use is key to making AI systems safer over time.

Research Techniques

Sora is a diffusion model. It starts with a video that looks like white noise, gradually becoming clearer as the noise is removed step by step. Sora can create entire videos at once or extend existing videos by letting the model see many frames at once. We solve the tough problem of keeping a subject the same even if it leaves the frame for a moment.

Like GPT models, Sora uses a transformer architecture, which allows it to scale up its performance.

We represent videos and images as collections. We break videos and images into small pieces called patches, similar to tokens in GPT. By using this unified representation of data, we can train diffusion transformers on a wider range of visual data, including images and videos of varying lengths, resolutions, and aspect ratios, as well as GPT models. It uses the recaptioning technique from DALL·E 3, which entails generating highly descriptive captions for the visual training data. As a result, the model can more faithfully follow the user’s text instructions in the generated video.

Besides generating videos from text instructions, the model can turn a still image into a video by animating its contents with careful attention to detail. It can also extend an existing video or fill in missing frames. You can learn more in our technical report.

Sora is a starting point for models that can understand and simulate the real world. We believe this is an important step toward reaching AGI.

Source: Creating video from text