Listen to this story
|
Just when we thought that Google made OpenAI and Microsoft dance with the Gemini 1.5 release, OpenAI chief Sam Altman stole the spotlight with Sora – a super cool, text-to-video tool that generates life-like, hyper-realistic video footage, something the world has never seen before.
“The difference that is getting more and more obvious between OpenAI and Google every day is one company talks, and the other company shows,” wrote Nikunj Kothari, a partner at Khosla Ventures.
Partially, Google did make OpenAI dance when Altman personally sought prompts from users on X to create a buzz around Sora. “We’d like to show you what Sora can do, please reply with captions for videos you’d like to see and we’ll start making some!” wrote Altman, asking users to not hold back on the details or difficulty. The rest was history.
“Have to step out, more videos coming in about 45 mins,” cried Altman, as the wave of requests continued to overflow. And upon his return, CRED’s Kunal Shah’s was one of the cherry-picked requests that came to life.
Altman didn’t stop at that. He even gave much-deserved credit to the team behind Sora, saying, “OpenAI is the most talented and nicest group of people I have ever seen in one place.” Further, he said that they were working on the toughest, most interesting, and most important problems with all the resources in place, focused on building AGI. “You should perhaps consider joining us,” he added.
The team behind Sora is led by Tim Brooks, a research scientist at OpenAI, William Peebles, also a research scientist at OpenAI, and Aditya Ramesh, the creator of DALL·E and the head of Videogen.
All this feels like a stop-gap mechanism by OpenAI as it prepares to release GPT-5 in the coming weeks. “It will be more intelligent, multimodal, and faster,” said Altman at the World Government Summit in Dubai. In a recent episode of Unconfuse Me with Bill Gates as well, he highlighted the importance of multimodality as the key aspect of GPT-5 that enables it to process video input and generate new videos.
Sora might be just what GPT-5 needs.
A ChatGPT Moment in Video Generation?
OpenAI’s all new super-cool, text-to-video tool can create videos of up to 60 seconds featuring highly detailed scenes, complex camera motion, and multiple characters with vibrant emotions. Many are also calling this a ChatGPT moment in video generation.
Sora is not just a video generation model for OpenAI, it’s a stepping stone to AGI. OpenAI is teaching Sora to understand and simulate the physical world in motion, with the goal of training models that help people solve problems requiring real-world interaction.
How Sora Works
Sora is just like an LLM using Transformer architecture. While LLMs utilise text tokens, Sora employs visual patches. Patches serve as a highly scalable and effective representation for training generative models on various types of videos and images. Similar to how LLMs predict words, Sora is trained to predict the original ‘clean’ patches, given input noisy patches.
Sora has multiple features like animating DALL·E images, extending generated videos, video-to-video editing and connecting videos. However, apart from video generation, the possibilities with Sora are endless. It can simulate some aspects of people, animals and environments from the physical world.
Moreover, it can also simulate digital worlds, which includes popular games like Minecraft. It can simultaneously control the player in Minecraft with a basic policy while also rendering the world and its dynamics in high fidelity.
The introduction of Sora comes on the heels of Google’s recent launch of Lumiere, a text-to-video diffusion model crafted to synthesise videos, producing realistic, diverse, and coherent motion. Sora translates to ‘sky’ in Japanese. Sora is surely a sky full of stars instead of a mere constellation. It also a wake-up call for other text-to-video generators like RunwayML, Pika, and Stable Video.