Last updated March 4, 2024
In Innovation in AI

Voice Slowly Catching Up on Multimodal AI Features

The sudden growth of lip-sync and voice integrated features to complement AI-generated videos is helping ‘voice’ find prominence in a multimodal model.

Share

Illustration by Nikhil Kumar

Published on March 2, 2024

by Vandana Nair

Listen to this story

Eleven Labs, a voice technology research company that works in developing AI for speech synthesis and text-to-speech software, recently added voice to Sora’s generated videos, showcasing a holistic example of what voice can bring to AI-generated videos. While this is not the first time for a development like this, voice modality is increasingly being brought to the forefront.

People aren't taking the "everyone will be filmmakers" seriously enough.

I made this 20s trailer in 15mins with an OpenAI Sora clip, David Attenborough's voice on Eleven Labs, and sampling some nature music from Youtube on iMovie pic.twitter.com/pbfmle9bvb
— Deedy (@deedydas) February 16, 2024

It’s Not All Easy with Voice

Voice in AI modality is considered to have a uniquely difficult interface mechanism as it employs probabilistic AI as opposed to deterministic machine learning-based voice services such as Apple Siri, and other home assistant products.

Technology investor Michael Parekh believes that the time to implement perfect AI voice modality on devices will be a lengthy one. “It’s going to be a long road to get it right, likely as long as it took to even get the previous versions like Apple Siri, Google Next, and Amazon Alexa/Echo especially, to barely tell us the time, set timers, and play some music on demand,” he said.

Voice has also been chosen as a mode of interaction, which is evident through its implementation as a primary user interface in devices such as Rabbit. The Humane Ai Pin, a futuristic, small wearable AI device that can be pinned to one’s clothing, works on finger gestures and voice for operating the device.

SoundHound Inc, an AI voice and speech recognition company founded almost a decade ago, developing technologies for speech recognition, NLP and more, have predicted in 2020 itself- “Although, voice does not need to be the only method of interaction (nor should it be), voice assistants will soon become a primary user interface in a world where people will never casually touch shared surfaces again.”

Voice for Video

The stream of AI voice integration announcements spiked in the last few weeks. Pika Labs, which creates AI-powered tools for generating and editing videos, came to limelight a few months ago with a $55M funding. They recently announced an early access to the ‘Lip Sync’ feature for their Pro users that enables voice and dialogues for AI-generated videos.

Alibaba’s EMO AI generator (Emote Portrait Alive) that generates expressive portrait videos with audio2video diffusion models was released last week to give a direct competition to Pika Labs. The company released videos where images were made to talk/sing with expressive facial gestures.

Voice feature has also been integrated to simplify podcasts. Eleven Labs had partnered with Perplexity to bring ‘Discover Daily’, a daily podcast that will be narrated by Eleven Labs’ AI-generated voices: another use case of how combining voice technology with other functionalities can create tangible use cases.

Theme for 2024

In the top three AI trends that Microsoft identified for 2024, multimodal AI was one of them. “Multimodality has the power to create more human-like experiences that can better take advantage of the range of senses we use as humans, such as sight, speech and hearing,” said Jennifer Marsman, principal engineer in AI (Office of the CTO) at Microsoft.

Microsoft’s efforts in the same direction is reflected in their AI offering Microsoft Copilot. Catering to enterprises and consumers alike, Copilot’s multimodal capabilities can process various formats including images, natural language and Bing search data. Multimodal AI also powers Microsoft Designer, a graphic image tool for creating designs, logos, banners and lots more with a simple text prompt.

Latest AI kid on the block, Perplexity, has also integrated the multimodal features, where a user can upload images from their Pro account and get relevant answers based on that. There is a common theme from all these functionalities. Is ‘voice’ truly an added feature?

Big Tech’s Foray Into Voice

With the release of ChatGPT’s voice feature, that allows one to easily converse with the model, almost 6 months after launching a multimodal GPT 4 model, voice capability was fully integrated on GPT-4’s model. Google Gemini, the most powerful AI model of Google, is also a multimodal model.

While the advancements are promising, misuse risks related to implementing it still persists, with the most prominent one being deepfakes. With an increasing number of companies entering the space, adding voice to AI-generated videos only adds to the propensity of abuse, where stringent copyright and privacy laws will be the only saviour.

📣 Want to advertise in AIM? Book here

Vandana Nair

As a rare blend of engineering, MBA, and journalism degree, Vandana Nair brings a unique combination of technical know-how, business acumen, and storytelling skills to the table. Her insatiable curiosity for all things startups, businesses, and AI technologies ensures that there's always a fresh and insightful perspective to her reporting.