Last updated March 29, 2024
In AI News

OpenAI’s Voice Engine Can Recreate Human Voices with Emotions

Voice Engine can create emotive and realistic voices with a single 15- second sample.

Share

Published on March 29, 2024

by Siddharth Jindal

Listen to this story

OpenAI announced its latest model Voice Engine, built to generate natural-sounding speech from text input and a mere 15-second audio sample. Notably, Voice Engine can create emotive and realistic voices using this brief audio input.

We're sharing our learnings from a small-scale preview of Voice Engine, a model which uses text input and a single 15-second audio sample to generate natural-sounding speech that closely resembles the original speaker. https://t.co/yLsfGaVtrZ
— OpenAI (@OpenAI) March 29, 2024

OpenAI said the Voice Engine project commenced in late 2022, initially powering preset voices within OpenAI’s text-to-speech API, ChatGPT Voice, and Read Aloud features. However, due to concerns about potential misuse, the company has not yet released it to the public, similar to its text-to-video generation model, Sora.

“We recognize that generating speech that resembles people’s voices has serious risks, which are especially top of mind in an election year. We are engaging with U.S. and international partners from across government, media, entertainment, education, civil society and beyond to ensure we are incorporating their feedback as we build,” the company wrote in the blog post.

OpenAI has been privately testing Voice Engine with a small group of trusted partners. Early trials have yielded promising applications across various sectors. These include:

Enhancing Education: Age of Learning, an education technology company, utilises Voice Engine to create pre-scripted voice-over content for reading assistance among non-readers and children. The integration of Voice Engine and GPT-4 enables personalised real-time interactions, expanding content accessibility and audience reach.

Global Content Translation: HeyGen, an AI visual storytelling platform, employs Voice Engine for translation of videos and podcasts into multiple languages while preserving the native accent of the original speaker. This innovation facilitates global content dissemination and audience engagement.

Community Health Services: Dimagi leverages Voice Engine and GPT-4 to enhance essential service delivery in remote areas, particularly in healthcare settings. The use of interactive feedback in local languages aids community health workers in providing counseling and support services effectively.

Assistive Communication: Livox, an AI communication app, integrates Voice Engine to offer non-robotic and customisable voices for individuals with speech-related disabilities. This advancement empowers users to express themselves authentically across different languages and communication contexts.

Clinical Applications: The Norman Prince Neurosciences Institute at Lifespan explores Voice Engine’s potential in clinical settings, restoring speech for patients with speech impairments caused by medical conditions. The short audio sample requirement makes Voice Engine a viable tool for speech rehabilitation and patient care.

📣 Want to advertise in AIM? Book here

Siddharth Jindal

Siddharth is a media graduate who loves to explore tech through journalism and putting forward ideas worth pondering about in the era of artificial intelligence.