Listen to this story
|
Text-to-speech (TTS) models are comparatively easier to make in English than in other languages. To fill this gap, IIT Guwahati alumni Sudarshan Kamath and Akshat Mandloi started smallest.ai, and decided to create one for Hindi as well. They call it AWAAZ.
With state-of-the-art Mean Opinion Scores (MOS) in Hindi and Indian English, AWAAZ can fluently converse in over ten accents, reflecting the diverse linguistic landscape of India.
The inception of AWAAZ was driven by the founders’ recognition of a gap in the market for high-quality, affordable TTS models for Indian languages. “When we started building, we realised that the models required for a voice bot were not mature for Indian languages. Existing models for non-English languages were nowhere close to production,” explained Kamath in an exclusive interaction with AIM.
Citing OpenAI’s GPT-4o, which is a generalised model, Kamath said that the company aims to build specialised models that can be tailored for customer support, even for small business. It is also cheaper than other Indian language TTS models, such as Veed.io and Murf.ai.
Janta ki AWAAZ
AWAAZ stands out for its single-shot voice cloning capability, which can replicate a voice from a mere five-second audio clip. The model also boasts a low streaming latency of just 200 milliseconds.
To make this technology accessible, smallest.ai has set an introductory price of INR 999 for 500,000 characters, positioning AWAAZ as a cost-effective solution, claiming to be ten times cheaper than its competitors, such as ElevenLabs.
Kamath said that the language model is about 750 million parameters in size, leveraged using existing open source models.
Kamath attributes the affordability of AWAAZ to their focus on data quality and model efficiency. “Our model is much smaller than those of competitors like ElevenLabs. Despite this, we achieve high-quality speech because our data is highly refined,” he explained.
smallest.ai uses AWS for cloud services, although they remain flexible about potential future partnerships.
The Dataset of AWAAZ was the Critical Part
Kamath and Mandloi launched smallest.ai in October 2023. The initial goal was to create a voice bot for India capable of qualifying leads and handling customer support. This led to the development of SAPIEN, a voice bot for sales, marketing, and customer support.
However, the lack of robust TTS models for Indian languages led them to focus on core model development, resulting in the creation of AWAAZ. “The data quality for TTS models reduced drastically when we moved away from English to other languages. It is worse for South Asian languages,” said Kamath.
The Indic data problem has been highlighted several times by researchers when speaking with AIM, be it for text or voice models.
“We spent a lot of time perfecting the dataset, using over thousands of hours of audio from various people from different states in India. We focused on data quality to ensure a diverse representation, making our model suitable for production-level deployment,” Kamath said.
The team invested significant resources into this endeavour, with over six months dedicated purely to the development and iterations of data quality.
AWAAZ is currently limited to Hindi and Indian English, but Kamath emphasises the importance of understanding the quality of the output. “The most difficult part is the data. If you tried our model in Tamil, it might respond a little, but we don’t advertise that capability because it’s not up to our standards yet,” he said.
Way Forward
The company’s ambitious roadmap includes expanding the model’s capabilities. “Our next step is moving closer to GPT-4o-like abilities for Indian languages, where the model can generate answers with a voice, enhancing the interactive experience,” Kamath revealed.
Additionally, smallest.ai is exploring the development of voice-to-voice models, aiming to offer custom solutions for specific business needs such as lead qualification and customer support.
The founders are committed to AI’s understanding of multimodal data.
“We’ve been fascinated by AI’s potential to understand more than just text. Speech is one of those areas where AI can truly start to seem human, much like in the movie ‘Her’,” Kamath said, reflecting on the broader vision that drives their work.