Graphics processing units (GPUs) have become highly sought-after in the AI field, especially for training models. However, when it comes to inference, they might not always be the most efficient or cost-effective choice.
D-Matrix, a startup incorporated in 2019 and headquartered in Santa Clara, California, is developing silicons better suited for generative AI inference.
Currently, only a handful of companies in the world are training AI models. But when it comes to deploying these models, the numbers could be in millions.
In an interaction with AIM, Sid Sheth, the founder & CEO of d-Matrix, said that 90% of AI workloads today involve training a model, whereas around 10% involve inferencing.
“But it is rapidly changing to a world, say, five to 10 years from now, when it will be 90% inference, 10% training. The transition is already underway. We are building the world’s most efficient inference computing platform for generative AI because our platform was built for transformer acceleration.
“Moreover, we are seeing a dramatic increase in GPU costs and power consumption. For example, an NVIDIA GPU that consumed 300 watts three years ago now consumes 1.2 kilowatts—a 4x increase. This trend is clearly unsustainable,” Sheth said.
Enterprises Like Small Language Model
The startup is also aligning its business model with the fact that enterprises see smaller language models as truly beneficial, which they can fine-tune and train on their enterprise data.
Some of the best LLMs, like GPT-4, or Llama 3.1, are significantly big in size with billions or potentially trillions of parameters. These models are trained on all the knowledge of the world.
However, enterprises need models that are specialised, efficient, and cost-effective to meet their specific requirements and constraints.
Over time, we have seen OpenAI, Microsoft and Meta launch small language models (SLMs) like Llama 3 7 b, Phi-3 and GPT-4o mini.
“Smaller models are now emerging, ranging from 2 billion to 100 billion parameters, and they prove to be highly capable—comparable to some of the leading frontier models. This is promising news for the inference market, as these smaller models require less computational power and are therefore more cost-effective,” Sheth pointed out.
The startup believes enterprises don’t need to rely on expensive NVIDIA GPUs for inferencing. Its flagship product Corsair is specifically designed for inferencing generative AI models (100 billion parameter or less) and is much more cost effective, compared to GPUs.
“We believe that the majority of enterprises and individuals interested in inference will prefer to work with models of up to 100 billion parameters. Deploying models larger than this becomes prohibitively expensive, making it less practical for most applications,” he said.
( Jayhawk by d-Matrix)
Pioneering Digital In-Memory Compute
The startup was one of the pioneers in developing a digital in-memory compute (DIMC) engine, which they assert effectively addresses traditional AI limitations and is rapidly gaining popularity for inference applications.
“In older architectures, inference involves separate memory and computation units. Our approach integrates memory and computation into a single array, where the model is stored in memory and computations occur directly within it,” Sheth revealed.
Based on this approach, the startup has developed its first chip called Jayhawk II, which powers its flagship product Corsair.
The startup claims that the Jayhawk II platform delivers up to 150 trillions of operations per second (TOPs), a 10-fold reduction in total cost of ownership (TCO) and provides up to 20 times more inferences per second compared to the high-end GPUs.
Rather than focusing on creating very large chips, the startup’s philosophy is to design smaller chiplets and connect them into a flexible fabric.
“One chiplet of ours is approximately 400 square millimetres in size. We connect eight of these chiplets on a single card, resulting in a total of 3,200 square millimetres of silicon. In comparison, NVIDIA’s current maximum is 800 square millimetres.
“This advantage of using chiplets allows us to integrate a higher density of smaller chips on a single card, thereby increasing overall functionality,” Sheth revealed.
This approach allows the startup to scale computational power up or down based on the size of the model. If the model is large, they can increase the computational resources, and vice versa. According to Sheth, this unique method is a key innovation from D Matrix.
Corsair is Coming in the Second Half of 2024
The startup plans to launch Corsair in November and enter production by 2025. So far, it is already in talks with hyperscalers, AI cloud service providers, sovereign AI cloud providers, and enterprises looking for on-prem solutions.
Sheth revealed that the company has customers in North America, Asia, and the Middle East and has signed a multi-million dollar contract with one of these customers.
While due to non-disclosure agreements, Sheth refrained from revealing who this customer is, there is a possibility that Microsoft, a model maker and an investor in d-Matrix, would significantly benefit from the startup’s silicon.
Expanding in India
In 2022, the startup established an R&D centre in Bengaluru. Currently, the size of the startup team here is around 20-30, and the company plans to double its engineering team in the coming months.
Previously, Sheth had highlighted his intention to increase the Indian workforce to 25-30% of the company’s global personnel.
(d-Matrix office in Bengaluru, India)
In India, the startup is actively leading a skilling initiative, engaging with universities, reaching out to ecosystem peers, and collaborating with entrepreneurs.
Through the initiative, the startup wants to not only create awareness about AI, but also boost India’s skilling initiatives. They collaborate with universities, providing them with curriculum, updates, guidelines, and more.