UHG
Search
Close this search box.

NVIDIA’s Text-to-Image Model eDiffi Completes the Picture

What eDiffi does differently is it trains a group of expert denoisers at different intervals of time during the whole process.

Share

Listen to this story

AI text-to-image generators have become as commonplace as having an ‘opinion’ —if everyone has an opinion, every tech company worth its salt has its own AI text-to-image generator. All the big tech companies have one—Microsoft-backed OpenAI has ‘DALL.E 2’, Google has ‘Imagen’ and Meta has ‘Make-a-Scene’, while buzzy startups like Emad Mostaque’s Stability.ai have ‘Stable Diffusion’. Now, US semiconductor design giant NVIDIA has also entered the mix with its text-to-image model called ‘ensemble diffusion for images’ or ‘eDiffi’. However, eDiffi isn’t open to the public for use unlike Stable Diffusion and DALL.E 2 which are open source. 

Some old, some new

Diffusion models synthesise images through an iterative denoising process that slowly generates an image from random noise. Traditionally, diffusion models have a single model which is trained to denoise the entire noise distribution. What eDiffi does differently is that it trains a group of expert denoisers at different intervals of time during the whole process. NVIDIA released a research paper, along with the announcement, titled, ‘eDiff-I: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers’, which claimed that this simplified the sampling process. 

Denoising involves solving a reverse differential equation during which a denoising network is called several times. NVIDIA wanted the model to be easily scalable, which is harder when each denoising step adversely impacts the test-time and computational complexity of sampling. The study found that eDiffi’s model was able to achieve the scaling goal without eating into the test-time computational complexity. 

Which model is the best?

The paper concluded that eDiffi had managed to outperform competitors like DALL.E 2, Make-a-Scene, GLIDE and Stable Diffusion on the basis of the Frechet Inception Distance, or FID—a metric to evaluate the quality of AI generated images. eDiffi achieved a FID score slightly higher than Google’s Imagen and Parti. However, while each upcoming model seems to better the previous one in terms of accuracy and quality, it must be noted that researchers cherry pick the examples to showcase their best illustrations

The model’s best configuration was then compared with DALL.E 2 and Stable Diffusion, both of which are publicly available text-to-image generative models. The experiment found that the other models were mixing up attributes from different entities while ignoring some of the attributes. Meanwhile, eDiffi was able to correctly model attributes from all entities. 

(The first image is generated by Stable Diffusion, the second image is by DALL.E 2 and third image is by eDiffi)

When it came to generating text which has been a sticky spot for most text-to-image generators, both Stable Diffusion and DALL.E 2 tended to misspell or even ignore words while eDiffi was able to generate the text accurately. 

In the context of long descriptions, eDiffi was also shown to be able to handle long-range dependencies much better than DALL.E 2 and Stable Diffusion, which indicates that it has a longer memory than the other two. 

New features added

NVIDIA’s eDiffi uses a bunch of pretrained text encoders to give inputs to its text-to-image model. It uses a combination of the CLIP text encoder—which aligns the embedded text to the matching embedded image—along with the T5 text encoder—which performs language modelling. While older models like DALL.E 2 use only CLIP or Imagen uses the T5, eDiffi uses both encoders in the same model. 

This enables eDiffi to produce entirely different images even with the same text input. CLIP helps lend a stylised look to the generated images but the output normally misses out on the details in the text. On the other hand, images produced by T5 text embeddings produce better individual objects instead of a style. By using them together, eDiffi was essentially able to produce images with both qualities. 

The model was also tested on the usual datasets, like MS-COCO, which demonstrated that CLIP+T5 embeddings lead to much better trade-off curves than either used individually. On the visual genome dataset, it was proven that using T5 individual embeddings performed better than CLIP embeddings. The study finds that the more descriptive the text prompt is, the better T5 performs than CLIP. However, overall, a blend of the two worked best.

This allows eDiffi to have what it calls ‘style transfer’. In this process, a reference image can be used for style from which CLIP image embeddings are extracted and used as a style reference vector. Then, style conditioning is enabled in the second step, following which the model generates an image similar to the input style and caption. In the third step, style conditioning is disabled, following which images are generated in a natural style. 

The study also generated images produced solely using CLIP text embeddings and T5 text embeddings separately. Images generated by the former often contained correct objects in the foreground with blurry, fine-grain details while images generated by the latter showed incorrect objects at times. 

eDiffi also introduced a feature called ‘Paint with Words’ which helps users determine the location of the objects in the image by mentioning it in the text prompt as well as scribbling on the image itself. Users can select the phrase to specify the location within the prompt. The model is then able to produce an image that matches both the input map or sketch and the caption. 

📣 Want to advertise in AIM? Book here

Picture of Poulomi Chatterjee

Poulomi Chatterjee

Poulomi is a Technology Journalist with Analytics India Magazine. Her fascination with tech and eagerness to dive into new areas led her to the dynamic world of AI and data analytics.
Related Posts
Association of Data Scientists
Tailored Generative AI Training for Your Team
Upcoming Large format Conference
Sep 25-27, 2024 | 📍 Bangalore, India
Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

Flagship Events

Rising 2024 | DE&I in Tech Summit
April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore
Data Engineering Summit 2024
May 30 and 31, 2024 | 📍 Bangalore, India
MachineCon USA 2024
26 July 2024 | 583 Park Avenue, New York
MachineCon GCC Summit 2024
June 28 2024 | 📍Bangalore, India
Cypher USA 2024
Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA
Cypher India 2024
September 25-27, 2024 | 📍Bangalore, India
discord icon
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration with NVIDIA.