In the current era of artificial intelligence, computers can create their own “art” through diffusion models, repeatedly adding structure to noisy initial states until a clear image or video appears. The diffusion model suddenly had a place at everyone’s dinner table. Type a few words and you'll experience an instant dopamine rush of dreamscapes at the intersection of reality and fantasy. The behind-the-scenes process involves a complex and time-intensive process that requires numerous iterations of the algorithm to perfect the image.
Researchers at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) have introduced a new framework that addresses previous limitations by simplifying the multistep process of existing diffusion models into a single step. This is done through a kind of teacher-student model. That is, teaching new computer models to mimic the behavior of more complex and creative models that produce images. This approach, known as Distribution Matching Distillation (DMD), maintains the quality of the images produced and allows for much faster production.
“Our work is a novel method that accelerates current diffusion models such as Stable Diffusion and DALLE-3 by a factor of 30,” says Tianwei Yin, a doctoral student in electrical engineering and computer science at MIT, a CSAIL affiliate, and DMD principal investigator. skeleton. “These advances not only significantly reduce computation times, but also maintain, if not surpass, the quality of the generated visual content. Theoretically, this approach combines the principles of generative adversarial networks (GANs) and diffusion models to achieve visual content generation in a single step. This is in stark contrast to the hundreds of steps of iterative refinement required for current diffusion models. “This could potentially be a new generative modeling method with superior speed and quality.”
This single-step diffusion model can enhance design tools, enabling faster content creation and potentially supporting advances in drug discovery and 3D modeling where rapidity and efficiency are key.
dream of distribution
DMD cleverly has two components: First, to make learning more stable, we use a regression loss that fixes the mapping to ensure an approximate configuration of the image space. Next, we use a distribution matching loss to determine whether the probability of generating an image given by the student model matches its actual frequency of occurrence. To achieve this, we utilize two diffusion models that act as guides, helping the system understand the differences between real and generated images and enabling fast one-stage generator training.
The system achieves faster generation by training a new network to minimize the distribution differences between the generated images and the images in the training dataset used by existing diffusion models. “Our key insight is to use two diffusion models to approximate a gradient that guides the refinement of the new model,” says Yin. “In this way, we refine the knowledge of the original, more complex model into a simpler, faster model, while circumventing the notorious instability and mode collapse problems of GANs.”
Yin and colleagues simplified the process by using a pre-trained network for their new student model. By copying and fine-tuning the parameters of the original model, the team achieved fast training convergence of a new model capable of producing high-quality images based on the same architecture. “This can be combined with other system optimizations based on the original architecture to further accelerate the creation process,” adds Yin.
When tested in the usual way using a wide range of benchmarks, DMDs demonstrated consistent performance. In the popular benchmark of generating images based on a specific class of ImageNet, DMD is the first one-stage diffusion technique to produce pictures that are roughly equivalent to those of the original, more complex model by shaking up the ultra-close Fréchet starting distance (FID) score. is only 0.3. This is impressive because FID is about judging the quality and variety of the images produced. Additionally, DMD excels at industrial-scale text-to-image generation and achieves state-of-the-art one-step generation performance. There are still some quality differences when handling more demanding text-to-image applications, suggesting that there is some room for future improvement.
Additionally, the performance of images produced by a DMD is intrinsically linked to the capabilities of the teacher model used in the distillation process. In its current form, using Stable Diffusion v1.5 as a teacher model, the student inherits limitations such as rendering detailed depictions of text and small faces, suggesting that DMD-generated images can be further improved through advanced teacher models. do.
“Reducing the number of iterations has been the holy grail since the inception of diffusion models,” says Fredo Durand, MIT professor of electrical engineering and computer science, CSAIL principal investigator, and lead author of the paper. “We are excited to finally support single-step image creation. This can significantly reduce computing costs and accelerate processes.”
“Finally, a paper that successfully combines the versatility and high visual quality of diffusion models with the real-time performance of GANs,” said Alexei Efros, professor of electrical engineering and computer science at the University of California, Berkeley. In this study. “I expect this work will open up fantastic possibilities for high-quality, real-time visual editing.”
Yin and Durand's co-authors are William T. Freeman, MIT professor of electrical engineering and computer science and CSAIL principal investigator, and Michaël Gharbi SM '15, PhD '18, an Adobe research scientist. Richard Chang; Eli Shechtman; and Park Tae-seong. Their work was supported in part by grants from the U.S. National Science Foundation (including the Institute for Artificial Intelligence and Basic Interaction), Singapore's Defense Science and Technology Agency, Gwangju Institute of Science and Technology, and funding from Amazon. Their work will be presented at the Computer Vision and Pattern Recognition Conference in June.