StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation

Due to its vast potential and commercialization opportunities, particularly in gaming, broadcasting, and video streaming, the Metaverse is currently one of the fastest-growing technologies. Modern Metaverse applications utilize AI frameworks, including computer vision and diffusion models, to enhance their realism. A significant challenge for Metaverse applications is integrating various diffusion pipelines that provide low latency and high throughput, ensuring effective interaction between humans and these applications.

Today's diffusion-based AI frameworks excel in creating images from textual or image prompts but fall short in real-time interactions. This limitation is particularly evident in tasks that require continuous input and high throughput, such as video game graphics, Metaverse applications, broadcasting, and live video streaming.

In this article, we will discuss StreamDiffusion, a real-time diffusion pipeline developed to generate interactive and realistic images, addressing the current limitations of diffusion-based frameworks in tasks involving continuous input. StreamDiffusion is an innovative approach that transforms the sequential noising of the original image into batch denoising, aiming to enable high throughput and fluid streams. This approach moves away from the traditional wait-and-interact method used by existing diffusion-based frameworks. In the upcoming sections, we will delve into the StreamDiffusion framework in detail, exploring its working, architecture, and comparative results against current state-of-the-art frameworks. Let's get started.

StreamDiffusion : An Introduction to Real-Time Interactive Generation

Metaverse are performance intensive applications as they process a large amount of data including texts, animations, videos, and images in real-time to provide its users with its trademark interactive interfaces and experience. Modern Metaverse applications rely on AI-based frameworks including computer vision, image processing, and diffusion models to attain low latency and a high throughput to ensure a seamless user experience. Currently, a majority of Metaverse applications rely on reducing the occurrence of denoising iterations to ensure high throughput and enhance the application’s interactive capabilities in real-time. These frameworks opt for a common strategy that either involves re-framing the diffusion process with neural ODEs (Ordinary Differential Equations) or reducing multi-step diffusion models into a few steps or even a single step. Although the approach delivers satisfactory results, it has certain limitations including limited flexibility, and high computational costs.

On the other hand, the StreamDiffusion is a pipeline level solution that starts from an orthogonal direction and enhances the framework’s capabilities to generate interactive images in real-time while ensuring a high throughput. StreamDiffusion uses a simple strategy in which instead of denoising the original input, the framework batches the denoising step. The strategy takes inspiration from asynchronous processing as the framework does not have to wait for the first denoising stage to complete before it can move on to the second stage, as demonstrated in the following image. To tackle the issue of U-Net processing frequency and input frequency synchronously, the StreamDiffusion framework implements a queue strategy to cache the input and the outputs.

Although the StreamDiffusion pipeline seeks inspiration from asynchronous processing, it is unique in its own way as it implements GPU parallelism that allows the framework to utilize a single UNet component to denoise a batched noise latent feature. Furthermore, existing diffusion-based pipelines emphasize on the given prompts in the generated images by incorporating classifier-free guidance, as a result of which the current pipelines are rigged with redundant and excessive computational overheads. To ensure the StreamDiffusion pipeline don’t encounter the same issues, it implements an innovative RCFG or Residual Classifier-Free Guidance approach that uses a virtual residual noise to approximate the negative conditions, thus allowing the framework to calculate the negative noise conditions in the initial stages of the process itself. Additionally, the StreamDiffusion pipeline also reduces the computational requirements of a traditional diffusion-pipeline by implementing a stochastic similarity filtering strategy that determines whether the pipeline should process the input images by computing the similarities between continuous inputs.

The StreamDiffusion framework is built on the learnings of diffusion models, and acceleration diffusion models.

Diffusion models are known for their exceptional image generation capabilities and the amount of control they offer. Owing to their capabilities, diffusion models have found their applications in image editing, text to image generation, and video generation. Furthermore, development of consistent models have demonstrated the potential to enhance the sample processing efficiency without compromising on the quality of the images generated by the model that has opened new doors to expand the applicability and efficiency of diffusion models by reducing the number of sampling steps. Although extremely capable, diffusion models tend to have a major limitation: slow image generation. To tackle this limitation, developers introduced accelerated diffusion models, diffusion-based frameworks that do not require additional training steps or implement predictor-corrector strategies and adaptive step-size solvers to increase the output speeds.

The distinguishing factor between StreamDiffusion and traditional diffusion-based frameworks is that while the latter focuses primarily on low latency of individual models, the former introduces a pipeline-level approach designed for achieving high throughputs enabling efficient interactive diffusion.

StreamDiffusion : Working and Architecture

The StreamDiffusion pipeline is a real-time diffusion pipeline developed for generating interactive and realistic images, and it employs 6 key components namely: RCFG or Residual Classifier Free Guidance, Stream Batch strategy, Stochastic Similarity Filter, an input-output queue, model acceleration tools with autoencoder, and a pre-computation procedure. Let’s talk about these components in detail.

Stream Batch Strategy

Traditionally, the denoising steps in a diffusion model are performed sequentially, resulting in a significant increase in the U-Net processing time to the number of processing steps. However, it is essential to increase the number of processing steps to generate high-fidelity images, and the StreamDiffusion framework introduces the Stream Batch strategy to overcome high-latency resolution in interactive diffusion frameworks.

In the Stream Batch strategy, the sequential denoising operations are restructured into batched processes with each batch corresponding to a predetermined number of denoising steps, and the number of these denoising steps is determined by the size of each batch. Thanks to the approach, each element in the batch can proceed one step further using the single passthrough UNet in the denoising sequence. By implementing the stream batch strategy iteratively, the input images encoded at timestep “t” can be transformed into their respective image to image results at timestep “t+n”, thus streamlining the denoising process.

Residual Classifier Free Guidance

CFG or Classifier Free Guidance is an AI algorithm that performs a host of vector calculations between the original conditioning term and a negative conditioning or unconditioning term to enhance the effect of original conditioning. The algorithm strengthens the effect of the prompt even though to compute the negative conditioning residual noise, it is necessary to pair individual input latent variables with negative conditioning embedding followed up by passing the embeddings through the UNet at reference time.

To tackle this issue posed by Classifier Free Guidance algorithm, the StreamDiffusion framework introduces Residual Classifier Free Guidance algorithm with the aim to reduce computational costs for additional UNet interference for negative conditioning embedding. First, the encoded latent input is transferred to the noise distribution by using values determined by the noise scheduler. Once the latent consistency model has been implemented, the algorithm can predict data distribution, and use the CFG residual noise to generate the next step noise distribution.

Input Output Queue

The major issue with high-speed image generation frameworks is their neural network modules including the UNet and VAE components. To maximize the efficiency and overall output speed, image generation frameworks move processes like pre and post processing images that do not require additional handling by the neural network modules outside of the pipeline, post which they are processed in parallel. Furthermore, in terms of handling the input image, specific operations including conversion of tensor format, resizing input images, and normalization are executed by the pipeline meticulously.

To tackle the disparity in processing frequencies between the model throughput and the human input, the pipeline integrates an input-output queuing system that enables efficient parallelization as demonstrated in the following image.

The processed input tensors are first queued methodically for Diffusion models, and during each frame, the model retrieves the most recent tensor from the input queue, and forwards the tensor to the VAE encoder, thus initiating the image generation process. At the same time, the tensor output from the VAE decoder is fed into the output queue. Finally, the processed image data is transmitted to the rendering client.

Stochastic Similarity Filter

In scenarios where the images either remain unchanged or show minimal changes without a static environment or without active user interaction, input images resembling each other are fed repeatedly into UNet and VAE components. The repeated feeding leads to generation of near identical images and additional consumption of GPU resources. Furthermore, in scenarios involving continuous inputs, unmodified input images might surface occasionally. To overcome this issue and prevent unnecessary utilization of resources, the StreamDiffusion pipeline employs a Stochastic Similarity Filter component in its pipeline. The Stochastic Similarity Filter first calculates the cosine similarity between the reference image and the input image, and uses the cosine similarity score to calculate the probability of skipping the subsequent UNet and VAE processes.

On the basis of the probability score, the pipeline decides whether subsequent processes like VAE Encoding, VAE Decoding, and U-Net should be skipped or not. If these processes are not skipped, the pipeline saves the input image at that time, and simultaneously updates the reference image to be used in the future. This probability-based skipping mechanism allows the StreamDiffusion pipeline to fully operate in dynamic scenarios with low inter-frame similarity whereas in static scenarios, the pipeline operates with higher inter-frame similarity. The approach helps in conserving the computational resources and also ensures optimal GPU utilization based on the similarity of the input images.

Pre-Computation

The UNet architecture needs both conditioning embeddings as well as input latent variables. Traditionally, the conditioning embeddings are derived from prompt embeddings that remain constant across frames. To optimize the derivation from prompt embeddings, the StreamDiffusion pipeline pre-computed these prompt embeddings and stores them in a cache, which are then called in streaming or interactive mode. Within the UNet framework, the Key-Value pair is computed on the basis of each frame’s pre-computed prompt embedding, and with slight modifications in the U-Net, these Key-Value pairs can be reused.

Model Acceleration and Tiny AutoEncoder

The StreamDiffusion pipeline employs TensorRT, an optimization toolkit from Nvidia for deep learning interfaces, to construct the VAE and UNet engines, to accelerate the inference speed. To achieve this, the TensorRT component performs numerous optimizations on neural networks that are designed to boost efficiency and enhance throughput for deep learning frameworks and applications.

To optimize speed, the StreamDiffusion configures the framework to use fixed input dimensions and static batch sizes to ensure optimal memory allocation and computational graphs for a specific input size in an attempt to achieve faster processing times.

The above figure provides an overview of the inference pipeline. The core diffusion pipeline houses the UNet and VAE components. The pipeline incorporates a denoising batch, sampled noise cache, pre-computed prompt embedding cache, and scheduler values cache to enhance the speed, and the ability of the pipeline to generate images in real-time. The Stochastic Similarity Filter or SSF is deployed to optimize GPU usage, and also to gate the pass of the diffusion model dynamically.

StreamDiffusion : Experiments and Results

To evaluate its capabilities, the StreamDiffusion pipeline is implemented on LCM and SD-turbo frameworks. The TensorRT by NVIDIA is used as the model accelerator, and to enable lightweight efficiency VAE, the pipeline employs the TAESD component. Let’s now have a look at how the StreamDiffusion pipeline performs when compared against current state of the art frameworks.

Quantitative Evaluation

The following figure demonstrates the efficiency comparison between the original sequential UNet and the denoising batch components in the pipeline, and as it can be seen, implementing the denoising batch approach helps in reducing the processing time significantly by almost 50% when compared to the traditional UNet loops at sequential denoising steps.

Furthermore, the average inference time at different denoising steps also witnesses a substantial boost with different speedup factors when compared against current state of the art pipelines, and the results are demonstrated in the following image.

Moving along, the StreamDiffusion pipeline with the RCFG component demonstrates less inference time when compared against pipelines including the traditional CFG component.

Furthermore, the impact of using the RCFG component its evident in the following images when compared to using the CFG component.

As it can be seen, the use of CFG intesifies the impact of the textual prompt in image generation, and the image resembles the input prompts a lot more when compared to the images generated by the pipeline without using the CFG component. The results improve further with the use of the RCFG component as the influence of the prompts on the generated images is quite significant when compared to the original CFG component.

Final Thoughts

In this article, we have talked about StreamDiffusion, a real-time diffusion pipeline developed for generating interactive and realistic images, and tackle the current limitations posed by diffusion-based frameworks on tasks involving continuous input. StreamDiffusion is a simple and novel approach that aims to transform the sequential noising of the original image into batch denoising. StreamDiffusion aims to enable high throughput and fluid streams by eliminating the traditional wait and interact approach opted by current diffusion-based frameworks. The potential efficiency gains highlights the potential of StreamDiffusion pipeline for commercial applications offering high-performance computing and compelling solutions for generative AI.

The post StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation appeared first on Unite.AI.