Stable Diffusion Web User Interface, or SD-WebUI, is a comprehensive project for Stable Diffusion models that utilizes the Gradio library to provide a browser interface. Today, we're going to talk about EasyPhoto, an innovative WebUI plugin enabling end users to generate AI portraits and images. The EasyPhoto WebUI plugin creates AI portraits using various templates, supporting different photo styles and multiple modifications. Additionally, to enhance EasyPhoto’s capabilities further, users can generate images using the SDXL model for more satisfactory, accurate, and diverse results. Let's begin.
An Introduction to EasyPhoto and Stable Diffusion
The Stable Diffusion framework is a popular and robust diffusion-based generation framework used by developers to generate realistic images based on input text descriptions. Thanks to its capabilities, the Stable Diffusion framework boasts a wide range of applications, including image outpainting, image inpainting, and image-to-image translation. The Stable Diffusion Web UI, or SD-WebUI, stands out as one of the most popular and well-known applications of this framework. It features a browser interface built on the Gradio library, providing an interactive and user-friendly interface for Stable Diffusion models. To further enhance control and usability in image generation, SD-WebUI integrates numerous Stable Diffusion applications.
Owing to the convenience offered by the SD-WebUI framework, the developers of the EasyPhoto framework decided to create it as a web plugin rather than a full-fledged application. In contrast to existing methods that often suffer from identity loss or introduce unrealistic features into images, the EasyPhoto framework leverages the image-to-image capabilities of the Stable Diffusion models to produce accurate and realistic images. Users can easily install the EasyPhoto framework as an extension within the WebUI, enhancing user-friendliness and accessibility to a broader range of users. The EasyPhoto framework allows users to generate identity-guided, high-quality, and realistic AI portraits that closely resemble the input identity.
First, the EasyPhoto framework asks users to create their digital doppelganger by uploading a few images to train a face LoRA or Low-Rank Adaptation model online. The LoRA framework quickly fine-tunes the diffusion models by making use of low-rank adaptation technology. This process allows the based model to understand the ID information of specific users. The trained models are then merged & integrated into the baseline Stable Diffusion model for interference. Furthermore, during the interference process, the model uses stable diffusion models in an attempt to repaint the facial regions in the interference template, and the similarity between the input and the output images are verified using the various ControlNet units.
The EasyPhoto framework also deploys a two-stage diffusion process to tackle potential issues like boundary artifacts & identity loss, thus ensuring that the images generated minimizes visual inconsistencies while maintaining the user’s identity. Furthermore, the interference pipeline in the EasyPhoto framework is not only limited to generating portraits, but it can also be used to generate anything that is related to the user’s ID. This implies that once you train the LoRA model for a particular ID, you can generate a wide array of AI pictures, and thus it can have widespread applications including virtual try-ons.
Tu summarize, the EasyPhoto framework
- Proposes a novel approach to train the LoRA model by incorporating multiple LoRA models to maintain the facial fidelity of the images generated.
- Makes use of various reinforcement learning methods to optimize the LoRA models for facial identity rewards that further helps in enhancing the similarity of identities between the training images, and the results generated.
- Proposes a dual-stage inpaint-based diffusion process that aims to generate AI photos with high aesthetics, and resemblance.
EasyPhoto : Architecture & Training
The following figure demonstrates the training process of the EasyPhoto AI framework.
As it can be seen, the framework first asks the users to input the training images, and then performs face detection to detect the face locations. Once the framework detects the face, it crops the input image using a predefined specific ratio that focuses solely on the facial region. The framework then deploys a skin beautification & a saliency detection model to obtain a clean & clear face training image. These two models play a crucial role in enhancing the visual quality of the face, and also ensure that the background information has been removed, and the training image predominantly contains the face. Finally, the framework uses these processed images and input prompts to train the LoRA model, and thus equipping it with the ability to comprehend user-specific facial characteristics more effectively & accurately.
Furthermore, during the training phase, the framework includes a critical validation step, in which the framework computes the face ID gap between the user input image, and the verification image that was generated by the trained LoRA model. The validation step is a fundamental process that plays a key role in achieving the fusion of the LoRA models, ultimately ensuring that the trained LoRA framework transforms into a doppelganger, or an accurate digital representation of the user. Additionally, the verification image that has the optimal face_id score will be selected as the face_id image, and this face_id image will then be used to enhance the identity similarity of the interference generation.
Moving along, based on the ensemble process, the framework trains the LoRA models with likelihood estimation being the primary objective, whereas preserving facial identity similarity is the downstream objective. To tackle this issue, the EasyPhoto framework makes use of reinforcement learning techniques to optimize the downstream objective directly. As a result, the facial features that the LoRA models learn display improvement that leads to an enhanced similarity between the template generated results, and also demonstrates the generalization across templates.
The following figure demonstrates the interference process for an individual User ID in the EasyPhoto framework, and is divided into three parts
- Face Preprocess for obtaining the ControlNet reference, and the preprocessed input image.
- First Diffusion that helps in generating coarse results that resemble the user input.
- Second Diffusion that fixes the boundary artifacts, thus making the images more accurate, and appear more realistic.
For the input, the framework takes a face_id image(generated during training validation using the optimal face_id score), and an interference template. The output is a highly detailed, accurate, and realistic portrait of the user, and closely resembles the identity & unique appearance of the user on the basis of the infer template. Let’s have a detailed look at these processes.
A way to generate an AI portrait based on an interference template without conscious reasoning is to use the SD model to inpaint the facial region in the interference template. Additionally, adding the ControlNet framework to the process not only enhances the preservation of user identity, but also enhances the similarity between the images generated. However, using ControlNet directly for regional inpainting can introduce potential issues that may include
- Inconsistency between the Input and the Generated Image : It is evident that the key points in the template image are not compatible with the key points in the face_id image which is why using ControlNet with the face_id image as reference can lead to some inconsistencies in the output.
- Defects in the Inpaint Region : Masking a region, and then inpainting it with a new face might lead to noticeable defects, especially along the inpaint boundary that will not only impact the authenticity of the image generated, but will also negatively affect the realism of the image.
- Identity Loss by Control Net : As the training process does not utilize the ControlNet framework, using ControlNet during the interference phase might affect the ability of the trained LoRA models to preserve the input user id identity.
To tackle the issues mentioned above, the EasyPhoto framework proposes three procedures.
- Align and Paste : By using a face-pasting algorithm, the EasyPhoto framework aims to tackle the issue of mismatch between facial landmarks between the face id and the template. First, the model calculates the facial landmarks of the face_id and the template image, following which the model determines the affine transformation matrix that will be used to align the facial landmarks of the template image with the face_id image. The resulting image retains the same landmarks of the face_id image, and also aligns with the template image.
- Face Fuse : Face Fuse is a novel approach that is used to correct the boundary artifacts that are a result of mask inpainting, and it involves the rectification of artifacts using the ControlNet framework. The method allows the EasyPhoto framework to ensure the preservation of harmonious edges, and thus ultimately guiding the process of image generation. The face fusion algorithm further fuses the roop(ground truth user images) image & the template, that allows the resulting fused image to exhibit better stabilization of the edge boundaries, which then leads to an enhanced output during the first diffusion stage.
- ControlNet guided Validation : Since the LoRA models were not trained using the ControlNet framework, using it during the inference process might affect the ability of the LoRA model to preserve the identities. In order to enhance the generalization capabilities of EasyPhoto, the framework considers the influence of the ControlNet framework, and incorporates LoRA models from different stages.
The first diffusion stage uses the template image to generate an image with a unique id that resembles the input user id. The input image is a fusion of the user input image, and the template image, whereas the calibrated face mask is the input mask. To further increase the control over image generation, the EasyPhoto framework integrates three ControlNet units where the first ControlNet unit focuses on the control of the fused images, the second ControlNet unit controls the colors of the fused image, and the final ControlNet unit is the openpose (real-time multi-person human pose control) of the replaced image that not only contains the facial structure of the template image, but also the facial identity of the user.
In the second diffusion stage, the artifacts near the boundary of the face are refined and fine tuned along with providing users with the flexibility to mask a specific region in the image in an attempt to enhance the effectiveness of generation within that dedicated area. In this stage, the framework fuses the output image obtained from the first diffusion stage with the roop image or the result of the user’s image, thus generating the input image for the second diffusion stage. Overall, the second diffusion stage plays a crucial role in enhancing the overall quality, and the details of the generated image.
Multi User IDs
One of EasyPhoto’s highlights is its support for generating multiple user IDs, and the figure below demonstrates the pipeline of the interference process for multi user IDs in the EasyPhoto framework.
To provide support for multi-user ID generation, the EasyPhoto framework first performs face detection on the interference template. These interference templates are then split into numerous masks, where each mask contains only one face, and the rest of the image is masked in white, thus breaking the multi-user ID generation into a simple task of generating individual user IDs. Once the framework generates the user ID images, these images are merged into the inference template, thus facilitating a seamless integration of the template images with the generated images, that ultimately results in a high-quality image.
Experiments and Results
Now that we have an understanding of the EasyPhoto framework, it is time for us to explore the performance of the EasyPhoto framework.
The above image is generated by the EasyPhoto plugin, and it uses a Style based SD model for the image generation. As it can be observed, the generated images look realistic, and are quite accurate.
The image added above is generated by the EasyPhoto framework using a Comic Style based SD model. As it can be seen, the comic photos, and the realistic photos look quite realistic, and closely resemble the input image on the basis of the user prompts or requirements.
The image added below has been generated by the EasyPhoto framework by making the use of a Multi-Person template. As it can be clearly seen, the images generated are clear, accurate, and resemble the original image.
With the help of EasyPhoto, users can now generate a wide array of AI portraits, or generate multiple user IDs using preserved templates, or use the SD model to generate inference templates. The images added above demonstrate the capability of the EasyPhoto framework in producing diverse, and high-quality AI pictures.
In this article, we have talked about EasyPhoto, a novel WebUI plugin that allows end users to generate AI portraits & images. The EasyPhoto WebUI plugin generates AI portraits using arbitrary templates, and the current implications of the EasyPhoto WebUI supports different photo styles, and multiple modifications. Additionally, to further enhance EasyPhoto’s capabilities, users have the flexibility to generate images using the SDXL model to generate more satisfactory, accurate, and diverse images. The EasyPhoto framework utilizes a stable diffusion base model coupled with a pretrained LoRA model that produces high quality image outputs.