Stable Diffusion 1.5 v/s Stable Diffusion XL

Lalith Gudipati

Apr 15, 2024 • 12 min read

The open-source community has been actively engaged in exploring Stable Diffusion 2 ever since its recent release just a couple of weeks ago. While some users have reported perceived differences in performance when compared to Stable Diffusion 1, this release has sparked significant interest and exploration within the community.

Since its recent release a couple of weeks ago, the open-source community has been energetically delving into Stable Diffusion XL. Although some users have noted variances in performance compared to Stable Diffusion 1, this release has ignited considerable enthusiasm and investigation among community members.

OpenClip

Stable Diffusion XL introduces a significant shift by replacing the text encoder. While Stable Diffusion 1 utilises OpenAI's CLIP, a model that evaluates how accurately a caption describes an image, it's worth noting that the dataset on which CLIP was trained is not publicly accessible.

In contrast, Stable Diffusion XL employs OpenCLIP, an open-source adaptation of CLIP, trained on a known dataset—aesthetic subset of LAION-5B, excluding NSFW images. According to Stability AI, OpenCLIP notably enhances the quality of generated images and surpasses an unreleased version of CLIP based on metrics.

Why this matters

Putting aside inquiries regarding the comparative performance of these models, the transition from CLIP to OpenCLIP is responsible for several variances between Stable Diffusion 1 and Stable Diffusion XL.

Specifically, numerous users of Stable Diffusion XL have observed that it may not capture celebrities or artistic styles as effectively as Stable Diffusion 1. Interestingly, the training data for Stable Diffusion XL did not undergo deliberate filtration to exclude artists. However, this disparity arises from the fact that CLIP's training data contained a greater representation of celebrities and artists compared to the LAION dataset. As CLIP's dataset remains inaccessible to the public, replicating the same functionality solely using the LAION dataset is unattainable. Consequently, many conventional prompting techniques utilised for Stable Diffusion 1 are nearly obsolete for Stable Diffusion XL.

What this means

The transition to a fully open-source, open-data model represents a significant milestone in the evolution of the Stable Diffusion project. It places the responsibility on the shoulders of the open-source community to refine Stable Diffusion XL and enhance its functionalities according to user preferences—a fundamental aspect of Stable Diffusion's original vision as a community-driven, fully open project. While some users may currently express disappointment in the relative performance of Stable Diffusion XL, it's important to recognize that the StabilityAI team has dedicated over 1 million A100 hours to establish a robust foundation for future development.

Additionally, although not explicitly stated by the creators, moving away from using CLIP may offer some protection against potential liability issues for project contributors—an aspect of significance considering the anticipated surge in intellectual property litigation surrounding models of this nature.

With this insightful context in mind, let's now delve into the practical distinctions between Stable Diffusion 1 and XL.

Negative Prompts

Let's start by examining negative prompting, a factor that appears to play a significantly more crucial role in achieving strong performance in Stable Diffusion (SD) XL compared to SD 1, as illustrated below. Now, let's delve deeper into the concept of negative prompting.

No Negative Prompt

With Negative Prompt

Negative prompting in Studio Dashtoon

Simple Prompt

We initiate the process by providing the prompt "infinity pool" to both Stable Diffusion 1.5 and Stable Diffusion XL without applying any negative prompt. Subsequently, three images generated by each model are displayed, with each column representing a distinct random seed.

It appears that Stable Diffusion 1.5 demonstrates better overall performance compared to Stable Diffusion XL. In Stable Diffusion XL, the image on the left exhibits a poorly fitting patch within the composition, while the image on the right appears almost incoherent.

Now, we proceed to generate images using the same method and initial noise, but this time employing negative prompting. We include a negative prompt consisting of descriptors such as "ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, mutation, mutated, extra limbs, extra legs, extra arms, disfigured, deformed, cross-eye, body out of frame, blurry, bad art, bad anatomy, blurred, text, watermark, grainy," as suggested by Emad Mostaque.

With the addition of negative prompting, Stable Diffusion 1.5 generally exhibits improved performance, although there may be potential issues with caption alignment in the middle image. In the case of Stable Diffusion XL, there are more noticeable improvements, although its overall performance still falls short of Stable Diffusion 1.5.

No Negative Prompt

With Negative Prompt

Simple Prompting in Studio Dashtoon

Complicated Prompt

We repeat the previous experiment with a more intricate positive prompt. Instead of using "infinity pool," we utilise the prompt "infinity pool with a tropical forest in the background, high resolution, detail, 8k, DSLR, good lighting, ray tracing, realistic." While we could have excluded the "with a tropical forest in the background" segment to focus solely on aesthetic enhancements, we include it to assess the semantic alignment with a more complex prompt.

Once again, we present the outcomes without negative prompting. The resulting images deviate from photorealism, and the alignment with the captions is arguably improved. Additionally, the water texture exhibits notable improvement with Stable Diffusion 1.5.

When we incorporate the identical negative prompt as in the previous instance, intriguing observations emerge. Specifically, it seems that the negative prompt may indeed have a detrimental impact on SD 1, yet consistently benefit SD XL. Each image from SD XL exhibits improvement with negative prompting, while the alignment of captions for SD 1 appears to universally decline. Interestingly, the addition of the negative prompt seems to steer the generated images towards photorealism.

Stable Diffusion 1.5

Stable Diffusion XL

Textual Inversion

In addition to conventional negative prompts, Stable Diffusion also offers support for textual inversion, a technique where a small set of reference images can be utilised to generate a new "word" representing these images. Once learned, this "word" can be employed in prompts as usual, enabling the generation of images closely resembling the reference images. In the example provided, four images of a small figure are inverted to "S_*", which is then integrated into various prompts alongside other semantic concepts:

In the following illustration, several images are produced using Stable Diffusion XL from the base prompt "a rabbit in a park". This prompt is then enhanced with either a positive prompt or a textually-inverted token, along with a negative prompt or textually-inverted token. For instance, the rightmost image in the second row supplements the base prompt with a textually-inverted token referencing Midjourney, alongside a standard negative prompt "ugly, boring, bad anatomy".

It's evident that the utilisation of textual inversion notably enhances the performance of Stable Diffusion XL. The image above is sourced from a blog post by Max Woolf, providing a comprehensive exploration of this subject matter.

Text to image prompt on Dashtoon Studio

Celebrities

Considering that CLIP's training data includes more celebrity images compared to LAION, it's understandable that some SD XL users have noticed a reduced ability to generate celebrity images compared to SD 1.5.

Below, we present images generated from 3 random seeds (columns) with and without negative prompting for both SD 1.5 and SD XL using the prompt "Keanu Reeves." A full-resolution version of this image is also provided.

Overall, SD XL performs comparably to SD 1.5 concerning this specific prompt. However, when celebrities are combined with semantic concepts, SD XL's ability to depict them seems to diminish. We provide comparisons for two such prompts below, with each column within an image corresponding to a specific random seed. Negative prompting is employed in each case.

As illustrated, SD 1.5 generally surpasses SD XL in this aspect (even portraying Steve Carell instead of Robert Downey Jr. at one point). While this difference is anticipated, its extent may be larger than anticipated based on the Keanu Reeves example.

Stable Diffusion 1.5

Stable Diffusion XL

Stable Diffusion 1.5

Stable Diffusion XL

Artistic Images

As discussed in the OpenCLIP section, the LAION dataset not only contains fewer celebrity images compared to the CLIP training data but also fewer artistic images. Consequently, generating stylized images becomes more challenging, and the traditional method of "_____ in the style of _____" no longer functions as effectively as it did in Stable Diffusion 1. Below, we present a comparison of images generated from 4 random seeds using Stable Diffusion 1.5 and Stable Diffusion XL, attempting to create an image in the style of Greg Rutkowski.

The disparity in results is notable - Stable Diffusion 1.5 emerges as the clear victor over Stable Diffusion XL (out of the box). While augmenting the prompt with descriptors not explicitly referencing artists can still yield stylized images with SD XL, its performance still lags behind SD 1.5, as demonstrated below:

On the flip side, some users have discovered that SD XL excels at generating photorealistic images:

Stable Diffusion 1.5

Stable Diffusion XL

Text Coherence

One area where Stable Diffusion XL could potentially outperform Stable Diffusion 1 is in text coherence. Many text-to-image models struggle significantly with accurately representing text. This is not surprising given the complexity of language and the diverse ways in which words are structured and visually represented.

Despite these challenges, it seems that Stable Diffusion XL may have a slight edge over Stable Diffusion 1 in conveying text. Below, we present several images for comparison:

While neither model produces stellar results, it's worth noting that negative prompting has minimal impact in improving text coherence. Although quantifying the effectiveness of text generation is subjective, it could be argued that Stable Diffusion XL performs marginally better than Stable Diffusion 1.

Stable Diffusion 1.5

Stable Diffusion XL

Depth Model

Alongside SD XL, a depth model was introduced. This model accepts a 2D image as input and produces a depth map prediction for it. This depth information can then be utilised to influence image generation alongside text inputs, enabling users to produce new images that maintain the geometric integrity of a reference image.

Below, we showcase a variety of images generated using this approach, all of which uphold the consistent geometric structure.

Upscaling Model

Stable Diffusion XL also introduced an upscaling model, capable of increasing the size of images to four times their original dimensions. This signifies that the upscaled images boast a sixteen-fold increase in area compared to the original ones!

Upscaling options on Studio Dashtoon

Inpainting Model

Stable Diffusion XL also introduces an enhanced inpainting model, empowering users to seamlessly modify specific portions of an image, ensuring that the alterations blend aesthetically with the overall composition.

768 x 768 Model

Lastly, Stable Diffusion XL now extends support to 768 x 768 images, providing more than double the area compared to the 512 x 512 images in Stable Diffusion 1.

Stable Diffusion 2.1

Stable Diffusion 2.1 was launched shortly after the debut of Stable Diffusion 2.0, aiming to enhance several aspects that were perceived as drawbacks compared to Stable Diffusion 1.5. Let's explore the improvements introduced by 2.1.

NSFW Filter

The primary enhancement introduced by version 2.1 compared to 2.0 involves a revised NSFW filter. In version 2.0, the training data from a subset of the LAION dataset underwent filtration for inappropriate content using an NSFW filter. This led to a somewhat reduced capacity to represent humans effectively.

In Stable Diffusion 2.1, a similar NSFW filter is employed, albeit with modifications to make it less restrictive. Specifically, the filter now produces fewer false positives, significantly increasing the number of images eligible for model training. This expanded training dataset consequently enhances the model's ability to depict people. Below, we showcase several images featuring Robert Downey Jr., generated under identical settings except for the model version used, now incorporating Stable Diffusion 2.1.

As evident from the results, Stable Diffusion 2.1 represents a significant advancement compared to Stable Diffusion 2, successfully rendering images of Robert Downey Jr. Furthermore, the skin texture quality achieved in SD 2.1 surpasses that of SD 1.5.

Stable Diffusion 1.5

Stable Diffusion XL

A picture of Robert Downey Jr. generated by every model under the same settings.

Artistic Styles

Regrettably, SD 2.1's capacity to replicate specific artist styles seems to lag behind that of SD 1.5. Below, we once more present images generated with identical settings, differing only in the model utilised. These images aim to emulate the style of Greg Rutkowski.

As illustrated, Stable Diffusion 1.5 continues to excel in this aspect.

Stable Diffusion 1.5

Stable Diffusion XL

General Images

We replicate the experiment conducted in the preceding section comparing plain and "augmented" prompts, once more altering only the model version.

Upon examination, we observe that the "original" textures in 2.1 exhibit an enhancement compared to 2.0. Furthermore, the "augmented" image generated by 2.1 appears to be more stylized than that of 2.0, albeit with a largely similar overall appearance.

Original

Augmented

Conclusion

While these experiments are not intended to be exhaustive or rigorously scientific, they offer valuable insights into the comparative performance of Stable Diffusion 1 and Stable Diffusion XL.

OpenClip

Why this matters

What this means

Negative Prompts

Simple Prompt

Complicated Prompt

Textual Inversion

Celebrities

Artistic Images

Text Coherence

Depth Model

Upscaling Model

Inpainting Model

Stable Diffusion 2.1

NSFW Filter

Artistic Styles

General Images

Conclusion

Sign up for more like this.