Stable Diffusion, which is mostly used to generate images from texts, is an evolving application of AI technology in the content creation industry. To run Stable Diffusion on your local computer, you need a powerful GPU that can handle its heavy requirements. A powerful GPU allows you to generate images faster, and a more powerful GPU with plenty of VRAM will let you create larger-resolution images even faster. So, what are the best consumer GPUs for Stable Diffusion? Let’s check out the Stable Diffusion performance on some GPUs of NVIDIA and AMD to find the answer.
Stable Diffusion is a machine-learning model. It is increasingly being used in content creation thanks to its ability to generate images from text prompts. What makes Stable Diffusion unique is that it lacks commercially-developed software and instead relies on various open-source applications. Besides, unlike other similar text-to-image models, it is often used locally on local systems rather than using online web services.
Stable Diffusion can run on mid-range GPUs with at least 8GB of VRAM. However, it benefits greatly from powerful modern GPUs with more VRAM.
You can directly use the Stable Diffusion version developed by Stability AI and Runway. However, most people use a web-based version created by third parties. The most common Stable Diffusion being used are:
- Automatic 1111: This is mostly used with NVIDIA GPUs, though there are forks that exist for AMD and Apple Silicon. It allows you to use xformers, which can significantly boost performance on NVIDIA GPUs.
- SHARK: SHARK is an alternative to Automatic 1111. It natively supports NVIDIA and AMD GPUs. Yet, its performance tends to be higher with AMD GPUs while lower with NVIDIA GPUs.
- Custom: Some people create their own applications with the features they need because Stable Diffusion is publicly available for anyone to use directly.
Each implementation has unique advantages and drawbacks regarding features and usability. From a performance and benchmarking view, Automatic 1111 and SHARK are recommended. Based on the GPU you intend to test, it’s suggested to utilize both Automatic 1111 and SHARK together. Use Automatic 1111 for testing NVIDIA GPUs and SHARK for AMD GPUs.
Note: Stable Diffusion is constantly updated, so the different versions you use can result in changes in performance.
Firstly, Stable Diffusion settings & models
The most frequently adjusted settings such as the prompt, negative prompt, cfg scale, and seed – do not meaningfully affect the performance. It takes the same amount of time to generate an image of a dog or mountain landscape. Even the model selected tends to result in only minor differences in generation time. Look at the below images, despite having different prompts and cfg scales, they almost take the exact same amount of time to generate.
Image Credit: Puget Systems
Other settings like the steps, resolution, and sampling method will impact Stable Diffusion’s performance.
- Steps: Adjusting steps impact the time needed to generate an image but will not alter the processing speed in terms of iterations per second. Though many users choose between 20 and 50 steps, increasing the step count to around 200 tends to produce more consistent results from run to run.
- Resolution: The image resolution not only has the greatest impact on performance but also influences how much VRAM is needed to generate the image. For benchmarking purposes, you can use a 512×512 resolution to ensure compatibility with various GPU models.
- Sampling method (Euler, DPM, etc.). It can significantly impact generation time, with some options taking about twice as long as others. “Euler” and “Euler a” are the most widely used and tend to provide the best performance. Other methods like DPM2 tend to take about twice as long. For the purpose of GPU benchmarking, sticking with a variation of Euler for consistency is recommended.
Secondly, the hardware
- GPU: GPU has the biggest impact on speed and image quality. More powerful GPUs with higher memory bandwidth and more VRAM can generate Stable Diffusion images much faster, especially at higher resolutions. The amount of VRAM on the GPU determines the maximum resolution images that can be generated. At least 8GB is recommended, higher resolutions require 12GB or more.
- CPU: While the GPU handles most of the heavy lifting, a fast CPU can still improve performance to a lesser extent. CPUs with higher clock speeds and more cores can provide a small boost.
- RAM: The system memory helps feed data to the GPU, so having at least 16GB of RAM can ensure optimal performance. More RAM, up to 32GB or 64GB, can further improve speeds.
To know what are the best consumer GPUs for Stable Diffusion, we will examine the Stable Diffusion Performance of these GPUs on its two most popular implementations (their latest public releases).
Many Stable Diffusion implementations show how fast they work by counting the “iterations per second” or “it/s“. Therefore, to check Stable Diffusion Performance, this metric is a commonly used and great measurement. Iterations per second are calculated by dividing the number of iterations by the number of seconds it takes to generate an image. For example, if generating an image with 200 iterations takes 15 seconds, the iterations per second are about 13.3 (that is 200 iterations divided by 15 seconds).
First, let’s look at the benchmark result that Puget Systems tested on the 4000 series GPUs along with the top-tier GPUs from their last three generations of NVIDIA and AMD’s RX 7900 XTX & RX 6900 XT.
Image Credit: Puget Systems
Automatic 1111, Stable Diffusion’s most commonly used implementation, usually provides the best performance on NVIDIA GPUs.
NVIDIA clearly outperforms AMD here. Amongst NVIDIA’s GPU list, RTX 4090 is the winner providing the most performance result on Automatic 1111. Even the RTX 3060 Ti is twice as fast as the Radeon GPU. Only the GTX 1080 Ti is worse than the RX 7900 XTX.
The newer 4000 series GPUs offer a clear advantage in image generation speed while also providing a linear increase in performance with price. This is shown by the RTX 4070 Ti being about 5% faster than the previous RTX 3090 Ti, and the RTX 4060 Ti being nearly 43% faster than the 3060 Ti. If you still have a 2000 or 1000 series GPU, even a mid-range 4000 series GPU will provide a noticeable performance boost.
Image Credit: Puget Systems
Even though SHARK is less commonly used than Automatic 1111, it is preferred by many AMD users. Look at the above benchmark results, it’s clear why.
The RX 7900 XTX sees its performance quadruple with SHARK, resulting in iterations per second similar to the RTX 4090 running 1111. Similarly, the RX 6900 XT has an even larger 1100% performance increase, though this only makes it competitive with the low-end NVIDIA GPUs tested.
With SHARK, NVIDIA GPUs perform around 30% worse than Automatic 1111, despite maintaining the same relative performance.
Important note: It’s very important to use the proper implementation of Stable Diffusion because it can greatly impact performance. It can be from a 30% decrease to a massive 1100% increase! The above GTX 1080 Ti result proves this. It can’t run the SHARK in this testing of Puget Systems.
What stands out the most is the huge difference in performance between the various Stable Diffusion implementations. NVIDIA GPUs offer the highest performance on Automatic 1111, while AMD GPUs work best with SHARK. The top GPUs on their respective implementations have similar performance.
If you have not decided to use a particular implementation yet, both NVIDIA and AMD provide great performance at their high-end GPUs. The GeForce RTX 4090 and Radeon RX 7900 XTX both provide around 21 it/s in their preferred implementation of Stable Diffusion.
It is very important to note that Stable Diffusion is a constantly evolving model with a set of tools. How it works today is remarkably different from how it did months ago or will do in the future. Its performance is going to change in the coming months and years. Therefore, the performance results in this article are likely to change over time. As a wise reader, we hope you understand that these benchmark results are for reference only.
If you are interested in testing the performance of your currently-used Stable Diffusion implementation on top-tier GPUs like the RTX 4090, check out our service below.
Stable Diffusion is primarily designed for single GPU usage; however, with some additional software and configuration, it can take advantage of multiple GPUs. By splitting the work across multiple GPUs, the overall iteration speed can be increased. While most Stable Diffusion implementations are designed to run on a single GPU by default, one commonly used implementation which is Automatic1111 has options to enable multi-GPU support with minimal additional configuration.
Regardless of which implementation of Stable Diffusion you utilize, you can still benefit from iRender’s high-performance GPU machines to accelerate image generation.
We offer flexible configurations of 1, 2, 4, 6, and 8 GPU machines using the top-tier RTX 4090 and RTX 3090. Built with powerful AMD Ryzen Threadripper PRO CPUs with up to 64 cores, 256GB RAM, and 2TB NVMe SSD storage, our servers can handle even the most demanding AI art in Stable Diffusion quickly.
We have just released an iRender GPU desktop application, allowing you to fully utilize our services easier, and more efficiently. See how our service works:
iRender – Happy Rendering, Happy Training
Reference source: pugetsystems.com