VOOZH about

URL: https://www.analyticsvidhya.com/blog/2023/12/what-is-stable-diffusion/

โ‡ฑ Everything You Need To Know About Stable Diffusion - Analytics Vidhya


India's Most Futuristic AI Conference Is Back โ€“ Bigger, Sharper, Bolder

  • d
  • :
  • h
  • :
  • m
  • :
  • s

Everything You Need To Know About Stable Diffusion

Sarvagya Agrawal Last Updated : 29 Apr, 2025
7 min read

Introduction

With the recent advancement in AI, the capabilities of Generative AI are being explored, and generating images from text is one such capability. Many models include Stable Diffusion, Imagen, Dall-E 3, Midjourney, Dreambooth, DreamFusion, and many more. In this article, we shall review the concept of the diffusion model utilized in Stable Diffusion along with its fine-tuning using LoRA.

Learning Objectives

  • To understand the basic concept behind Stable Diffusion.
  • Components involved in the image generation.
  • Get hands-on experience in generating images with stable diffusion.

This article was published as a part of the Data Science Blogathon.

Introduction to Stable Diffusion

The diffusion model is a class of deep learning models capable of generating new data similar to what they have seen during the training. Stable diffusion is one such model which has the following capabilities:

Text-to-Image Generation

  • In this aspect, the Stable Diffusion model excels at translating textual descriptions into visually coherent images. It leverages the learned patterns from its training data to create images that align with the provided text prompts.
  •  Applications of this capability include content creation, where users can describe a scene or concept in text, and the model generates an image based on that description.
  • Additionally, developers can leverage the Stable Diffusion API to integrate text-to-image generation into their applications, enabling programmatic creation of images from textual prompts.

Image-to-Image Generation

  • This compelling functionality allows users to input an image and provide a textual prompt to guide the modification process. The model then combines the visual information from the image with the contextual cues from the text to produce a modified version of the input image.
  • Use cases for this feature range from creative design to image enhancement, where users can specify desired changes or adjustments through both text and visual input.

Inpainting

  • Inpainting is a specialized form of an image-to-image generation where the model focuses on restoring or completing specific regions of an image that may be missing or corrupted. Introducing noise to these areas is an essential technique employed by the Stable Diffusion API model.
  • This capability finds applications in image restoration, where the model can reconstruct damaged or incomplete images based on the provided information.

Depth-to-Image

  • The depth-to-image functionality involves the transformation of depth information into a visual representation. Depth information typically describes the distance of objects in a scene, and the model can convert this data into a corresponding image.
  • Applications of this feature include computer vision tasks such as 3D reconstruction and scene understanding, where depth information is crucial for interpreting the spatial layout of a scene.

In summary, the Stable Diffusion model is a versatile deep-learning model with capabilities ranging from creative content generation to image manipulation and restoration. Its adaptability to diverse tasks makes it a valuable tool in various fields, including computer vision, graphics, and creative arts.

Understanding the Working of Stable Diffusion

Letโ€™s start with the components involved in the Stable Diffusion model:

Text Tokenizer

The task of the text encoder is to transform the input prompt into an embedding space that the U-Net can comprehend. Typically implemented as a simple transformer-based encoder, it maps a sequence of input tokens to a set of latent text embeddings.

Influenced by Imagen, the Stable Diffusion methodology takes a unique stance by refraining from training the text-encoder during its training phase. Instead, it utilizes the pre-existing and pretrained text encoder from CLIP, specifically the CLIPTextModel. CLIP, functioning as a multi-modal vision and language model, serves multiple purposes, including image-text similarity and zero-shot image classification. This model incorporates a ViT-like transformer for visual features and a causal language model for text features. The text and visual features are subsequently projected into a latent space with identical dimensions.

U-Net Model as Noise Predictor

The U-Net architecture consists of an encoder and a decoder, each comprising ResNet blocks. In this design, the encoder compresses an image representation into a lower resolution. At the same time, the decoder reconstructs the lower-resolution representation back to the original higher-resolution image, aiming for reduced noise. Specifically, the U-Net output predicts the noise residual, facilitating the computation of the denoised image representation.

To mitigate the loss of crucial information during downsampling, short-cut connections are typically introduced. These connections link the encoderโ€™s downsampling ResNets to the decoderโ€™s upsampling ResNets. Furthermore, the stable diffusion U-Net can condition its output on text embeddings by incorporating cross-attention layers. Both the encoder and decoder sections of the U-Net integrate these cross-attention layers, usually positioning them between ResNet blocks.

Autoencoder (VAE)

The VAE model has two parts: an encoder and a decoder. The encoder converts the image into a low-dimensional latent representation, which will serve as the input to the U-Net model. The decoder transforms the latent representation back into an image. During latent diffusion training, the encoder utilizes the photos to obtain their latent representations for the forward diffusion process, gradually adding more noise at each step. In inference, the denoised latent vectors produced by the reverse diffusion process are transformed back into images using the VAE decoder. As we will see during inference, we only need the VAE decoder.

Steps to Generate Images with Stable Diffusion

This section will look at the Diffusers pipeline to write our inference pipeline.

Step 1.

Import all the pretrained models using the diffuser library

from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, PNDMScheduler


vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae")


tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14")
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14")


# 3. The UNet model for generating the latents.
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", 
subfolder="unet")

Step 2.

In this step, we will define a K-LMS scheduler instead of a pre-defined one. Schedulers are algorithms that generate latent representations from the noisy latent representations produced by the U-Net model.

from diffusers import LMSDiscreteScheduler

scheduler = LMSDiscreteScheduler.from_pretrained("CompVis/stable-diffusion-v1-4", 
subfolder="scheduler")

Step 3.

Letโ€™s define a few parameters to be used for generating images:

prompt = [โ€œ an astronaut riding a horse"]


height = 512 # default height of Stable Diffusion
width = 512 # default width of Stable Diffusion


num_inference_steps = 100 # Number of denoising steps


guidance_scale = 7.5 # Scale for classifier-free guidance


generator = torch.manual_seed(32) # Seed generator to create the inital latent noise


batch_size = 1

Step 4.

Get the text embeddings for the prompt, which will be used for the U-Net model.

text_input = tokenizer(prompt, padding="max_length", 
 max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt")


with torch.no_grad():
 text_embeddings = text_encoder(text_input.input_ids.to(torch_device))[0]

Step 5.

We will obtain unconditional text embeddings to guide without relying on a classifier. These embeddings precisely correspond to the padding token (representing empty text). These unconditional text embeddings must maintain the same shape as the conditional text embeddings, aligning with batch size and sequence length parameters.

max_length = text_input.input_ids.shape[-1]

uncond_input = tokenizer(

    [""] * batch_size, padding="max_length", max_length=max_length, return_tensors="pt"

)

with torch.no_grad():

  uncond_embeddings = text_encoder(uncond_input.input_ids.to(torch_device))[0]  

Step 6.

To achieve classifier-free guidance, it is necessary to perform two forward passes. The first pass involves the conditioned input using text embeddings, while the second one utilizes unconditional embeddings (uncond_embeddings). A more efficient approach in practical implementation involves concatenating both sets of embeddings into a single batch. This streamlines the process and eliminates the need to conduct two forward passes.

text_embeddings = torch.cat([uncond_embeddings, text_embeddings])

Step 7.

Generate initial latent noise:

latents = torch.randn(

  (batch_size, unet.in_channels, height // 8, width // 8),

  generator=generator,

)

latents = latents.to(torch_device)

Step 8.

The initialization of the scheduler involves specifying the chosen num_inference_steps. During this initialization, the scheduler computes the sigmas and determines the exact time step values to use throughout the denoising process.

scheduler.set_timesteps(num_inference_steps)

latents = latents * scheduler.init_noise_sigma

Step 9.

Letโ€™s write denoising loop: from tqdm.auto import tqdm

from torch import autocast

for t in tqdm(scheduler.timesteps):

  # expand the latents if we are doing classifier-free guidance to avoid doing two forward passes.

  latent_model_input = torch.cat([latents] * 2)

  latent_model_input = scheduler.scale_model_input(latent_model_input, t)

  # predict the noise residual

  with torch.no_grad():

    noise_pred = unet(latent_model_input, t, encoder_hidden_states=text_embeddings).sample

  # perform guidance

  noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)

  noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)

  # compute the previous noisy sample x_t -> x_t-1

  latents = scheduler.step(noise_pred, t, latents).prev_sample

Step 10.

Letโ€™s use the VAE to decode the generated latent into the image.

# scale and decode the image latents with vae

latents = 1 / 0.18215 * latents

with torch.no_grad():

  image = vae.decode(latents).sample

Step 11.

Letโ€™s convert the image to PIL to display or save it.

image = (image / 2 + 0.5).clamp(0, 1)

image = image.detach().cpu().permute(0, 2, 3, 1).numpy()

images = (image * 255).round().astype("uint8")

pil_images = [Image.fromarray(image) for image in images]

pil_images[0]

The below image is generated using the above code:

Conclusion

In the above article, we explored the components involved in image generation by Stable Diffusion and its capabilities. Following are the key takeaways:

  • Comprehensive insight into the capabilities of diffusion models.
  • Overview of the critical components integral to Stable Diffusion.
  • Practical, hands-on experience in constructing a personalized diffusion pipeline.

Gain hands-on experience with our Stable Diffusion Image Generation course. Master key components and build your own diffusion pipeline with ease

Frequently Asked Questions

Q1. Why Stable Diffusion is faster than other models like Imagen?

Unlike other models like Imagen, which operates in the pixel space, it operates in latent space.

Q2. What is the role of the text encoder in the Stable Diffusion?

It converts the text input into text embeddings, which can be used as input for U-Net.

Q3. What is latent diffusion?

Latent diffusion presents a notable enhancement in efficiency by diminishing both memory and compute complexities. Implementing the diffusion process across a lower-dimensional latent space achieves this improvement instead of utilizing the actual pixel space.

Q4. What is a latent seed?

A latent seed generates random latent image representations of size  64ร—64.

Q5. What are schedulers?

They are denoising algorithms that remove noise from the latent image produced by the U-Net model.

The media shown in this article is not owned by Analytics Vidhya and is used at the Authorโ€™s discretion.

Hi, I'm Sarvagya Agrawal, Software Engineer, with a strong passion for utilizing technology to drive positive change in society. I believe that technology is not just a skill, but an art form that can be leveraged to transform the world.
My primary focus lies in machine learning and web development, with strong programming skills in Python. I have worked on innovative projects, including developing an AI model to calculate cardiovascular risk factors from OCTA scans using computer vision algorithms and creating an AI-based web application for calculating financial risk based on an individual's spending trends.

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent
๐Ÿ‘ Av Logo White

Continue your learning for FREE

Forgot your password?
๐Ÿ‘ Av Logo White

Enter OTP sent to

Edit

Wrong OTP.

Enter the OTP

Resend OTP

Resend OTP in 45s

๐Ÿ‘ Popup Banner
๐Ÿ‘ AI Popup Banner