ComfyUI is a node-based Gradio GUI designed for generative AI models. It is one of the most versatile ways to generate AI images, video, and audio locally on your own hardware. Free from software censorship or subscription requirements, it offers thousands of nodes and AI models with exceptional user control. ComfyUI is an incredibly powerful tool for anyone aiming to stay ahead of the AI curve.

It’s a myth that you need to have an insanely powerful GPU or be a programmer, as some may believe, although those things do speed up the process. You can run many image generation models on 6GB to 8GB of VRAM on a good laptop. ComfyUI and other GUI options like Automatic1111 or InvokeAI may have a significant learning curve, but they’re no more challenging than learning Adobe Photoshop.

What you need to get started

Setting up ComfyUI manually requires some technical knowledge and familiarity with Python. The V1 ComfyUI Desktop Application, now in closed beta, promises to simplify the setup process. Until its release, you can use the Pinokio virtual computer for an easy installation of ComfyUI if you don't want to mess with terminal commands and virtual Python environments.

This guide isn’t meant to just be a beginner’s tutorial. Instead, it provides the foundational knowledge needed to experiment and learn on your own. Each step highlights essential techniques and leads to a basic workflow. The goal is to help you grasp the core elements of a workflow without the frustration of deciphering someone else’s complex process.

Installing Checkpoint models, LoRAs, VAEs, and custom nodes

First, install ComfyUI manually or using Pinokio by visiting the resources above. For learning purposes, this guide uses an older Stable Diffusion 1.5 (SD 1.5) model, which has lower VRAM requirements and faster generation speeds.

Civitai is an excellent resource for downloading models, tools, and workflows. You can filter by Highest Rated and sort to find the most popular options. To get started, I suggest you try MeinaMix V12 - Final for anime and Realistic Vision V6.0 B1 for photorealism. They are generally easy to use and consistently produce decent results, even when using very basic workflows.

Base models are original trained models like SD 1.5, SDXL, Pony or Flux. Consider checkpoint models to be like base models that have been refined or merged with other checkpoints to produce certain types of images, the most broad being anime and photorealistic.

Civitai.com has a lot of NSFW content, while Civitai.green provides a safe-for-work alternative. Make sure to double-check what URL you are visiting.

You’re going to be using LoRA models (Low-Rank Adaptation) frequently in the future. Think of these as smaller models that make very specific adjustments to the checkpoint model. For example, you can use a Studio Ghibli style LoRA with MeinaMix to generate images in that art style. LoRAs aren’t limited to style though, there are LoRAs for specific characters and people, pieces of clothing, poses, hairstyles, environments, etc. We’ll be using a simple LoRA meant to increase quality for this guide called Perfection “SD1.5” v0.9.

Launch ComfyUI and click on the Manager button in the upper right. Click on Custom Nodes Manager. Search for “chooser” and click Install for ID #241 Image chooser. Click Close to return to the ComfyUI Manager Menu and click on Model Manager. Search for “vae” and click Install for ID #105 vae-ft-mse-840000-ema-pruned. Close the ComfyUI Manager Menu to return to the main screen. Click Refresh at the top of the screen if you are using Pinokio or just refresh the webpage if you are using a standard web browser.

The Image chooser custom node will be the only non-core node we’ll be using. It pauses a workflow after an image is generated, so you can cancel before saving or some major steps in more complex workflows. The VAE model (Variational Autoencoder) is often “baked-in” to most checkpoint models, but we’re installing it so you know how to use different VAEs in the future. We’ll tackle what a VAE is and what it does in the next section.

I highly recommend the Stable Diffusion Art website if you get stuck at any step. The tutorials are well written and easy to follow.

Core nodes in a basic workflow

I suggest you build basic elements of a workflow from scratch every single time you start a new project, until you can do it by memory. Understanding this core structure will help you troubleshoot others’ workflows when you start trying them out and inevitably run into problems. ComfyUI and custom nodes are updated frequently, and you can easily run into missing nodes in older workflows.

Double-click anywhere on the screen and search for a node by name to add it to the workflow.

Start with a Load Checkpoint, Load VAE, and Empty Latent Image node. Click on the text fields in the nodes to load the models downloaded earlier. In the comfyUI folder is a folder named models. Inside that folder are subfolders for placing your downloads that are named accordingly.

Load Checkpoint node outputs

The CLIP output (Contrastive Language-Image Pre-Training) from the Load Checkpoint node will connect to the nodes where you enter your text prompt. CLIP is another AI model that is embedded in the checkpoint model. It’s been trained to determine how well text captions fit with their images. Think of it as a translator that can convert your natural language prompt into a language the AI understands.

Load VAE node

Typically, you will use the VAE output of the Load Checkpoint node. The VAE is also embedded in most checkpoints, but you can use the Load VAE node if you want to override that and use a different VAE or when a checkpoint does not have a VAE baked in. The VAE converts (decodes) the images generated in latent space to the final viewable images. In an image-to-image workflow, the input image is converted (encoded) to a latent image by the VAE.

Empty Latent Image node

The Empty Latent Image node provides an empty “canvas” for image generation. Defining the width and height of an empty latent is like setting the size of a canvas, even though latents aren’t simply a new empty page to draw on. Base models and checkpoints are trained to generate images at specific sizes. You can find recommended sizes for a base model or use nodes like Comfyroll Studio (ID #78) Aspect Ratio options to select common sizes. The batch_size parameter in this node determines how many images will be generated.

Add a CLIP Set Last Layer, Repeat Latent Batch, Load LoRA, and two CLIP Text Encode (Prompt) nodes as shown in the image. You don’t need CLIP Set Last Layer or Repeat Latent Batch nodes in a simple workflow, but you are likely to use them as you advance.

Repeat Latent Batch node

Repeat Latent Batch is the same as the batch_size parameter, but if you were instead converting an image to a latent for image-to-image generation, you would need a way to set how many variations should be generated when running the workflow.

CLIP Set Last Layer node changes CLIP Skip

You will often see "CLIP Skip" mentioned when discussing Stable Diffusion. The CLIP Set Last Layer changes this parameter. Sometimes checkpoints will recommend a CLIP Skip setting, but most of the time you don’t need to set this yourself. This is an easy parameter to experiment with, and I encourage you to try it. The setting literally skips a layer in the generation process, with -1 ending the process normally at the last layer, -2 ending at the second to last layer, etc. Each layer removed makes prompt adherence more generalized.

Load LoRA node

The Load LoRA node parameters are simple. I suggest not changing strength_clip when you are starting out, and use the strength_model parameter to adjust how much the LoRA affects the final image. LoRAs often have suggested ranges that work best, which you can find in their descriptions on Civitai. You can string together multiple LoRAs by connecting their MODEL and CLIP outputs to the model and clip inputs of another LoRA loader. For example, you might use a character LoRA from your favorite anime in combination with an art style you want to see that character depicted in.

CLIP Text Encode (Prompt) node

The two CLIP Text Encode (Prompt) nodes are used for your positive and negative prompts where you tell the AI what you want to see or not see in the image.

KSampler node and connections

Add a KSampler node and connect the CONDITIONING outputs to the positive and negative inputs, along with the model and latent_image connections. The KSampler can be thought of as a specialized type of processor. It processes the information from the checkpoint model, LoRAs, and your prompts to generate an image.

The KSampler has several key parameters. The seed parameter generates random noise to fill the empty latent image. Each seed is unique, allowing you to recreate identical images using the same seed and settings. The control_after_generate setting determines how the seed changes or if it stays fixed after each generation. The steps setting determines how many steps it takes to refine noise into a coherent image. Higher steps yield more detail but require more time. The cfg setting adjusts how strongly the model adheres to your text prompts, balancing creativity and control.

A lower number allows the model to be more “creative” in what it produces, while a higher number gives your prompts more control over what is produced. Higher numbers can begin to produce artifacts, distortions, and harsh images that are oversaturated and high contrast. The workable cfg range can have a correlation with LoRA model strength, so you may be able to reduce LoRA strength in order to increase cfg further.

TIP: Set the number of steps low for faster image generation. When you see an image you think has potential or a composition you like, fix the seed and increase the number of steps.

Samplers and Schedulers in the KSampler node

Sampler_name and scheduler can seem overwhelming with the number of options. The sampler is the tool and method used to denoise the image. In a very simplified way, you can think of it like brushstrokes. Some painting techniques are quick and rough, and some are slow and precise. Schedulers determine the overall plan for denoising, telling the sampler when to remove noise and how much noise to remove at each step.

Not all samplers and schedulers work together, and they also don’t all work well for different checkpoint models. I suggest you don’t spend too much time changing these, unless your hardware is capable of generating images very quickly. Look at the checkpoint model description or images that have been generated using that model on Civitai for samplers that work with that model.

You’ll likely be more interested in playing with LoRAs and prompts. My suggestion is to stick with a few sampler and scheduler combos that are used frequently and typically work with most models. Euler and euler_ancestral work well with the normal, karras, and exponential schedulers. The dpmpp_2m_sde and dpmpp_3m_sde samplers are frequently used with the karras scheduler. DDIM and ddim_uniform can sometimes work well for photorealistic images. Unless otherwise suggested on Civitai, stick to those until you feel like you have a good grasp on prompting and using LoRAs.

VAE Decode, Preview Chooser, and Save Image nodes

Use the VAE Decode node to convert the latent into an image you can view. Use the Preview Chooser node to review the image, and pass it to the Save Image node if you want to keep it. This basic workflow provides a clear understanding of each step in the image generation process.

ComfyUI is worth learning as a generative AI tool

Critics may claim AI image generation produces low-quality results, but ComfyUI offers detailed control over the process. Its node-based interface may seem intimidating at first, but learning the basics equips users to troubleshoot and create effectively. Understanding checkpoints, latent images, prompts, samplers, and VAEs builds a solid foundation that supports AI in creative projects.

Checkpoint models contain information on what different images can look like. Latent images are like the canvas the image is generated on. CLIP prompts determine the content of the image with your input. The KSampler adds noise to the latent and then processes everything to create an image by removing a little noise at each step. VAE models convert images to latents and latents to images, so you can work with your existing images or see the results after processing.

AI can be a helpful tool in your creative process, even if you aren’t the best traditional artist. Expressing yourself comes in many forms, and everyone should have the opportunity to. Professional creatives will benefit from learning how to use these tools now, instead of scrambling to catch up later. Other professional creative tools like Adobe Photoshop have been integrating their own generative AI powered by Firefly, and many more creative-focused companies are sure to follow.

ComfyUI

An open source, node-based program that allows users to generate images, video, and audio using free AI diffusion models and other AI tools.