artificial-intelligence

How to Run a Diffusion Model Locally (without ComfyUI) — Using Qwen-Image-Edit with Nunchaku

Background

I’ve been experimenting with image generation models and wanted to run them locally on my own hardware. My laptop setup is:

  • Intel Ultra 9 32GB RAM
  • NVIDIA RTX 4080 12GB VRAM

The key challenge? Most tutorials online point to ComfyUI, which is great for tinkering, but I only needed the generation pipeline to integrate directly into my application. That’s when I started exploring quantized diffusion models and tools like Nunchaku.

In this post, I’ll walk through a bit of background on quantized models, explain why Nunchaku is useful compared to the current Hugging Face API, and then show the exact steps I used to run Qwen-Image-Edit locally on my laptop.

Quantization Models

Diffusion models are notoriously heavy — often tens of gigabytes in full precision. Running a full model locally can be a pain if you don’t have enterprise GPUs.

That’s where quantization comes in. Instead of storing every weight at full 16/32-bit precision, quantized models compress them down (e.g. 4-bit, 5-bit) while keeping most of the performance intact.

  • ✅ Saves VRAM and RAM
  • ✅ Runs faster on consumer GPUs
  • ⚠️ Slight trade-off in fidelity

On Hugging Face, you’ll often see these models in GGUF format (a general quantization format popularized by llama.cpp for LLMs, but now adopted for vision models too).

So instead of running a full 20GB model, you can load a quantized GGUF model and still get high-quality image outputs.

Why Nunchaku?

Hugging Face’s Diffusers library has recently added GGUF support, but there are still limitations:

  • Pipelines like StableDiffusionPipeline don’t support quantized checkpoints yet.
  • You often have to manually stitch together encoders, decoders, and the UNet.

While Diffusers technically lets you run GGUF models by manually loading components with from_single_file, you’d also need to handle the VAE, text encoder, and scheduler yourself — and write the sampling loop. This works, but in practice it’s cumbersome if all you want is quick local inference.

This is where Nunchaku comes in. It’s designed for running quantized diffusion models locally with optimized CUDA kernels. Think of it as the missing link: it fills the gap between “nice quantized weights on Hugging Face” and actually running them in a local pipeline.

Implementation: Running Qwen-Image-Edit Locally

Here’s a minimal setup I used to get Qwen-Image-Edit running with Nunchaku.

1. Install Nunchaku

Follow the official installation guide: Nunchaku Installation. Make sure to match your PyTorch version, Python version, and platform (Windows/Linux, CUDA version).

2. Download the Quantized Model

Go to Hugging Face and download the**** quantized model**** provided by Nunchaku for Qwen-Image-Edit.

When choosing the model, there are several parameters to consider. The r128 model provides higher output quality than r32, though it runs slower. For users with non-Blackwell GPUs (pre-RTX 50 series), its is recommend downloading the in4 variant. Those with Blackwell GPUs (50s-series) should select the fp4 model for optimal performance. The 4-step models generate results faster, whereas the 8-step models provide smoother, more refined edits.

Important: Be careful not to download a quantized model that was developed by another party — it may be incompatible with Nunchaku.

3. Run the Model in Python

Note: The following code snippet is provided directly from the Nunchaku documentation.

The two major parameters to consider when running inference are the number of steps and the CFG scale. Make sure the inference steps match the step count of the model you downloaded (e.g., 4-step or 8-step) for optimal performance. Increasing the CFG scale can improve alignment with the text prompt but may also slow down the generation process.

import torch from diffusers 
import QwenImageEditPipeline from diffusers.utils 
import load_image from nunchaku 
import NunchakuQwenImageTransformer2DModelfrom nunchaku.utils import get_gpu_memory, get_precision# Replace with your file path of the modeltransformer = NunchakuQwenImageTransformer2DModel.from_pretrained(    f"nunchaku-tech/nunchaku-qwen-image-edit/svdq-int4_r32-qwen-image-edit.safetensors")pipeline = QwenImageEditPipeline.from_pretrained(    "Qwen/Qwen-Image-Edit", transformer=transformer, torch_dtype=torch.bfloat16)if get_gpu_memory() > 18:    pipeline.enable_model_cpu_offload()else:    # use per-layer offloading for low VRAM. This only requires 3-4GB of VRAM.    transformer.set_offload(        True, use_pin_memory=False, num_blocks_on_gpu=1    )  # increase num_blocks_on_gpu if you have more VRAM    pipeline._exclude_from_cpu_offload.append("transformer")    pipeline.enable_sequential_cpu_offload()image = load_image("https://huggingface.co/datasets/nunchaku-tech/test-data/resolve/main/inputs/neon_sign.png")image = image.convert("RGB")prompt = "change the text to read '双截棍 Qwen Image Edit is here'"inputs = {    "image": image,    "prompt": prompt,    "true_cfg_scale": 4.0,    "negative_prompt": " ",    "num_inference_steps": 10, # match the number of steps with your downloaded model}output = pipeline(**inputs)output_image = output.images[0]output_image.save(f"qwen-image-edit-r{rank}.png")

And that’s it — you’re editing images locally with a quantized diffusion model, without ever opening ComfyUI.

Frequently Asked Questions

Common questions about this topic

What hardware did the author use to run image generation models locally?

The author used a laptop with an Intel Ultra 9 CPU, 32GB RAM, and an NVIDIA RTX 4080 GPU with 12GB VRAM.

Why is quantization used for diffusion models?

Quantization compresses model weights from full 16/32-bit precision down to lower bit widths (e.g., 4-bit, 5-bit), which saves VRAM and RAM and enables faster runs on consumer GPUs while trading off slight fidelity loss.

What is GGUF and how is it used?

GGUF is a general quantization format popularized for LLMs and adopted for vision models; quantized models on Hugging Face are often distributed in GGUF format so they can be loaded in a compressed form for local inference.

What limitations of Hugging Face Diffusers motivated using Nunchaku?

Diffusers added GGUF support but pipelines like StableDiffusionPipeline don’t support quantized checkpoints yet, requiring manual stitching of encoder/decoder/UNet components, handling VAE/text encoder/scheduler, and writing the sampling loop, which is cumbersome for quick local inference.

What does Nunchaku provide compared to using Diffusers directly?

Nunchaku is designed for running quantized diffusion models locally with optimized CUDA kernels and provides a more integrated, convenient pipeline for quantized weights, filling the gap between GGUF model files and runnable local inference.

Which quantized model variants and parameters should be considered when downloading Qwen-Image-Edit from Hugging Face?

Consider model rank (r128 for higher output quality but slower than r32), GPU family (download in4 variants for non-Blackwell GPUs and fp4 variants for Blackwell GPUs), and step-count variants (4-step models are faster, 8-step models give smoother edits).

What important warning is given about downloading quantized models?

Do not download a quantized model developed by another party that may be incompatible with Nunchaku; use the quantized model provided by Nunchaku for compatibility.

What major inference parameters should match the downloaded quantized model when running it?

The number of inference steps must match the step count of the downloaded model (e.g., 4-step or 8-step), and the CFG scale (true_cfg_scale) controls prompt alignment and can affect generation speed.

How does the provided example handle low VRAM versus higher VRAM GPUs?

If available GPU memory is greater than 18GB, the example enables model CPU offload; otherwise it uses per-layer offloading via transformer.set_offload with num_blocks_on_gpu=1, appends the transformer to the CPU offload exclusion list, and enables sequential CPU offload to run on low VRAM (requiring only 3–4GB VRAM).

What steps are recommended to install Nunchaku correctly?

Follow the official Nunchaku installation guide and ensure the installation matches the system’s PyTorch version, Python version, platform (Windows/Linux), and CUDA version.

What high-level Python components are used in the example to run Qwen-Image-Edit with Nunchaku?

The example imports a Nunchaku transformer class (NunchakuQwenImageTransformer2DModel), a QwenImageEditPipeline from diffusers, a load_image helper from Nunchaku, and uses torch/diffusers utilities along with GPU memory and precision helpers to configure offloading and run the pipeline.

What practical benefit did the author seek by using Nunchaku instead of ComfyUI?

The author wanted only the generation pipeline integrated directly into an application for local inference and found Nunchaku provided a minimal, direct way to run quantized diffusion models locally without using ComfyUI’s broader tinkering environment.