How to Run a Diffusion Model Locally (without ComfyUI) — Using Qwen-Image-Edit with Nunchaku

Background
I’ve been experimenting with image generation models and wanted to run them locally on my own hardware. My laptop setup is:
- Intel Ultra 9 32GB RAM
- NVIDIA RTX 4080 12GB VRAM
The key challenge? Most tutorials online point to ComfyUI, which is great for tinkering, but I only needed the generation pipeline to integrate directly into my application. That’s when I started exploring quantized diffusion models and tools like Nunchaku.
In this post, I’ll walk through a bit of background on quantized models, explain why Nunchaku is useful compared to the current Hugging Face API, and then show the exact steps I used to run Qwen-Image-Edit locally on my laptop.
Quantization Models
Diffusion models are notoriously heavy — often tens of gigabytes in full precision. Running a full model locally can be a pain if you don’t have enterprise GPUs.
That’s where quantization comes in. Instead of storing every weight at full 16/32-bit precision, quantized models compress them down (e.g. 4-bit, 5-bit) while keeping most of the performance intact.
- ✅ Saves VRAM and RAM
- ✅ Runs faster on consumer GPUs
- ⚠️ Slight trade-off in fidelity
On Hugging Face, you’ll often see these models in GGUF format (a general quantization format popularized by llama.cpp for LLMs, but now adopted for vision models too).
So instead of running a full 20GB model, you can load a quantized GGUF model and still get high-quality image outputs.
Why Nunchaku?
Hugging Face’s Diffusers library has recently added GGUF support, but there are still limitations:
- Pipelines like
StableDiffusionPipelinedon’t support quantized checkpoints yet. - You often have to manually stitch together encoders, decoders, and the UNet.
While Diffusers technically lets you run GGUF models by manually loading components with from_single_file, you’d also need to handle the VAE, text encoder, and scheduler yourself — and write the sampling loop. This works, but in practice it’s cumbersome if all you want is quick local inference.
This is where Nunchaku comes in. It’s designed for running quantized diffusion models locally with optimized CUDA kernels. Think of it as the missing link: it fills the gap between “nice quantized weights on Hugging Face” and actually running them in a local pipeline.
Implementation: Running Qwen-Image-Edit Locally
Here’s a minimal setup I used to get Qwen-Image-Edit running with Nunchaku.
1. Install Nunchaku
Follow the official installation guide: Nunchaku Installation. Make sure to match your PyTorch version, Python version, and platform (Windows/Linux, CUDA version).
2. Download the Quantized Model
Go to Hugging Face and download the**** quantized model**** provided by Nunchaku for Qwen-Image-Edit.
When choosing the model, there are several parameters to consider. The r128 model provides higher output quality than r32, though it runs slower. For users with non-Blackwell GPUs (pre-RTX 50 series), its is recommend downloading the in4 variant. Those with Blackwell GPUs (50s-series) should select the fp4 model for optimal performance. The 4-step models generate results faster, whereas the 8-step models provide smoother, more refined edits.
⚠ Important: Be careful not to download a quantized model that was developed by another party — it may be incompatible with Nunchaku.
3. Run the Model in Python
Note: The following code snippet is provided directly from the Nunchaku documentation.
The two major parameters to consider when running inference are the number of steps and the CFG scale. Make sure the inference steps match the step count of the model you downloaded (e.g., 4-step or 8-step) for optimal performance. Increasing the CFG scale can improve alignment with the text prompt but may also slow down the generation process.
import torch from diffusers
import QwenImageEditPipeline from diffusers.utils
import load_image from nunchaku
import NunchakuQwenImageTransformer2DModelfrom nunchaku.utils import get_gpu_memory, get_precision# Replace with your file path of the modeltransformer = NunchakuQwenImageTransformer2DModel.from_pretrained( f"nunchaku-tech/nunchaku-qwen-image-edit/svdq-int4_r32-qwen-image-edit.safetensors")pipeline = QwenImageEditPipeline.from_pretrained( "Qwen/Qwen-Image-Edit", transformer=transformer, torch_dtype=torch.bfloat16)if get_gpu_memory() > 18: pipeline.enable_model_cpu_offload()else: # use per-layer offloading for low VRAM. This only requires 3-4GB of VRAM. transformer.set_offload( True, use_pin_memory=False, num_blocks_on_gpu=1 ) # increase num_blocks_on_gpu if you have more VRAM pipeline._exclude_from_cpu_offload.append("transformer") pipeline.enable_sequential_cpu_offload()image = load_image("https://huggingface.co/datasets/nunchaku-tech/test-data/resolve/main/inputs/neon_sign.png")image = image.convert("RGB")prompt = "change the text to read '双截棍 Qwen Image Edit is here'"inputs = { "image": image, "prompt": prompt, "true_cfg_scale": 4.0, "negative_prompt": " ", "num_inference_steps": 10, # match the number of steps with your downloaded model}output = pipeline(**inputs)output_image = output.images[0]output_image.save(f"qwen-image-edit-r{rank}.png")
And that’s it — you’re editing images locally with a quantized diffusion model, without ever opening ComfyUI.