Local AI Image Generation: Run Stable Diffusion & FLUX Locally
The model landscape in 2026
Hugging Face hosts over 90,000 text-to-image models, and the practical shortlist is small [1]. FLUX.2 was released in November 2025 by Black Forest Labs in four variants [pro], [flex], [dev], and [klein] and is the current quality leader for local users. FLUX.2 [dev] is a 32-billion-parameter rectified-flow transformer that handles generation, editing, and combining images from text instructions, and accepts up to 10 reference images in a single run while preserving character identity, product appearance, and style[2]. FLUX.2 [klein] ships as a distilled 9B and 4B family; the 4B variant runs on consumer GPUs with around 13 GB VRAM at sub-second end-to-end inference. FLUX.2 [pro] and [flex] remain API-only, not open weights.
SDXL holds the middle ground. Its native resolution is 1024×1024, it needs 10–12 GB VRAM, and it offers far better composition and prompt following than SD 1.5, with an optional refiner for extra detail [3]. The community LoRA library around SDXL is the deepest of any base model, which keeps it the sweet spot for most creators in 2026 [4]. Stable Diffusion 3 Medium sits alongside SDXL with claimed gains in typography and complex-prompt understanding, though community testing showed weaker human-figure generation than SDXL on some prompts [5]. In a controlled three-attempt comparison, average per-image times were SD 2 ~12.02 s, SDXL ~21.21 s, and SD 3 ~16.86 s for a text-rendering pixel-art prompt, with SD 3 producing the most readable in-image typography.
Stable Diffusion 1.5 remains the legacy floor. Its native resolution is 512×512, it runs on 6–8 GB GPUs, and it carries a massive ecosystem of fine-tunes, LoRAs, and embeddings still the right choice for anime and niche styles served by community models. The honest summary for 2026: FLUX is the quality king but needs 12 GB+ VRAM, SDXL is the sweet spot, and SD 1.5 runs on anything.
Hardware: VRAM is the only number that matters
VRAM determines which models you can load before any speed discussion begins. The rules of thumb are clean: SD 1.5 needs 4 GB minimum and 8 GB comfortable; SDXL needs 8 GB minimum and 12–16 GB comfortable; FLUX needs 12 GB minimum and 16 GB+ comfortable; LoRA training needs 12 GB minimum and 24 GB ideal [4].
Benchmark times on common cards make the tiers concrete. An RTX 4060 8GB generates SD 1.5 at 512×512 in roughly 8 seconds per image and SDXL at 1024×1024 in about 25 seconds, but cannot run FLUX [3]. An RTX 4060 Ti 16GB drops those to ~5 s for SD 1.5, ~15 s for SDXL, and ~30 s for FLUX. The RTX 4090 24GB reaches ~2 s SD 1.5, ~6 s SDXL, and ~12 s FLUX, and on a single 4090 you can produce thousands of images per day with no API costs or limits. By image-per-minute throughput, 8 GB cards manage 3–5 images/min on SD 1.5, 12 GB cards reach 5–8/min on SDXL, 16 GB cards hit 8–12/min, and 24 GB cards top out at 15–25/min with LoRA training viable.
"FLUX can match Midjourney quality locally."
AMD is workable but not equal. AMD GPUs run via DirectML or ROCm and you should expect 20–40% slower speeds with occasional compatibility issues compared with NVIDIA CUDA, though the AMD 7900 XTX is popular thanks to its 24 GB VRAM. Newer runtimes are closing the gap: Modular's MAX runtime exceeds torch.compile performance on FLUX.2-dev by 1.25× on AMD MI355X, landing within 4% of an NVIDIA B200 on time-to-generation [6].

The runtime the front-end that loads models, accepts prompts, and runs the diffusion loop shapes your daily experience more than the model does. Four open-source options dominate.
ComfyUI is a node-based workflow editor and the de facto standard for power users [3]. It is the most powerful and flexible front-end, has a steeper learning curve than the alternatives, is preferred by professionals, and is required for advanced techniques. Automatic1111 WebUI usually called A1111 is a traditional Gradio-based web interface that is easier than ComfyUI and carries a strong extension ecosystem, but it is less flexible for complex workflows. A1111 is among the most popular open-source local interfaces and lets users switch between text-to-image, image-to-image, and outpainting/inpainting without coding; its biggest advantage over ComfyUI is simplicity [1].
Fooocus is a simplified, Midjourney-like experience with minimal settings good for beginners but limited customisation. InvokeAI sits between the two, offering a balance of power and usability with a unified canvas for inpainting, well suited to intermediate users. For headless or server workloads, LocalAI runs Stable Diffusion on CPU via C++ and Python implementations and exposes an OpenAI-compatible /v1/images/generations endpoint [7]. Its stablediffusion-ggml backend is built on stable-diffusion.cpp, flux.1-dev-ggml is available via the model gallery, and the Diffusers backend can pull models like Linaqruf/animagine-xl from Hugging Face on first use with parameters such as clip_skip, scheduler_type, cfg_scale, and CUDA/fp16 toggles. Negative prompts in LocalAI are passed by splitting the prompt with |, e.g. a cute baby sea otter|malformed.
Control techniques: ControlNet, LoRA, inpainting, upscaling
Base-model output gets you started; control techniques get you to production. ControlNet guides image generation with reference images for pose, depth, edges, and more, and is essential for consistent characters and scenes [3]. LoRA fine-tunes are small additive models that modify style or add subjects thousands are available on CivitAI, and you can train your own on 12 GB+ GPUs.
Inpainting and outpainting edit specific parts of an image or extend it beyond the original boundaries, and upscaling models like 4x-UltraSharp can take output from 1024 to 4K+ with detail intact. For sampling, the recommended baseline is DPM++ 2M Karras with 20–30 steps for drafts and 40–50 for finals, at CFG 7–8 for balance lower CFG gives more creative freedom. FLUX.2 [dev] follows a different recipe: the diffusers reference workflow uses a 4-bit quantised checkpoint (diffusers/FLUX.2-dev-bnb-4bit) with a remote text encoder, bf16 VAE, 50 inference steps (28 a good trade-off), and guidance scale 4 on an RTX 4090 or RTX 5090 [2].
Use cases and licensing reality
Local generation suits concept art, product mockups, illustration, photo editing, and open-ended creative exploration. The economics are unusual for AI: once the GPU is paid for, marginal cost per image is electricity. On an RTX 4090, that is thousands of images per day with no API costs or rate limits [3]. SDXL with strong community models and LoRAs produces excellent results, and FLUX matches hosted services on prompt understanding and text rendering.
Licensing deserves a careful read before commercial work. FLUX.2 [dev] outputs may be used for personal, scientific, and commercial purposes only as described in the FLUX [dev] Non-Commercial License, and commercial use of the model requires separate licensing through Black Forest Labs [1][2]. SDXL and SD 1.5 carry their own licenses through Stability AI; community LoRAs on CivitAI vary individually. Read each model card before shipping.
The trade-off is straightforward: cloud services give you instant scale and zero setup; local generation gives you ownership, privacy, and unlimited iteration once you commit to a GPU and a runtime. For most serious creators in 2026, start with SDXL on ComfyUI on a 12–16 GB card, add ControlNet and a few LoRAs, and graduate to FLUX.2 [dev] when your workflow demands it.
Sources
- The Best Open-Source Image Generation Models in 2026
- A 32B open-weight model
- AI Image Generation Guide | SD, SDXL, Flux | 2025
- Best GPU for Stable Diffusion & Local Image Generation (2026) | OwnYourAI
- See our blog post
- ~4× faster image generation than torch.compile while maintaining image quality
- Image Generation :: LocalAI
