Diffusers integrates FLUX.2: new architecture for image generation

FLUX.2 arrives as the next generation of image models from Black Forest Labs, and Diffusers integrates it with support for inference and fine-tuning. What changes compared to Flux.1 and why should it matter if you work with multimodal generation? I'll walk you through it step by step, with practical examples you can run on real hardware.

What is FLUX.2

FLUX.2 is a new family of image-generation models trained from scratch with a different architecture than Flux.1. It's multimodal: it works as text-to-image and image-to-image, and it also accepts up to 10 reference images per prompt, letting you combine references by index or by natural language.

In practical terms, this means you can ask the model to blend two different photos and describe which elements to take from each. Useful, right? But watch out: every extra image increases VRAM usage.

Key architectural changes

FLUX.2 keeps the general idea of a multimodal diffusion transformer (MM-DiT) and DiT blocks in parallel, but introduces several optimizations and important design choices.

What is FLUX.2

Key architectural changes

Practical inference with Diffusers

Basic CPU offloading example

Flash Attention 3 for Hopper GPUs

Load models in 4-bit with bitsandbytes (NF4)

Local + remote: outsource the text encoder

Quantizations and backends

Fine-tuning LoRA: how to do it on limited GPUs

Best practices and quick tips

Final thoughts

Original source

Stay up to date!

Diffusers integrates FLUX.2: new architecture for image generation