Train a text-to-image model in 24h with PRX | Keryc
You're back just in time for the hands-on part. What happens if you gather all the ideas that actually work and train them in 24 hours on a tight budget? That's exactly what this PRX experiment does: stacking architecture tricks, perceptual losses, and optimizations to get a useful text-to-image model in a day of compute.
The challenge: a 24-hour speedrun
A clear, realistic goal: train a competitive model in 24 hours using 32 H200s with an approximate budget of $1500 (about $2/hour per GPU). This isn't pure theoretical research: it's engineering to maximize performance under strong constraints. How far can you get by combining what already works? Much further than you'd think.
Architecture and formulation: pixel-space with x-prediction
Instead of training in latents and relying on a VAE, they use the x-prediction formulation (Back to Basics: Let Denoising Generative Models Denoise). That lets you train directly in pixels and reuse the classic computer vision toolbox.
Key decisions:
Patch size 32 and initial projection to a 256-dimension bottleneck to control sequence length. This makes pixel-space training viable even at high resolutions.
They start directly at 512px and then fine-tune at 1024px (they don't follow the 256 -> 512 -> 1024 ladder). This concentrates most training on the resolution that matters.
Practical advantage: predicting pixels directly means you can use perceptual losses exactly as they were designed, without the detour of decoding latents.
Perceptual losses: LPIPS and DINOv2
They take inspiration from PixelGen and add two auxiliary losses:
LPIPS for low-level perceptual similarity
A DINOv2-based perceptual loss for stronger semantic signal
Implementation details that made a difference:
Apply them over the full image pooled instead of patch-wise features
Apply them at all noise levels during training
Weights used (empirical values that worked well): LPIPS 0.1 and DINO perceptual 0.01. They add little overhead compared to the transformer step, but speed up convergence and improve final quality.
Efficiency: routing with TREAD and guidance for routed tokens
To make each step cheaper they use token routing with TREAD, which randomly selects a fraction of tokens and lets them bypass a contiguous transformer block to be reinjected later. Practical choice versus other options:
TREAD for simplicity and a good balance between savings and complexity (example: sequence 64 vs 128 in their setting)
Routing applied: 50% of tokens from block 2 to the penultimate block
Known issue: routed models can perform worse under conventional CFG if they are undertrained. Practical fix: implement self-guidance inspired by Guiding Token-Sparse Diffusion Models that guides using a dense conditional prediction vs. routed instead of an unconditional branch.
Representation alignment with REPA and DINOv3 teacher
They use REPA to align representations with a DINOv3 teacher (the best performer in prior experiments). Concretely:
Alignment applied once at block 8
REPA loss weight = 0.5
Because they combine REPA with TREAD, the loss is computed only over non-routed tokens (those that pass through the blocks where the loss is applied). This avoids inconsistent signals comparing tokens that skipped the route.
Optimization: Muon + Adam (FSDP)
Main optimizer for 2D matrices: Muon with FSDP (muon_fsdp_2). The rest (biases, norm, embeddings) use Adam. Two pragmatic parameter groups:
Group
Application
Key parameters
Muon
2D parameters (matrices)
lr=1e-4, momentum=0.95, nesterov=true, ns_steps=5
Adam
Non-2D parameters
lr=1e-4, betas=(0.9, 0.95), eps=1e-8
Result: Muon showed clear improvement over pure Adam in prior runs, so they applied it selectively.
Data and training schedule
Public synthetic sets used:
Flux generated (1.7M) - lehduong/flux_generated
FLUX-Reason-6M (6M) - LucasFang/FLUX-Reason-6M
midjourney-v6-llava (1M) - brivangl/midjourney-v6-llava, re-captioned with Gemini 2.5 Flash to standardize prompts and reduce caption noise
Practical training recipe:
512px: 100k steps, batch size 1024
1024px: 20k steps, batch size 512 (no REPA in this stage)
EMA for sampling and evaluation:
smoothing = 0.999
update_interval = 10ba
ema_start = 0ba
The pipeline was designed to be configurable: you can swap datasets, tweak routing, REPA, perceptual losses, or the Muon configuration.
Results and practical lessons
Does it work in 24 hours? Yes, and usefully. Main observations:
Quality: strong prompt following, consistent aesthetic, the 1024px stage sharpens details without breaking composition.
Failures: texture glitches, occasionally odd anatomy, and trouble on very hard prompts. These look more like undertraining or lack of data diversity than fundamental recipe flaws.
Key lesson: by combining pixel-space, efficient routing, representation alignment, and lightweight perceptual losses, you can get a meaningful model in a single day with a moderate budget. It's not magic: it's careful engineering and choosing proven components.
What’s next and how to reproduce it
This is a starting point. The team plans to scale the recipe, iterate on the data mix, and improve captioning. If you want to reproduce it or try variants, the code and configs are open to the community.
You can also see the repo with all the code and configs: PRX repository
Summary: PRX shows that by combining pixel-space, perceptual losses, token routing, and representation alignment, it's possible to train a useful text-to-image model in 24 hours with 32 H200s and ~$1500. The experiment paves the way for larger, reproducible iterations thanks to open source code.
Stay up to date!
Get AI news, tool launches, and innovative products straight to your inbox. Everything clear and useful.