Personalized Listing Photo Editing

Highlights

Uses FLUX.1 Kontext, a flow-matching diffusion transformer, as teacher for content-preserving, instruction-following style edits.
Distills diffusion-scale capability into InstructPix2Pix + LoRA adapters, reducing inference footprint from billions of parameters to ~3 MB.
BLIP-2 converts visual style signals into language-conditioned instructions, bridging the teacher-student representation gap.

Diffusion as teacher

FLUX.1 Kontext is a flow-matching diffusion transformer built for content-preserving image edits: it follows language instructions while leaving scene geometry and composition intact. That makes it a precise teacher, able to generate stylistically edited images faithful to the original listing photo. The challenge is that running a model of this scale at inference time is impractical for a product serving many photographers.

Distillation pipeline

InstructPix2Pix with LoRA acts as the student, learning to approximate the teacher's edit behavior from generated training pairs. LoRA freezes the base model weights and trains only a small set of low-rank adaptations, keeping the final adapter under 3 MB per photographer. The result: heavy diffusion computation happens once at distillation time, and inference runs on a compact, photographer-specific module that requires no generative model at serving time.