April 20, 20263 min readЧитать на русском

Mixed-Prompt Composition: One Image, Five Typed References

You want one image that satisfies five constraints at once: this person, in this pose, in that environment, wearing those clothes, holding that product. Five references attached. The generator pulls everything from everything — the person's identity becomes a hybrid, the product's color drifts toward the wardrobe's, the pose lands somewhere between three. Silent-reference hallucination at full volume. The generator did its job; the prompt just didn't tell it which reference to consult for which question.

The fix is a single LLM call that builds one coherent composite prompt by explicitly typing each reference and naming, inline, what each one is for — so the generator receives one instruction with one role per attached image, instead of five competing image-pulls.

This is the mixed-prompt composition builder: a single-shot prompt-construction pattern that merges multiple reference images, where each contributes only its assigned domain — subject identity from one, pose from another, background from a third, wardrobe from a fourth, product from a fifth. A single-shot instantiation of typed-reference-composition.

The problem

Users often want one image that combines several independent constraints:

This person (identity reference)
In this pose (pose reference — pose only, ignore appearance)
In this environment (background reference — place only, ignore subjects)
Wearing these clothes (wardrobe references — garments only)
Holding this product (product reference — object only)

Attaching all five as raw reference images to a generator produces incoherent output. The model tries to use everything from every image. Silent-reference failure mode at maximum.

A multi-stage campaign pipeline solves this with architect cascades, IMAGE ORDER numbering, and identity / product Locks — but at the cost of running a full multi-stage pipeline. For one-shot composition, that's too much machinery.

The pattern

A single LLM call constructs one composite prompt by explicitly typing each reference:

Mixed-prompt input
├── subject image       → refType: "identity"
├── pose reference      → refType: "pose"
├── background reference → refType: "background"
├── clothing images     → refType: "clothing"
├── product image       → refType: "product"
└── technical / style   → refType: "technical"

The LLM receives the five (or fewer) images plus a structured parts array declaring each one's refType. The system prompt's job is narrow:

Build one coherent image-generation prompt. For each reference, use only its declared domain. Ignore everything else in every image. Reference each image by its explicit number so the generator knows which image to consult for which domain.

The output is one prompt with inline image-number declarations:

A [subject from Image 1, physical identity preserved],
[pose/body position from Image 2, wardrobe and identity ignored],
wearing [garments from Images 4a, 4b — color and fabric only, not the models
that wear them], standing in [environment from Image 3, subjects in Image 3
ignored], holding [product from Image 5, rigid form preserved]...

Why the refType typology is load-bearing

Without a declared role per image, the LLM defaults to "use everything from every image" and the generator does the same. The refType typology is an explicit contract:

identity — face, body, hair, skin tone, distinguishing features. Everything else from this image is ignored.
pose — body position, angle, gesture. Appearance, clothing, background from this image are ignored.
background — environment, lighting state, set dressing. Subjects and products in this image are ignored.
clothing — garment form, fabric, color. The model who wears it in the reference is ignored.
product — object form, orientation, text. Anything around the product is ignored.
technical — aspect ratio, format, composition hints. No subject content.

The declaration is explicit in the prompt text ("using Image 2 for pose only, ignoring the person depicted in Image 2"). Making it explicit is what stops the model from silently pulling the wrong dimension.

Relationship to A.O.C. and the heavy path

Mixed-prompt composition is not an A.O.C. prompt. It is a single composite prompt built from typed parts. The two patterns solve different problems:

A.O.C. → how to describe one shot (three axes: anchor, optics, chemistry).
Mixed-prompt → how to source the shot's contents from multiple reference images.

The two compose — a mixed-prompt-built description can fill the Anchor of an A.O.C. block — but they are normally alternatives. Mixed-prompt is the lightweight path when the user wants one image with many inputs; A.O.C. plus a multi-stage campaign pipeline is the heavyweight path when the user wants five consistent shots.

Why it's worth naming

The pattern ("typed reference composition with explicit per-image role declaration") generalises beyond product work:

Fashion try-on — garments from one reference, model from another, environment from a third.
Character design iteration — face from one sheet, costume from another, pose from a third.
Editorial compositing — subject from a shoot, location from a moodboard, props from a library.

Any single-shot compositor faces the same silent-reference failure mode and benefits from the same refType discipline.

Pairs with

typed-reference-composition — the broader architecture this is a single-shot instance of.