Skip to content
CONCEPTS
3 min readЧитать на русском

Typed-Reference Composition: Every Reference Image Has One Job

Your pipeline attaches four references to a generator: a person, a pose, a background, a piece of clothing. You write the prompt: "the subject in this pose, wearing these clothes, in this background." The model pulls clothes from the person reference and from the clothing reference. The pose reference also contributes a hairstyle. The background contributes a stray figure. Your prompt didn't tell the model "use reference 4 for clothing only, ignore everything else in it" — so it didn't. And you can't diagnose which reference caused which contamination because all four were silent contributors.

The fix is to give every reference an explicit typed role and to declare that role inside the prompt text the generator reads — so the model never has to guess which reference answers which question.

Typed-reference composition is a prompt architecture for multi-input generation where every reference image carries an explicit typed role and the prompt text declares, inline, which reference contributes which dimension. It stops silent-reference hallucination at its source.

The shape

Input references
├── Image 1 → refType: "identity"       (face, body, skin tone — ignore clothes/background)
├── Image 2 → refType: "pose"           (body position — ignore appearance/clothes/location)
├── Image 3 → refType: "background"     (environment — ignore subjects)
├── Image 4 → refType: "clothing"       (garment form, fabric, colour — ignore the model wearing it)
└── Image 5 → refType: "product"        (object form, orientation, text — ignore context)

System prompt:
  IMAGE ORDER
  ─────────────
  Image 1: Identity reference
  Image 2: Pose reference (pose only — ignore person)
  Image 3: Background reference
  ...

  RULE: In every prompt, state explicitly which image contributes what.
  RULE: Do not use image numbers that don't exist in IMAGE ORDER.

Emitted prompt (for the generator):
  "... using Image 1 for identity (face/body/skin), Image 2 for pose only
   (ignoring the person depicted), Image 3 for environment (ignoring any
   subjects present), ..."

The three moving parts

  1. refType — a typed role per reference. The type names what dimensions the reference provides and implicitly what it does not.
  2. IMAGE ORDER — a positional declaration shared between system prompt and emitted prompt. Gives every reference a stable, explicit identity.
  3. Inline declaration — the LLM is instructed to name, inside every prompt, what every attached reference is doing. No silent reference usage.

All three are load-bearing. Remove any one and silent-reference hallucination returns.

Why it works

Generative image models receiving N reference images will use all of them — that's the whole point of conditioning. The problem isn't "too many references"; it's "no guidance on which reference answers which question." A clothing reference that also happens to contain a model becomes a second identity source unless explicitly typed as clothing-only.

The pattern converts the implicit "use these images however seems sensible" contract into an explicit "Image 2 is pose only, ignore the person depicted" contract. The model still makes judgment calls, but within a much narrower space.

Failure modes

  • Untyped reference attachment. A reference without a declared role is interpreted by the model as "use however seems relevant" — which silently pulls the wrong dimension. Catch: every reference must carry a refType at ingest time.
  • Implicit typing via image position alone. "Image 1 is always identity" breaks when some users don't attach Image 1. An explicit IMAGE ORDER with current state, per-call, is load-bearing.
  • Declaration in system prompt only, not in the emitted prompt. If the LLM merely knows about the refType but doesn't write it into the output prompt, the downstream generator doesn't see it. The inline declaration is what reaches the model that matters.

Composition

The pattern slots cleanly into bigger architectures:

  • Often wrapped by a structured-fidelity Lock that restates the IMAGE REFERENCE MAP at the top of every final prompt — the typing reaches the generator on every call, not just at the start of a session.
  • Typically produced by the Architect half of a two-stage director / architect cascade.
  • Composes with reference-condensation: many typed references can be merged into one coherent composite reference to reduce attention load.

A single-shot worked example of the pattern is the mixed-prompt-composition-builder — explicit typing for identity, pose, background, clothing, product, and technical references in one composite prompt.

Generalises beyond image generation

Any LLM flow taking multiple typed inputs benefits:

  • Code generation from multiple file contexts — spec file / test file / existing module, each with a declared role.
  • Document synthesis from multiple source documents — primary / supporting / rebuttal.
  • Tool-using agents with multiple data sources — every source declared at ingest.

The discipline — every input has a declared role, and that role is named inline in the output — is the whole pattern.