April 20, 202610 min readЧитать на русском

Sequential Consistency via Prompt Architecture

Holding Coherence Across Multi-Shot and Multi-Scene Generative AI Without Training

Author: Alex Nix Status: Working draft — for public release

Abstract

The canonical answer to "how do I keep my character looking the same across a thirty-shot video" is train a model on them. LoRA, DreamBooth, textual inversion, full fine-tunes. These work and they are expensive. For a single production — thirty shots across a week, one time — training is overkill; it trades flexibility for consistency at a higher cost than necessary.

This paper describes four composable prompt architectures that together hold consistency across long sequences of generative outputs without training any per-subject model. The four are:

Reference-Inheritance — the pipeline's own prior outputs become inputs to later stages
Series-Aware Anti-Repetition — prior outputs attached as constraint-what-to-avoid, not what-to-copy
Neighbor-Aware Prompts — each output describes how it connects to adjacent outputs
Reference Condensation — many typed references collapsed into one coherent reference for downstream consumption

The four are operationally independent but architecturally composed: each answers a different question about sequence consistency, and a production pipeline typically uses all four. This paper describes each, how they compose, and when to reach for them rather than for training.

1. The sequential-consistency problem

Generate thirty shots of a character in a commercial. Thirty separate generations produce thirty subtly-different characters — wardrobe drift, face drift, environment drift, lighting drift. Individually each shot looks fine; as a sequence, the viewer registers "these are different people across scenes."

Two canonical solutions:

Solution A — train a model on the character. Five hundred reference images, a LoRA run, deployment. Reliable; expensive; inflexible (the character is now locked into whatever pose/wardrobe/lighting the training data implied). Overkill for a one-time production.

Solution B — attach reference images to every generation and hope. Cheap; brittle. Generator behavior with attached references is inconsistent. Silent-reference failure modes abound. Consistency improves but doesn't land.

This paper describes Solution C — prompt architecture that imposes consistency structurally, without training. The mechanisms below work because generator models already accept reference conditioning; the pattern is about how references are attached and what the prompt text says about them.

2. The graph model

Before the four patterns, name the graph the sequence lives in.

A production is not a flat list of thirty generations. It is a forest of linear chains grouped into scenes:

scene 1:  1A → 1B → 1C → 1D        (linear within scene)
scene 2:  2A → 2B → 2C             (linear, parallel to scene 1)
scene 3:  3A → 3B → 3C → 3D → 3E

Within a scene: shots are a linear chain. Shot 1B depends on shot 1A's output; shot 1C on 1B's.
Across scenes: no dependency. Scene 2 and scene 3 render independently.

This shape is not an incidental implementation detail — it is the architecture that makes the patterns below work. Consistency is enforced within a chain; cross-scene consistency is handled by external references (a shared character reference, a shared style reference), not by chained inheritance.

3. Pattern 1 — Reference-Inheritance

The pipeline's own prior outputs become inputs to later stages.

Rendering shot 1B:
  text:        describes only subject action, location, camera (what varies)
  vision refs: external moodboard + storyboard_1B + [rendered_1A.jpg]
  output:      rendered_1B.jpg

Shot 1B's render job attaches shot 1A's rendered image as a vision reference. Character identity, wardrobe, environment state — all carried visually by 1A's render, not re-described in 1B's text.

Why it works

Generators accept vision conditioning natively. No new model infrastructure needed.
Dense signal in the reference. Text cannot specify "exact same shirt button arrangement"; the image can.
Linear chains avoid compound drift. Each shot only has to match its immediate predecessor; errors don't accumulate exponentially across a short chain (3–6 shots typical).

What it requires of the text prompt

Deliberate omission. If the text tries to describe identity/wardrobe in addition to vision-inheriting from the prior render, the two channels compete. The text must describe only what changes — action, location (if it varies), camera. Identity is vision-carried.

Limitations

Short chains only. Drift accumulates; beyond 6–8 shots the pattern degrades.
Hard predecessor dependency. Shot 1B cannot render before shot 1A completes. This imposes strict in-scene ordering.
No cross-scene inheritance. Scene 2 cannot inherit from scene 1 — too many shots intervene, drift would dominate.

4. Pattern 2 — Series-Aware Anti-Repetition

Prior outputs attached as constraint-what-to-avoid.

Generating shot 1B:
  text:        current shot's description
  vision refs: prior shots in the series (1A)
  system prompt: "Analyse the prior frames. Identify their camera angles
                  and framing. For this shot, intentionally choose a
                  DIFFERENT angle while keeping the same style."

The reference is a negative anchor. The LLM reads it to identify what has been done and picks something different along named dimensions.

Why it works

LLMs default to imitating attached references. Explicitly inverting the default with a named instruction produces variety.
Separate vary-vs-constant dimensions. The pattern requires declaring what must vary (camera angle, framing) and what must hold (style, character identity). Without this split, "produce something different" drifts on everything.

Canonical case: storyboard generation

Storyboards within a scene share visual style (line weight, hatching, red accents — see the style-prefix-architecture concept) but must vary in camera angle and framing. The LLM reads prior storyboard frames to identify already-used angles, then intentionally picks unused ones.

Relationship to Reference-Inheritance

These two patterns pull in opposite directions. Inheritance pulls outputs together (hold identity constant); anti-repetition pushes them apart (vary camera). They compose only when the dimensions are cleanly separated: inherit identity, vary camera. Confusing the two — inheriting camera or varying identity — defeats both.

5. Pattern 3 — Neighbor-Aware Prompts

Each output describes how it connects to adjacent outputs.

Shot 1A motion prompt:
  "Camera pushes in. Subject's gaze rotates from off-frame to the camera.
   Motion lands on a pose that sets up 1B's opening action."

Shot 1B motion prompt:
  "Subject, now making eye contact (continuing the gaze landed in 1A),
   begins to speak."

The prompt for output n explicitly references state from n−1 and setup for n+1. The generator produces motion that dovetails with its neighbors rather than with imaginary other shots.

Where it matters

Video transition stitching. When two adjacent shots must connect seamlessly (a continuous gesture, a narrative beat hand-off), the motion between them is generator-driven, and the prompt must specify the connection.
Narrative scene hand-offs. Each scene's generation prompt inherits unresolved threads and leaves threads for the next.
Multi-turn dialog generation. Each turn's prompt knows what the prior speaker said and what the next expects.

Composition with first/last-frame constraints

The pattern is most powerful when paired with a hard visual constraint: start frame = rendered shot n, end frame = rendered shot n+1. The generator must produce motion that begins at the start and lands at the end; the prompt describes how the motion connects rather than what the endpoints look like.

Failure modes

Over-inheritance. Content-within-the-output contaminated by neighbor content. Counter: isolate "connection points at edges" from "content within."
Missing neighbors. First and last outputs in a sequence have no predecessor or successor. Counter: fallback rules — "opening shot; begin fresh; no predecessor state."
Backfill cost. If shot n+1 is generated after shot n, shot n can't have known about it. Counter: two-pass planning — plan all shots abstractly first, then refine each with neighbor awareness.

6. Pattern 4 — Reference Condensation

Many typed references → one coherent single-image reference.

Per-scene reference set:
  location_ref.jpg, character_ref.jpg, style_ref.jpg,
  prop_ref.jpg, color_ref.jpg, camera_ref.jpg

     ↓ LLM + generator condensation step
     ↓

One composite reference image (typically 16:9 cinematic moodboard)

     ↓

Downstream stages attach ONLY the composite.

The per-scene references are condensed into a single anchor that downstream generation stages can use efficiently. The originals are retained for debugging but no longer attached directly.

Why condense

Generators have a finite attention budget for reference conditioning. Attaching more than ~3 references produces diminishing and then negative returns — the model tries to satisfy all references and produces washed-out, incoherent output.

Condensation converts a many-reference problem into a one-reference problem at the cost of one additional generation step. The composite carries the distilled intent of the originals in a form the downstream generator can actually use.

How the condenser stage works

Reads every reference via vision
Reads the typed role of each reference (the refType: location, character, style, prop, color, camera)
Applies a declared resolution order when references disagree (typically: environment → style → color → subject)
Emits a single-reference prompt (itself a deliberate-omission prompt) that describes the intended composite
Passes the prompt to a generator to produce the composite

The output is a generated image that is itself used as a reference in the next stage. Recursive use of generation for reference construction.

Role in the overall sequential-consistency stack

Enables deliberate omission downstream. When the composite reliably carries style / colour / lighting, the text prompt in the downstream stage can omit those dimensions confidently.
Stable per-scene anchor. Scenes that span multiple shots all inherit from the same composite; style consistency across shots is structural.
Debuggable. A bad output traces to either the composite (inspect it) or the downstream text prompt (inspect it).

7. How the four patterns compose

A production pipeline typically uses all four, in a specific arrangement:

Per-scene setup (once per scene):
  - typed references: location, character, style, color, camera
  - Pattern 4 (Condensation): → composite moodboard image

Rendering scene shots (linear within scene, all four patterns active):
  Shot 1A:
    - Pattern 4 output attached as vision (style/colour/lighting)
    - Pattern 2 (Anti-Repetition): prior shots (none — opening shot)
    - Pattern 1 (Inheritance): prior shot (none)
    - Text: deliberately minimal — subject/location/camera only

  Shot 1B:
    - Pattern 4 output attached as vision
    - Pattern 2: prior shots 1A considered for angle variety
    - Pattern 1: rendered_1A.jpg attached as continuity reference
    - Pattern 3 (Neighbor-Aware): text notes connection from 1A, setup for 1C
    - Text: minimal, neighbor-aware

  Shot 1C:
    - all four patterns active, referencing 1B
  ...

The ordering matters: condensation is setup; the other three are active during in-scene rendering. Anti-repetition and inheritance push in opposite directions and must be cleanly separated by dimension. Neighbor-awareness operates at the connection points between shots.

No single pattern is sufficient. Inheritance without condensation produces over-referenced mush. Anti-repetition without separation-of-dimensions produces style drift. Neighbor-awareness without inheritance produces fine-transitions between-incompatible-subjects. All four together produce consistent, directable sequential output.

8. When to reach for these patterns vs. training

| Pattern stack | Training (LoRA / fine-tune) | |---|---| | Fast to set up | Days to prepare + train | | Per-project | Per-subject (reusable) | | Flexible — subject can do anything references suggest | Inflexible — subject locked into training distribution | | Limited within-chain length (5–8 shots before drift) | No chain-length limit | | Cross-scene consistency requires external references | Cross-scene consistency is inherent | | Cost: marginal generator-reference tokens | Cost: upfront training + inference |

Rule of thumb:

One production (any length) with one subject → pattern stack
Many productions with a recurring subject → train a LoRA
Productions combining recurring subjects + new variations → both, with the LoRA providing cross-scene anchoring and the pattern stack handling within-scene drift

9. Failure modes of the stack

Drift accumulation in long chains. Reference-Inheritance is only robust for short chains. Long scenes (8+ shots) should break into sub-series with external reference refreshes at boundaries.
Dimension confusion between inheritance and anti-repetition. If the two patterns are applied without separation-of-dimensions, both degrade. Explicit declaration of vary-vs-constant is load-bearing.
Composite drift. The composite reference from condensation is itself generated; if the generator produces a stylistically-distinctive composite (strong unintended aesthetic), downstream stages over-imitate. Counter: condensation system prompt should produce neutral composites that anchor dimensions without imposing new style.
Stale composite. User updates a reference; composite not regenerated; downstream inherits stale. Counter: invalidate the composite when any of its sources change.
Neighbor-awareness without hard-constraint endpoints. Describing how shot 1A connects to shot 1B is less useful when the generator doesn't see 1B's actual endpoint. Counter: pair with first/last-frame image constraints where the generator supports them.

10. Generalisations beyond image/video

The stack applies to any sequential generation task where consistency matters and per-subject training is infeasible:

Long-form narrative generation. Each chapter inherits from the prior (Pattern 1), varies from it along named dimensions (Pattern 2), dovetails at its edges (Pattern 3). Character descriptions condensed into a persona anchor (Pattern 4).
Code-base-consistent code generation. Each function inherits style and patterns from prior generated siblings; a condensed "style anchor" captures the code-base voice.
Podcast-episode series. Each episode's outline inherits from the prior; varies in topic while holding tone; dovetails at cold-open and sign-off.
Conversational agent with long context. Each turn inherits from the last few turns (Pattern 1), varies topic along user-intent (Pattern 2), signals transitions cleanly (Pattern 3), and condenses long histories into a persona/context anchor (Pattern 4).

The shape is the same: within-chain linear consistency, between-chain parallelism, explicit dimension separation, condensation for attention economy.

11. Open research questions

Empirical validation. Quantified comparison of pattern-stack vs. LoRA-trained vs. naïve-generation across a standard consistency benchmark. Needed badly; the field currently argues from anecdote.
Maximum chain length per pattern. Reference-Inheritance is observed to degrade at 5–8 shots; what are the analogous limits for neighbor-awareness in long narratives? Anti-repetition in long series?
Automated composition tuning. Given a new task, how to decide which of the four patterns to activate and how to configure them? Currently manual; a pipeline-designer assistant is plausible.
Cross-modal transfer. Do the patterns transfer between image/video and text/code? Informally yes; formally unstudied.
Pattern interaction failure modes. When do inheritance and anti-repetition conflict? When does neighbor-awareness over-ride inheritance? A calculus of the pattern interactions would be useful.
Condensation quality metrics. How to measure whether a condensed reference captures the intent of its source set? Possibly CLIP-style similarity with role-weighted attention; untested.

12. Conclusion

Sequential consistency in generative AI does not require training. It requires a different thing — careful composition of four prompt architectures, each addressing a specific sub-problem of sequence coherence: carrying state forward (Reference-Inheritance), varying along explicit dimensions while holding others constant (Anti-Repetition), connecting outputs at their boundaries (Neighbor-Awareness), and reducing attention load from many references to one (Condensation).

The four compose into a stack that produces directable, consistent sequential output from off-the-shelf generators. It is not a silver bullet — long chains drift, composites can go stale, pattern interactions require care. But for the common case (one production, one or more subjects, reasonable chain lengths, desire for flexibility), the stack is the smallest commitment that reliably works.

Training remains the right answer for recurring subjects and cross-scene consistency. The stack is the right answer for within-production coherence. Most real pipelines will use both.

Citation

Nix, A. (2026). Sequential Consistency via Prompt Architecture — Holding Coherence Across Multi-Shot and Multi-Scene Generative AI Without Training. Working paper.