April 20, 202610 min readЧитать на русском

A Generative-AI Video Production Pipeline

From Client Brief to Finished Shot, Stage by Stage

Author: Alex Nix Status: Working draft — for public release Companion paper: A.O.C. — A Prompt Framework for Generative Image Models

Abstract

Most writing about generative AI for video treats generation as a single step: a prompt goes in, a clip comes out. Production work doesn't behave this way. A thirty-shot commercial is not thirty independent generations — it is a dependency graph with scene-series consistency requirements, a cross-reference system for characters and locations, a style lock that has to hold across half an hour of footage, and an LLM orchestration layer that makes all of it tractable to write.

This paper is a structured description of such a pipeline, built and operated in production over several years. It treats generative AI not as a model to prompt but as a stage inside a larger architecture. The paper names the stages, the consistency mechanisms, the dependency graph, and the design choices that have turned out to be load-bearing. The aim is practical: give other builders a reference architecture to start from and disagree with. The orchestration layer (§ 5) is operated through a small set of Claude Code skills — A.O.C. application skills for the prompt backbone, creative direction skills for the pre-prompt half, Happy Horse Prompt for HH-specific video, MiniMax Voice for speech, and /logo-builder (in generative-logo-design) for identity work.

1. The thesis

Generative AI video, at production quality, is a constraint-propagation pipeline — not a generation model.

Reliable, consistent, directable output emerges from a pipeline that: (a) locks style early and cheaply, (b) separates composition, rendering, and motion as independent stages, (c) enforces scene-series consistency through linear reference inheritance, and (d) uses an LLM orchestration layer to keep every stage's output well-formed for the next.

The generation models are interchangeable components inside this pipeline. Swap one video model for another, swap one image model for another — the pipeline keeps working, because the pipeline is what holds the project together, not any single model call.

2. The six stages

The pipeline decomposes into six stages, each addressing one class of decision:

setup → scenario → storyboard → moodboard → scenes → video

Each stage takes the output of the previous stage and adds one new class of information. Each stage can be independently refined without re-running earlier stages. Single-shot refinement is a first-class operation, deliberately much lighter than fresh generation, so iteration is structurally encouraged.

2.1 Setup

The minimum viable description of the project: title, aspect ratio, duration estimate, technical notes, knowledge-base files (pre-extracted client briefs, scripts, reference docs, notes). Nothing generated yet. Setup is a staging area for the next stages' inputs.

2.2 Scenario

A brief or screenplay becomes a structured shot list. An LLM acting as production director emits a list of shots, each with:

Shot name in {scene-number}{letter} form (1A, 1B, 1C, 2A, …)
Seven-element description: subject+action, location, framing, angle, movement, lens, lighting
Shot type: full-ai / composite / live
Duration
isKeyFrame: optional pairing flag for transition stitching (see §2.6)

The shot-naming scheme is load-bearing. Shots sharing a scene number form a series that must remain visually consistent; the letter suffix orders them; the scheme survives unchanged through every downstream stage.

Design choice — why a fixed seven-element description: the seven elements are what every downstream stage needs: storyboards need framing/angle/movement, stills need subject/location, video needs movement/action. Encoding the list as free prose would force each downstream stage to re-extract; encoding it as a rigid object would force an always-incomplete schema. Seven named elements, always present, in prose form.

2.3 Storyboard

A fast, cheap visual lock on composition and camera language. An LLM emits per-shot prompts prefixed with a fixed hand-drawn storyboard style string ("traditional hand-drawn storyboard frame on white paper, clean black ink line art with selective red accent lines…"). The actual storyboard images render in seconds at a fraction of the effort of a final render.

Two properties make this stage load-bearing:

The same style prefix for every project. This turns storyboard images into an interchangeable vocabulary — they all "look the same way", so the LLM architects downstream can reason about them as composition references rather than style references.
Earlier-in-series storyboards become references for later-in-series ones. The LLM reads their line weight, hatching direction, and camera angles, then intentionally chooses different angles for subsequent shots — enforcing variety while holding style.

2.4 Moodboard (runs parallel to storyboard)

Two-tier reference system. A project-level (global) moodboard holds refs shared across shots; a per-scene moodboard holds scene-specific overrides. Each ref has a typed role (location / character / style / prop / color / camera), which downstream stages consult to decide which ref answers which question.

A composite 16:9 moodboard image is generated per scene — an AI-summarised single-image reference that condenses the scene's many refs into one visual. This is how style, colour, and lighting reach the stills stage: as one vision input, not as many raw reference images that would overconstrain generation.

2.5 Stills

Per-shot photographic renders. The stage consumes six input sources:

Shot description (text)
Storyboard image (vision — composition only, style ignored)
Previous same-series renders (vision — continuity)
Composite moodboard image (vision — style / colour / lighting)
Moodboard captions (text — environment context)
Shot type flag

The prompt-construction system prompt deliberately forbids describing style, colour, or lighting in text — those dimensions arrive through the composite moodboard and moodboard refs as vision. The text prompt describes only subject, location, camera. See deliberate-omission-paper.

This is a deliberate inversion of the A.O.C. approach used in single-shot and campaign contexts, where A.O.C. text prompts specify all three axes. The difference: production shots have dense visual references; campaign shots often have only one or two. Attention allocation follows available anchors.

Linear series inheritance is the cheap substitute for training a custom character model. Inside a series (1A → 1B → 1C …), each shot's render job attaches the previous shot's rendered image as a vision reference. Character design established in 1A propagates through 1B, 1C without training. See sequential-consistency-prompt-architecture-paper for the full mechanism.

2.6 Video

Motion generation. An LLM emits motion-only prompts: one or two sentences describing camera movement and subject action. No aesthetic adjectives, no scene description, no mood words — the scene is already rendered; only motion remains to specify.

The signature capability is first / last frame stitching. For shots flagged as paired key-frames at scenario time, the video job receives the start shot's render as the first frame and the end shot's render as the last frame. The model generates only the motion between, producing a seamless morph — no crossfade, no cut. This is only possible because the pipeline is frame-first: motion is the last decision, not the first.

3. The consistency problem and its three solutions

Consistency is where naïve pipelines fail. A thirty-shot project has thirty opportunities for a character's wardrobe to drift, for lighting to shift, for a product's label text to mangle. The pipeline has three interlocking mechanisms.

3.1 Scene-series inheritance (intra-series)

Already described. Linear render ordering inside a scene; each shot inherits the previous shot's render as a vision reference. Continuity across a cut sequence without training.

3.2 Reference typology (across shots, within a project)

Every reference image carries a typed role. "Which reference tells the model where the scene takes place?" has a well-defined answer (the location refs). "Which one tells the model what the character looks like?" (the character refs). Silent-reference usage is disallowed; the LLM architect is instructed to declare, inline, what every attached reference contributes.

3.3 Lock layers (per-shot, at the prompt level)

For campaign-style shots specifically, A.O.C. prompts are wrapped in a Lock layer — IDENTITY LOCK or PRODUCT LOCK — that replays a structured descriptor of the subject (face, body, hair, skin tone; or form, material, orientation, text, rigid-unit rule). The Lock is immutable across the batch and survives generation variance.

4. The dependency graph

The pipeline is mostly linear between stages, but the inter-shot graph inside the late stages is a forest of linear chains:

setup → scenario → storyboard → moodboard → scenes → video
                                              ↓
                 scene 1: 1A → 1B → 1C → ...        (linear)
                 scene 2: 2A → 2B → 2C → ...        (linear, parallel to scene 1)
                 scene 3: 3A → 3B → ...

Across scenes: no dependency. Scenes render in parallel.
Inside a scene: strict linear order. A shot's render is blocked until prior same-scene shots complete.

Two gating conditions enforce this:

canGenerateStoryboard(shot) // prior same-series shots have storyboard
canGenerateScene(shot)      // prior same-series shots have render

The graph shape is what makes a big project tractable: the work parallelises across scenes while series consistency is guaranteed inside each scene.

5. The LLM orchestration layer

Every stage except moodboard compositing is an LLM call. The orchestration contract for each call is narrow:

| Stage | LLM role | Temperature | Strict output | |---|---|---|---| | analyze-brief | Production director | 0.4 | Project overview + shot list JSON | | parse-scenario | Cinematographer cut-finder | 0.3 | Shot list JSON | | edit-scenario | Director edit-enforcer | 0.3 | Modified shot list JSON | | storyboard-prompts | Storyboard artist | — | {shotName: prompt} JSON | | image-prompts | Minimal shot-describer | — | {shotName: prompt} JSON, no style/colour/lighting | | video-prompts | Motion describer | — | {shotName: prompt} JSON, motion only, 1–2 sentences | | scene-moodboard | Production designer | — | One 16:9 moodboard prompt |

Two properties of this layer are load-bearing:

System-prompt discipline. Each role has a narrow charter. The image-prompts system prompt explicitly forbids style / colour / lighting description. The video-prompts system prompt forbids aesthetic adjectives. A broad system prompt lets the LLM contaminate the next stage with concerns that should have stayed in this one. See two-stage-architect-pattern-paper for the temperature-and-role split. In practice these system-prompt charters are operated through the A.O.C. application skills (image and video prompt assembly) and creative direction skills (brief → moodboard / scenario); engine-specific exits like happy-horse-prompt-skill and minimax-voice-skill handle Happy Horse and TTS respectively.

Refinement variants. Every generative stage has a cheap single-shot refinement mode that reuses the existing prompt as a baseline. Refinement is the default iteration mode. The economics of this matter: if refinement were as expensive as fresh generation, producers would either commit early or burn budget; neither produces good output. Cheap refinement keeps producers directing.

6. What the pipeline deliberately does not do

It does not train custom character models as the default consistency mechanism. Training is available but reserved for long-term persona reuse across many projects. For a single production, linear series inheritance + reference typology is cheaper and fast enough.
It does not expose a single "generate the whole project" button. Stage-by-stage progression is the interface — producers remain in the loop at every stage.
It does not attempt narrative authorship. Scenario parsing restructures; it does not invent. The brief is the author; the pipeline is the crew.
It does not try to fight model-specific quirks with fallback logic. Generation failures surface back to the producer as iteration invitations, not as silent retries.

7. What has turned out to be load-bearing

The hard-won decisions, ranked roughly by how painful their absence would be:

The {scene-number}{letter} shot-naming scheme. Every consistency mechanism downstream depends on it. Without the scheme, there is no "series" to enforce continuity over.
The fixed storyboard style. The same style for every project means the downstream stages can treat storyboard images as a neutral composition vocabulary.
Minimalism in image prompts. Forbidding style / colour / lighting in image-prompt text and routing those dimensions through vision is what keeps the generated stills from being over-constrained noise.
The cheap refinement stages. Single-shot iteration is what makes the pipeline useful instead of theoretical.
First / last frame stitching. The pipeline's most visible differentiator at final-video time.
Predictable per-stage effort. Producers trust the system because they can forecast work.

8. Open research questions

A scoring rubric for well-formed shot descriptions (the seven elements) — today ad-hoc. Would improve LLM architect reliability.
Cross-scene consistency for recurring characters. The pipeline enforces series consistency; cross-scene character consistency currently relies on moodboard refs and user discipline. A lightweight cross-scene anchor ("scene 3 features the same character as scene 1") would close the gap without needing a trained model.
Automatic shot-type selection. full-ai / composite / live is currently set at brief time. An LLM-driven heuristic could reduce manual tagging.
DAG rather than forest. Some productions have genuine cross-scene dependencies (a matched cut from scene 1's last shot to scene 2's first shot). A generalised DAG would model this; today it's handled by first/last frame stitching only at adjacent indices.
Measuring consistency. No automated scoring for whether 1A and 1B "feel like the same scene". Could be approached with CLIP similarity, face-embedding distance, or scene-graph comparison.
Cross-modality generalisation. Does the same six-stage shape work for audio-led productions (music videos, podcasts)? For interactive productions (games, experiential video)?

9. Conclusion

The pipeline described here is the result of running many productions through generative AI models and learning, the hard way, what breaks. What remains is a structure that makes generative AI behave like a production crew: constraints propagate forward, iteration is cheap, style locks hold, consistency is enforced by reference inheritance rather than by training, and the producer stays in the loop at every stage.

The specific implementation choices are not the only shape this architecture can take. But the stage decomposition, the series-based consistency mechanism, the minimality rule at the image-prompt layer, and the first-frame-then-motion ordering have turned out to be the smallest set of commitments that reliably produces consistent, directable, production-quality output from off-the-shelf generative models.

The invitation is the same as the companion A.O.C. paper: adopt it, break it, extend it, replace it.

Citation

Nix, A. (2026). A Generative-AI Video Production Pipeline. Working paper.