Skip to content
RESEARCH
9 min readЧитать на русском

A.O.C. — A Prompt Framework for Generative Image Models

A.O.C. — A Prompt Framework for Generative Image Models

Anchor · Optics · Chemistry

Author: Alex Nix Status: Working draft — for public release


Abstract

Prompts for generative image models are usually written as free-form prose, a checklist of style keywords, or a templated form with loosely-named fields. None of these scales to production: free prose is ambiguous, keyword stacks are non-orthogonal, templates collapse into parameter pages nobody reads.

A.O.C.Anchor · Optics · Chemistry — is a structured prompt framework for image-generation models. It reduces every visual decision a shot requires to three orthogonal axes corresponding to the three independently-controllable stages of real photographic production (staging, camera, and lighting/film). The framework has been operationalized in production across multi-shot campaigns and cinematic video storyboards, and is offered here as a reusable discipline for prompt engineers, directors, and tool-builders.

This paper defines the framework, the composition rules that make it reliable in production, the failure modes it prevents, and the open research questions that remain.


1. Motivation

The generative-image community has converged on three default prompting strategies, each of which fails differently under production pressure:

  • Free prose. "A moody portrait of a woman holding a red bottle in a warehouse, cinematic, 8K." Fast to write, impossible to debug. When the output is wrong, there is no handle to grasp.
  • Keyword stacks. portrait, woman, red bottle, warehouse, cinematic, volumetric light, 85mm, bokeh, 8K. Feels more controllable but the keywords are not orthogonal — "cinematic" overlaps with everything, "volumetric light" prejudges lighting, "85mm" couples subject and camera. Fixing one keyword breaks another.
  • Named-field templates. {subject}, {style}, {lighting}, {camera}, {mood}, {aesthetic}. The fields turn out to overlap: style encodes lighting, lighting encodes mood, mood encodes aesthetic. Users either duplicate or leave fields blank.

Production systems — multi-shot campaigns, narrative storyboards, character-consistent series — need a prompt format that:

  1. Can be composed (one shot of five must fit alongside the other four)
  2. Can be debugged (a bad image must be traceable to a specific decision)
  3. Can be taught (new practitioners must be able to learn it)
  4. Can be automated (LLM "architect" agents must emit it reliably)

A.O.C. is the framework the author converged on through several years of building production generative-image pipelines. It is opinionated by design and optimized for the production case.


2. The framework

A.O.C. decomposes every shot into three orthogonal axes:

2.1 A — Anchor

What is in the frame and where it is.

The Anchor specifies:

  • Subject — what the frame is about (one subject, one moment)
  • Location — where it is
  • Orientation — how the subject faces the camera, how it sits in the environment, scale relationship to surroundings

The Anchor answers "what am I looking at, where is it, and how is it placed." It does not say anything about how the camera sees it or how the light renders it.

2.2 O — Optics

How the camera sees it.

Optics specifies the photographic mechanics:

  • Lens behavior (focal length effect: environmental, natural, compressed, macro)
  • Camera distance (measured, not vague)
  • Depth of field (shallow / deep, specified by f-stop behavior)
  • Focus plane (where the sharpness sits)
  • Exposure (clean, long, brief)
  • Effects (motion blur, occlusion, smear, bokeh — off by default, described as optical mechanisms when used)

Effects must earn their place. A clean, sharp image is the baseline. When an effect is used, it must be described by the physical behavior that produces it — "directional smear from subject motion during a 1/4-second exposure" — not by a brand or an aesthetic word.

Optics is described as function, not as brand. "85mm" is acceptable as a shorthand for "moderate portrait compression with natural subject separation"; "shot on an ARRI" is not.

2.3 C — Chemistry

How light and material interact to form the image.

Chemistry specifies the imaging physics:

  • Lighting behavior (source position, directionality, falloff, shadow-edge quality, contrast ratio)
  • Tonal response (exposure state, highlight / shadow handling)
  • Color interaction (temperature in K, saturation, cast)
  • Grain / texture (film-stock behavior, noise, compression, surface rendering)
  • Atmosphere (haze, fog, particulate, depth separation)

Mood is engineered through physical imaging properties, not through mood words. Instead of "moody and cinematic", the Chemistry section should read:

Single directional light from left at 45°, soft shadow edge (diffused through scrim), 4:1 contrast ratio, tungsten colour temperature around 3200K, heavy grain consistent with ISO 3200 film stock, slight atmospheric haze lifting the blacks in the midground.

The generated image will feel moody and cinematic — but now the feeling is a consequence of specified physics, not an incantation.

2.4 Why these three, not more, not fewer

Orthogonality. In a real photo shoot, three independent teams make three independent decisions: the director stages the scene (Anchor), the DP frames and shoots it (Optics), the gaffer and colourist light and grade it (Chemistry). The framework mirrors the division of labour that photography has already converged on.

Completeness. Every visual decision a still-image model must make falls into one of the three. Anything that falls outside — motion over time, sound, narrative causality — is handled by a sibling framework (motion language for video, scenario and storyboard systems for narrative).

Teachability. Three is the largest set a practitioner can reliably hold in working memory while writing. Two axes collapse too much together; four starts to feel like a checklist.


3. A well-formed A.O.C. prompt

ANCHOR:
A 50ml amber glass dropper bottle with a black matte dropper cap, label text
"Serum 01" readable across the front, standing upright on a polished dark-walnut
countertop in a product-prep kitchen. The bottle is centred in the frame,
three-quarter angle toward camera, grounded with its shadow, surrounded at the
periphery by soft-focus studio props (linen towel, ceramic dish, frosted glass
pitcher) that establish the environment without competing for attention.

OPTICS:
Medium-close product framing, 85mm equivalent, camera at bottle-height (low
table level, very slight downward tilt of 5°). Shallow depth of field, f/2.8
behavior, focus locked on the label text. Clean exposure, no motion blur,
no lens flare. Slight peripheral defocus falling off from the subject plane
to a creamy environment.

CHEMISTRY:
Single directional key light from upper-left at 45°, diffused through a
large softbox (shadow-edge soft). Bounce fill from right at ratio 1:3 to
preserve form modelling on the bottle's amber body. Color temperature 4200K
(neutral-warm). Midtone-biased tonal curve, slight lift in the blacks,
highlights rolled off. Fine film grain (ISO 400 stock). No atmospheric
haze. Subtle specular highlight travelling down the bottle's right shoulder
suggests the glass surface.

What this prompt does well:

  • Every Anchor statement is about what is in the frame and where.
  • Every Optics statement is about how the camera captures it.
  • Every Chemistry statement is about how light and film render it.
  • No axis overlaps with another. Changing the lens doesn't disturb the lighting; changing the lighting doesn't disturb the composition.
  • A failure ("label unreadable") is traceable — Optics (focus plane), not Anchor (position).

4. Composition rules

A.O.C. as axes is a start. A.O.C. as a reliable production tool requires these composition rules, which emerged from operating multi-shot campaigns under it:

4.1 Resolution order

When multiple input references disagree about a dimension, resolve in a fixed sequence: environment → style → color → subject/product. The first reference to claim a dimension wins; others adapt. Without this rule, LLM architects and human writers alternate which reference to privilege and consistency collapses.

4.2 Continuity rule (for multi-shot campaigns)

Across all shots of a campaign: the Anchor's where and the Chemistry's lighting state are held constant. Only Optics varies shot-to-shot. This is what makes five photographs look like one shoot.

The single-environment rule is load-bearing. It is why a multi-shot campaign can be generated from one shared creative direction block applied to all shots — the generator is not re-rolling the scene, only re-framing it.

4.3 Shot-role sequence

Each axis position in a multi-shot campaign has a pre-declared role:

  • Product campaign: environment-master → part-of-product → full-product → detail-closeup → atmospheric-frame.
  • Character / persona campaign: lifestyle-wide → portrait-hero → product-interaction → detail-moment → atmospheric-story.

The roles force variety in camera distance, framing, and emphasis. Without them, LLM architects drift toward five similar shots because each concept "felt right on its own".

4.4 Image-input declaration

Every A.O.C. prompt must state, inline, which reference image contributes what. No reference may silently influence the output. A prompt that attaches Image 2 without declaring "environment and shadow direction sourced from Image 2" permits the model to invent a use for Image 2 — which it will.

4.5 Identity / rigid-unit preservation

When a subject identity must survive across shots (a person, a specific product), the Lock is a separate layer, not part of the A.O.C. block. A.O.C. describes the shot. The Lock (IDENTITY LOCK, PRODUCT LOCK) guarantees the subject survives the shot unchanged. Mixing these concerns pollutes the A.O.C. axes and loses both. See the companion paper lock-layer-pattern-paper.


5. Failure modes

A.O.C. is valuable mostly because of the specific failures it rules out.

| Failure | Symptom | A.O.C. fix | |---|---|---| | Mood words in Chemistry | "moody", "cinematic", "atmospheric" | Replace with physical lighting and film-stock behavior | | Brand words in Optics | "shot on ARRI", "Fujifilm look" | Replace with optical mechanics and film behavior | | Collapsed axes | Lighting described in Anchor, framing in Chemistry | Split. Keep each axis disciplined | | Silent references | Images attached with no declared use | Require image-input declaration for every prompt | | Default-on effects | Blur, bokeh, grain added "for cinema feel" | Effects off by default; must be justified | | Adjective chain | "moody, cinematic, dramatic, epic" | Each must translate to a physical mechanism or be cut | | Axis bleed | Optics value that implies lighting ("natural light 85mm") | Extract the lighting claim to Chemistry |


6. Operationalization: LLM architects

A.O.C. works as a human discipline. It works better as an LLM-emitted output. Two specialized LLM "architects" are typically deployed:

  • Product A.O.C. architect — emits exactly 5 A.O.C. blocks covering the 5 product shot roles.
  • Character A.O.C. architect — emits N blocks (configurable) over cycling character shot roles.

Architect design principles that fell out of production use:

  1. Shared Creative Direction. A prior LLM stage ("Creative Director") emits one environment description. The A.O.C. architect receives that single block and writes all N shots inside it. This enforces the continuity rule (§4.2) structurally. See two-stage-architect-pattern-paper.
  2. Explicit image order. The architect's input includes an IMAGE ORDER section listing reference images by position (Image 1: background, Image 2: style, …). The system prompt forbids using numbers that don't exist. This eliminates reference hallucination.
  3. Declared fallbacks. When a reference is missing, the architect must declare the fallback in the prompt ("no background reference provided; environment drawn from the Creative Direction"). Silent substitution breaks debugging.
  4. Separator discipline. Blocks are separated by a low-entropy delimiter (e.g. ^ on its own line) that the architect cannot accidentally produce inside a block, making downstream parsing trivial.
  5. Temperature sweet spot. 0.5 for the A.O.C. stage. Higher temperatures produce undisciplined prose; lower ones produce five nearly-identical shots.

Sample architect system-message skeleton (abridged):

ROLE: You are the A.O.C. Architect. Generate exactly N A.O.C. prompts...
CRITICAL: Use the exact image numbers from IMAGE ORDER — do not use numbers that don't exist.
FORMAT: ANCHOR / OPTICS / CHEMISTRY, blocks separated by ^ on its own line.
RULES: In every prompt, explicitly state how each provided reference is used...

7. Relationship to adjacent frameworks

  • Reference authority hierarchies. A metadata-level framework that specifies which reference wins on which dimension. A.O.C. is the prose level; an authority hierarchy is the schema level. They compose cleanly.
  • STCV, SPECS, PromptMaster prompt frameworks. Varying levels of structure, none with A.O.C.'s photographic orthogonality. A.O.C. is opinionated about why the three axes, which the alternatives typically aren't. A controlled study would be valuable.
  • Cinematic motion frameworks. The video counterpart adds a Motion axis (camera movement + subject action) while keeping Anchor / Optics / Chemistry as-is. Early evidence that A.O.C. extends cleanly to time-based media. See generative-ai-production-pipeline-paper.
  • ControlNet stack. Operates at the conditioning-tensor level, not the prompt level. A.O.C. is orthogonal and composes with ControlNet cleanly (Anchor can encode pose-reference intent, Chemistry can encode depth-map intent).
  • Sequential-consistency patterns. The four patterns of sequential-consistency-prompt-architecture-paper (reference-inheritance, anti-repetition, neighbor-aware, reference-condensation) compose with A.O.C. as the per-shot vocabulary while they handle the across-shot dimension.

8. Open research questions

  • Formal rubric. A scoring rubric for "well-formed A.O.C. prompt" — likely three axes × required/optional fields × pass/fail per field. Needed for teaching and for automated QA of architect outputs.
  • Controlled human study. A.O.C. vs. free prose vs. keyword stack, blinded human raters, fixed concept, multiple models. Effect size on output quality and on consistency across repeated generations.
  • Cross-modal extension. Does A.O.C. survive the jump to video with only a Motion axis added? To 3D with Chemistry split into Light and Material? To audio (probably not — a sibling framework is likely cleaner).
  • LLM architect failure atlas. When LLMs emit A.O.C. at scale, what specific failure modes appear? Axis bleed? Template collapse under temperature? Need a corpus of broken outputs classified by mode.
  • Empirical basis for the resolution order. The environment → style → color → subject precedence is based on production experience, not on user study. Alternative orderings are plausible.
  • Naming. "Chemistry" is a slight stretch — it covers light + film + atmosphere together. A cleaner term may exist; the community is invited to propose one.

9. Conclusion

A.O.C. is a small framework making a large claim: that every still-image generation prompt can be factored into three orthogonal axes, that those three axes correspond to three real stages of photographic production, and that disciplined adherence to the factorization gives prompts that compose, that debug, that teach, and that an LLM can reliably emit.

It is offered here not as a finished specification but as a working tool already in production. The invitation is open: adopt it, break it, extend it, replace it.


Citation

Nix, A. (2026). A.O.C. — A Prompt Framework for Generative Image Models. Working paper.