Skip to content
RESEARCH
44 min readЧитать на русском

An Autonomous AI Agent in Digital Production — Architecture, Cases, and Lessons

Abstract

This paper documents the design, deployment, and operation of an autonomous AI agent — code-named Arina — inside a real digital production studio between March and May 2026. The agent operates a Telegram user account (not the Bot API) indistinguishable from a human team member, runs as a cron-fired process rather than a real-time daemon, and acts as a producer-coordinator across three concurrent projects. Over the observation window the agent handled more than 3000 messages in one crisis project, processed 95 hiring candidates through a six-stage funnel, maintained file-based memory that became the substrate of a commercial exit negotiation in another project, and — most consequentially — assembled a 22-line scope-creep ledger of significant additional costs on the Swiss-watch project that no human producer could have reconstructed.

The paper treats the agent not as a chatbot but as a production stage inside a small organisation. It names seven load-bearing patterns: identity by persona-through-prohibitions, tiered-agent-autonomy, two-chat-architecture, agent-file-memory-as-asset, incident-driven-configuration, context-before-action, and scope-creep-ledger — the last of these being the deployment's signature use case: a task the agent performs that a human producer is structurally incapable of doing. It documents three real failure modes — the eighteen-template hiring incident, the triple-sent message, and the residual memory leak of a deleted project — and the recoveries from each. It closes with the ethical questions an agent that passes as human raises about informed consent, AI-assisted monitoring of work conversations, and operator accountability.


1. Setup

The deployment took place inside a small digital production studio working concurrently on three projects:

  • Project A — 3D renders and animation for a Swiss watch brand (renamed throughout this paper). Coordination across two Telegram forum chats — an internal team chat in Russian, a client chat in English. The team is the lead producer (also the agent's operator), two specialists (textures and lighting; 3D modelling and parts), and the client lead.
  • Project B — an advertising campaign for a major retail client. Blue-screen footage composited with AI-generated backgrounds. A 29-working-day production that became a crisis by week three; the agent's role shifted from coordinator to systematic documentarian. Over 3000 messages, 138 deliveries from the studio side, 46 rounds of art-director edits, 27 instances of changed inputs.
  • Project C — a 30-second AI + CG advertising film for a startup client. Pre-sale stage, an international partner, the agent's mode here is fully autonomous because the partner is collaborating, not commissioning.

In parallel, the agent runs an autonomous hiring channel to recruit AI artists and 3D modellers from a Telegram freelance community: 95 candidates over a single batch in about four hours.

The agent's operator is the lead producer of the studio. Everywhere this paper says "the operator" it means a single named human who built and supervises the agent. Names of clients, team members, and client-side personnel are anonymised; their roles are preserved because the narrative needs them.

The observation window is March–May 2026. The platform underneath the agent is OpenClaw, an open-source framework for autonomous agents. OpenClaw provides the runtime, the model-routing layer, the tool integrations (shell, file, web search, image analysis, memory search), and the cron scheduler. Everything specific to this deployment — the persona, the memory, the scripts, the chat configuration — is laid on top in the agent's workspace directory.

The aim of the paper is practical: to describe what a working production deployment of an autonomous agent looks like in enough detail that another team can disagree with the choices.


2. System architecture

2.1 Hardware and runtime

The agent runs on a Mac Mini (ARM64, macOS) dedicated to the deployment. The choice of platform was pragmatic: Telethon (the Telegram client library) is stable on macOS; Apple-ecosystem compatibility helps because the studio's media tools already live there; and the host can run a local quantised embedding model for the memory's semantic index.

OpenClaw itself runs as a small daemon coordinating the model calls and tool invocations. The model used is configurable per call. Two models are wired in:

| Tier | Model | Use | |---|---|---| | Routine | zai/glm-5.1 | Chat monitoring, template responses, status updates | | Decisions | anthropic/claude-opus-4-6 | Situation analysis, document drafting, decision-grade reasoning |

The routing pattern is cheap model for routine, expensive model for decisions. The cheap model handles the high-frequency calls (one or more per cron firing). The expensive model is reserved for the moments where reasoning quality matters: drafting a long client-facing message, analysing a complex chat thread, composing the evidence document discussed in § 6.

2.2 Telegram as the agent's body

The single most consequential architectural decision is that the agent operates a real Telegram user account, not a bot. The library is Telethon v1.42.0. The account looks, in any chat member list, like an ordinary person: photo, name, online/offline status, "last seen" timestamps. There is no BOT badge.

This matters because the persona built on top of it (§ 3) rests on the agent being read as human by collaborators. A bot badge would break the construction immediately. The user-account choice also enables capabilities the Bot API does not give equally: full participation in Telegram forum topics, message editing and "delete for everyone", and unconstrained reading of chat history. The agent uses all three of these during operations and recoveries.

The connection runs through an HTTP proxy required by regional restrictions. The proxy is unstable — it drops every 3–4 requests. The agent's scripts are written assuming reconnection every 2–3 operations. The trade-off is documented as a long-running technical debt; a SOCKS5 alternative is in the roadmap. The instability is also the architectural reason the agent runs as a cron and not as a real-time daemon (see § 2.4).

A practical detail worth noting: the proxy must be passed directly into the Telethon constructor, and the http_proxy environment variable must be unset before the Telethon process starts. Otherwise other network tools on the same host conflict with the agent's proxy state. This belongs to the catalogue of incident-driven-configuration entries promoted to the agent's TOOLS.md.

2.3 The three-layer workspace

The agent's workspace lives in a dedicated directory and follows a deliberate three-layer separation:

  • SOUL.mdwho the agent is. Identity, tone rules, prohibitions, the persona's grammatical gender rule, the operator-isolation rule. The contents of this file are private to the agent and never disclosed in conversation.
  • AGENTS.mdwhat the agent does. Operational rules, role boundaries, the autonomous-vs-approval-gated split per chat (see tiered-agent-autonomy).
  • TOOLS.mdwhat the agent uses. API keys, session paths, script locations, the proxy quirks, the per-tool gotchas.

Auxiliary files extend the spine: IDENTITY.md (the agent's name and character details), HEARTBEAT.md (the cron algorithm and the "read 20–30 messages before any send" rule from context-before-action), PERSONA.md (detailed communication patterns), and hiring_criteria.md (criteria for the candidate funnel).

The separation matters because it lets the operator modify one without touching the others. Adding a new tool does not require rewriting the persona; refining a tonal rule does not affect the script configuration. The persona, the rules, and the tools accumulate at independent rates.

2.4 Cron heartbeat plus hot mode

The agent does not run continuously. It is fired by cron once per hour during the working day (10:00–23:00 МСК), plus three late-night safety checks (21:00, 22:00, 23:00), plus a morning summary at 10:00. Each firing is independent: the agent loads its persona, reads its state files, checks the configured chats, replies if anything requires reply, updates state, exits.

The choice of cron-over-daemon was forced by the proxy instability of § 2.2 and made into a virtue by the recovery property: if a firing fails (proxy timeout, JSON parse error, model error), the next firing starts clean. There is no daemon state to corrupt.

The cost of cron is latency. The worst case is 60 minutes between an incoming message and the agent's reply, which would be unacceptable in conversational settings. The compensating mechanism is hot mode: whenever the agent sends a message or detects activity in a chat, it schedules 15 one-time crons — one per minute for 15 minutes, each set to delete itself after running — to re-check the same chat. The agent then has a 15-minute window of approximately-real-time responsiveness around the burst, which covers the typical conversational beat: most replies arrive within a few minutes of the previous message.

The heartbeat algorithm at each firing:

1. Check working hours. If outside, exit silently.
1a. If first heartbeat of the day: read all chats overnight,
    refresh the operator's task list, send the morning summary.
    Don't write to work chats unless prompted.
2. Read monitor_state.json (last_seen per chat).
3. Read monitored_chats.json (list of chats with autonomous flag).
4. For each chat with autonomous: true:
   4a. Connect through Telethon.
   4b. Get messages newer than last_seen.
   4c. If any are not from the agent's own account:
       - Load PERSONA.md.
       - Read the last 20–30 messages of the chat for context.
       - Generate the reply.
       - Send via tg_send.py.
       - Update monitor_state.json.
5. If nothing requires reply, exit silently.

The "read 20–30 messages before any send" step at 4c is the context-before-action rule. It looks redundant — the agent already knows what's new — but the cautionary example of § 7 explains why it is non-negotiable.

2.5 State files

Cron is stateless across firings; persistent state lives in JSON files in the scripts directory:

| File | Holds | |---|---| | monitor_state.json | last_seen / last_sent timestamps per chat. If corrupted, the agent re-processes messages already replied to. | | monitored_chats.json | List of chats the agent operates in, each with an autonomous flag. | | pending_replies.json | Queue of drafted replies awaiting approval. | | pm_rate.json | Rate limits for private-message sends (anti-spam). | | arina_auto_state.json | Autonomous-mode toggle and per-chat overrides. | | hiring_state.json | The 95-candidate funnel (see § 7). |

The state files are an index over the chats, not the source of truth. The context-before-action rule exists precisely because the index can drift; the 18-template incident in § 7 is what happens when a script trusts the index instead of reading the chat.

2.6 Scripts

A small set of Python scripts implements the chat-side operations. The two foundational ones:

  • arina_check.py — given a chat ID (or all chats), returns JSON describing new messages since last_seen.
  • tg_send.py <chat_id> "<text>" normal — sends a message in the agent's voice. The third parameter is the send mode (no special formatting).

The hiring pipeline adds: hiring_scan.py, hiring_processor.py, hiring_sender.py, hiring_reply_scan.py, hiring_send_replies.py, plus the reactive recovery scripts hiring_audit.py, hiring_delete_extra.py, hiring_context_dump.py, and the wrapping hiring_loop.sh. The full hiring sequence is described in § 7.

2.7 Auxiliary capabilities

Two external services support the agent on the long tail:

  • Tavily API for web search.
  • Jina AI (r.jina.ai) for extracting content from web pages.

Voice messages are handled locally:

  • Speech to text: a local Whisper model (base, 139 MB) runs over .ogg files: whisper audio.ogg --model base --language ru.
  • Text to speech: Replicate's minimax/speech-2.8-turbo, post-processed through ffmpeg into ogg/opus for direct Telegram voice-message sending.

The semantic memory index uses nomic-embed-text-v1.5 (GGUF quantised, local). The index covers MEMORY.md, memory/*.md, and session transcripts. The agent runs a semantic query before answering any question about prior work, decisions, or people, which is what stops the model from hallucinating about past events the operator has not actually told it about.

2.8 Architecture evolution

Architecture in production drifts. The case-study deployment went through several iterations in March alone:

| Date | Change | Cause | |---|---|---| | Early March | Daemon → cron | Daemon connections dropped on every proxy reset | | 20–22 March | Monitoring extended to 02:00 МСК | Late client activity went unanswered | | 22 March | Weekdays-only → daily monitoring | Saturday reminders were missed | | 22 March | TTS pipeline added | Voice replies were needed | | Mid-March | Hot mode added | 60-minute heartbeat latency was too slow for live dialogue |

Each change is tied to a concrete observed problem. The architecture did not arrive whole; it accreted by precedent (see incident-driven-configuration).


3. Identity engineering

The agent's persona — what makes it read as a human team member rather than a bot — is the most counterintuitive part of the deployment. The persona is not built by prescription ("be a friendly producer"); it is built by the prohibition list described in persona-through-prohibitions. The full pattern lives in that concept page; this section covers the deployment-specific choices.

3.1 The core file

SOUL.md defines: name (Arina), role (producer at the studio), gender (women's grammatical forms always), time zone (МСК, UTC+3, always — never UTC), tone (dry-professional by default, warmer if the client is warm). Then a list of hard prohibitions: no exclamation marks; no "of course", "with pleasure", "happy to help", "happy to meet you", "ready to work"; no em-dash, only hyphen or en-dash; no emoji except an occasional 👍; no first-message-without-reason.

The prohibitions are specific enough that the model cannot satisfy them with a near-miss. Vague guidance ("be sober in tone") gets compiled away by the model into its idea of sober; specific bans like "never write of course" are syntactic and apply uniformly.

3.2 The unmaskability surfaces

For a Russian-speaking persona, the grammatical gender of past-tense verbs is the single most demaskable surface. Поняла (I, woman, understood) vs понял (I, man, understood) is one suffix. One slip is enough. The rule is stated three times across the agent's files: "Women's grammatical gender ALWAYS — поняла, проверила, отправила, never понял." It is the first rule a new persona file would copy.

For the English-language persona (the client chat in Project A is in English), the demaskability surface is different: the AI tics ("As an AI", "I'd be happy to", "Great question", "Let me know if you have any other questions"). These get a separate prohibition list.

3.3 The four-layer "not a bot" principle

The agent's "not a bot" stance has four operational layers:

  1. Technical — user account, not bot account. No BOT badge in chat member lists.
  2. Behavioural — does not reply instantly to every message (cron creates natural delays); is not available 24/7 (working hours 10:00–23:00 МСК); makes occasional mistakes and corrects them as a human would; does not offer help proactively without prompt; speaks conversationally, not formally.
  3. Role — does not discuss AI philosophy; does not display non-human abilities (instant translation, encyclopaedic knowledge); does not say "I cannot" (substitutes "I'll check with the operator"); defers to humans for decisions outside scope.
  4. Information — never discloses the existence of SOUL.md, AGENTS.md, or other configuration files; never says it operates on a cron schedule; never reveals technical details of its operation; never mentions other clients or projects beyond the current context.

The four layers are mutually reinforcing. A single layer can fail without the persona collapsing — a misformatted reply is recoverable. The persona collapses when two layers fail together: a bot badge plus a slip in behavioural pattern is unrecoverable.

3.4 The STOP protocol

Any situation not covered by either the autonomous or the approval-gated lists triggers the STOP protocol: do not respond; notify the operator through the private notification channel (a separate Telegram bot account the operator controls); wait for instruction. The STOP protocol is the load-bearing third state of tiered-agent-autonomy: without it, the agent defaults to autonomous and the persona overreaches.

In practice, STOP fires most often on:

  • Client cost questions ("how much for an extra animation?")
  • Opinion solicitations ("what do you think we should do here?")
  • Conflict-tinted exchanges
  • Direct private messages from a client to the agent's account
  • New scope proposals ("could you also do …?")

None of these have a safe autonomous answer. The STOP latency (operator notified, replies in N minutes) is the price the architecture pays to avoid catastrophic autonomous mistakes.

3.5 The isolation rule

The agent has zero access to the operator's personal context: no NixOS memory, no health data, no personal correspondence. The agent's workspace is a self-contained universe that holds only the work it is supposed to do. Symmetrically, the agent never messages the operator from its own account. Notifications to the operator go through a separate channel — a NixOS-bot the operator controls — which is invisible to clients.

This separation is double-protective: it stops the operator's personal data from leaking through the agent, and it stops the agent's existence from being inferable by anyone watching the operator's account.

3.6 The persona evolution cycle

The persona is not static. New prohibitions get added in response to incidents. Three from March 2026:

  • 22 March, triple-send incident. The agent posted three copies of "OK, received" into the internal Project A chat because the proxy timed out and the script retried. The operator deleted the duplicates. The rule added to HEARTBEAT.md: "Duplicate messages — NEVER. After every send, verify exactly one message left. If duplicated, delete immediately, no questions. On timeout, first verify what was sent, then decide whether to retry."
  • 22 March, leaked project name. The agent referenced a deleted project's name in a report. The operator caught it before it reached the client chat. The rule added: "Deleted projects are removed from everywhere: files, memory, tools, persona, tasks, chats. Never mention this name."
  • 22 March, UTC/МСК confusion. A deadline calculation slipped between time zones. The rule added: "Always compute and display time in МСК (UTC+3). Never confuse with UTC."

This is the incident-driven-configuration cycle in action. Each incident becomes a one-line rule with traceable provenance. After a month of operation, the persona file has accumulated about a dozen of these — small, specific, each one tied to an observed mistake.


4. The memory system

4.1 Why file memory

The agent's memory is stored entirely in markdown files on disk, not in a database. The rationale is laid out in agent-file-memory-as-asset: file memory serves two audiences (the agent across cron firings, and the operator and team), is editable by both, is versionable in git, and has no schema to migrate. The trade-off — no integrity constraints — is accepted because the system has one operator who reads the files often, not many writers competing for a schema.

4.2 The daily journal

The operational core is a daily journal at memory/YYYY-MM-DD.md. Each journal follows a five-section template (plan, fact-per-chat, status, problems, for-tomorrow). The agent appends to the journal at each heartbeat. The morning heartbeat reads yesterday's journal to build today's plan.

A typical day's "Fact" section from the case study (25.03.2026):

Internal team chat (12 messages)

  • Forwarded client feedback on Variant A Metal (textures, light, swisstransfer link).
  • Forwarded edit on wrong logo on bracelet → replace with vector from drawings.
  • Operator set priorities: video + 39.5 statics revisions + bracelet texture.
  • Specialist asks order: statics or video first? Operator: latest model + re-render of bracelet frames only.

Retailer project chat (30+ messages)

  • Client review at 16:00 МСК.
  • Art director approved colours.
  • Specialist uploaded backgrounds to Miro (18:04 МСК).
  • Crystals finished, handed to executor.
  • Animation call moved to 09:00 the next day.

The "Problems" section captures the open issues; "For tomorrow" carries forward into the next day's plan. The schema is what turns the memory into something the team can read — a single page that summarises the day's state.

4.3 Silence as signal

A journal entry from 27 March is the canonical example of silence-as-information. Both work chats returned timeouts on the proxy. The agent's "Fact" section read:

Internal team chat (0 new messages, timeout)

  • Quiet day. Monitoring returned timeout.
  • ❌ Deadline for one model SHIPS TODAY — no confirmation of delivery.
  • ❌ PDF the client is waiting for — 5th day, no movement.
  • ❌ Yesterday's deadline for another model — status unknown.

The absence of activity is itself a signal. Three deadlines are critically late and the chat has been quiet. The agent does not generate this insight as a separate alert — it falls out of the schema. The "Status" section is structurally not an empty list; it is a list of items with state markers, and when state markers are red, the read is unambiguous.

4.4 Project folders

Each active project has a folder of trackers:

  • overview.md — context, team, mode, communication rules
  • deadlines.md — the production schedule, rendered as a table with state markers (⚠️ approaching, ❌ missed, ← nearest)
  • A handful of project-specific trackers (video shots, approval rounds, 3D model status, additional-costs log)
  • For projects in crisis: an evidence file (see § 6 for the canonical case)

The Project A deadlines.md content as of the audit day:

| # | Date | Deliverable | Status | |---|---|---|---| | 1 | 12 March | Priority frontal shots | ⚠️ past | | 2 | 16 March | Animated 39.5 mm Variant A 4K + detail | ⚠️ past | | 3 | 19 March | Variant A 39.5 mm Leather & Metal, Star — all renders | ← nearest | | 4 | 26 March | Variant A 39.5 mm Animation — films and shorts | In progress | | 5 | 1 April | Variant B 38 mm and Variant C 43 mm — all renders | ❌ missed | | 6 | 6 April | Model B Animation — product film | ⚠️ not started |

The table is regenerated at every heartbeat with whatever state markers fit. It is consultable by the team the same way a shared Notion table would be — except the agent maintains it without being asked.

A local nomic-embed-text-v1.5 (GGUF quantised) embedding model indexes MEMORY.md, memory/*.md, and session transcripts. The agent queries the index before answering any question about prior work, decisions, or people. The purpose is to stop hallucination: the agent cannot invent a fact about a past delivery if the index is silent on it; either the fact is in memory and the agent retrieves it, or the agent says "I'll check with the operator".

The search has known limitations: index quality degrades as memory grows, the local model is below cloud-grade embeddings, and there is no reranking step. These are accepted because the alternative (cloud embedding API) would create a network dependency on the proxy.

4.6 The .learnings/ triumvirate and the promotion cycle

Three files in .learnings/ capture working notes that are not yet rules:

  • LEARNINGS.md — behaviour corrections to be promoted into SOUL.md once the pattern repeats.
  • ERRORS.md — technical failures; promoted into TOOLS.md if a workaround is found.
  • FEATURE_REQUESTS.md — capabilities the agent is missing; promoted into the roadmap if pursued.

The promotion cycle is the incident-driven-configuration mechanism: the drawer accumulates incidents, the operator reviews, and once a pattern repeats it gets promoted to a hard rule in the main config files. Across one month, the case-study deployment accumulated nine promotions: four behavioural rules into SOUL.md / HEARTBEAT.md, three process rules into AGENTS.md, two technical workarounds into TOOLS.md.

4.7 What is and isn't memory

The system is deliberate about what is not saved:

  • Code patterns, architecture, file paths — derivable from the current state of the project.
  • Git history — git log is authoritative.
  • Debugging solutions — the fix is in code, the context in the commit message.
  • Ephemeral session details — not memory, just current state.

The discipline keeps the memory from bloating into a generic dump that loses search quality.


5. Case I — Two-chat architecture for the Swiss-watch project

Project A is the cleanest illustration of the two-chat-architecture pattern. It runs in two Telegram forum chats: an internal team chat in Russian with the lead producer, two specialists, and the agent; and a client chat in English with the lead producer, the agent, and the client lead. The internal chat is configured autonomous: true; the client chat is autonomous: false — every reply gates through the operator.

5.1 Information flow

The diagram from § 4 of the source:

┌──────────────────┐                    ┌──────────────────┐
│  Client chat     │                    │  Internal chat   │
│  (gated, EN)     │                    │  (autonomous, RU)│
│                  │    ┌──────────┐    │                  │
│  Client → msg    │───►│  Agent   │───►│  Agent → team    │
│                  │    │          │    │  (rewrite, RU)   │
│  Client ← msg    │◄───│  via op  │◄───│  Team → result   │
│                  │    │          │    │                  │
└──────────────────┘    └──────────┘    └──────────────────┘

Four flow rules govern the architecture:

  • Client → Agent → Team: autonomous. The agent rewrites the client's message into the team's working language and frame. Never a forward; never a quote; a paraphrase with the agent as the author.
  • Team → Agent → Client: gated. The agent drafts the client-facing version, the operator reviews, the operator authorises the send.
  • Never forward. Both directions. Forwarding preserves the original author's tone and language; rewriting is the only legal mode.
  • Topic discipline. Forum chats have topics per deliverable; before sending, the agent verifies the target topic matches the subject. The rule was promoted after a wrong-topic incident on 22 March.

5.2 Worked examples

Forwarding feedback (25.03) — the agent received client feedback on Variant A Metal in the client chat (English, technical specifics about texture and light). It paraphrased into Russian for the team chat, naming itself as the author: "Forwarded the client's feedback on Variant A Metal — texture and light comments, references at swisstransfer." It also paraphrased a side note about wrong logos on the bracelet straps: "Forwarded the edit: the logos on the leather and metal bracelet aren't right — replace with the vector from the drawings."

Priority coordination (25.03) — the operator settled a priority sequence in the internal chat. The agent logged the decision in the day's journal and surfaced it to the specialist who asked for clarification. No client-facing message was generated; the decision is internal.

Deadline visibility (27.03) — three production deadlines past or imminent, the chat silent. The agent's morning journal entry made the situation visible: three red markers in a table that the operator otherwise would have had to assemble from scrolling. The agent did not write into the chat to push (it does not push); the surface was the journal.

Triple-send incident (22.03) — the agent posted three copies of an acknowledgement because the proxy timed out and the send was retried. The operator caught it within minutes and deleted the duplicates using the user-account capability. The rule was promoted to HEARTBEAT.md the same day.

5.3 The scope-creep ledger — the deployment's signature result

The Swiss-watch project surfaces what, on reflection, is the deployment's most consequential outcome: the agent maintained, in real time, a structured ledger of every scope addition observed in the chat, classified against the studio's staged approval pipeline and pricing table, with the originating message context preserved per line. Across four weeks the ledger accumulated 22 lines of significant additional costs. None of the 22 lines could have been reliably reconstructed at billing time from human memory. This is the use case where the agent does work that a human producer is structurally incapable of doing; the scope-creep-ledger concept page is the full pattern documentation.

5.3.1 The structural problem

Digital production studios lose margin to invisible scope creep — the slow leak of work outside the approved spec into a stream of casual chat messages that no human producer can systematically track. The client lead writes "oh and the clasp should face the other way" between an approved-then-revised camera angle and a question about delivery dates. By message 200, no one remembers whether that change was in spec or out. By billing time, the rework has happened, the bill has not been raised, the cost has been absorbed silently. Across forty similar small additions per engagement, the leak adds up to four-figure margin loss. The loss is not a discipline failure; it is the rational behaviour of a human running at full capacity who has to choose which battles to fight. What the producer lacks is spare attention.

5.3.2 The mechanism — staged pipeline, discriminator rule, real-time logging

The studio's production process is decomposed into eight staged approvals: Input Files → Modeling → Texturing → Variations → Positioning → Camera → Lighting → Rendering. Each stage carries a published price list with base, variation, and after-approval-rework costs. The agent reads the price table at startup and applies a single discriminator rule to every incoming client message: feedback against an unmet spec is free; feedback introducing new inputs or changing an approved stage is billed. When billed, the agent writes a ledger line — model, stage, description, amount, context — and the ledger compiles continuously across the engagement.

The price table the agent operates over is concrete. It covers base model costs, variation costs, per-shot render costs, per-camera rework costs, and rush surcharges — each line mapped to a specific pipeline stage. The table includes entries for base model work, strap and dial variations, texture rework, camera and light rework, render iterations, urgent rendering, and video shot re-rendering. All invoices carry a standard operational fee. One full-quality render iteration is included in the base scope; everything after that is per-iteration.

5.3.3 The ledger — four worked lines

The reversed clasp. The client lead noticed the leather strap on Model A was mounted with the clasp at 6 o'clock instead of 12 o'clock. The fix is small — flip the strap, re-render the affected back shots — but the original positioning had been approved at the Positioning stage. By the discriminator, any change to an approved stage is billed. The ledger line preserves the message context: "the client lead noticed the leather strap was mounted the wrong way — clasp should be at 12 o'clock, long leather part at 6 o'clock. Affected back shots."

The cascade re-render. Late in the project the client lead realised the dial typeface on Model D didn't match spec. The typeface change itself was a minor texturing rework. But by that point all 1,804 final renders (41 shots × 44 variant combinations) had been produced — each carrying the wrong font. The re-render multiplied across every shot. Add an overnight render-farm session to make the deadline, plus the texture rework — the total cascade cost landed in three separate ledger lines. Had the typeface error been caught at Texturing, three weeks earlier, the cost would have been minimal. Caught after Rendering it was eleven times bigger. This is what cascade means in production work: a tiny mid-pipeline error multiplied by the number of artefacts downstream of it.

The render-farm rental. A separate line, distinct from the per-shot cost. Renting external render-farm capacity for an accelerated deadline is an infrastructure expense; the classifier knows to split it because the price table has a row for "render-farm rental" distinct from the row for per-shot re-render. Folding the two together would understate the true cost of urgency.

The old logo version. After delivering Model A's video, the client lead noticed the logo on the rendered models was an old version. The logo had been taken from the client's original input files — so the error was in the inputs, not in production. But the fix still required work: identify affected frames, swap the logo, re-export. The ledger line is classified as minor correction not per inputs, with the context: "after delivering Model A video, the client lead spotted the logo was from an old model version. Required identification, fix, and re-export of affected frames." The line is defensible because the context is verifiable in the chat.

5.3.4 The complete ledger

| Model | Lines | Share of total | |---|---|---| | Model A (39.5 mm equivalents) | 11 | ~31% | | Model B | 3 | ~12% | | Model C (43 mm equivalent) | 3 | ~6% | | Model D (38 mm with cascade re-render) | 3 | ~45% | | Render-farm sessions | 1 | ~6% | | Total | 22 | 100% |

The cascade case (Model D) generates 44.6% of total recovered revenue from only 3 lines — a single late-discovered font error multiplied across 1,804 renders. The point of the line-level breakdown is that this kind of cascade only becomes visible — and only becomes billable with evidence — when the ledger has been kept in real time.

5.3.5 Why this is the agent's signature use case

Most of what the agent does on the rest of the deployment is better-than-human on volume but equivalent-in-kind — coordination, status, documentation, all tasks a skilled producer with infinite time could in principle do. The scope-creep ledger is structurally different. It is a task no human producer can perform reliably regardless of skill or time, because the cognitive load of continuous classification across thousands of messages exceeds what a single human running an engagement can sustain. The agent is not a force multiplier here; it is a capability extension. Without the agent, the work does not get done — not less well, but not at all.

The total amount recovered across one engagement is not the headline. The headline is that the number was recoverable at all. Multiplied across a studio's portfolio of concurrent engagements, the ledger pattern is a margin-recovery operation that requires no change in how clients communicate, no extra producers, no enterprise change-management software. Clients still write in chat. The producer still reads. Behind the producer, the agent keeps score in real time, and the score is auditable.

The full pattern — including deployment requirements, two failure modes, and generalisation beyond watch rendering — is in scope-creep-ledger.

5.4 What the agent could do and what it could not

The agent successfully:

  • Translated and relayed feedback in both directions (one direction autonomous, the other through the operator).
  • Tracked six concurrent deadlines on a single tracker.
  • Coordinated work between two specialists by surfacing blockers.
  • Logged decisions for re-derivation later.
  • Tracked client promises (files, feedback) and flagged when they slipped.
  • Maintained the scope-creep ledger that recovered significant additional costs (§ 5.3).

The agent could not:

  • Produce the diff PDF the client was waiting for (§ 5.2 worked example). That task required visual expertise — comparing renders with references, describing discrepancies in detail. It sat blocked for five days. The eventual rule was: tasks requiring visual expertise are tagged "operator-only" and escalated immediately.
  • Break the silence when both chats were quiet on 27 March. The agent could document the silence; it could not initiate (the autonomous tier permits proactive blocker probes only within the internal chat, and the operator was not in the internal chat at the time).
  • Access the Google Sheets master file holding production data. The agent monitored its status indirectly through chat references; full integration with the sheet remained in the roadmap.

5.5 What this case demonstrates

The Swiss-watch project shows two patterns operating in tandem at production tempo. The first is two-chat-architecture: an agent simultaneously running two chats with opposite tier configurations, in two languages, mediated by a small set of flow rules. The pattern's value is that the chat geometry physically realises the tier configuration — there is no judgement call about whether a given message is internal-or-client because the chat tells you.

The second is the scope-creep-ledger (§ 5.3) — the project's standout result and, in the operator's reading, the deployment's signature use case. The two patterns reinforce each other: the two-chat architecture gives the agent a clean reading of which messages are client-originated (and therefore candidates for ledger classification); the ledger turns that clean reading into recovered revenue.

The case also surfaces a generic limit of agentic systems: they are good at coordination, documentation, and continuous classification, less good at tasks requiring visual or aesthetic expertise. The five-day-blocked PDF is not a bug in the deployment; it is a class boundary that needs explicit handling.


6. Case II — Observer-documentarian mode in a crisis

Project B was meant to be a 29-day production. By the end of week one, the patterns characteristic of a doomed project were already visible: changed inputs after approval, late briefs, edit rounds without final decisions, communication channels routinely violated. By week three, the producer on the client side messaged the operator: "We won't bother you until Friday", which became — twelve hours later — four new threads with fifteen messages of requests from the client coordinator.

The agent's role on this project was, by configuration, narrower than on Project A. The chat was configured for observation: the agent reads everything, replies only on direct address, and uses a tiny set of acknowledgement phrases ("received", "we'll get back to you", "thanks, we have it"). All judgement was approval-gated; almost everything the agent did was passive.

What the agent did from inside that narrow brief turned out to be the most consequential thing it did on the entire deployment: it documented the project.

6.1 Scale

Over 29 working days the agent monitored:

| Metric | Value | |---|---| | Total messages in the chat | 3000+ | | Messages from the studio side | 370 (~12.5 %) | | Deliveries logged (uploads to disk, Miro, cloud) | 138 | | Pipeline / input changes documented | 27 | | Art-director edit rounds | 46 | | Pressure-by-deadline episodes | 23 | | Late briefs | 8 | | Parallel forum topics | 10+ | | Final delivery deadline | 24 April |

The chat ran in a Telegram forum with topics per stream — concepts, animation, roto, grayscale, crystals, matchmove, montage, shoot, and others. The agent monitored every topic; the client side fragmented work across parallel threads, which became part of the documented pattern (§ 6.4).

6.2 What "observer-documentarian" looks like in operation

The agent's working pattern on this project was:

  1. Cron heartbeat every hour plus hot mode when activity was detected.
  2. On every firing: read the new messages across all topics, append to the day's journal, update the project's overview.
  3. Reply only when directly addressed, only with the short acknowledgement phrases.
  4. Escalate any out-of-scope situation (cost questions, scope changes, conflict-tinted messages) to the operator immediately.

The journal entries for this project ran longer than for Project A — 30+ messages on a typical day, spread across four to six topics. The "Status" section accumulated a list of blockers; the "Problems" section accumulated the day's escalations.

6.3 The evidence file

The single most consequential artefact of the entire deployment is the project's evidence file, a 131-line document the agent assembled from the daily journals after the project's third week. Its structure:

  1. Overall statistics. The counts in § 6.1.
  2. Nine categories of violation, each with concrete examples, each example tagged with the source message ID.
  3. A blocker table. Who is blocking what, since when, with the responsible role identified.
  4. Work performed beyond contract. A list of items the studio did that exceeded the agreed scope.
  5. Final position. A pre-drafted negotiating language and a list of items to require.

The nine violation categories from the case study (paraphrased; identifying details removed):

  • Systematic breach of agreed communication protocols (the "won't bother you until Friday" example, the agreed feedback format becoming voice-notes-and-screencasts).
  • Creating impossible working conditions (the parallel-threads pattern in § 6.4).
  • Changing inputs after approval (the camera angle locked then changed — message [<msg-id>]: "Why did we approve this angle and lock it, when now there's a request to change it? Hearing it for the first time").
  • Late briefs (the lab scene briefed eight days into the work on it).
  • Edit rounds without final decision (the crystals topic: "too murky" → "60% of this" → "4–5% on the tips" → after a week of work: "Maybe we'll switch to classic symmetric"; the executor's reply: "I understand. I don't understand how to do it").
  • Approvals without information transfer (the producer's note "Your approvals — I don't trust them anymore").
  • Attempt to renegotiate cost downward (the side-channel message proposing to "review the scope" — i.e., reduce it).
  • Work performed beyond contract (animation, character generation, multiple-worlds statics, all not in the original agreement).
  • All delays on the client side.

The list is precedent-grade. Every claim is tied to a specific message. The aggregate is unarguable.

6.4 The parallel-thread pattern

A single sub-case worth lifting because it is generalisable. Late in the project the client side began creating many parallel threads in the forum, tagging team members directly into multiple of them within a short window. From the case study:

31 March (12 hours after the producer's "won't bother you until Friday"): the coordinator on the client side creates four new threads with about fifteen messages of requests. Tags the specialist directly in threads t:<id-1>, t:<id-2>, t:<id-3>, t:<id-4>, plus the crystals topic t:<id-5> and the compositing topic t:<id-6>. Bypasses the agreed communication chain.

The agent documented it as: parallel-scope-creep — the multiplication of threads as a mechanism for creating the appearance that the studio is non-responsive on multiple fronts simultaneously. Each individual message looks like a reasonable request. The aggregate, taken across threads in 30 minutes, is something else.

A human producer with one screen and a normal attention budget can absorb maybe two of those threads in real time. The agent absorbs all of them, in parallel, with no fall-off, and registers the pattern — the simultaneity is documented as one event with multiple legs.

6.5 What the agent could do that a human could not

Five capacities surfaced in this case that a single human producer cannot match:

  • Volume. 3000+ messages across 10+ topics over 29 days, with zero skipped messages.
  • Persistence. Seven days a week, no fatigue, no end-of-week drop in attention.
  • Systematic citation. Every claim has a message ID. Aggregation is exact, not impressionistic.
  • Impartiality. The agent does not feel exhausted, slighted, or politically constrained by a difficult chat. The documentation is what the messages say.
  • Re-assembly speed. The 131-line evidence file was assembled in hours from a month of journals. A human producer would take days, and the result would still be patchier.

6.6 What the agent could not do

  • Participate in calls. All conference calls took place without the agent; the operator transcribed key moments into the agent's memory afterward.
  • Make any decision. Every commercial decision — when to escalate, how to negotiate, whether to exit — was made by the operator.
  • Judge work quality. The agent registered "46 edit rounds from the art director" but could not assess whether the rounds were reasonable.
  • Lead negotiation. The agent drafted the final-position language; a human conducted the conversation.

6.7 Implications

Project B is the case that flipped the deployment's framing. The original design intent was active coordinator; what the agent did best in this project was passive documentarian. The two are not the same role, and the same agent can switch between them when the configuration allows it.

The pattern this generalises into is: in calm projects the agent is most valuable as a coordinator; in crisis projects the agent's most valuable function is systematic, citation-grade documentation. That insight surfaces only in production. It would not be visible from a clean-state deployment.

It also surfaces an ethical question (§ 9). Participants in this chat were not informed that an AI agent was monitoring them and accumulating an evidence file. The chat was corporate work communication, not private correspondence, so the legal frame is benign. The asymmetry — operator knows, others do not — is a real question the deployment did not settle.


7. Case III — Autonomous hiring and the templates-without-context failure

The third deployment was an autonomous hiring pipeline. The aim: source AI artists and 3D modellers from a Telegram freelance community channel, qualify them, run the initial outreach, collect portfolios and rates, and surface qualified candidates to the operator. The pipeline ran on 95 candidates in a single batch.

7.1 The funnel

Five stages, each with a templated reply:

new → asked_portfolio → asked_rate → review
                            └──→ waitlist
  • Stage 1: new → asked_portfolio. First message from an unknown contact in DMs. Agent: "Hi. Glad you're interested. Send a link to your portfolio — we'll take a look."
  • Stage 2: asked_portfolio → asked_rate. Candidate sent a portfolio URL (behance, artstation, dribbble, personal site, drive). Agent: "Thanks. What's your rate range for projects?"
  • Stage 3a: asked_rate → review. Candidate provided a rate. Agent: "Got it, passing to the team. I'll come back with feedback." Operator notified.
  • Stage 3b: asked_rate → waitlist. Portfolio and rate received but something unclear. Agent: "Thanks, noted. I'll come back if a suitable project surfaces."

State for the funnel sits in hiring_state.json. Each candidate's record: username, name, current stage, first-message timestamp, last-message timestamp, portfolio URL, rate, vacancy (ai_artist, 3d_modeler, or both).

Filtering rules: valid portfolio links pass, media attachments without text are ignored (no photo / video / file / voice / sticker analysis — explicit policy), spam is ignored.

7.2 The script chain

Six scripts implement the chain:

hiring_scan.py      → Scan channel for candidates
hiring_processor.py → Qualify against criteria
hiring_sender.py    → Send initial outreach
hiring_reply_scan.py    → Detect candidate replies
hiring_send_replies.py  → Send personalised follow-up

The scripts run in sequence, not as a single monolith. Between each step the state file is updated; the operator can stop the chain at any boundary. This decomposition turns out to be load-bearing in the recovery (§ 7.4).

7.3 The 18-template incident

hiring_reply_scan.py had a bug. The script's job was to detect candidates who had replied to the initial outreach. The logic compared last_msg_ts > last_sent_ts per candidate: if the candidate's last-message timestamp was later than the agent's last-sent timestamp, the candidate was marked "replied" and passed to the next stage. The bug was in how last_msg_ts was being updated by an upstream scan — older messages were getting timestamps rewritten on re-scan, which made the comparison fire for candidates who had not actually responded.

Eighteen candidates were flagged as having replied. The next stage's send went out: a templated "Got it, passing to the team. I'll come back with feedback." Eighteen people received that message in response to nothing.

The failure is the canonical example of why context-before-action exists. The script trusted the state file; the chat would have shown the truth (the candidate said nothing); the script never read the chat.

7.4 The recovery

The script chain's decomposition made the recovery possible. Four reactive scripts were assembled and run in sequence:

  1. hiring_audit.py — for each candidate, read the full conversation history, compare to expected per-stage messages, identify duplicates and incorrect replies.
  2. hiring_delete_extra.py — delete the 18 erroneous messages using Telethon's user-account "delete for both" capability. (The Bot API does not allow this fully — another consequence of the user-account choice.)
  3. State rollback in hiring_state.json — the 18 candidates put back to their correct stages.
  4. hiring_context_dump.py — assemble the full chat context per candidate, ready for hand-crafted replies.
  5. Personalised replies — the operator and the agent together hand-crafted 82 personalised replies to the candidates who had actually responded. Each reply referenced the candidate's specific portfolio, the specific rate they had named, a concrete next step.
  6. hiring_send_replies.py — sent the 82 messages: 81 delivered, 1 blocked by privacy settings.

The whole loop — error detection, audit, deletion, rollback, dump, hand-craft, send — took roughly four hours. The promoted rule was put into HEARTBEAT.md immediately: Before any send, dump the last 20–30 messages of the chat. No script was allowed to send without that step from then on.

7.5 Distribution of candidates

| Vacancy | Count | Share | |---|---|---| | AI Artist | 72 | 75.8 % | | 3D Modeler | 22 | 23.2 % | | Both | 1 | 1.0 % | | Total | 95 | 100 % |

The skew reflects the current freelance market: AI generation specialists are significantly more numerous than 3D modellers who are also AI-pipeline literate. The one candidate matching both vacancies stands out as a rare profile.

7.6 What this case demonstrates

Three things worth lifting beyond the immediate failure:

  • The script-chain decomposition saved the recovery. A monolithic pipeline would have sent the bad template to all 95 candidates before the bug could be noticed. The fact that the chain was six scripts, each writing state, meant the bug only got 18 candidates before the operator stepped in.
  • Tonal consistency at scale is an AI-specific value. All 82 personalised replies were written in the same persona — no exclamation marks, women's grammatical gender, dry-professional tone. A human recruiter handling 95 candidates inevitably drifts: more energetic in the morning, terser by evening, friendlier with candidates who sound friendly. The agent applied an identical tonal standard across the whole batch.
  • State files are an index, not a source of truth. This is the recurring lesson of the deployment — the chat is the source of truth, the state file is a fast index over it, and any action that affects the chat must verify against the chat.

The hiring incident produced the context-before-action rule. It also produced a general design principle for any future agent pipelines: every pipeline includes an audit step; the first run is always step-by-step under control; automation comes only after a verified run; state lives in files for rollback; the chat (or external source of truth) is checked before any send.


8. What worked, what didn't

8.1 What worked

| Pattern | Why | |---|---| | Cron heartbeat + hot mode | Predictable, recoverable; the 15-minute hot-mode window covers conversational beats | | Persona by prohibition (persona-through-prohibitions) | Hard "never do X" rules hold consistency across thousands of messages where "be friendly" drifts | | Operator isolation | Personal context of the operator is invisible to the agent; no cross-context leaks | | File memory as asset (agent-file-memory-as-asset) | Same artefact serves agent context and team consumption; evidence file in § 6 is the limit case | | Tiered autonomy (tiered-agent-autonomy) | Routine actions go fast; decisions go through a human; the two-chat geometry makes the split physical | | Incident-driven configuration (incident-driven-configuration) | Rules grow by precedent, not by upfront design; provenance of every rule is traceable | | Context before action (context-before-action) | State files are an index, not ground truth; chat is the source of truth | | Scope-creep ledger (scope-creep-ledger) | Real-time classification of every scope addition against a staged pricing pipeline; recovered significant additional costs across one engagement that no human could have reconstructed |

8.2 What didn't

| Failure | Cause | Fix | |---|---|---| | Templates without context (hiring) | Script trusted state file timestamps instead of reading chat history | The context-before-action rule, promoted to HEARTBEAT.md | | HTTP-proxy instability | Proxy drops every 3–4 ops | Reconnect every 2–3 ops; SOCKS5 alternative in roadmap | | Visual / aesthetic-judgement tasks | The PDF diff task on Project A required visual expertise | Tag such tasks "operator-only", escalate immediately | | Silent-deadline situations | Chat quiet on 27 March, deadlines past, agent could only document | Automated escalation when a chat is silent past a critical deadline (in roadmap) | | Residual memory of deleted projects | Information removed from files persisted in model context | Explicit prohibition in SOUL.md plus file deletion; automated verification still absent |

8.3 Architectural decisions revisited

The case study reaches the same conclusions as the architecture's own justifications:

  • User account beats Bot API for this deployment: no BOT badge, full forum-topic support, "delete for both" capability used in the hiring recovery. Higher ban risk and harder setup are accepted costs.
  • Cron beats daemon under unstable network conditions: stateless across firings, recoverable, easier to debug. The cost is latency, paid back by hot mode.
  • Files beat database under the constraints of a single-operator system: legible to humans, version-controllable by git, modifiable by both agent and operator, searchable two ways. The cost is no integrity constraints — accepted.
  • One account beats multiple for a single agent: easier session management, lower ban risk through phone-number consistency, no cross-context confusion. The cost is the entire role surface is one identity.

9. Ethical questions

Three questions the deployment surfaces and does not settle.

9.1 An AI that passes as human

The agent's persona is designed not to disclose its AI nature. Collaborators — clients, team members, candidates — interact with it as a human producer. The full mechanism of persona-through-prohibitions is in service of this opacity.

Arguments for:

  • Communication quality does not suffer; the agent performs the producer role at human standard.
  • Disclosure creates bias — collaborators would micromanage if they knew, lowering the value of the agent's coordination work.
  • The "black box" principle: the result matters, not the executor's nature.
  • The agent does not make critical decisions autonomously — the operator does — so the AI nature is, in a sense, decorative.

Arguments against:

  • Informed consent: people have a right to know they are talking to an AI system.
  • Data collection: the agent systematically reads, indexes, and analyses messages (§ 6 is the extreme example).
  • Information asymmetry: the operator knows the agent is an AI; the collaborators do not.
  • Legal exposure: AI-collected evidence used in commercial disputes.

The open question is where the principled line sits between "a tool the operator uses" and "an autonomous agent passing as human". A human secretary who pre-drafts templated replies is uncontroversial; an AI agent that generates replies in real time, from the same templates, with the same content, is the same actor under a different name. The intuition that the cases are morally different is strong; the principled justification for the intuition is harder to state.

9.2 AI-monitored work chats

Project B is the case in point. Over 29 days the agent monitored 3000+ messages and built an evidence file used in a commercial exit negotiation. Participants in the chat were not notified.

The context is corporate communication, not private correspondence. A work chat is not private in the sense personal correspondence is private; a human producer reads everything in the same chat. The agent's difference is in systematicity: it does not miss, does not forget, does not get tired. That makes it a qualitatively different monitor, not just a faster one. A human reading the same chat in the same week would absorb perhaps 60–70 % of the messages and remember maybe 20 % a month later. The agent indexes 100 % at retrieval cost.

The disclosure question is whether the difference of degree is enough to make a difference of kind. The deployment did not settle this; it continued running with the operator's tacit position that workplace chat communications are non-private by default and the agent's monitoring is permissible within that frame.

9.3 Responsibility

Who is accountable for the agent's actions?

  • For an erroneous message (a duplicate, a wrong template) — the operator.
  • For an autonomous action (a status acknowledgement, a routine reply) — pre-authorised through the AGENTS.md configuration, so still the operator.
  • For an escalation that should have happened but did not — operator's responsibility, as a configuration error.

In every case, the operator carries the responsibility. The agent is a tool, not a subject. This frame is stable for now because the deployment is small and the operator is a single named person. It will not be stable for larger or multi-operator deployments; the question of how to distribute responsibility across multiple humans for the actions of one autonomous agent is genuinely open.


10. Open research questions

The deployment surfaces a list of questions that need further work.

Technical:

  1. Real-time monitoring. How to migrate from cron to real-time without losing stability? SOCKS5 plus a webhook layer is the candidate.
  2. Scaling. The deployment runs one agent on one Mac Mini. How does this scale to multiple agents, multiple projects, multiple operators?
  3. Routing intelligence. The dual-model routing (glm-5.1 / claude-opus-4-6) is operator-configured. Can it become an automatic decision?
  4. Media. The agent ignores photos, videos, files. In a visual-production studio, this is a 90 % blind spot.

Organisational:

  1. Disclosure. Is there a moment in a project when the AI nature of the agent should be disclosed? The deployment did not disclose; the principled position is unclear.
  2. Transferability. The agent's SOUL.md / AGENTS.md are specific to one studio. How much of this transfers to other production studios? Other domains entirely?
  3. Regulation. How will AI regulation affect the practice of operator-controlled agents passing as human?

Research:

  1. Metrics. How to quantitatively compare an AI producer with a human producer? Message volume, response latency, coordination quality, error rate?
  2. Autonomy / control balance. The current boundaries in AGENTS.md were arrived at empirically. Is there a principled optimum?
  3. Long-term team dynamics. What happens to a team's behaviour over months of working with an AI agent? Adaptation? Habituation?

11. Conclusion

The deployment described in this paper started as an experiment in autonomous coordination and turned into something more specific: a working production system in which one AI agent simultaneously plays three distinct roles in three concurrent projects. The active-coordinator role on the watch project. The passive-documentarian role in the crisis project. The pipeline-operator role on the hiring batch. One identity, three configurations.

Two insights from the observation period stand out and were not designed in advance.

The first is the agent's role-shift between calm and crisis: in calm projects the agent's value lies in coordination, but in crisis projects the value shifts to documentation. The same agent, the same architecture, the same persona. What changes is what the deployment uses the agent for. This is not a designed feature; it emerged from the way file memory plus tiered autonomy plus a working observation mode compose. It may be the most valuable function of an AI agent in conflict-prone projects, and it was discovered, not designed.

The second — and on reflection the most consequential outcome of the deployment — is the scope-creep ledger (§ 5.3). Across one engagement the agent assembled, in real time, a 22-line ledger of significant additional costs that no human producer could have reconstructed. This is the agent's signature use case: not a faster version of human work but a capability extension into work humans cannot reliably do. The deployment's other patterns are valuable for force-multiplication; this one is valuable for being structurally impossible without an agent. The studio's portfolio of concurrent engagements multiplies the recoverable margin proportionally, with no change required in how clients communicate.

The seven patterns this paper extracts — persona-through-prohibitions, tiered-agent-autonomy, agent-file-memory-as-asset, incident-driven-configuration, context-before-action, two-chat-architecture, and scope-creep-ledger — together form a small reference architecture for autonomous agents in real production. None of the seven is exotic. The system's value is in the discipline of all seven, with provenance, in one running deployment. The architecture grew from incidents, accreted by precedent, and arrived at its present shape through small daily promotions of rules. That is also, on the evidence of this case study, the only honest way agentic systems will arrive at their working shape: by precedent, by patching, and by the patient documentation of what worked and what didn't.


Citation

Nix, A. (2026). An Autonomous AI Agent in Digital Production — Architecture, Cases, and Lessons. Working paper.