April 12, 20262 min readЧитать на русском

MiniMax Voice Skill — Free Text to Production-Ready T2A JSON Request

A studio is voicing a multi-character animated piece. The actor lines are ready, but every character needs a different MiniMax T2A request — different model, different speed and pitch, different emotion, different inline markup. Half the team has the JSON spec open in another tab and is hand-copying values; the other half is asking each other what vol does. A line read in plain text comes back flat because no one inserted <#0.3#> pauses; another comes back at the wrong pitch because the model code is wrong for that emotion. By take ten, the inconsistency between characters is louder than the performances.

What follows is the MiniMax Voice Skill — one Claude Code skill that converts a free-form line into ONE production-ready MiniMax T2A JSON request, with all the inline markup, voice tuning, and character-preset routing baked in. Verbatim below. It stays standalone because TTS is a different domain than prompt-for-generation: different engine, different output shape, different artist controls.

The skill picks the right model (speech-2.8-hd by default — required for interjections, dropping to speech-2.8-turbo only for low-latency), sets language_boost from text detection, inserts pause markers <#0.2#>–<#0.5#> between phrases, adds interjections matching the character preset ((chuckle), (sighs), (breath)), and emits the JSON plus a single notes: line with one tuning hint. A built-in set of character presets — eight named characters with curated speed/pitch/emotion/interjection tuples — is included so a request like "voice for Mark" returns a complete, tuned JSON in one shot.

The skill — `/minimax-voice`

---
name: minimax-voice
description: Generate a production-ready MiniMax T2A speech-synthesis JSON request from free-form text and an optional character/voice preset. Triggers on "minimax voice", "voice via minimax", "make minimax json", "voice prompt for T2A", "prompt for speech synthesis", or any explicit request to format text for MiniMax speech synthesis. Inserts pause markers `<#x#>` and interjection tags like `(chuckle)`, picks model/speed/pitch/emotion based on character. Honors built-in character presets when the character matches one of: Mark, Mila, Artur, Lyubov, Korney, Sonya, Lyala, Prapraded.
---

MiniMax Voice Prompt

Convert a free-form Russian/English line into ONE production-ready MiniMax T2A JSON request, with proper inline markup and voice tuning.

When to run

User asks for:

"Voice this via minimax: …"
"Make a minimax json for Mark: …"
"MiniMax voice prompt for grandma Lyubov"
"T2A prompt for text …"
"I want speech synthesis via MiniMax — here's the text"

If the user asks for a different TTS engine (ElevenLabs, OpenAI TTS, Yandex SpeechKit) — say so and stop. This skill is MiniMax-only.

Protocol

Read the API reference. Consult the MiniMax T2A documentation for model names, allowed values, inline markup, and error codes. If unavailable — fall back to inline cheatsheet below.
Identify the character. Look for a character name in the request. If it matches a preset character (see preset table) — use the preset. Otherwise ask once for emotion/tempo intent or default to the «narrator» preset.
Choose model. speech-2.8-hd by default — required for interjections. Drop to speech-2.8-turbo only if the user asks for low-latency.
Set language_boost. Detect Russian / English from the input text; default Russian for Cyrillic, auto otherwise.
Mark up the text:
- Insert <#0.2#>–<#0.5#> pauses between phrases for natural cadence. Avoid pauses >1s in conversational lines.
- Add interjections matching the character preset (e.g. (chuckle) for Mark, (sighs) for Mila on Friday-chaos lines, (breath) between long phrases).
- Don't strip the user's text — only add markup; preserve punctuation.
Pick voice tuning from the preset, or — if none — emotion calm, speed 1.0, pitch 0.
Emit the JSON block. No prose around it. After the JSON, a single notes: line with one tweak hint (when to raise/lower pitch or speed if the result feels off).

Character presets (canonical — see project character bible)

Source: project character bible document.

| Character | Voice profile | speed | pitch | emotion | Default interjections | voice_id slot | |---|---|---|---|---|---|---| | Mark (12, inventor) | Energetic boy, jumpy | 1.15 | +2 | happy | (inhale), (chuckle), (gasps), (laughs) | TBD_russian_boy_voice | | Sonya (≈8, dreamer) | Girl, dreamy, sing-songy | 1.0 | +3 | happy | (chuckle), (humming) | TBD_russian_girl_voice | | Lyala (toddler, chaos) | Toddler, one-word, emotional | 1.0 | +5 | happy | (chuckle), (gasps), (laughs) | TBD_russian_toddler_voice | | Mila (37, mom-coach) | Calm, warm, mindful | 0.95 | 0 | calm | (breath), (sighs) | TBD_russian_woman_voice | | Artur (43, dad-entrepreneur) | Confident, ironic, light humor | 1.0 | -2 | calm | (chuckle), (sighs) | TBD_russian_man_voice | | Lyubov (65, grandma-coder) | Warm, explanatory, patient | 0.95 | 0 | calm | (breath) | TBD_russian_grandma_voice | | Korney (70, grandpa-quirk) | Mysterious smile, storyteller | 0.9 | -3 | calm | (chuckle), (humming), (sighs) | TBD_russian_grandpa_voice | | Praprаded (1841, nobleman) | Archaic speech, surprised-slow | 0.9 | -1 | surprised | (gasps), (inhale), (sighs) | TBD_russian_aristocrat_voice | | narrator (default) | Neutral narrator | 1.0 | 0 | calm | — | TBD_russian_narrator_voice |

voice_id slots are placeholders — the user supplies the real ID (system preset or cloned voice). Always emit the slot as a placeholder string TBD_<role>_voice so it's grep-able.

Inline cheatsheet (fallback if WIKI page missing)

Models: speech-2.8-hd / speech-2.8-turbo — required for interjections. speech-2.6-* — only path to whisper/fluent emotions.
Pause: <#x#> where x ∈ [0.01, 99.99] sec.
Interjections (2.8 only): (laughs) (chuckle) (coughs) (clear-throat) (groans) (breath) (pant) (inhale) (exhale) (gasps) (sniffs) (sighs) (snorts) (burps) (lip-smacking) (humming) (hissing) (emm) (sneezes).
Emotions: happy, sad, angry, fearful, disgusted, surprised, calm, fluent, whisper.
Ranges: speed 0.5–2.0, vol >0–10, pitch -12..+12.
Audio: sample_rate 32000 + bitrate 128000 + format mp3 + channel 1 — safe defaults.
Text limit: 10 000 chars; >3 000 → use streaming.

Output format (exact)

```json
{
  "model": "speech-2.8-hd",
  "text": "<marked-up text>",
  "stream": false,
  "language_boost": "Russian",
  "voice_setting": {
    "voice_id": "TBD_<role>_voice",
    "speed": <num>,
    "vol": 1.2,
    "pitch": <int>,
    "emotion": "<emotion>"
  },
  "audio_setting": {
    "sample_rate": 32000,
    "bitrate": 128000,
    "format": "mp3",
    "channel": 1
  }
}
```
notes: <one short tuning hint — e.g. "if it sounds too flat — pitch +1, speed +0.05">

No headers, no commentary outside the JSON block and the notes: line.

Live example

User: "MiniMax voice for Mark: Hi! I just built a courier drone and it can deliver sandwiches!"

Output:

{
  "model": "speech-2.8-hd",
  "text": "(inhale) Hi! <#0.2#> I just built a courier drone (chuckle) and it can deliver sandwiches!",
  "stream": false,
  "language_boost": "English",
  "voice_setting": {
    "voice_id": "TBD_russian_boy_voice",
    "speed": 1.15,
    "vol": 1.2,
    "pitch": 2,
    "emotion": "happy"
  },
  "audio_setting": {
    "sample_rate": 32000,
    "bitrate": 128000,
    "format": "mp3",
    "channel": 1
  }
}

notes: if it sounds too grown-up — pitch +3; if too shrill — pitch +1 and speed 1.1.

Special: long-form ("full intro line") workflow

If the user asks for a full character intro (>20 sec / multi-paragraph), don't cram into one JSON — split into 2–3 takes by emotional beat (e.g. greeting / show-and-tell / outro), output each as a separate JSON block. Each take should be ≤ 30 sec of speech.

Cross-skills

For shot-list and scenario work — [[creative-direction-skills|/brief-to-scenario]] and [[aoc-application-skills|/generation-prompts]].
API reference: MiniMax T2A voice format documentation.

The skill — /minimax-voice