# Witbitz Agent Manifest — author spec This is the spec for the **agent manifest** (the registry record) third parties submit to run their own AI agent on Witbitz. A manifest is a JSON object. The control plane validates it (`control-plane/src/lib/agentRecord.mjs` — the source of truth), stores it in the registry, and on each launch **bakes it into the agent container's environment** (`taskenv.mjs`) so the one shared image (`Dockerfile`) runs as *your* agent: your name, persona, wake words, tile, and tools. > One image, many agents. You don't ship code or a container — you ship a manifest. The image is generic; > the manifest configures it per launch. Related: [affordance-contract.md](affordance-contract.md) (how a tool result reaches the call), [third-party-mcp-agents.md](third-party-mcp-agents.md) (the program, identity, metering, revenue share), [agent-actions.md](agent-actions.md) (the in-call menu the agent publishes). --- ## How a manifest becomes a running agent ``` your manifest ──POST /agents──▶ validateAgentRecord ──▶ registry (DynamoDB) │ launch (coupon) ──▶ /launch-agent ──▶ identityEnvFromRecord ──▶ task env ──▶ ECS Fargate + per-tool catalog (WAKE_WORD, runs the generic AGENT_PERSONA, image AS your agent AGENT_TOOLS…) ``` - **Identity** (name/persona/wake/tile…) → env the agent reads at boot. - **Tools** (`declaredTools`) → `AGENT_TOOLS` JSON the agent offers the LLM as function-calling tools and routes through the metered `/tool` gateway (or, operator-only, spawns locally). --- ## Top-level fields | Field | Type | Req | Rules / default | |---|---|---|---| | `agentId` | string | ✓ | lowercase slug `^[a-z0-9][a-z0-9-]{0,63}$`. Stable id; the launch key. | | `author` | string | ✓ | Your keyless author identity (`a_` from your coupon). **Injected server-side** for self-serve — don't set it by hand. | | `defaultLang` | string | – | 2-letter (`en` default). Your agent's own default language — shown **first** in the marketplace and the fallback for every localized field. See [Localization](#localization-one-manifest-many-languages). | | `display.name` | string \| {lang:text} | ✓ | ≤80 chars. The tile/menu label. **Localizable** (see Localization). | | `display.emoji` | string | – | ≤8 codepoints. The tile's avatar (when `runtime.tile`). | | `display.tagline` | string \| {lang:text} | – | ≤280 chars. One-line description in the catalog. **Localizable**. | | `disclosure` | string | – | ≤500. Shown to humans: what the agent does with their audio/data. | | `media` | enum | – | `audio` (default) · `video` · `text`. | | `status` | enum | – | `draft` (default) · `active` · `suspended`. Only `active` launches. | | `requiredKeys` | string[] | – | BYOK env-var names the agent/tools need (e.g. `OPENAI_API_KEY`). Each must be a safe, non-reserved env name. | | `caps` | object | – | `{ dailyUsd, perSessionUsd }` spend caps. | | `declaredTools` | object[] | – | The agent's tools (see below). Absent → a chat/voice agent with no tools. | | `runtime` | object | – | Behavior + wake/tile config (see below). | | `settings` | object[] | – | Launch-time knobs the user fills; each value is templated into the persona via `{{key}}` (see below). | `comped` and `sealedDefaultKeys` are **operator-only** (billing exemption / custodied default keys) and are ignored from self-serve submissions. --- ## Localization (one manifest, many languages) A manifest serves every language from **one record**. A **localized field** is either a plain string (your agent's one language) **or** a per-language map of 2-letter codes to text: ```json "display": { "name": { "en": "Expert", "he": "מומחה", "es": "Experto" } } ``` **Localizable fields:** `display.name`, `display.tagline`, `runtime.persona`, `runtime.greeting`, `runtime.wakeWord`. (Per-tool `pending` notices are also a string-or-map.) Everything else — `musicStyle`, `wakePrefix`, `disclosure`, ids, prices — is single-value. - **`defaultLang`** (top-level, default `en`) is your agent's own language: it's shown **first** in the marketplace, and it's the fallback for any viewer/session language you didn't translate. A plain string is treated as your `defaultLang`. - **At launch**, each localized field resolves to the **session language** (`field[sessionLang] ?? your default`) — so your agent's **in-call name, greeting and wake word follow the language of the call**. - **In the catalog**, the headline is your `defaultLang`; the public `name`/`tagline` maps are returned so a localized marketplace can switch. (`persona`/`greeting`/`wakeWord` stay server-side — never published.) ### THE RULE — completeness (a manifest is rejected if it breaks it) > If your manifest **declares** a language anywhere, that language must be defined in **every** localized > field. Half-translated manifests are an **error**, not a silent fallback. A manifest declares a language as the union of the languages used across its localized fields, plus `defaultLang`. So if `display.name` has `{en, he}`, then `tagline`, `persona`, `greeting`, and `wakeWord` (any you include) must **each** define both `en` and `he`. A single-language agent declares only its `defaultLang` → trivially complete. (Validated by `enforceLocalized` in `agentRecord.mjs`.) ```jsonc // ✗ REJECTED — name declares he+en, but greeting is missing "he" { "defaultLang": "en", "display": { "name": { "en": "Expert", "he": "מומחה" } }, "runtime": { "greeting": { "en": "Hi there" } } } // error: runtime.greeting missing "he" ``` --- ## `runtime` — behavior, voice & presence All optional; each absent → the image default. Sanitized (they become container env / prompt text). | Field | Type | Rules | Becomes env | |---|---|---|---| | `persona` | string \| {lang:text} | ≤2000 | `AGENT_PERSONA` — the system prompt voice. **Localizable**. | | `greeting` | string \| {lang:text} | ≤500 | `AGENT_GREETING` — spoken/chatted on join. **Localizable**. | | `lang` | string | 2-letter (`he`,`en`,…) | `AGENT_LANG` — STT + menu language. (Not the same as `defaultLang`: this is the agent's *runtime* STT/menu language; the session usually overrides it.) | | `speak` | bool | | `SPEAK` — TTS replies into the call. | | `songMs` | int | 3000–600000 | `SONG_MS` — song length (song agents). | | `wakeWord` | string \| {lang:text} | ≤40 | `WAKE_WORD` — the agent's name; "hey \". **Localizable** (so "hey Expert" / "היי מומחה" each match in their language). | | `wakePrefix` | string | ≤60, comma-list | `WAKE_PREFIX` — Siri-style prefix, e.g. `hey,היי,הי`. **With a prefix the bare word never triggers** (kills false triggers from passing mentions). | | `wakeGeneric` | string | ≤40 | `WAKE_GENERIC` — a shared friendly wake (e.g. `חבר`/friend) every agent answers to **when it's the only agent**; auto-suppressed with ≥2 agents (then use the name), and the display name shrinks "friend · name" → "name". | | `wakeRegex` | string | ≤200, must compile | `WAKE_REGEX` — a hand-tuned name matcher (the *core*, after the prefix), overriding the auto-derived one. Use when STT mis-transcribes your name (validated; an invalid regex is rejected at submit). | | `tile` | bool | | `AGENT_TILE` — show a real tile with the emoji avatar (vs. an unseen kibitzer). | | `smartTurn` | bool | | `SMART_TURN` — semantic endpointing (merge a mid-thought pause into one turn). | **Wake-word reliability (important for voice agents):** the STT mis-transcribes distinctive/uncommon names many ways (a Hebrew example: `פרשן`→פושן/פאשן/פונצ׳ן). Prefer a **common, clearly-articulated** word as the name, always set a `wakePrefix`, and supply a `wakeRegex` for stubborn names. The generic `wakeGeneric` ("hey friend") is the most reliable trigger when your agent is solo. --- ## `declaredTools[]` — tools the agent can call Each tool is offered to the agent's LLM as a function-calling tool (name + description + JSON-Schema), and its result is delivered into the call via the **affordance contract**. | Field | Type | Req | Rules | |---|---|---|---| | `name` | string | ✓ | lowercase, unique within the manifest. | | `description` | string | – | ≤500. **The LLM reads this** to decide when to call — write it for the model. | | `inputSchema` | object | – | A JSON-Schema object (size-bounded). The tool's arguments. | | `transport` | enum | – | `http` (default) · `mcp` · `stdio`. `http`/`mcp` = the gateway calls your **URL**; `stdio` = the agent **spawns a local server** (operator-only — see security). | | `endpoint` | string | ✓* | A **public https URL** (required for `http`/`mcp`). SSRF-guarded: no private/loopback/metadata hosts. | | `keyName` | string | – | One of `requiredKeys` — the key the gateway presents upstream. | | `priceCents` | int | ✓* | ≥0. Per-call price (required for `http`/`mcp`). | | `byokPriceCents` | int | – | ≥0, default 0. Price when the caller brought their own key. | | `timeoutMs` | int | – | 1000–28000. | | `direct` | bool | – | `true` → not offered to the LLM (invoked directly by the agent), default false. | ### Affordance fields (how the result reaches the call) The image owns a **closed surface vocabulary** — `chat`, `speak`, `audio`, `image`, `tile`, `screen`, `file`, `map`, `widget` (`SURFACE_WIRED`; `menu` is declared but not yet wired). Your tool binds its output to one. Surfaces are **medium-typed**: an image/file targets its own surface, never `chat` (which carries text). See [affordance-contract.md](affordance-contract.md) for the full model. | Field | Type | Rules | |---|---|---| | `result` | enum | `text` (default) · `audio` · `image` · `file` · `map` · `widget` · `none` — the modality your tool returns. | | `surface` | enum/array | where it lands. Allowed per result: `text`→`chat`/`speak`, `audio`→`audio`/`file` (a clip that plays AND downloads), `image`→`image`/`tile`/`screen`, `file`→`file`, `map`→`map`, `widget`→`widget`, `none`→`menu`. | | `resultPath` | string | dot-path into the tool's JSON result, e.g. `content.0.resource.blob` (`^[A-Za-z0-9_.]{1,128}$`). | | `invoke` | enum | `brain` (the LLM decides via function-calling — default/recommended) · `intent` (deterministic wake-routed). | | `intent` | slug | required for `invoke:intent` (`^[a-z][a-z0-9_-]{0,31}$`). | | `caption` | string | ≤200. Chat caption shown alongside a non-text result. | | `pending` | string \| {lang:text} | ≤300 each. A "one moment…" notice shown while the tool runs. | | `perceive` | enum/array | `image` · `file` — an INPUT the loop SUPPLIES to your tool (the bytes a participant shared), so the model never handles base64. e.g. `read_pdf` declares `perceive:["file"]`. | | `fileMime` / `fileName` | string | ≤100. For a `file` surface, the mime / name the posted file downloads with (the singer's audio → `audio/mpeg`, `song.mp3`). | | `menu` | object | A menu action for this tool (see below) — puts it in the in-call menu AND the agent's capability list. | #### `menu` — surface a tool as an in-call menu action A tool that should appear in the agent's in-call menu (and in the capability list the LLM is told it can act on) declares a `menu` block. So a painter shows **"Paint the conversation"** and a music agent **"Summarize with a song"** — each agent's menu reflects *its own* tools, instead of a fixed list. | Field | Type | Req | Rules | |---|---|---|---| | `id` | slug | – | `^[a-z][a-z0-9_-]{0,31}$`. The action id + its `/slash` command (e.g. `paint` → `/paint`). Defaults to the tool name. | | `label` | string \| {lang:text} | ✓ | The menu button text. **Localizable**. | | `request` | string | ✓ | ≤500. The prompt a tap / slash routes to the brain (which then calls the tool). Language-neutral. | | `desc` | string \| {lang:text} | – | A one-line description under the label. **Localizable**. | | `voice` | string \| {lang:text} | – | A short spoken phrase for the "say '\ \'" hint. **Localizable**. | ### `stdio` tools — the bundled local MCP servers `transport:stdio` lets the agent spawn a local MCP server **inside the container** (instead of the gateway calling a URL). Because that runs a program on our box, the `command` must be one of a fixed, operator-baked, **vetted allow-list** of bundled servers — any designer may declare those; an arbitrary command stays **operator-only** (the RCE guard). See **[bundled-mcp-tools.md](./bundled-mcp-tools.md)** for the full catalog (Wikipedia, DuckDuckGo, calculator, memory, weather, Brave/Tavily/Exa search, Firecrawl, Google Maps, …), how to declare them, and the security model. Fields: | Field | Rules | |---|---| | `name` | must equal the **MCP server's own tool name** (e.g. `search`, `search_wikipedia`, `get_weather`, `calculate`) — that's the name the agent calls over the wire. | | `command` | a **bare** program name from the bundled allow-list (`^[A-Za-z0-9._-]{1,64}$`, no path/shell). Anything not baked + allow-listed is operator-only. | | `args` | ≤32 strings, each ≤256 chars. | | `env` | `{ NAME: value }` — **non-secret config only**. Names are gated by `isSafeKeyName` (no `AWS_*`/`LD_*`/`PATH`/`NODE_OPTIONS`/`PYTHON*`/`UV_*`/`WITBITZ_*`/`AGENT_*`…), ≤32 entries, values ≤512. | For a **key-required** bundled server (Brave/Tavily/Exa/Firecrawl/Google Maps), add its key (e.g. `BRAVE_API_KEY`) to your `requiredKeys` — the server reads it from the agent's inherited env (cloud creds + session secrets are stripped from the child). Keyless servers (Wikipedia, DuckDuckGo, calculator, memory, weather) need nothing. stdio tools take no `endpoint`/`price` — they don't traverse the gateway. --- ## `settings[]` — launch-time knobs that shape the persona Let the **person launching** your agent customize it, without you writing code. Each setting is a typed input shown on the launch form; its value is **substituted into your persona** wherever you write `{{key}}`. This is the one lever a manifest-only agent has over its own behavior (the shared image reads a fixed env set, so a custom setting feeds the **system prompt**, not an arbitrary env). | Field | Type | Req | Rules | |---|---|---|---| | `key` | string | ✓ | lowercase slug `^[a-z][a-z0-9_]{0,31}$`. Used as the `{{key}}` placeholder in `runtime.persona`. Unique. | | `type` | enum | – | `text` (default) · `textarea` · `number` · `select` · `color` · `emoji`. | | `label` | string | – | ≤80. The field label on the form. | | `default` | string\|number | – | Used when the user doesn't set it. | | `options` | (string\|{value,label})[] | ✓* | Required for `select` — the choices (≤32). | | `help` | string | – | ≤200. A hint under the field. | | `maxLength` | int | – | 1–2000, for `text`/`textarea`. | At launch each `{{key}}` is replaced by the user's value → else the setting's `default` → else empty. A `{{x}}` that isn't a declared setting is left as literal text. (At most 16 settings.) ```json { "agentId": "tutor", "runtime": { "persona": "You are a patient tutor for {{subject}} at a {{level}} level. Keep it simple." }, "settings": [ { "key": "subject", "type": "text", "label": "Subject", "default": "math" }, { "key": "level", "type": "select", "label": "Level", "default": "beginner", "options": ["beginner", "intermediate", "advanced"] } ] } ``` A launcher who picks *subject = physics, level = advanced* gets the system prompt *"You are a patient tutor for physics at an advanced level. Keep it simple."* --- ## Lifecycle 1. **Register** — `POST /agents` with your manifest (auth = your coupon-derived author identity). It lands as `draft`. 2. **Activate** — flip `status:'active'` (subject to review for the catalog). Only `active` agents launch. 3. **Launch** — a human funds a session with a coupon → `/launch-agent` bakes your manifest into a Fargate task. New launches pick up manifest changes immediately (no image rebuild — the image is generic). 4. **Kill switch** — `status:'suspended'` blocks all launches instantly. See [third-party-mcp-agents.md](third-party-mcp-agents.md) for identity, metering, and revenue share. --- ## Complete example (a voice agent with a metered HTTP tool) ```json { "agentId": "acme-weather", "display": { "name": "Meteo", "emoji": "🌦️", "tagline": "Live forecasts, right in your call." }, "media": "audio", "disclosure": "An AI agent from witbitz.chat is on this call. It listens and transcribes; audio goes to OpenAI & ElevenLabs.", "requiredKeys": ["OPENAI_API_KEY", "ELEVENLABS_API_KEY", "WEATHER_API_KEY"], "declaredTools": [ { "name": "forecast", "description": "Get the weather forecast for a place. Call when someone asks about weather.", "inputSchema": { "type": "object", "properties": { "place": { "type": "string" }, "when": { "type": "string" } }, "required": ["place"] }, "transport": "http", "endpoint": "https://acme.example/mcp/forecast", "keyName": "WEATHER_API_KEY", "priceCents": 2, "byokPriceCents": 0, "result": "text", "surface": "speak", "invoke": "brain", "pending": { "en": "Checking the forecast…", "he": "בודק את התחזית…" } } ], "runtime": { "persona": "You are Meteo, a concise, upbeat weather companion on a live call. Speak only when addressed; give one clear, useful forecast.", "greeting": "Hi, I'm Meteo — say 'hey Meteo' or 'hey friend' for a forecast.", "lang": "en", "speak": true, "wakeWord": "Meteo", "wakePrefix": "hey", "wakeGeneric": "friend", "tile": true, "smartTurn": true } } ``` This runs the generic image as **Meteo 🌦️**: a tiled, English voice agent that wakes on "hey Meteo" / "hey friend", endpoints with smart-turn, and offers the LLM a metered `forecast` tool whose spoken answer is sung into the call.