Realtime voice — Kataleptic Docs

Quickstart

Open a WebSocket, configure the session, send a bare response.create to make the agent speak first, then stream microphone audio in and play audio deltas out. That is the whole loop.

// Browser / Cloudflare Workers — auth via subprotocol
const ws = new WebSocket(
  "wss://api.kataleptic.com/v1/realtime?model=kataleptic-realtime",
  ["realtime", "openai-insecure-api-key." + KATALEPTIC_API_KEY],
);

ws.onopen = () => {
  ws.send(JSON.stringify({
    type: "session.update",
    session: {
      instructions: "You are the booking agent for a small hotel. " +
                    "Open by greeting the caller and asking how you can help.",
      turn_detection: { type: "server_vad", silence_duration_ms: 400 },
    },
  }));
  // Bare response.create → the agent speaks its opening line.
  ws.send(JSON.stringify({ type: "response.create" }));
};

ws.onmessage = (e) => {
  const ev = JSON.parse(e.data);
  if (ev.type === "response.output_audio.delta") {
    playPcm16(atob(ev.delta));            // PCM16 mono @ 24 kHz
  } else if (ev.type === "response.output_audio_transcript.done") {
    console.log("agent said:", ev.transcript);
  }
};

// Stream microphone audio as base64-encoded PCM16:
function sendChunk(base64Pcm16) {
  ws.send(JSON.stringify({ type: "input_audio_buffer.append", audio: base64Pcm16 }));
}

import asyncio, base64, json, os
import websockets

URL = "wss://api.kataleptic.com/v1/realtime?model=kataleptic-realtime"

async def main():
    async with websockets.connect(
        URL,
        additional_headers={
            "Authorization": f"Bearer {os.environ['KATALEPTIC_API_KEY']}",
        },
    ) as ws:
        await ws.send(json.dumps({
            "type": "session.update",
            "session": {
                "instructions": (
                    "You are the booking agent for a small hotel. "
                    "Open by greeting the caller and asking how you can help."
                ),
                "turn_detection": {"type": "server_vad", "silence_duration_ms": 400},
            },
        }))
        # Bare response.create → the agent speaks its opening line.
        await ws.send(json.dumps({"type": "response.create"}))

        async for raw in ws:
            ev = json.loads(raw)
            if ev["type"] == "response.output_audio.delta":
                pcm16 = base64.b64decode(ev["delta"])   # PCM16 mono @ 24 kHz
            elif ev["type"] == "response.output_audio_transcript.done":
                print("agent said:", ev["transcript"])

asyncio.run(main())

Already on the OpenAI SDK? Unmodified OpenAI realtime clients work as-is — change the host to api.kataleptic.com and keep your code. We speak the GA dialect by default and switch to the beta dialect automatically when your client sends the OpenAI-Beta: realtime=v1 header or the openai-beta.realtime-v1 subprotocol.

Authentication

Three ways to present your dg_… key, in order of preference:

Header — Authorization: Bearer dg_…. Use this from servers.
Query parameter — ?token=dg_… appended to the WebSocket URL, for clients that cannot set headers.
Subprotocol — openai-insecure-api-key.dg_… in the WebSocket subprotocol list, the same convention OpenAI uses for browser and Workers clients. As the name says: only use this with short-lived keys you are comfortable exposing to the client.

The three tiers

One endpoint, three engines. The model id in ?model= selects the engine; everything else about the protocol stays the same.

Model id	Engine	First audio	Residency	Transcripts	Typical price
`kataleptic-realtime`	Cascade: Whisper STT → chat model → Piper TTS	~250 ms	EU, our own fleet	Exact	≈$0.0133/min
`kataleptic-realtime-hd`	Azure Voice Live (Sweden Central)	~1.2 s	EU	Exact	≈$0.03/min
`gpt-realtime-2`	Native speech-to-speech	~1.0 s	Global routing — not EU-pinned	Model approximation	≈$0.07/min

kataleptic-realtime — the default

A fully EU-resident cascade on our own fleet: streaming Whisper speech-to-text, a catalogue chat model in the middle, Piper text-to-speech on the way out. The brain is swappable per session — pass any catalogue chat model id in ?model= (default mistral-nemo-12b) and the cascade uses it. Ten languages are auto-detected per utterance — EN, DE, FR, ES, NL, SV, DA, IT, FI, RU — and the TTS voice follows the detected language. Server-side VAD with barge-in; speech recognition is noise-gated (Silero VAD plus no-speech and language-probability thresholds), so breathing and background noise do not become turns.

kataleptic-realtime-hd — premium voices

The same WebSocket, served by Azure Voice Live in Sweden Central: 600+ HD neural voices, deep noise suppression, echo cancellation, and semantic turn detection (the model judges whether the caller is done, not just the silence timer). EU-resident processing, exact transcripts.

gpt-realtime-2 — native speech-to-speech

No cascade — one model hears audio and speaks audio. Best prosody and expressiveness of the three; it responds to tone, hesitation, and emphasis, not just words. The trade-offs are real and listed in caveats: inference is globally routed (not EU-pinned) and transcripts are the model's own approximation of what was said.

session.update reference

Send session.update as your first message to configure the conversation. The supported subset:

Field	Type	What it does
`instructions`	string	The system prompt. Persona, opening line, guardrails.
`voice`	string	Voice selection. On the default tier the voice follows the detected language; on the Azure tiers pick from their voice catalogues.
`turn_detection.type`	`"server_vad"`	Server-side voice activity detection. The server decides when the caller's turn ends.
`turn_detection.threshold`	number	VAD sensitivity. Higher = needs louder/clearer speech to open a turn.
`turn_detection.prefix_padding_ms`	number	Audio retained from before speech onset, so first syllables are not clipped.
`turn_detection.silence_duration_ms`	number	Trailing silence that ends the turn. Lower = snappier, more interruptions.
`turn_detection.create_response`	boolean	Auto-respond when a turn ends. Set `false` to drive responses yourself with `response.create`.
`turn_detection.interrupt_response`	boolean	Barge-in: caller speech cancels the agent's in-flight reply.

Protocol subset

Client events we accept, on every tier:

session.update — configure instructions, voice, turn detection (see above).
input_audio_buffer.append / .commit / .clear — stream caller audio; commit manually if you run your own VAD.
conversation.item.create / .delete / .truncate — edit conversation history, including previous_item_id placement and root insertion.
response.create / response.cancel — request or cancel an agent reply.

Audio is PCM16 at 16 or 24 kHz in both directions on all tiers. The two Azure tiers additionally accept G.711 for telephony — see below.

Transcripts & call logging

Both directions of the conversation arrive as text events, which is all you need to build a call log:

Caller side — conversation.item.input_audio_transcription.completed fires once per caller utterance with the final transcript.
Agent side — response.output_audio_transcript.delta streams the agent's words as it speaks; response.output_audio_transcript.done carries the full utterance.

gpt-realtime-2 only: caller transcripts are on by default (we enable input_audio_transcription with whisper-1 for you; override or disable it in session.update). The agent-side transcript is the model's approximation of its own speech, not an exact STT transcript; if your call logs have compliance weight, use the standard or HD tier. On the standard tier, transcription events also carry language and language_probability fields.

Greeting pattern

Phone agents should speak first. Put the opening line in instructions, then send a bare response.create — no conversation items needed:

{ "type": "session.update",
  "session": { "instructions": "Greet the caller: 'Grüß Gott, Hotel Sacher reception.' Then assist." } }

{ "type": "response.create" }

The agent speaks the greeting per its instructions, and the normal turn-taking loop begins from there.

Telephony / G.711

SIP trunks and most PSTN gateways hand you G.711. On kataleptic-realtime-hd and gpt-realtime-2 you can pass it straight through without transcoding:

Beta-dialect flat fields: "input_audio_format": "g711_ulaw" (or "g711_alaw"), same for output.
GA-dialect format objects: {"type": "audio/pcmu"} / {"type": "audio/pcma"}.

The default kataleptic-realtime tier is PCM16-only — transcode at your media gateway if you bridge it to a trunk.

Caveats per tier

kataleptic-realtime

Cascade voices are functional, not studio-grade — if voice quality is the product, use HD.
One voice per language; the voice field has limited effect because the voice follows the detected language.

kataleptic-realtime-hd

First audio ~1.2 s — noticeably slower to open than the default tier's ~250 ms.

gpt-realtime-2

Not EU-pinned — inference uses global routing. Do not put it behind a residency commitment.
Agent transcripts are model approximations, not exact STT output (caller transcripts use whisper-1, on by default).

Function calling

All three tiers support OpenAI Realtime function calling. Define tools in session.update (flat realtime shape: {"type": "function", "name", "description", "parameters"}); when the model decides to call one you receive response.function_call_arguments.delta events, a final response.function_call_arguments.done with the JSON arguments, and a function_call item in response.done. Send the result back as a conversation.item.create with {"type": "function_call_output", "call_id", "output"} followed by response.create.

On the standard tier the cascade brain executes the tool call; small models occasionally write a call as prose instead of invoking it — the server strips text that exactly matches a defined tool-call pattern from the spoken audio, so the agent never says "end_call()" aloud.

Session limits & lifecycle

Max session duration: 60 minutes. One minute before the cutoff the server emits a vendor-extension event {"type": "session.expiring", "reason": "max_session_duration", "expires_in_seconds": …} so bridges can reconnect gracefully. Clients that ignore unknown events lose nothing.
Idle timeout: 5 minutes without any WebSocket message (continuous audio streaming counts as activity).
Server deploys can terminate live sessions; production bridges should reconnect on unexpected close and re-send session.update.

Voice catalog per tier

kataleptic-realtime — voice follows the detected caller language automatically across all ten languages. Before the first caller utterance, the initial voice seeds from input_audio_transcription.language when set, or from the language of your instructions — so instruction-driven greetings come out in the right voice with zero configuration. The language field is a seed and STT-accuracy hint, not a cage: once real speech arrives, per-utterance detection overrides it — in the reported language field, the voice, and what the model is told — even when it contradicts the seed. To pin a voice explicitly, pass a Piper id as voice: en_US-lessac-medium, de_DE-thorsten-medium, fr_FR-siwis-medium, es_ES-sharvard-medium, nl_NL-mls-medium, sv_SE-nst-medium, da_DK-talesyntese-medium, it_IT-paola-medium, fi_FI-harri-medium, ru_RU-irina-medium. OpenAI voice names are accepted and ignored in favor of language-matching.
kataleptic-realtime-hd — any Azure neural voice name passes through (e.g. de-DE-SeraphinaMultilingualNeural, 600+ voices); OpenAI voice names (alloy, marin, …) map to Azure multilingual voices; Piper ids map to the closest Azure voice. Default is a multilingual voice, so language-follow works with no configuration.
gpt-realtime-2 — OpenAI voices only (marin, cedar, alloy, …); non-OpenAI names coerce to marin. Voices are natively multilingual.

The full machine-readable catalog (including the live per-language Piper map) is served at GET /v1/realtime/voices — no auth required. One namespace, three tiers: OpenAI names work everywhere; engine-native names (Piper ids, Azure voice names) work on their own tier and degrade gracefully elsewhere. Unknown transcription.model values return an error event; supported values are whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe (→ turbo) and whisper-large-v3 (full model, ~+110 ms, better on noisy audio).

Choosing the cascade brain

mistral-nemo-12b (default) — fastest replies (~0.3 s first audio), solid small-talk and form-filling; weaker at multi-step reasoning (dates, arithmetic) and occasionally imperfect language adherence on long prompts.
llama-3.3-70b — strong reasoning and reliable multilingual replies at ~1–1.4 s first audio. Recommended for production receptionists that must reason about schedules.
gpt-5.4-mini / other catalogue models — pick any chat model via ?model=; latency is dominated by that model's time-to-first-token.

Pricing & billing

kataleptic-realtime — $0.0033/min audio in + $0.01/min audio out, plus the chat model's tokens at its catalogue rate. ≈$0.0133/min all-in with the default brain.
kataleptic-realtime-hd — billed per token at Azure Voice Live rates with a service margin; ≈$0.03/min typical.
gpt-realtime-2 — billed per text + audio token; ≈$0.07/min typical.

Usage shows up on your key under the model ids kataleptic-realtime, kataleptic-realtime-hd and gpt-realtime-2 — same GET /v1/auth/key surface as everything else.

Voice minutes are cheap to try: the $5 of free signup credit buys roughly six hours of conversation on the default tier. Get a key and say hello to it.