Realtime voice
Speech-to-speech over WebSocket, wire-compatible with the OpenAI
Realtime API. Point an unmodified OpenAI realtime client at
api.kataleptic.com and it works — GA dialect by default,
beta dialect auto-detected. Three tiers behind one endpoint,
selected by the model id.
wss://api.kataleptic.com/v1/realtime?model=<id>Bearer dg_... · ?token= · subprotocolOpenAI Realtime (GA; beta auto-detected)PCM16 @ 16/24 kHz · G.711 on Azure tiersQuickstart
Open a WebSocket, configure the session, send a bare
response.create to make the agent speak first, then
stream microphone audio in and play audio deltas out. That is the
whole loop.
// Browser / Cloudflare Workers — auth via subprotocol
const ws = new WebSocket(
"wss://api.kataleptic.com/v1/realtime?model=kataleptic-realtime",
["realtime", "openai-insecure-api-key." + KATALEPTIC_API_KEY],
);
ws.onopen = () => {
ws.send(JSON.stringify({
type: "session.update",
session: {
instructions: "You are the booking agent for a small hotel. " +
"Open by greeting the caller and asking how you can help.",
turn_detection: { type: "server_vad", silence_duration_ms: 400 },
},
}));
// Bare response.create → the agent speaks its opening line.
ws.send(JSON.stringify({ type: "response.create" }));
};
ws.onmessage = (e) => {
const ev = JSON.parse(e.data);
if (ev.type === "response.output_audio.delta") {
playPcm16(atob(ev.delta)); // PCM16 mono @ 24 kHz
} else if (ev.type === "response.output_audio_transcript.done") {
console.log("agent said:", ev.transcript);
}
};
// Stream microphone audio as base64-encoded PCM16:
function sendChunk(base64Pcm16) {
ws.send(JSON.stringify({ type: "input_audio_buffer.append", audio: base64Pcm16 }));
}
import asyncio, base64, json, os
import websockets
URL = "wss://api.kataleptic.com/v1/realtime?model=kataleptic-realtime"
async def main():
async with websockets.connect(
URL,
additional_headers={
"Authorization": f"Bearer {os.environ['KATALEPTIC_API_KEY']}",
},
) as ws:
await ws.send(json.dumps({
"type": "session.update",
"session": {
"instructions": (
"You are the booking agent for a small hotel. "
"Open by greeting the caller and asking how you can help."
),
"turn_detection": {"type": "server_vad", "silence_duration_ms": 400},
},
}))
# Bare response.create → the agent speaks its opening line.
await ws.send(json.dumps({"type": "response.create"}))
async for raw in ws:
ev = json.loads(raw)
if ev["type"] == "response.output_audio.delta":
pcm16 = base64.b64decode(ev["delta"]) # PCM16 mono @ 24 kHz
elif ev["type"] == "response.output_audio_transcript.done":
print("agent said:", ev["transcript"])
asyncio.run(main())
Already on the OpenAI SDK? Unmodified OpenAI realtime
clients work as-is — change the host to
api.kataleptic.com and keep your code. We speak the GA
dialect by default and switch to the beta dialect automatically when
your client sends the OpenAI-Beta: realtime=v1 header
or the openai-beta.realtime-v1 subprotocol.
Authentication
Three ways to present your dg_… key, in order of preference:
-
Header —
Authorization: Bearer dg_…. Use this from servers. -
Query parameter —
?token=dg_…appended to the WebSocket URL, for clients that cannot set headers. -
Subprotocol —
openai-insecure-api-key.dg_…in the WebSocket subprotocol list, the same convention OpenAI uses for browser and Workers clients. As the name says: only use this with short-lived keys you are comfortable exposing to the client.
The three tiers
One endpoint, three engines. The model id in
?model= selects the engine; everything else about the
protocol stays the same.
| Model id | Engine | First audio | Residency | Transcripts | Typical price |
|---|---|---|---|---|---|
kataleptic-realtime |
Cascade: Whisper STT → chat model → Piper TTS | ~250 ms | EU, our own fleet | Exact | ≈$0.0133/min |
kataleptic-realtime-hd |
Azure Voice Live (Sweden Central) | ~1.2 s | EU | Exact | ≈$0.03/min |
gpt-realtime-2 |
Native speech-to-speech | ~1.0 s | Global routing — not EU-pinned | Model approximation | ≈$0.07/min |
kataleptic-realtime — the default
A fully EU-resident cascade on our own fleet: streaming Whisper
speech-to-text, a catalogue chat model in the middle, Piper
text-to-speech on the way out. The brain is swappable per session —
pass any catalogue chat model id in ?model=
(default mistral-nemo-12b) and the cascade uses it.
Ten languages are auto-detected per utterance — EN, DE, FR, ES, NL,
SV, DA, IT, FI, RU — and the TTS voice follows the detected
language. Server-side VAD with barge-in; speech recognition is
noise-gated (Silero VAD plus no-speech and language-probability
thresholds), so breathing and background noise do not become turns.
kataleptic-realtime-hd — premium voices
The same WebSocket, served by Azure Voice Live in Sweden Central: 600+ HD neural voices, deep noise suppression, echo cancellation, and semantic turn detection (the model judges whether the caller is done, not just the silence timer). EU-resident processing, exact transcripts.
gpt-realtime-2 — native speech-to-speech
No cascade — one model hears audio and speaks audio. Best prosody and expressiveness of the three; it responds to tone, hesitation, and emphasis, not just words. The trade-offs are real and listed in caveats: inference is globally routed (not EU-pinned) and transcripts are the model's own approximation of what was said.
session.update reference
Send session.update as your first message to configure
the conversation. The supported subset:
| Field | Type | What it does |
|---|---|---|
instructions | string | The system prompt. Persona, opening line, guardrails. |
voice | string | Voice selection. On the default tier the voice follows the detected language; on the Azure tiers pick from their voice catalogues. |
turn_detection.type | "server_vad" | Server-side voice activity detection. The server decides when the caller's turn ends. |
turn_detection.threshold | number | VAD sensitivity. Higher = needs louder/clearer speech to open a turn. |
turn_detection.prefix_padding_ms | number | Audio retained from before speech onset, so first syllables are not clipped. |
turn_detection.silence_duration_ms | number | Trailing silence that ends the turn. Lower = snappier, more interruptions. |
turn_detection.create_response | boolean | Auto-respond when a turn ends. Set false to drive responses yourself with response.create. |
turn_detection.interrupt_response | boolean | Barge-in: caller speech cancels the agent's in-flight reply. |
Protocol subset
Client events we accept, on every tier:
session.update— configure instructions, voice, turn detection (see above).input_audio_buffer.append/.commit/.clear— stream caller audio; commit manually if you run your own VAD.conversation.item.create/.delete/.truncate— edit conversation history, includingprevious_item_idplacement and root insertion.response.create/response.cancel— request or cancel an agent reply.
Audio is PCM16 at 16 or 24 kHz in both directions on all tiers. The two Azure tiers additionally accept G.711 for telephony — see below.
Transcripts & call logging
Both directions of the conversation arrive as text events, which is all you need to build a call log:
-
Caller side —
conversation.item.input_audio_transcription.completedfires once per caller utterance with the final transcript. -
Agent side —
response.output_audio_transcript.deltastreams the agent's words as it speaks;response.output_audio_transcript.donecarries the full utterance.
gpt-realtime-2 only: caller transcripts are on by default
(we enable input_audio_transcription with
whisper-1 for you; override or disable it in
session.update). The agent-side transcript is
the model's approximation of its own speech, not an exact STT
transcript; if your call logs have compliance weight, use the
standard or HD tier. On the standard tier, transcription events
also carry language and
language_probability fields.
Greeting pattern
Phone agents should speak first. Put the opening line in
instructions, then send a bare
response.create — no conversation items needed:
{ "type": "session.update",
"session": { "instructions": "Greet the caller: 'Grüß Gott, Hotel Sacher reception.' Then assist." } }
{ "type": "response.create" }
The agent speaks the greeting per its instructions, and the normal turn-taking loop begins from there.
Telephony / G.711
SIP trunks and most PSTN gateways hand you G.711. On
kataleptic-realtime-hd and gpt-realtime-2
you can pass it straight through without transcoding:
- Beta-dialect flat fields:
"input_audio_format": "g711_ulaw"(or"g711_alaw"), same for output. - GA-dialect format objects:
{"type": "audio/pcmu"}/{"type": "audio/pcma"}.
The default kataleptic-realtime tier is PCM16-only —
transcode at your media gateway if you bridge it to a trunk.
Caveats per tier
kataleptic-realtime
- Cascade voices are functional, not studio-grade — if voice quality is the product, use HD.
- One voice per language; the
voicefield has limited effect because the voice follows the detected language.
kataleptic-realtime-hd
- First audio ~1.2 s — noticeably slower to open than the default tier's ~250 ms.
gpt-realtime-2
- Not EU-pinned — inference uses global routing. Do not put it behind a residency commitment.
- Agent transcripts are model approximations, not exact STT output (caller transcripts use whisper-1, on by default).
Function calling
All three tiers support OpenAI Realtime function calling. Define tools in session.update (flat realtime shape: {"type": "function", "name", "description", "parameters"}); when the model decides to call one you receive response.function_call_arguments.delta events, a final response.function_call_arguments.done with the JSON arguments, and a function_call item in response.done. Send the result back as a conversation.item.create with {"type": "function_call_output", "call_id", "output"} followed by response.create.
On the standard tier the cascade brain executes the tool call; small models occasionally write a call as prose instead of invoking it — the server strips text that exactly matches a defined tool-call pattern from the spoken audio, so the agent never says "end_call()" aloud.
Session limits & lifecycle
- Max session duration: 60 minutes. One minute before the cutoff the server emits a vendor-extension event
{"type": "session.expiring", "reason": "max_session_duration", "expires_in_seconds": …}so bridges can reconnect gracefully. Clients that ignore unknown events lose nothing. - Idle timeout: 5 minutes without any WebSocket message (continuous audio streaming counts as activity).
- Server deploys can terminate live sessions; production bridges should reconnect on unexpected close and re-send
session.update.
Voice catalog per tier
- kataleptic-realtime — voice follows the detected caller language automatically across all ten languages. Before the first caller utterance, the initial voice seeds from
input_audio_transcription.languagewhen set, or from the language of yourinstructions— so instruction-driven greetings come out in the right voice with zero configuration. The language field is a seed and STT-accuracy hint, not a cage: once real speech arrives, per-utterance detection overrides it — in the reportedlanguagefield, the voice, and what the model is told — even when it contradicts the seed. To pin a voice explicitly, pass a Piper id asvoice:en_US-lessac-medium,de_DE-thorsten-medium,fr_FR-siwis-medium,es_ES-sharvard-medium,nl_NL-mls-medium,sv_SE-nst-medium,da_DK-talesyntese-medium,it_IT-paola-medium,fi_FI-harri-medium,ru_RU-irina-medium. OpenAI voice names are accepted and ignored in favor of language-matching. - kataleptic-realtime-hd — any Azure neural voice name passes through (e.g.
de-DE-SeraphinaMultilingualNeural, 600+ voices); OpenAI voice names (alloy,marin, …) map to Azure multilingual voices; Piper ids map to the closest Azure voice. Default is a multilingual voice, so language-follow works with no configuration. - gpt-realtime-2 — OpenAI voices only (
marin,cedar,alloy, …); non-OpenAI names coerce tomarin. Voices are natively multilingual.
The full machine-readable catalog (including the live per-language Piper map) is served at GET /v1/realtime/voices — no auth required. One namespace, three tiers: OpenAI names work everywhere; engine-native names (Piper ids, Azure voice names) work on their own tier and degrade gracefully elsewhere. Unknown transcription.model values return an error event; supported values are whisper-1, gpt-4o-transcribe, gpt-4o-mini-transcribe (→ turbo) and whisper-large-v3 (full model, ~+110 ms, better on noisy audio).
Choosing the cascade brain
mistral-nemo-12b(default) — fastest replies (~0.3 s first audio), solid small-talk and form-filling; weaker at multi-step reasoning (dates, arithmetic) and occasionally imperfect language adherence on long prompts.llama-3.3-70b— strong reasoning and reliable multilingual replies at ~1–1.4 s first audio. Recommended for production receptionists that must reason about schedules.gpt-5.4-mini/ other catalogue models — pick any chat model via?model=; latency is dominated by that model's time-to-first-token.
Pricing & billing
-
kataleptic-realtime— $0.0033/min audio in + $0.01/min audio out, plus the chat model's tokens at its catalogue rate. ≈$0.0133/min all-in with the default brain. -
kataleptic-realtime-hd— billed per token at Azure Voice Live rates with a service margin; ≈$0.03/min typical. -
gpt-realtime-2— billed per text + audio token; ≈$0.07/min typical.
Usage shows up on your key under the model ids
kataleptic-realtime, kataleptic-realtime-hd
and gpt-realtime-2 — same
GET /v1/auth/key surface as everything else.
Voice minutes are cheap to try: the $5 of free signup credit buys roughly six hours of conversation on the default tier. Get a key and say hello to it.