Technical reference for the OpenClaw framework. Real-time synchronization with the official documentation engine.
Use this file to discover all available pages before exploring further.
Text-to-speech
OpenClaw can convert outbound replies into audio across 14 speech providers
and deliver native voice messages on Feishu, Matrix, Telegram, and WhatsApp,
audio attachments everywhere else, and PCM/Ulaw streams for telephony and Talk.
Quick start
Pick a provider
OpenAI and ElevenLabs are the most reliable hosted options. Microsoft and
Local CLI work without an API key. See the [provider matrix](#supported-providers)
for the full list.
Set the API key
Export the env var for your provider (for example `OPENAI_API_KEY`,
`ELEVENLABS_API_KEY`). Microsoft and Local CLI need no key.
Enable in config
Set `messages.tts.auto: "always"` and `messages.tts.provider`:
`/tts status` shows the current state. `/tts audio Hello from OpenClaw`
sends a one-off audio reply.
note
Auto-TTS is **off** by default. When `messages.tts.provider` is unset,
OpenClaw picks the first configured provider in registry auto-select order.
Supported providers
Provider
Auth
Notes
Azure Speech
text
AZURE_SPEECH_KEY
+
text
AZURE_SPEECH_REGION
(also
text
AZURE_SPEECH_API_KEY
,
text
SPEECH_KEY
,
text
SPEECH_REGION
)
Native Ogg/Opus voice-note output and telephony.
DeepInfra
text
DEEPINFRA_API_KEY
OpenAI-compatible TTS. Defaults to
text
hexgrad/Kokoro-82M
.
ElevenLabs
text
ELEVENLABS_API_KEY
or
text
XI_API_KEY
Voice cloning, multilingual, deterministic via
text
seed
.
Google Gemini
text
GEMINI_API_KEY
or
text
GOOGLE_API_KEY
Gemini API TTS; persona-aware via
text
promptTemplate: "audio-profile-v1"
.
Gradium
text
GRADIUM_API_KEY
Voice-note and telephony output.
Inworld
text
INWORLD_API_KEY
Streaming TTS API. Native Opus voice-note and PCM telephony.
Local CLI
none
Runs a configured local TTS command.
Microsoft
none
Public Edge neural TTS via
text
node-edge-tts
. Best-effort, no SLA.
MiniMax
text
MINIMAX_API_KEY
(or Token Plan:
text
MINIMAX_OAUTH_TOKEN
,
text
MINIMAX_CODE_PLAN_KEY
,
text
MINIMAX_CODING_API_KEY
)
T2A v2 API. Defaults to
text
speech-2.8-hd
.
OpenAI
text
OPENAI_API_KEY
Also used for auto-summary; supports persona
text
instructions
.
OpenRouter
text
OPENROUTER_API_KEY
(can reuse
text
models.providers.openrouter.apiKey
)
Default model
text
hexgrad/kokoro-82m
.
Volcengine
text
VOLCENGINE_TTS_API_KEY
or
text
BYTEPLUS_SEED_SPEECH_API_KEY
(legacy AppID/token:
text
VOLCENGINE_TTS_APPID
/
text
_TOKEN
)
BytePlus Seed Speech HTTP API.
Vydra
text
VYDRA_API_KEY
Shared image, video, and speech provider.
xAI
text
XAI_API_KEY
xAI batch TTS. Native Opus voice-note is not supported.
Xiaomi MiMo
text
XIAOMI_API_KEY
MiMo TTS through Xiaomi chat completions.
If multiple providers are configured, the selected one is used first and the
others are fallback options. Auto-summary uses
text
summaryModel
(or
text
agents.defaults.model.primary
), so that provider must also be authenticated
if you keep summaries enabled.
warning
The bundled **Microsoft** provider uses Microsoft Edge's online neural TTS
service via `node-edge-tts`. It is a public web service without a published
SLA or quota — treat it as best-effort. The legacy provider id `edge` is
normalized to `microsoft` and `openclaw doctor --fix` rewrites persisted
config; new configs should always use `microsoft`.
A persona is a stable spoken identity that can be applied deterministically
across providers. It can prefer one provider, define provider-neutral prompt
intent, and carry provider-specific bindings for voices, models, prompt
templates, seeds, and voice settings.
{
messages: {
tts: {
auto: "always",
persona: "alfred",
personas: {
alfred: {
label: "Alfred",
description: "Dry, warm British butler narrator.",
provider: "google",
fallbackPolicy: "preserve-persona",
prompt: {
profile: "A brilliant British butler. Dry, witty, warm, charming, emotionally expressive, never generic.",
scene: "A quiet late-night study. Close-mic narration for a trusted operator.",
sampleContext: "The speaker is answering a private technical request with concise confidence and dry warmth.",
style: "Refined, understated, lightly amused.",
accent: "British English.",
pacing: "Measured, with short dramatic pauses.",
constraints: ["Do not read configuration values aloud.", "Do not explain the persona."],
},
providers: {
google: {
model: "gemini-3.1-flash-tts-preview",
voiceName: "Algieba",
promptTemplate: "audio-profile-v1",
},
openai: { model: "gpt-4o-mini-tts", voice: "cedar" },
elevenlabs: {
voiceId: "voice_id",
modelId: "eleven_multilingual_v2",
seed: 42,
voiceSettings: {
stability: 0.65,
similarityBoost: 0.8,
style: 0.25,
useSpeakerBoost: true,
speed: 0.95,
},
},
},
},
},
},
},
}
Persona resolution
The active persona is selected deterministically:
text
/tts persona <id>
local preference, if set.
text
messages.tts.persona
, if set.
No persona.
Provider selection runs explicit-first:
Direct overrides (CLI, gateway, Talk, allowed TTS directives).
text
/tts provider <id>
local preference.
Active persona's
text
provider
.
text
messages.tts.provider
.
Registry auto-select.
For each provider attempt, OpenClaw merges configs in this order:
text
messages.tts.providers.<id>
text
messages.tts.personas.<persona>.providers.<id>
Trusted request overrides
Allowed model-emitted TTS directive overrides
How providers use persona prompts
Persona prompt fields (
text
profile
,
text
scene
,
text
sampleContext
,
text
style
,
text
accent
,
text
pacing
,
text
constraints
) are provider-neutral. Each provider decides how
to use them:
Fallback policy
text
fallbackPolicy
controls behavior when a persona has no binding for the
attempted provider:
Policy
Behavior
text
preserve-persona
Default. Provider-neutral prompt fields stay available; the provider may use them or ignore them.
text
provider-defaults
Persona is omitted from prompt preparation for that attempt; the provider uses its neutral defaults while fallback to other providers continues.
text
fail
Skip that provider attempt with
text
reasonCode: "not_configured"
and
text
personaBinding: "missing"
. Fallback providers are still tried.
The whole TTS request only fails when every attempted provider is skipped
or fails.
Model-driven directives
By default, the assistant can emit
text
[[tts:...]]
directives to override
voice, model, or speed for a single reply, plus an optional
text
[[tts:text]]...[[/tts:text]]
block for expressive cues that should appear in
audio only:
text
Here you go.
[[tts:voiceId=pMsXgVXv3BLzUgSXRplE model=eleven_v3 speed=1.1]]
[[tts:text]](laughs) Read the song once more.[[/tts:text]]
When
text
messages.tts.auto
is
text
"tagged"
, directives are required to trigger
audio. Streaming block delivery strips directives from visible text before the
channel sees them, even when split across adjacent blocks.
text
provider=...
is ignored unless
text
modelOverrides.allowProvider: true
. When a
reply declares
text
provider=...
, the other keys in that directive are parsed
only by that provider; unsupported keys are stripped and reported as TTS
directive warnings.
Available directive keys:
text
provider
(registered provider id; requires
text
allowProvider: true
)
text
voice
/
text
voiceName
/
text
voice_name
/
text
google_voice
/
text
voiceId
text
model
/
text
google_model
text
stability
,
text
similarityBoost
,
text
style
,
text
speed
,
text
useSpeakerBoost
text
vol
/
text
volume
(MiniMax volume, 0–10)
text
pitch
(MiniMax integer pitch, −12 to 12; fractional values are truncated)
/tts off | on | status
/tts chat on | off | default
/tts latest
/tts provider <id>
/tts persona <id> | off
/tts limit <chars>
/tts summary off
/tts audio <text>
note
Commands require an authorized sender (allowlist/owner rules apply) and either
`commands.text` or native command registration must be enabled.
Behavior notes:
text
/tts on
writes the local TTS preference to
text
always
;
text
/tts off
writes it to
text
off
.
text
/tts chat on|off|default
writes a session-scoped auto-TTS override for the current chat.
text
/tts persona <id>
writes the local persona preference;
text
/tts persona off
clears it.
text
/tts latest
reads the latest assistant reply from the current session transcript and sends it as audio once. It stores only a hash of that reply on the session entry to suppress duplicate voice sends.
text
/tts audio
generates a one-off audio reply (does not toggle TTS on).
text
limit
and
text
summary
are stored in local prefs, not the main config.
text
/tts status
includes fallback diagnostics for the latest attempt —
text
Fallback: <primary> -> <used>
,
text
Attempts: ...
, and per-attempt detail (
text
provider:outcome(reasonCode) latency
).
text
/status
shows the active TTS mode plus configured provider, model, voice, and sanitized custom endpoint metadata when TTS is enabled.
Per-user preferences
Slash commands write local overrides to
text
prefsPath
. The default is
text
~/.openclaw/settings/tts.json
; override with the
text
OPENCLAW_TTS_PREFS
env var
or
text
messages.tts.prefsPath
.
Stored field
Effect
text
auto
Local auto-TTS override (
text
always
,
text
off
, …)
text
provider
Local primary provider override
text
persona
Local persona override
text
maxLength
Summary threshold (default
text
1500
chars)
text
summarize
Summary toggle (default
text
true
)
These override the effective config from
text
messages.tts
plus the active
text
agents.list[].tts
block for that host.
Output formats (fixed)
TTS voice delivery is channel-capability driven. Channel plugins advertise
whether voice-style TTS should ask providers for a native
text
voice-note
target or
keep normal
text
audio-file
synthesis and only mark compatible output for voice
delivery.
Voice-note capable channels: voice-note replies prefer Opus (
text
opus_48000_64
from ElevenLabs,
text
opus
from OpenAI).
48kHz / 64kbps is a good voice message tradeoff.
Feishu / WhatsApp: when a voice-note reply is produced as MP3/WebM/WAV/M4A
or another likely audio file, the channel plugin transcodes it to 48kHz
Ogg/Opus with
text
ffmpeg
before sending the native voice message. WhatsApp sends
the result through the Baileys
text
audio
payload with
text
ptt: true
and
text
audio/ogg; codecs=opus
. If conversion fails, Feishu receives the original
file as an attachment; WhatsApp send fails rather than posting an incompatible
PTT payload.
BlueBubbles: keeps provider synthesis on the normal audio-file path; MP3
and CAF outputs are marked for iMessage voice memo delivery.
Other channels: MP3 (
text
mp3_44100_128
from ElevenLabs,
text
mp3
from OpenAI).
44.1kHz / 128kbps is the default balance for speech clarity.
MiniMax: MP3 (
text
speech-2.8-hd
model, 32kHz sample rate) for normal audio attachments. For channel-advertised voice-note targets, OpenClaw transcodes the MiniMax MP3 to 48kHz Opus with
text
ffmpeg
before delivery when the channel advertises transcoding.
Xiaomi MiMo: MP3 by default, or WAV when configured. For channel-advertised voice-note targets, OpenClaw transcodes Xiaomi output to 48kHz Opus with
text
ffmpeg
before delivery when the channel advertises transcoding.
Local CLI: uses the configured
text
outputFormat
. Voice-note targets are
converted to Ogg/Opus and telephony output is converted to raw 16 kHz mono PCM
with
text
ffmpeg
.
Google Gemini: Gemini API TTS returns raw 24kHz PCM. OpenClaw wraps it as WAV for audio attachments, transcodes it to 48kHz Opus for voice-note targets, and returns PCM directly for Talk/telephony.
Gradium: WAV for audio attachments, Opus for voice-note targets, and
text
ulaw_8000
at 8 kHz for telephony.
Inworld: MP3 for normal audio attachments, native
text
OGG_OPUS
for voice-note targets, and raw
text
PCM
at 22050 Hz for Talk/telephony.
xAI: MP3 by default;
text
responseFormat
may be
text
mp3
,
text
wav
,
text
pcm
,
text
mulaw
, or
text
alaw
. OpenClaw uses xAI's batch REST TTS endpoint and returns a complete audio attachment; xAI's streaming TTS WebSocket is not used by this provider path. Native Opus voice-note format is not supported by this path.
Microsoft: uses
text
microsoft.outputFormat
(default
text
audio-24khz-48kbitrate-mono-mp3
).
The bundled transport accepts an
text
outputFormat
, but not all formats are available from the service.
Output format values follow Microsoft Speech output formats (including Ogg/WebM Opus).
Telegram
text
sendVoice
accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need
guaranteed Opus voice messages.
If the configured Microsoft output format fails, OpenClaw retries with MP3.
OpenAI/ElevenLabs output formats are fixed per channel (see above).
Auto-TTS behavior
When
text
messages.tts.auto
is enabled, OpenClaw:
Skips TTS if the reply already contains media or a
text
MEDIA:
directive.
Skips very short replies (under 10 chars).
Summarizes long replies when summaries are enabled, using
text
summaryModel
(or
text
agents.defaults.model.primary
).
Attaches the generated audio to the reply.
In
text
mode: "final"
, still sends audio-only TTS for streamed final replies
after the text stream completes; the generated media goes through the same
channel media normalization as normal reply attachments.
If the reply exceeds
text
maxLength
and summary is off (or no API key for the
summary model), audio is skipped and the normal text reply is sent.
text
Reply -> TTS enabled?
no -> send text
yes -> has media / MEDIA: / short?
yes -> send text
no -> length > limit?
no -> TTS -> attach audio
yes -> summary enabled?
no -> send text
yes -> summarize -> TTS -> attach audio
Output formats by channel
Target
Format
Feishu / Matrix / Telegram / WhatsApp
Voice-note replies prefer Opus (
text
opus_48000_64
from ElevenLabs,
text
opus
from OpenAI). 48 kHz / 64 kbps balances clarity and size.
Other channels
MP3 (
text
mp3_44100_128
from ElevenLabs,
text
mp3
from OpenAI). 44.1 kHz / 128 kbps default for speech.
Talk / telephony
Provider-native PCM (Inworld 22050 Hz, Google 24 kHz), or
text
ulaw_8000
from Gradium for telephony.
Per-provider notes:
Feishu / WhatsApp transcoding: When a voice-note reply lands as MP3/WebM/WAV/M4A, the channel plugin transcodes to 48 kHz Ogg/Opus with
text
ffmpeg
. WhatsApp sends through Baileys with
text
ptt: true
and
text
audio/ogg; codecs=opus
. If conversion fails: Feishu falls back to attaching the original file; WhatsApp send fails rather than posting an incompatible PTT payload.
MiniMax / Xiaomi MiMo: Default MP3 (32 kHz for MiniMax
text
speech-2.8-hd
); transcoded to 48 kHz Opus for voice-note targets via
text
ffmpeg
.
Local CLI: Uses configured
text
outputFormat
. Voice-note targets are converted to Ogg/Opus and telephony output to raw 16 kHz mono PCM.
Google Gemini: Returns raw 24 kHz PCM. OpenClaw wraps as WAV for attachments, transcodes to 48 kHz Opus for voice-note targets, returns PCM directly for Talk/telephony.
Inworld: MP3 attachments, native
text
OGG_OPUS
voice-note, raw
text
PCM
22050 Hz for Talk/telephony.
xAI: MP3 by default;
text
responseFormat
may be
text
mp3|wav|pcm|mulaw|alaw
. Uses xAI's batch REST endpoint — streaming WebSocket TTS is not used. Native Opus voice-note format is not supported.
Microsoft: Uses
text
microsoft.outputFormat
(default
text
audio-24khz-48kbitrate-mono-mp3
). Telegram
text
sendVoice
accepts OGG/MP3/M4A; use OpenAI/ElevenLabs if you need guaranteed Opus voice messages. If the configured Microsoft format fails, OpenClaw retries with MP3.
OpenAI and ElevenLabs output formats are fixed per channel as listed above.
Field reference
Agent tool
The
text
tts
tool converts text to speech and returns an audio attachment for
reply delivery. On Feishu, Matrix, Telegram, and WhatsApp, the audio is
delivered as a voice message rather than a file attachment. Feishu and
WhatsApp can transcode non-Opus TTS output on this path when
text
ffmpeg
is
available.
WhatsApp sends audio through Baileys as a PTT voice note (
text
audio
with
text
ptt: true
) and sends visible text separately from PTT audio because
clients do not consistently render captions on voice notes.
The tool accepts optional
text
channel
and
text
timeoutMs
fields;
text
timeoutMs
is a
per-call provider request timeout in milliseconds.