Use this file to discover all available pages before exploring further.

Audio and voice notes

Audio / Voice Notes (2026-01-17)

What works

Media understanding (audio): If audio understanding is enabled (or auto‑detected), OpenClaw:
1. Locates the first audio attachment (local path or URL) and downloads it if needed.
2. Enforces
  text
  maxBytes
  before sending to each model entry.
3. Runs the first eligible model entry in order (provider or CLI).
4. If it fails or skips (size/timeout), it tries the next entry.
5. On success, it replaces
  text
  Body
  with an
  text
  [Audio]
  block and sets
  text
  {{Transcript}}
  .
Command parsing: When transcription succeeds,
text
CommandBody
/
text
RawBody
are set to the transcript so slash commands still work.
Verbose logging: In
text
--verbose
, we log when transcription runs and when it replaces the body.

Auto-detection (default)

If you don’t configure models and

text

tools.media.audio.enabled

is not set to

text

false

, OpenClaw auto-detects in this order and stops at the first working option:

Active reply model when its provider supports audio understanding.
Local CLIs (if installed)
- text
  sherpa-onnx-offline
  (requires
  text
  SHERPA_ONNX_MODEL_DIR
  with encoder/decoder/joiner/tokens)
- text
  whisper-cli
  (from
  text
  whisper-cpp
  ; uses
  text
  WHISPER_CPP_MODEL
  or the bundled tiny model)
- text
  whisper
  (Python CLI; downloads models automatically)
Gemini CLI (
text
gemini
) using
text
read_many_files
Provider auth
- Configured
  text
  models.providers.*
  entries that support audio are tried first
- Bundled fallback order: OpenAI → Groq → xAI → Deepgram → Google → SenseAudio → ElevenLabs → Mistral

To disable auto-detection, set

text

tools.media.audio.enabled: false

. To customize, set

text

tools.media.audio.models

. Note: Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on

text

PATH

(we expand

text

~

), or set an explicit CLI model with a full command path.

Config examples

Provider + CLI fallback (OpenAI + Whisper CLI)


json5
{
  tools: {
    media: {
      audio: {
        enabled: true,
        maxBytes: 20971520,
        models: [
          { provider: "openai", model: "gpt-4o-mini-transcribe" },
          {
            type: "cli",
            command: "whisper",
            args: ["--model", "base", "{{MediaPath}}"],
            timeoutSeconds: 45,
          },
        ],
      },
    },
  },
}

Provider-only with scope gating


json5
{
  tools: {
    media: {
      audio: {
        enabled: true,
        scope: {
          default: "allow",
          rules: [{ action: "deny", match: { chatType: "group" } }],
        },
        models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
      },
    },
  },
}

Provider-only (Deepgram)


json5
{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [{ provider: "deepgram", model: "nova-3" }],
      },
    },
  },
}

Provider-only (Mistral Voxtral)


json5
{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [{ provider: "mistral", model: "voxtral-mini-latest" }],
      },
    },
  },
}

Provider-only (SenseAudio)


json5
{
  tools: {
    media: {
      audio: {
        enabled: true,
        models: [{ provider: "senseaudio", model: "senseaudio-asr-pro-1.5-260319" }],
      },
    },
  },
}

Echo transcript to chat (opt-in)


json5
{
  tools: {
    media: {
      audio: {
        enabled: true,
        echoTranscript: true, // default is false
        echoFormat: '📝 "{transcript}"', // optional, supports {transcript}
        models: [{ provider: "openai", model: "gpt-4o-mini-transcribe" }],
      },
    },
  },
}

Notes & limits

Provider auth follows the standard model auth order (auth profiles, env vars,
text
models.providers.*.apiKey
).
Groq setup details: Groq.
Deepgram picks up
text
DEEPGRAM_API_KEY
when
text
provider: "deepgram"
is used.
Deepgram setup details: Deepgram (audio transcription).
Mistral setup details: Mistral.
SenseAudio picks up
text
SENSEAUDIO_API_KEY
when
text
provider: "senseaudio"
is used.
SenseAudio setup details: SenseAudio.
Audio providers can override
text
baseUrl
,
text
headers
, and
text
providerOptions
via
text
tools.media.audio
.
Default size cap is 20MB (
text
tools.media.audio.maxBytes
). Oversize audio is skipped for that model and the next entry is tried.
Tiny/empty audio files below 1024 bytes are skipped before provider/CLI transcription.
Default
text
maxChars
for audio is unset (full transcript). Set
text
tools.media.audio.maxChars
or per-entry
text
maxChars
to trim output.
OpenAI auto default is
text
gpt-4o-mini-transcribe
; set
text
model: "gpt-4o-transcribe"
for higher accuracy.
Use
text
tools.media.audio.attachments
to process multiple voice notes (
text
mode: "all"
+
text
maxAttachments
).
Transcript is available to templates as
text
{{Transcript}}
.
text
tools.media.audio.echoTranscript
is off by default; enable it to send transcript confirmation back to the originating chat before agent processing.
text
tools.media.audio.echoFormat
customizes the echo text (placeholder:
text
{transcript}
).
CLI stdout is capped (5MB); keep CLI output concise.
CLI
text
args
should use
text
{{MediaPath}}
for the local audio file path. Run
text
openclaw doctor --fix
to migrate deprecated
text
{input}
placeholders from older
text
audio.transcription.command
configs.

Proxy environment support

Provider-based audio transcription honors standard outbound proxy env vars:

text
HTTPS_PROXY
text
HTTP_PROXY
text
ALL_PROXY
text
https_proxy
text
http_proxy
text
all_proxy

If no proxy env vars are set, direct egress is used. If proxy config is malformed, OpenClaw logs a warning and falls back to direct fetch.

Mention detection in groups

When

text

requireMention: true

is set for a group chat, OpenClaw now transcribes audio before checking for mentions. This allows voice notes to be processed even when they contain mentions.

How it works:

If a voice message has no text body and the group requires mentions, OpenClaw performs a "preflight" transcription.
The transcript is checked for mention patterns (e.g.,
text
@BotName
, emoji triggers).
If a mention is found, the message proceeds through the full reply pipeline.
The transcript is used for mention detection so voice notes can pass the mention gate.

Fallback behavior:

If transcription fails during preflight (timeout, API error, etc.), the message is processed based on text-only mention detection.
This ensures that mixed messages (text + audio) are never incorrectly dropped.

Opt-out per Telegram group/topic:

Set
text
channels.telegram.groups.<chatId>.disableAudioPreflight: true
to skip preflight transcript mention checks for that group.
Set
text
channels.telegram.groups.<chatId>.topics.<threadId>.disableAudioPreflight
to override per-topic (
text
true
to skip,
text
false
to force-enable).
Default is
text
false
(preflight enabled when mention-gated conditions match).

Example: A user sends a voice note saying "Hey @Claude, what's the weather?" in a Telegram group with

text

requireMention: true

. The voice note is transcribed, the mention is detected, and the agent replies.

Gotchas

Scope rules use first-match wins.
text
chatType
is normalized to
text
direct
,
text
group
, or
text
room
.
Ensure your CLI exits 0 and prints plain text; JSON needs to be massaged via
text
jq -r .text
.
For
text
parakeet-mlx
, if you pass
text
--output-dir
, OpenClaw reads
text
<output-dir>/<media-basename>.txt
when
text
--output-format
is
text
txt
(or omitted); non-
text
txt
output formats fall back to stdout parsing.
Keep timeouts reasonable (
text
timeoutSeconds
, default 60s) to avoid blocking the reply queue.
Preflight transcription only processes the first audio attachment for mention detection. Additional audio is processed during the main media understanding phase.

OpenClaw Docs

Audio and voice notes

Audio / Voice Notes (2026-01-17)

What works

Auto-detection (default)

Config examples

Provider + CLI fallback (OpenAI + Whisper CLI)

Provider-only with scope gating

Provider-only (Deepgram)

Provider-only (Mistral Voxtral)

Provider-only (SenseAudio)

Echo transcript to chat (opt-in)

Notes & limits

Proxy environment support

Mention detection in groups

Gotchas

Related