Use this file to discover all available pages before exploring further.

Media understanding

OpenClaw can summarize inbound media (image/audio/video) before the reply pipeline runs. It auto-detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.

Vendor-specific media behavior is registered by vendor plugins, while OpenClaw core owns the shared

text

tools.media

config, fallback order, and reply-pipeline integration.

Goals

Optional: pre-digest inbound media into short text for faster routing + better command parsing.
Preserve original media delivery to the model (always).
Support provider APIs and CLI fallbacks.
Allow multiple models with ordered fallback (error/size/timeout).

High-level behavior

Collect attachments

Collect inbound attachments (`MediaPaths`, `MediaUrls`, `MediaTypes`).

Select per-capability

For each enabled capability (image/audio/video), select attachments per policy (default: **first**).

Choose model

Choose the first eligible model entry (size + capability + auth).

Fallback on failure

If a model fails or the media is too large, **fall back to the next entry**.

Apply success block

On success:


text
* `Body` becomes `[Image]`, `[Audio]`, or `[Video]` block.
* Audio sets `{{Transcript}}`; command parsing uses caption text when present, otherwise the transcript.
* Captions are preserved as `User text:` inside the block.

If understanding fails or is disabled, the reply flow continues with the original body + attachments.

Config overview

text

tools.media

supports shared models plus per-capability overrides:


json5
{
  tools: {
    media: {
      models: [
        /* shared list */
      ],
      image: {
        /* optional overrides */
      },
      audio: {
        /* optional overrides */
        echoTranscript: true,
        echoFormat: '📝 "{transcript}"',
      },
      video: {
        /* optional overrides */
      },
    },
  },
}

Model entries

Each

text

models[]

entry can be provider or CLI:

```json5} { type: "provider", // default if omitted provider: "openai", model: "gpt-5.5", prompt: "Describe the image in <= 500 chars.", maxChars: 500, maxBytes: 10485760, timeoutSeconds: 60, capabilities: ["image"], // optional, used for multi-modal entries profile: "vision-profile", preferredProfile: "vision-fallback", } ``` ```json5} { type: "cli", command: "gemini", args: [ "-m", "gemini-3-flash", "--allowed-tools", "read_file", "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.", ], maxChars: 500, maxBytes: 52428800, timeoutSeconds: 120, capabilities: ["video", "image"], } ```


text
CLI templates can also use:

* `{{MediaDir}}` (directory containing the media file)
* `{{OutputDir}}` (scratch dir created for this run)
* `{{OutputBase}}` (scratch file base path, no extension)

Defaults and limits

Recommended defaults:

text
maxChars
: 500 for image/video (short, command-friendly)
text
maxChars
: unset for audio (full transcript unless you set a limit)
text
maxBytes
:
- image: 10MB
- audio: 20MB
- video: 50MB

Auto-detect media understanding (default)

text

tools.media.<capability>.enabled

is not set to

text

false

and you haven't configured models, OpenClaw auto-detects in this order and stops at the first working option:

Active reply model

Active reply model when its provider supports the capability.

agents.defaults.imageModel

`agents.defaults.imageModel` primary/fallback refs (image only). Prefer `provider/model` refs. Bare refs are qualified from configured image-capable provider model entries only when the match is unique.

Local CLIs (audio only)

Local CLIs (if installed):


text
* `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
* `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
* `whisper` (Python CLI; downloads models automatically)

Gemini CLI

`gemini` using `read_many_files`.

Provider auth

* Configured `models.providers.*` entries that support the capability are tried before the bundled fallback order. * Image-only config providers with an image-capable model auto-register for media understanding even when they are not a bundled vendor plugin. * Ollama image understanding is available when selected explicitly, for example through `agents.defaults.imageModel` or `openclaw infer image describe --model ollama/`.


text
Bundled fallback order:

* Audio: OpenAI → Groq → xAI → Deepgram → Google → SenseAudio → ElevenLabs → Mistral
* Image: OpenAI → Anthropic → Google → MiniMax → MiniMax Portal → Z.AI
* Video: Google → Qwen → Moonshot

To disable auto-detection, set:


json5
{
  tools: {
    media: {
      audio: {
        enabled: false,
      },
    },
  },
}

note

Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.

Proxy environment support (provider models)

When provider-based audio and video media understanding is enabled, OpenClaw honors standard outbound proxy environment variables for provider HTTP calls:

text
HTTPS_PROXY
text
HTTP_PROXY
text
ALL_PROXY
text
https_proxy
text
http_proxy
text
all_proxy

If no proxy env vars are set, media understanding uses direct egress. If the proxy value is malformed, OpenClaw logs a warning and falls back to direct fetch.

Capabilities (optional)

If you set

text

capabilities

, the entry only runs for those media types. For shared lists, OpenClaw can infer defaults:

text
openai
,
text
anthropic
,
text
minimax
: image
text
minimax-portal
: image
text
moonshot
: image + video
text
openrouter
: image
text
google
(Gemini API): image + audio + video
text
qwen
: image + video
text
mistral
: audio
text
zai
: image
text
groq
: audio
text
xai
: audio
text
deepgram
: audio
Any
text
models.providers.<id>.models[]
catalog with an image-capable model: image

For CLI entries, set

text

capabilities

explicitly to avoid surprising matches. If you omit

text

capabilities

, the entry is eligible for the list it appears in.

Provider support matrix (OpenClaw integrations)

Capability	Provider integration	Notes
Image	OpenAI, OpenAI Codex OAuth, Codex app-server, OpenRouter, Anthropic, Google, MiniMax, Moonshot, Qwen, Z.AI, config providers	Vendor plugins register image support; text `openai-codex/` uses OAuth provider plumbing; text `codex/` uses a bounded Codex app-server turn; MiniMax and MiniMax OAuth both use text `MiniMax-VL-01` ; image-capable config providers auto-register.
Audio	OpenAI, Groq, xAI, Deepgram, Google, SenseAudio, ElevenLabs, Mistral	Provider transcription (Whisper/Groq/xAI/Deepgram/Gemini/SenseAudio/Scribe/Voxtral).
Video	Google, Qwen, Moonshot	Provider video understanding via vendor plugins; Qwen video understanding uses the Standard DashScope endpoints.

note

**MiniMax note**

text
minimax
and
text
minimax-portal
image understanding comes from the plugin-owned
text
MiniMax-VL-01
media provider.
The bundled MiniMax text catalog still starts text-only; explicit
text
models.providers.minimax
entries materialize image-capable M2.7 chat refs.

Model selection guidance

Prefer the strongest latest-generation model available for each media capability when quality and safety matter.
For tool-enabled agents handling untrusted inputs, avoid older/weaker media models.
Keep at least one fallback per capability for availability (quality model + faster/cheaper model).
CLI fallbacks (
text
whisper-cli
,
text
whisper
,
text
gemini
) are useful when provider APIs are unavailable.
text
parakeet-mlx
note: with
text
--output-dir
, OpenClaw reads
text
<output-dir>/<media-basename>.txt
when output format is
text
txt
(or unspecified); non-
text
txt
formats fall back to stdout.

Attachment policy

Per-capability

text

attachments

controls which attachments are processed:

Whether to process the first selected attachment or all of them. Cap the number processed. Selection preference among candidate attachments.

When

text

mode: "all"

, outputs are labeled

text

[Image 1/2]

text

[Audio 2/2]

, etc.

Config examples

```json5} { tools: { media: { models: [ { provider: "openai", model: "gpt-5.5", capabilities: ["image"] }, { provider: "google", model: "gemini-3-flash-preview", capabilities: ["image", "audio", "video"], }, { type: "cli", command: "gemini", args: [ "-m", "gemini-3-flash", "--allowed-tools", "read_file", "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.", ], capabilities: ["image", "video"], }, ], audio: { attachments: { mode: "all", maxAttachments: 2 }, }, video: { maxChars: 500, }, }, }, } ``` ```json5} { tools: { media: { audio: { enabled: true, models: [ { provider: "openai", model: "gpt-4o-mini-transcribe" }, { type: "cli", command: "whisper", args: ["--model", "base", "{{MediaPath}}"], }, ], }, video: { enabled: true, maxChars: 500, models: [ { provider: "google", model: "gemini-3-flash-preview" }, { type: "cli", command: "gemini", args: [ "-m", "gemini-3-flash", "--allowed-tools", "read_file", "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.", ], }, ], }, }, }, } ``` ```json5} { tools: { media: { image: { enabled: true, maxBytes: 10485760, maxChars: 500, models: [ { provider: "openai", model: "gpt-5.5" }, { provider: "anthropic", model: "claude-opus-4-6" }, { type: "cli", command: "gemini", args: [ "-m", "gemini-3-flash", "--allowed-tools", "read_file", "Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.", ], }, ], }, }, }, } ``` ```json5} { tools: { media: { image: { models: [ { provider: "google", model: "gemini-3.1-pro-preview", capabilities: ["image", "video", "audio"], }, ], }, audio: { models: [ { provider: "google", model: "gemini-3.1-pro-preview", capabilities: ["image", "video", "audio"], }, ], }, video: { models: [ { provider: "google", model: "gemini-3.1-pro-preview", capabilities: ["image", "video", "audio"], }, ], }, }, }, } ```

Status output

When media understanding runs,

text

/status

includes a short summary line:


text
📎 Media: image ok (openai/gpt-5.4) · audio skipped (maxBytes)

This shows per-capability outcomes and the chosen provider/model when applicable.

Notes

Understanding is best-effort. Errors do not block replies.
Attachments are still passed to models even when understanding is disabled.
Use
text
scope
to limit where understanding runs (e.g. only DMs).

OpenClaw Docs

Media understanding

Goals

High-level behavior

Collect attachments

Select per-capability

Choose model

Fallback on failure

Apply success block

Config overview

Model entries

Defaults and limits

Auto-detect media understanding (default)

Active reply model

agents.defaults.imageModel

Local CLIs (audio only)

Gemini CLI

Provider auth

note

Proxy environment support (provider models)

Capabilities (optional)

Provider support matrix (OpenClaw integrations)

note

Model selection guidance

Attachment policy

Config examples

Status output

Notes

Related