Technical reference for the OpenClaw framework. Real-time synchronization with the official documentation engine.
Use this file to discover all available pages before exploring further.
Media understanding
OpenClaw can summarize inbound media (image/audio/video) before the reply pipeline runs. It auto-detects when local tools or provider keys are available, and can be disabled or customized. If understanding is off, models still receive the original files/URLs as usual.
Vendor-specific media behavior is registered by vendor plugins, while OpenClaw core owns the shared
text
tools.media
config, fallback order, and reply-pipeline integration.
Goals
Optional: pre-digest inbound media into short text for faster routing + better command parsing.
Preserve original media delivery to the model (always).
Support provider APIs and CLI fallbacks.
Allow multiple models with ordered fallback (error/size/timeout).
For each enabled capability (image/audio/video), select attachments per policy (default: **first**).
Choose model
Choose the first eligible model entry (size + capability + auth).
Fallback on failure
If a model fails or the media is too large, **fall back to the next entry**.
Apply success block
On success:
text
* `Body` becomes `[Image]`, `[Audio]`, or `[Video]` block.
* Audio sets `{{Transcript}}`; command parsing uses caption text when present, otherwise the transcript.
* Captions are preserved as `User text:` inside the block.
If understanding fails or is disabled, the reply flow continues with the original body + attachments.
Config overview
text
tools.media
supports shared models plus per-capability overrides:
entry can be provider or CLI:
```json5}
{
type: "provider", // default if omitted
provider: "openai",
model: "gpt-5.5",
prompt: "Describe the image in <= 500 chars.",
maxChars: 500,
maxBytes: 10485760,
timeoutSeconds: 60,
capabilities: ["image"], // optional, used for multi-modal entries
profile: "vision-profile",
preferredProfile: "vision-fallback",
}
```
```json5}
{
type: "cli",
command: "gemini",
args: [
"-m",
"gemini-3-flash",
"--allowed-tools",
"read_file",
"Read the media at {{MediaPath}} and describe it in <= {{MaxChars}} characters.",
],
maxChars: 500,
maxBytes: 52428800,
timeoutSeconds: 120,
capabilities: ["video", "image"],
}
```
text
CLI templates can also use:
* `{{MediaDir}}` (directory containing the media file)
* `{{OutputDir}}` (scratch dir created for this run)
* `{{OutputBase}}` (scratch file base path, no extension)
Defaults and limits
Recommended defaults:
text
maxChars
: 500 for image/video (short, command-friendly)
text
maxChars
: unset for audio (full transcript unless you set a limit)
text
maxBytes
:
image: 10MB
audio: 20MB
video: 50MB
Auto-detect media understanding (default)
If
text
tools.media.<capability>.enabled
is not set to
text
false
and you haven't configured models, OpenClaw auto-detects in this order and stops at the first working option:
Active reply model
Active reply model when its provider supports the capability.
agents.defaults.imageModel
`agents.defaults.imageModel` primary/fallback refs (image only).
Prefer `provider/model` refs. Bare refs are qualified from configured image-capable provider model entries only when the match is unique.
Local CLIs (audio only)
Local CLIs (if installed):
text
* `sherpa-onnx-offline` (requires `SHERPA_ONNX_MODEL_DIR` with encoder/decoder/joiner/tokens)
* `whisper-cli` (`whisper-cpp`; uses `WHISPER_CPP_MODEL` or the bundled tiny model)
* `whisper` (Python CLI; downloads models automatically)
Gemini CLI
`gemini` using `read_many_files`.
Provider auth
* Configured `models.providers.*` entries that support the capability are tried before the bundled fallback order.
* Image-only config providers with an image-capable model auto-register for media understanding even when they are not a bundled vendor plugin.
* Ollama image understanding is available when selected explicitly, for example through `agents.defaults.imageModel` or `openclaw infer image describe --model ollama/`.
Binary detection is best-effort across macOS/Linux/Windows; ensure the CLI is on `PATH` (we expand `~`), or set an explicit CLI model with a full command path.
Proxy environment support (provider models)
When provider-based audio and video media understanding is enabled, OpenClaw honors standard outbound proxy environment variables for provider HTTP calls:
text
HTTPS_PROXY
text
HTTP_PROXY
text
ALL_PROXY
text
https_proxy
text
http_proxy
text
all_proxy
If no proxy env vars are set, media understanding uses direct egress. If the proxy value is malformed, OpenClaw logs a warning and falls back to direct fetch.
Capabilities (optional)
If you set
text
capabilities
, the entry only runs for those media types. For shared lists, OpenClaw can infer defaults:
text
openai
,
text
anthropic
,
text
minimax
: image
text
minimax-portal
: image
text
moonshot
: image + video
text
openrouter
: image
text
google
(Gemini API): image + audio + video
text
qwen
: image + video
text
mistral
: audio
text
zai
: image
text
groq
: audio
text
xai
: audio
text
deepgram
: audio
Any
text
models.providers.<id>.models[]
catalog with an image-capable model: image
For CLI entries, set
text
capabilities
explicitly to avoid surprising matches. If you omit
text
capabilities
, the entry is eligible for the list it appears in.
Provider video understanding via vendor plugins; Qwen video understanding uses the Standard DashScope endpoints.
note
**MiniMax note**
text
minimax
and
text
minimax-portal
image understanding comes from the plugin-owned
text
MiniMax-VL-01
media provider.
The bundled MiniMax text catalog still starts text-only; explicit
text
models.providers.minimax
entries materialize image-capable M2.7 chat refs.
Model selection guidance
Prefer the strongest latest-generation model available for each media capability when quality and safety matter.
For tool-enabled agents handling untrusted inputs, avoid older/weaker media models.
Keep at least one fallback per capability for availability (quality model + faster/cheaper model).
CLI fallbacks (
text
whisper-cli
,
text
whisper
,
text
gemini
) are useful when provider APIs are unavailable.
text
parakeet-mlx
note: with
text
--output-dir
, OpenClaw reads
text
<output-dir>/<media-basename>.txt
when output format is
text
txt
(or unspecified); non-
text
txt
formats fall back to stdout.
Attachment policy
Per-capability
text
attachments
controls which attachments are processed:
Whether to process the first selected attachment or all of them.
Cap the number processed.
Selection preference among candidate attachments.