Use this file to discover all available pages before exploring further.
Image and media support
Image & Media Support (2025-12-05)
The WhatsApp channel runs via Baileys Web. This document captures the current media handling rules for send, gateway, and agent replies.
Goals
- Send media with optional captions via
openclaw message send --media
.
- Allow auto-replies from the web inbox to include media alongside text.
- Keep per-type limits sane and predictable.
CLI Surface
openclaw message send --media <path-or-url> [--message <caption>]
- optional; caption can be empty for media-only sends.
- prints the resolved payload; emits
{ channel, to, messageId, mediaUrl, caption }
.
WhatsApp Web channel behavior
- Input: local file path or HTTP(S) URL.
- Flow: load into a Buffer, detect media kind, and build the correct payload:
- Images: resize & recompress to JPEG (max side 2048px) targeting
channels.whatsapp.mediaMaxMb
(default: 50 MB).
- Audio/Voice/Video: pass-through up to 16 MB; audio is sent as a voice note ().
- Documents: anything else, up to 100 MB, with filename preserved when available.
- WhatsApp GIF-style playback: send an MP4 with (CLI: ) so mobile clients loop inline.
- MIME detection prefers magic bytes, then headers, then file extension.
- Caption comes from or ; empty caption is allowed.
- Logging: non-verbose shows /; verbose includes size and source path/URL.
Auto-Reply Pipeline
- returns
{ text?, mediaUrl?, mediaUrls? }
.
- When media is present, the web sender resolves local paths or URLs using the same pipeline as .
- Multiple media entries are sent sequentially if provided.
Inbound media to commands (Pi)
- When inbound web messages include media, OpenClaw downloads to a temp file and exposes templating variables:
- pseudo-URL for the inbound media.
- local temp path written before running the command.
- When a per-session Docker sandbox is enabled, inbound media is copied into the sandbox workspace and / are rewritten to a relative path like .
- Media understanding (if configured via or shared ) runs before templating and can insert , , and blocks into .
- Audio sets and uses the transcript for command parsing so slash commands still work.
- Video and image descriptions preserve any caption text for command parsing.
- If the active primary image model already supports vision natively, OpenClaw skips the summary block and passes the original image to the model instead.
- By default only the first matching image/audio/video attachment is processed; set
tools.media.<cap>.attachments
to process multiple attachments.
Limits & Errors
Outbound send caps (WhatsApp web send)
- Images: up to
channels.whatsapp.mediaMaxMb
(default: 50 MB) after recompression.
- Audio/voice/video: 16 MB cap; documents: 100 MB cap.
- Oversize or unreadable media → clear error in logs and the reply is skipped.
Media understanding caps (transcription/description)
- Image default: 10 MB (
tools.media.image.maxBytes
).
- Audio default: 20 MB (
tools.media.audio.maxBytes
).
- Video default: 50 MB (
tools.media.video.maxBytes
).
- Oversize media skips understanding, but replies still go through with the original body.
Notes for Tests
- Cover send + reply flows for image/audio/document cases.
- Validate recompression for images (size bound) and voice-note flag for audio.
- Ensure multi-media replies fan out as sequential sends.
Related