Caricamento in corso...
Caricamento in corso...
Last synced: Today, 22:00
Technical reference for the OpenClaw framework. Real-time synchronization with the official documentation engine.
Use this file to discover all available pages before exploring further.
OpenClaw generates images, videos, and music, understands inbound media (images, audio, video), and speaks replies aloud with text-to-speech. All media capabilities are tool-driven: the agent decides when to use them based on the conversation, and each tool only appears when at least one backing provider is configured.
Create and edit images from text prompts or reference images via `image_generate`. Synchronous — completes inline with the reply.
Text-to-video, image-to-video, and video-to-video via `video_generate`. Async — runs in the background and posts the result when ready.
Generate music or audio tracks via `music_generate`. Async on shared providers; ComfyUI workflow path runs synchronously.
Convert outbound replies to spoken audio via the `tts` tool plus `messages.tts` config. Synchronous.
Summarize inbound images, audio, and video using vision-capable model providers and dedicated media-understanding plugins.
Transcribe inbound voice messages through batch STT or Voice Call streaming STT providers.
| Provider | Image | Video | Music | TTS | STT | Realtime voice | Media understanding |
|---|---|---|---|---|---|---|---|
| Alibaba | ✓ | ||||||
| BytePlus | ✓ | ||||||
| ComfyUI | ✓ | ✓ | ✓ | ||||
| DeepInfra | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Deepgram | ✓ | ✓ | |||||
| ElevenLabs | ✓ | ✓ | |||||
| fal | ✓ | ✓ | |||||
| ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Gradium | ✓ | ||||||
| Local CLI | ✓ | ||||||
| Microsoft | ✓ | ||||||
| MiniMax | ✓ | ✓ | ✓ | ✓ | |||
| Mistral | ✓ | ||||||
| OpenAI | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | |
| OpenRouter | ✓ | ✓ | ✓ | ✓ | |||
| Qwen | ✓ | ||||||
| Runway | ✓ | ||||||
| SenseAudio | ✓ | ||||||
| Together | ✓ | ||||||
| Vydra | ✓ | ✓ | ✓ | ||||
| xAI | ✓ | ✓ | ✓ | ✓ | ✓ | ||
| Xiaomi MiMo | ✓ | ✓ | ✓ |
| Capability | Mode | Why |
|---|---|---|
| Image | Synchronous | Provider responses return in seconds; completes inline with reply. |
| Text-to-speech | Synchronous | Provider responses return in seconds; attached to the reply audio. |
| Video | Asynchronous | Provider processing takes 30 s to several minutes. |
| Music (shared) | Asynchronous | Same provider-processing characteristic as video. |
| Music (ComfyUI) | Synchronous | Local workflow runs inline against the configured ComfyUI server. |
For async tools, OpenClaw submits the request to the provider, returns a task id immediately, and tracks the job in the task ledger. The agent continues responding to other messages while the job runs. When the provider finishes, OpenClaw wakes the agent so it can post the finished media back into the original channel.
Deepgram, DeepInfra, ElevenLabs, Mistral, OpenAI, SenseAudio, and xAI can all transcribe inbound audio through the batch
tools.media.audioDeepgram, ElevenLabs, Mistral, OpenAI, and xAI also register Voice Call streaming STT providers, so live phone audio can be forwarded to the selected vendor without waiting for a completed recording.
© 2024 TaskFlow Mirror
Powered by TaskFlow Sync Engine