Use this file to discover all available pages before exploring further.

Video generation

OpenClaw agents can generate videos from text prompts, reference images, or existing videos. Sixteen provider backends are supported, each with different model options, input modes, and feature sets. The agent picks the right provider automatically based on your configuration and available API keys.

note

The `video_generate` tool only appears when at least one video-generation provider is available. If you do not see it in your agent tools, set a provider API key or configure `agents.defaults.videoGenerationModel`.

OpenClaw treats video generation as three runtime modes:

text
generate
— text-to-video requests with no reference media.
text
imageToVideo
— request includes one or more reference images.
text
videoToVideo
— request includes one or more reference videos.

Providers can support any subset of those modes. The tool validates the active mode before submission and reports supported modes in

text

action=list

Quick start

Configure auth

Set an API key for any supported provider:


text
```bash}
export GEMINI_API_KEY="your-key"
```

Pick a default model (optional)

```bash} openclaw config set agents.defaults.videoGenerationModel.primary "google/veo-3.1-fast-generate-preview" ```

Ask the agent

> Generate a 5-second cinematic video of a friendly lobster surfing at sunset.


text
The agent calls `video_generate` automatically. No tool allowlisting
is needed.

How async generation works

Video generation is asynchronous. When the agent calls

text

video_generate

in a session:

OpenClaw submits the request to the provider and immediately returns a task id.
The provider processes the job in the background (typically 30 seconds to 5 minutes depending on the provider and resolution).
When the video is ready, OpenClaw wakes the same session with an internal completion event.
The agent posts the finished video back into the original conversation.

While a job is in flight, duplicate

text

video_generate

calls in the same session return the current task status instead of starting another generation. Use

text

openclaw tasks list

text

openclaw tasks show <taskId>

to check progress from the CLI.

Outside of session-backed agent runs (for example, direct tool invocations), the tool falls back to inline generation and returns the final media path in the same turn.

Generated video files are saved under OpenClaw-managed media storage when the provider returns bytes. The default generated-video save cap follows the video media limit, and

text

agents.defaults.mediaMaxMb

raises it for larger renders. When a provider also returns a hosted output URL, OpenClaw can deliver that URL instead of failing the task if local persistence rejects an oversized file.

Task lifecycle

State	Meaning
text `queued`	Task created, waiting for the provider to accept it.
text `running`	Provider is processing (typically 30 seconds to 5 minutes depending on provider and resolution).
text `succeeded`	Video ready; the agent wakes and posts it to the conversation.
text `failed`	Provider error or timeout; the agent wakes with error details.

Check status from the CLI:


bash
openclaw tasks list
openclaw tasks show <taskId>
openclaw tasks cancel <taskId>

If a video task is already

text

queued

text

running

for the current session,

text

video_generate

returns the existing task status instead of starting a new one. Use

text

action: "status"

to check explicitly without triggering a new generation.

Supported providers

Provider	Default model	Text	Image ref	Video ref	Auth
Alibaba	text `wan2.6-t2v`	✓	Yes (remote URL)	Yes (remote URL)	text `MODELSTUDIO_API_KEY`
BytePlus (1.0)	text `seedance-1-0-pro-250528`	✓	Up to 2 images (I2V models only; first + last frame)	—	text `BYTEPLUS_API_KEY`
BytePlus Seedance 1.5	text `seedance-1-5-pro-251215`	✓	Up to 2 images (first + last frame via role)	—	text `BYTEPLUS_API_KEY`
BytePlus Seedance 2.0	text `dreamina-seedance-2-0-260128`	✓	Up to 9 reference images	Up to 3 videos	text `BYTEPLUS_API_KEY`
ComfyUI	text `workflow`	✓	1 image	—	text `COMFY_API_KEY` or text `COMFY_CLOUD_API_KEY`
DeepInfra	text `Pixverse/Pixverse-T2V`	✓	—	—	text `DEEPINFRA_API_KEY`
fal	text `fal-ai/minimax/video-01-live`	✓	1 image; up to 9 with Seedance reference-to-video	Up to 3 videos with Seedance reference-to-video	text `FAL_KEY`
Google	text `veo-3.1-fast-generate-preview`	✓	1 image	1 video	text `GEMINI_API_KEY`
MiniMax	text `MiniMax-Hailuo-2.3`	✓	1 image	—	text `MINIMAX_API_KEY` or MiniMax OAuth
OpenAI	text `sora-2`	✓	1 image	1 video	text `OPENAI_API_KEY`
OpenRouter	text `google/veo-3.1-fast`	✓	Up to 4 images (first/last frame or references)	—	text `OPENROUTER_API_KEY`
Qwen	text `wan2.6-t2v`	✓	Yes (remote URL)	Yes (remote URL)	text `QWEN_API_KEY`
Runway	text `gen4.5`	✓	1 image	1 video	text `RUNWAYML_API_SECRET`
Together	text `Wan-AI/Wan2.2-T2V-A14B`	✓	1 image	—	text `TOGETHER_API_KEY`
Vydra	text `veo3`	✓	1 image ( text `kling` )	—	text `VYDRA_API_KEY`
xAI	text `grok-imagine-video`	✓	1 first-frame image or up to 7 text `reference_image` s	1 video	text `XAI_API_KEY`

Some providers accept additional or alternate API key env vars. See individual provider pages for details.

Run

text

video_generate action=list

to inspect available providers, models, and runtime modes at runtime.

Capability matrix

The explicit mode contract used by

text

video_generate

, contract tests, and the shared live sweep:

Provider	text `generate`	text `imageToVideo`	text `videoToVideo`	Shared live lanes today
Alibaba	✓	✓	✓	text `generate` , text `imageToVideo` ; text `videoToVideo` skipped because this provider needs remote text `http(s)` video URLs
BytePlus	✓	✓	—	text `generate` , text `imageToVideo`
ComfyUI	✓	✓	—	Not in the shared sweep; workflow-specific coverage lives with Comfy tests
DeepInfra	✓	—	—	text `generate` ; native DeepInfra video schemas are text-to-video in the bundled contract
fal	✓	✓	✓	text `generate` , text `imageToVideo` ; text `videoToVideo` only when using Seedance reference-to-video
Google	✓	✓	✓	text `generate` , text `imageToVideo` ; shared text `videoToVideo` skipped because the current buffer-backed Gemini/Veo sweep does not accept that input
MiniMax	✓	✓	—	text `generate` , text `imageToVideo`
OpenAI	✓	✓	✓	text `generate` , text `imageToVideo` ; shared text `videoToVideo` skipped because this org/input path currently needs provider-side inpaint/remix access
OpenRouter	✓	✓	—	text `generate` , text `imageToVideo`
Qwen	✓	✓	✓	text `generate` , text `imageToVideo` ; text `videoToVideo` skipped because this provider needs remote text `http(s)` video URLs
Runway	✓	✓	✓	text `generate` , text `imageToVideo` ; text `videoToVideo` runs only when the selected model is text `runway/gen4_aleph`
Together	✓	✓	—	text `generate` , text `imageToVideo`
Vydra	✓	✓	—	text `generate` ; shared text `imageToVideo` skipped because bundled text `veo3` is text-only and bundled text `kling` requires a remote image URL
xAI	✓	✓	✓	text `generate` , text `imageToVideo` ; text `videoToVideo` skipped because this provider currently needs a remote MP4 URL

Tool parameters

Required

Text description of the video to generate. Required for `action: "generate"`.

Content inputs

Single reference image (path or URL). Multiple reference images (up to 9).

Optional per-position role hints parallel to the combined image list. Canonical values: `first_frame`, `last_frame`, `reference_image`.

Single reference video (path or URL). Multiple reference videos (up to 4).

Optional per-position role hints parallel to the combined video list. Canonical value: `reference_video`. Single reference audio (path or URL). Used for background music or voice reference when the provider supports audio inputs.

Multiple reference audios (up to 3).

Optional per-position role hints parallel to the combined audio list. Canonical value: `reference_audio`.

note

Role hints are forwarded to the provider as-is. Canonical values come from the `VideoGenerationAssetRole` union but providers may accept additional role strings. `*Roles` arrays must not have more entries than the corresponding reference list; off-by-one mistakes fail with a clear error. Use an empty string to leave a slot unset. For xAI, set every image role to `reference_image` to use its `reference_images` generation mode; omit the role or use `first_frame` for single-image image-to-video.

Style controls

`1:1`, `2:3`, `3:2`, `3:4`, `4:3`, `4:5`, `5:4`, `9:16`, `16:9`, `21:9`, or `adaptive`.

text

480P

text

720P

text

768P

, or

text

1080P

Target duration in seconds (rounded to nearest provider-supported value).

Size hint when the provider supports it.

Enable generated audio in the output when supported. Distinct from `audioRef*` (inputs).

Toggle provider watermarking when supported.

text

adaptive

is a provider-specific sentinel: it is forwarded as-is to providers that declare

text

adaptive

in their capabilities (e.g. BytePlus Seedance uses it to auto-detect the ratio from the input image dimensions). Providers that do not declare it surface the value via

text

details.ignoredOverrides

in the tool result so the drop is visible.

Advanced

`"status"` returns the current session task; `"list"` inspects providers.

Provider/model override (e.g.

text

runway/gen4.5

). Output filename hint. Optional provider request timeout in milliseconds.

Provider-specific options as a JSON object (e.g. `{"seed": 42, "draft": true}`). Providers that declare a typed schema validate the keys and types; unknown keys or mismatches skip the candidate during fallback. Providers without a declared schema receive the options as-is. Run `video_generate action=list` to see what each provider accepts.

note

Not all providers support all parameters. OpenClaw normalizes duration to the closest provider-supported value, and remaps translated geometry hints such as size-to-aspect-ratio when a fallback provider exposes a different control surface. Truly unsupported overrides are ignored on a best-effort basis and reported as warnings in the tool result. Hard capability limits (such as too many reference inputs) fail before submission. Tool results report applied settings; `details.normalization` captures any requested-to-applied translation.

Reference inputs select the runtime mode:

No reference media →
text
generate
Any image reference →
text
imageToVideo
Any video reference →
text
videoToVideo
Reference audio inputs do not change the resolved mode; they apply on top of whatever mode the image/video references select, and only work with providers that declare
text
maxInputAudios
.

Mixed image and video references are not a stable shared capability surface. Prefer one reference type per request.

Fallback and typed options

Some capability checks are applied at the fallback layer rather than the tool boundary, so a request that exceeds the primary provider's limits can still run on a capable fallback:

Active candidate declaring no
text
maxInputAudios
(or
text
0
) is skipped when the request contains audio references; next candidate is tried.
Active candidate's
text
maxDurationSeconds
below the requested
text
durationSeconds
with no declared
text
supportedDurationSeconds
list → skipped.
Request contains
text
providerOptions
and the active candidate explicitly declares a typed
text
providerOptions
schema → skipped if supplied keys are not in the schema or value types do not match. Providers without a declared schema receive options as-is (backward-compatible pass-through). A provider can opt out of all provider options by declaring an empty schema (
text
capabilities.providerOptions: {}
), which causes the same skip as a type mismatch.

The first skip reason in a request logs at

text

warn

so operators see when their primary provider was passed over; subsequent skips log at

text

debug

to keep long fallback chains quiet. If every candidate is skipped, the aggregated error includes the skip reason for each.

Actions

Action	What it does
text `generate`	Default. Create a video from the given prompt and optional reference inputs.
text `status`	Check the state of the in-flight video task for the current session without starting another generation.
text `list`	Show available providers, models, and their capabilities.

Model selection

OpenClaw resolves the model in this order:

text
model
tool parameter — if the agent specifies one in the call.
text
videoGenerationModel.primary
from config.
text
videoGenerationModel.fallbacks
in order.
Auto-detection — providers that have valid auth, starting with the current default provider, then remaining providers in alphabetical order.

If a provider fails, the next candidate is tried automatically. If all candidates fail, the error includes details from each attempt.

Set

text

agents.defaults.mediaGenerationAutoProviderFallback: false

to use only the explicit

text

model

text

primary

, and

text

fallbacks

entries.


json5
{
  agents: {
    defaults: {
      videoGenerationModel: {
        primary: "google/veo-3.1-fast-generate-preview",
        fallbacks: ["runway/gen4.5", "qwen/wan2.6-t2v"],
      },
    },
  },
}

Provider notes

Provider capability modes

The shared video-generation contract supports mode-specific capabilities instead of only flat aggregate limits. New provider implementations should prefer explicit mode blocks:


typescript
capabilities: {
  generate: {
    maxVideos: 1,
    maxDurationSeconds: 10,
    supportsResolution: true,
  },
  imageToVideo: {
    enabled: true,
    maxVideos: 1,
    maxInputImages: 1,
    maxInputImagesByModel: { "provider/reference-to-video": 9 },
    maxDurationSeconds: 5,
  },
  videoToVideo: {
    enabled: true,
    maxVideos: 1,
    maxInputVideos: 1,
    maxDurationSeconds: 5,
  },
}

Flat aggregate fields such as

text

maxInputImages

and

text

maxInputVideos

are not enough to advertise transform-mode support. Providers should declare

text

generate

text

imageToVideo

, and

text

videoToVideo

explicitly so live tests, contract tests, and the shared

text

video_generate

tool can validate mode support deterministically.

When one model in a provider has wider reference-input support than the rest, use

text

maxInputImagesByModel

text

maxInputVideosByModel

, or

text

maxInputAudiosByModel

instead of raising the mode-wide limit.

Live tests

Opt-in live coverage for the shared bundled providers:


bash
OPENCLAW_LIVE_TEST=1 pnpm test:live -- extensions/video-generation-providers.live.test.ts

Repo wrapper:


bash
pnpm test:live:media video

This live file loads missing provider env vars from

text

~/.profile

, prefers live/env API keys ahead of stored auth profiles by default, and runs a release-safe smoke by default:

text
generate
for every non-FAL provider in the sweep.
One-second lobster prompt.
Per-provider operation cap from
text
OPENCLAW_LIVE_VIDEO_GENERATION_TIMEOUT_MS
(
text
180000
by default).

FAL is opt-in because provider-side queue latency can dominate release time:


bash
pnpm test:live:media video --video-providers fal

Set

text

OPENCLAW_LIVE_VIDEO_GENERATION_FULL_MODES=1

to also run declared transform modes the shared sweep can exercise safely with local media:

text
imageToVideo
when
text
capabilities.imageToVideo.enabled
.
text
videoToVideo
when
text
capabilities.videoToVideo.enabled
and the provider/model accepts buffer-backed local video input in the shared sweep.

Today the shared

text

videoToVideo

live lane covers

text

runway

only when you select

text

runway/gen4_aleph

Configuration

Set the default video-generation model in your OpenClaw config:


json5
{
  agents: {
    defaults: {
      videoGenerationModel: {
        primary: "qwen/wan2.6-t2v",
        fallbacks: ["qwen/wan2.6-r2v-flash"],
      },
    },
  },
}

Or via the CLI:


bash
openclaw config set agents.defaults.videoGenerationModel.primary "qwen/wan2.6-t2v"

Alibaba Model Studio
Background tasks — task tracking for async video generation
BytePlus
ComfyUI
Configuration reference
fal
Google (Gemini)
MiniMax
Models
OpenAI
Qwen
Runway
Together AI
Tools overview
Vydra
xAI

OpenClaw Docs

Video generation

note

Quick start

Configure auth

Pick a default model (optional)

Ask the agent

How async generation works

Task lifecycle

Supported providers

Capability matrix

Tool parameters

Required

Content inputs

note

Style controls

Advanced

note

Fallback and typed options

Actions

Model selection

Provider notes

Provider capability modes

Live tests

Configuration

Related