Use this file to discover all available pages before exploring further.

Testing

OpenClaw has three Vitest suites (unit/integration, e2e, live) and a small set of Docker runners. This doc is a "how we test" guide:

What each suite covers (and what it deliberately does not cover).
Which commands to run for common workflows (local, pre-push, debugging).
How live tests discover credentials and select models/providers.
How to add regressions for real-world model/provider issues.

note

**QA stack (qa-lab, qa-channel, live transport lanes)** is documented separately:

QA overview — architecture, command surface, scenario authoring.
Matrix QA — reference for
text
pnpm openclaw qa matrix
.
QA channel — the synthetic transport plugin used by repo-backed scenarios.

This page covers running the regular test suites and Docker/Parallels runners. The QA-specific runners section below (QA-specific runners) lists the concrete

text

qa

invocations and points back at the references above.

Quick start

Most days:

Full gate (expected before push):
text
pnpm build && pnpm check && pnpm check:test-types && pnpm test
Faster local full-suite run on a roomy machine:
text
pnpm test:max
Direct Vitest watch loop:
text
pnpm test:watch
Direct file targeting now routes extension/channel paths too:
text
pnpm test extensions/discord/src/monitor/message-handler.preflight.test.ts
Prefer targeted runs first when you are iterating on a single failure.
Docker-backed QA site:
text
pnpm qa:lab:up
Linux VM-backed QA lane:
text
pnpm openclaw qa suite --runner multipass --scenario channel-chat-baseline

When you touch tests or want extra confidence:

Coverage gate:
text
pnpm test:coverage
E2E suite:
text
pnpm test:e2e

When debugging real providers/models (requires real creds):

Live suite (models + gateway tool/image probes):
text
pnpm test:live
Target one live file quietly:
text
pnpm test:live -- src/agents/models.profiles.live.test.ts
Docker live model sweep:
text
pnpm test:docker:live-models
- Each selected model now runs a text turn plus a small file-read-style probe. Models whose metadata advertises
  text
  image
  input also run a tiny image turn. Disable the extra probes with
  text
  OPENCLAW_LIVE_MODEL_FILE_PROBE=0
  or
  text
  OPENCLAW_LIVE_MODEL_IMAGE_PROBE=0
  when isolating provider failures.
- CI coverage: daily
  text
  OpenClaw Scheduled Live And E2E Checks
  and manual
  text
  OpenClaw Release Checks
  both call the reusable live/E2E workflow with
  text
  include_live_suites: true
  , which includes separate Docker live model matrix jobs sharded by provider.
- For focused CI reruns, dispatch
  text
  OpenClaw Live And E2E Checks (Reusable)
  with
  text
  include_live_suites: true
  and
  text
  live_models_only: true
  .
- Add new high-signal provider secrets to
  text
  scripts/ci-hydrate-live-auth.sh
  plus
  text
  .github/workflows/openclaw-live-and-e2e-checks-reusable.yml
  and its scheduled/release callers.
Native Codex bound-chat smoke:
text
pnpm test:docker:live-codex-bind
- Runs a Docker live lane against the Codex app-server path, binds a synthetic Slack DM with
  text
  /codex bind
  , exercises
  text
  /codex fast
  and
  text
  /codex permissions
  , then verifies a plain reply and an image attachment route through the native plugin binding instead of ACP.
Codex app-server harness smoke:
text
pnpm test:docker:live-codex-harness
- Runs gateway agent turns through the plugin-owned Codex app-server harness, verifies
  text
  /codex status
  and
  text
  /codex models
  , and by default exercises image, cron MCP, sub-agent, and Guardian probes. Disable the sub-agent probe with
  text
  OPENCLAW_LIVE_CODEX_HARNESS_SUBAGENT_PROBE=0
  when isolating other Codex app-server failures. For a focused sub-agent check, disable the other probes:
  text
  OPENCLAW_LIVE_CODEX_HARNESS_IMAGE_PROBE=0 OPENCLAW_LIVE_CODEX_HARNESS_MCP_PROBE=0 OPENCLAW_LIVE_CODEX_HARNESS_GUARDIAN_PROBE=0 OPENCLAW_LIVE_CODEX_HARNESS_SUBAGENT_PROBE=1 pnpm test:docker:live-codex-harness
  . This exits after the sub-agent probe unless
  text
  OPENCLAW_LIVE_CODEX_HARNESS_SUBAGENT_ONLY=0
  is set.
Crestodian rescue command smoke:
text
pnpm test:live:crestodian-rescue-channel
- Opt-in belt-and-suspenders check for the message-channel rescue command surface. It exercises
  text
  /crestodian status
  , queues a persistent model change, replies
  text
  /crestodian yes
  , and verifies the audit/config write path.
Crestodian planner Docker smoke:
text
pnpm test:docker:crestodian-planner
- Runs Crestodian in a configless container with a fake Claude CLI on
  text
  PATH
  and verifies the fuzzy planner fallback translates into an audited typed config write.
Crestodian first-run Docker smoke:
text
pnpm test:docker:crestodian-first-run
- Starts from an empty OpenClaw state dir, routes bare
  text
  openclaw
  to Crestodian, applies setup/model/agent/Discord plugin + SecretRef writes, validates config, and verifies audit entries. The same Ring 0 setup path is also covered in QA Lab by
  text
  pnpm openclaw qa suite --scenario crestodian-ring-zero-setup
  .
Moonshot/Kimi cost smoke: with
text
MOONSHOT_API_KEY
set, run
text
openclaw models list --provider moonshot --json
, then run an isolated
text
openclaw agent --local --session-id live-kimi-cost --message 'Reply exactly: KIMI_LIVE_OK' --thinking off --json
against
text
moonshot/kimi-k2.6
. Verify the JSON reports Moonshot/K2.6 and the assistant transcript stores normalized
text
usage.cost
.

tip

When you only need one failing case, prefer narrowing live tests via the allowlist env vars described below.

QA-specific runners

These commands sit beside the main test suites when you need QA-lab realism:

CI runs QA Lab in dedicated workflows.

text

Parity gate

runs on matching PRs and from manual dispatch with mock providers.

text

QA-Lab - All Lanes

runs nightly on

text

main

and from manual dispatch with the mock parity gate, live Matrix lane, Convex-managed live Telegram lane, and Convex-managed live Discord lane as parallel jobs. Scheduled QA and release checks pass Matrix

text

--profile fast

explicitly, while the Matrix CLI and manual workflow input default remain

text

all

; manual dispatch can shard

text

all

into

text

transport

text

media

text

e2ee-smoke

text

e2ee-deep

, and

text

e2ee-cli

jobs.

text

OpenClaw Release Checks

runs parity plus the fast Matrix and Telegram lanes before release approval, using

text

mock-openai/gpt-5.5

for release transport checks so they stay deterministic and avoid normal provider-plugin startup. These live transport gateways disable memory search; memory behavior stays covered by the QA parity suites.

Full release live media shards use

text

ghcr.io/openclaw/openclaw-live-media-runner:ubuntu-24.04

, which already has

text

ffmpeg

and

text

ffprobe

. Docker live model/backend shards use the shared

text

ghcr.io/openclaw/openclaw-live-test:<sha>

image built once per selected commit, then pull it with

text

OPENCLAW_SKIP_DOCKER_BUILD=1

instead of rebuilding inside every shard.

text
pnpm openclaw qa suite
- Runs repo-backed QA scenarios directly on the host.
- Runs multiple selected scenarios in parallel by default with isolated gateway workers.
  text
  qa-channel
  defaults to concurrency 4 (bounded by the selected scenario count). Use
  text
  --concurrency <count>
  to tune the worker count, or
  text
  --concurrency 1
  for the older serial lane.
- Exits non-zero when any scenario fails. Use
  text
  --allow-failures
  when you want artifacts without a failing exit code.
- Supports provider modes
  text
  live-frontier
  ,
  text
  mock-openai
  , and
  text
  aimock
  .
  text
  aimock
  starts a local AIMock-backed provider server for experimental fixture and protocol-mock coverage without replacing the scenario-aware
  text
  mock-openai
  lane.
text
pnpm test:gateway:cpu-scenarios
- Runs the gateway startup bench plus a small mock QA Lab scenario pack (
  text
  channel-chat-baseline
  ,
  text
  memory-failure-fallback
  ,
  text
  gateway-restart-inflight-run
  ) and writes a combined CPU observation summary under
  text
  .artifacts/gateway-cpu-scenarios/
  .
- Flags only sustained hot CPU observations by default (
  text
  --cpu-core-warn
  plus
  text
  --hot-wall-warn-ms
  ), so short startup bursts are recorded as metrics without looking like the minutes-long gateway peg regression.
- Uses built
  text
  dist
  artifacts; run a build first when the checkout does not already have fresh runtime output.
text
pnpm openclaw qa suite --runner multipass
- Runs the same QA suite inside a disposable Multipass Linux VM.
- Keeps the same scenario-selection behavior as
  text
  qa suite
  on the host.
- Reuses the same provider/model selection flags as
  text
  qa suite
  .
- Live runs forward the supported QA auth inputs that are practical for the guest: env-based provider keys, the QA live provider config path, and
  text
  CODEX_HOME
  when present.
- Output dirs must stay under the repo root so the guest can write back through the mounted workspace.
- Writes the normal QA report + summary plus Multipass logs under
  text
  .artifacts/qa-e2e/...
  .
text
pnpm qa:lab:up
- Starts the Docker-backed QA site for operator-style QA work.
text
pnpm test:docker:npm-onboard-channel-agent
- Builds an npm tarball from the current checkout, installs it globally in Docker, runs non-interactive OpenAI API-key onboarding, configures Telegram by default, verifies enabling the plugin installs runtime dependencies on demand, runs doctor, and runs one local agent turn against a mocked OpenAI endpoint.
- Use
  text
  OPENCLAW_NPM_ONBOARD_CHANNEL=discord
  to run the same packaged-install lane with Discord.
text
pnpm test:docker:session-runtime-context
- Runs a deterministic built-app Docker smoke for embedded runtime context transcripts. It verifies hidden OpenClaw runtime context is persisted as a non-display custom message instead of leaking into the visible user turn, then seeds an affected broken session JSONL and verifies
  text
  openclaw doctor --fix
  rewrites it to the active branch with a backup.
text
pnpm test:docker:npm-telegram-live
- Installs an OpenClaw package candidate in Docker, runs installed-package onboarding, configures Telegram through the installed CLI, then reuses the live Telegram QA lane with that installed package as the SUT Gateway.
- Defaults to
  text
  OPENCLAW_NPM_TELEGRAM_PACKAGE_SPEC=openclaw@beta
  ; set
  text
  OPENCLAW_NPM_TELEGRAM_PACKAGE_TGZ=/path/to/openclaw-current.tgz
  or
  text
  OPENCLAW_CURRENT_PACKAGE_TGZ
  to test a resolved local tarball instead of installing from the registry.
- Uses the same Telegram env credentials or Convex credential source as
  text
  pnpm openclaw qa telegram
  . For CI/release automation, set
  text
  OPENCLAW_NPM_TELEGRAM_CREDENTIAL_SOURCE=convex
  plus
  text
  OPENCLAW_QA_CONVEX_SITE_URL
  and the role secret. If
  text
  OPENCLAW_QA_CONVEX_SITE_URL
  and a Convex role secret are present in CI, the Docker wrapper selects Convex automatically.
- text
  OPENCLAW_NPM_TELEGRAM_CREDENTIAL_ROLE=ci|maintainer
  overrides the shared
  text
  OPENCLAW_QA_CREDENTIAL_ROLE
  for this lane only.
- GitHub Actions exposes this lane as the manual maintainer workflow
  text
  NPM Telegram Beta E2E
  . It does not run on merge. The workflow uses the
  text
  qa-live-shared
  environment and Convex CI credential leases.
GitHub Actions also exposes
text
Package Acceptance
for side-run product proof against one candidate package. It accepts a trusted ref, published npm spec, HTTPS tarball URL plus SHA-256, or tarball artifact from another run, uploads the normalized
text
openclaw-current.tgz
as
text
package-under-test
, then runs the existing Docker E2E scheduler with smoke, package, product, full, or custom lane profiles. Set
text
telegram_mode=mock-openai
or
text
live-frontier
to run the Telegram QA workflow against the same
text
package-under-test
artifact.
- Latest beta product proof:


bash
gh workflow run package-acceptance.yml --ref main \
  -f source=npm \
  -f package_spec=openclaw@beta \
  -f suite_profile=product \
  -f telegram_mode=mock-openai

Exact tarball URL proof requires a digest:


bash
gh workflow run package-acceptance.yml --ref main \
  -f source=url \
  -f package_url=https://registry.npmjs.org/openclaw/-/openclaw-VERSION.tgz \
  -f package_sha256=<sha256> \
  -f suite_profile=package

Artifact proof downloads a tarball artifact from another Actions run:


bash
gh workflow run package-acceptance.yml --ref main \
  -f source=artifact \
  -f artifact_run_id=<run-id> \
  -f artifact_name=<artifact-name> \
  -f suite_profile=smoke

text
pnpm test:docker:bundled-channel-deps
- Packs and installs the current OpenClaw build in Docker, starts the Gateway with OpenAI configured, then enables bundled channel/plugins via config edits.
- Verifies setup discovery leaves unconfigured plugin runtime dependencies absent, the first configured Gateway or doctor run installs each bundled plugin's runtime dependencies on demand, and a second restart does not reinstall dependencies that were already activated.
- Also installs a known older npm baseline, enables Telegram before running
  text
  openclaw update --tag <candidate>
  , and verifies the candidate's post-update doctor repairs bundled channel runtime dependencies without a harness-side postinstall repair.
text
pnpm test:parallels:npm-update
- Runs the native packaged-install update smoke across Parallels guests. Each selected platform first installs the requested baseline package, then runs the installed
  text
  openclaw update
  command in the same guest and verifies the installed version, update status, gateway readiness, and one local agent turn.
- Use
  text
  --platform macos
  ,
  text
  --platform windows
  , or
  text
  --platform linux
  while iterating on one guest. Use
  text
  --json
  for the summary artifact path and per-lane status.
- The OpenAI lane uses
  text
  openai/gpt-5.5
  for the live agent-turn proof by default. Pass
  text
  --model <provider/model>
  or set
  text
  OPENCLAW_PARALLELS_OPENAI_MODEL
  when deliberately validating another OpenAI model.
- Wrap long local runs in a host timeout so Parallels transport stalls cannot consume the rest of the testing window:
```
bash
timeout --foreground 150m pnpm test:parallels:npm-update -- --json
timeout --foreground 90m pnpm test:parallels:npm-update -- --platform windows --json
```
- The script writes nested lane logs under
  text
  /tmp/openclaw-parallels-npm-update.*
  . Inspect
  text
  windows-update.log
  ,
  text
  macos-update.log
  , or
  text
  linux-update.log
  before assuming the outer wrapper is hung.
- Windows update can spend 10 to 15 minutes in post-update doctor/runtime dependency repair on a cold guest; that is still healthy when the nested npm debug log is advancing.
- Do not run this aggregate wrapper in parallel with individual Parallels macOS, Windows, or Linux smoke lanes. They share VM state and can collide on snapshot restore, package serving, or guest gateway state.
- The post-update proof runs the normal bundled plugin surface because capability facades such as speech, image generation, and media understanding are loaded through bundled runtime APIs even when the agent turn itself only checks a simple text response.
text
pnpm openclaw qa aimock
- Starts only the local AIMock provider server for direct protocol smoke testing.
text
pnpm openclaw qa matrix
- Runs the Matrix live QA lane against a disposable Docker-backed Tuwunel homeserver. Source-checkout only — packaged installs do not ship
  text
  qa-lab
  .
- Full CLI, profile/scenario catalog, env vars, and artifact layout: Matrix QA.
text
pnpm openclaw qa telegram
- Runs the Telegram live QA lane against a real private group using the driver and SUT bot tokens from env.
- Requires
  text
  OPENCLAW_QA_TELEGRAM_GROUP_ID
  ,
  text
  OPENCLAW_QA_TELEGRAM_DRIVER_BOT_TOKEN
  , and
  text
  OPENCLAW_QA_TELEGRAM_SUT_BOT_TOKEN
  . The group id must be the numeric Telegram chat id.
- Supports
  text
  --credential-source convex
  for shared pooled credentials. Use env mode by default, or set
  text
  OPENCLAW_QA_CREDENTIAL_SOURCE=convex
  to opt into pooled leases.
- Exits non-zero when any scenario fails. Use
  text
  --allow-failures
  when you want artifacts without a failing exit code.
- Requires two distinct bots in the same private group, with the SUT bot exposing a Telegram username.
- For stable bot-to-bot observation, enable Bot-to-Bot Communication Mode in
  text
  @BotFather
  for both bots and ensure the driver bot can observe group bot traffic.
- Writes a Telegram QA report, summary, and observed-messages artifact under
  text
  .artifacts/qa-e2e/...
  . Replying scenarios include RTT from driver send request to observed SUT reply.

Live transport lanes share one standard contract so new transports do not drift; the per-lane coverage matrix lives in QA overview → Live transport coverage.

text

qa-channel

is the broad synthetic suite and is not part of that matrix.

Shared Telegram credentials via Convex (v1)

When

text

--credential-source convex

(or

text

OPENCLAW_QA_CREDENTIAL_SOURCE=convex

) is enabled for

text

openclaw qa telegram

, QA lab acquires an exclusive lease from a Convex-backed pool, heartbeats that lease while the lane is running, and releases the lease on shutdown.

Reference Convex project scaffold:

text
qa/convex-credential-broker/

Required env vars:

text
OPENCLAW_QA_CONVEX_SITE_URL
(for example
text
https://your-deployment.convex.site
)
One secret for the selected role:
- text
  OPENCLAW_QA_CONVEX_SECRET_MAINTAINER
  for
  text
  maintainer
- text
  OPENCLAW_QA_CONVEX_SECRET_CI
  for
  text
  ci
Credential role selection:
- CLI:
  text
  --credential-role maintainer|ci
- Env default:
  text
  OPENCLAW_QA_CREDENTIAL_ROLE
  (defaults to
  text
  ci
  in CI,
  text
  maintainer
  otherwise)

Optional env vars:

text
OPENCLAW_QA_CREDENTIAL_LEASE_TTL_MS
(default
text
1200000
)
text
OPENCLAW_QA_CREDENTIAL_HEARTBEAT_INTERVAL_MS
(default
text
30000
)
text
OPENCLAW_QA_CREDENTIAL_ACQUIRE_TIMEOUT_MS
(default
text
90000
)
text
OPENCLAW_QA_CREDENTIAL_HTTP_TIMEOUT_MS
(default
text
15000
)
text
OPENCLAW_QA_CONVEX_ENDPOINT_PREFIX
(default
text
/qa-credentials/v1
)
text
OPENCLAW_QA_CREDENTIAL_OWNER_ID
(optional trace id)
text
OPENCLAW_QA_ALLOW_INSECURE_HTTP=1
allows loopback
text
http://
Convex URLs for local-only development.

text

OPENCLAW_QA_CONVEX_SITE_URL

should use

text

https://

in normal operation.

Maintainer admin commands (pool add/remove/list) require

text

OPENCLAW_QA_CONVEX_SECRET_MAINTAINER

specifically.

CLI helpers for maintainers:


bash
pnpm openclaw qa credentials doctor
pnpm openclaw qa credentials add --kind telegram --payload-file qa/telegram-credential.json
pnpm openclaw qa credentials list --kind telegram
pnpm openclaw qa credentials remove --credential-id <credential-id>

Use

text

doctor

before live runs to check the Convex site URL, broker secrets, endpoint prefix, HTTP timeout, and admin/list reachability without printing secret values. Use

text

--json

for machine-readable output in scripts and CI utilities.

Default endpoint contract (

text

OPENCLAW_QA_CONVEX_SITE_URL

text

/qa-credentials/v1

text
POST /acquire
- Request:
  text
  { kind, ownerId, actorRole, leaseTtlMs, heartbeatIntervalMs }
- Success:
  text
  { status: "ok", credentialId, leaseToken, payload, leaseTtlMs?, heartbeatIntervalMs? }
- Exhausted/retryable:
  text
  { status: "error", code: "POOL_EXHAUSTED" | "NO_CREDENTIAL_AVAILABLE", ... }
text
POST /heartbeat
- Request:
  text
  { kind, ownerId, actorRole, credentialId, leaseToken, leaseTtlMs }
- Success:
  text
  { status: "ok" }
  (or empty
  text
  2xx
  )
text
POST /release
- Request:
  text
  { kind, ownerId, actorRole, credentialId, leaseToken }
- Success:
  text
  { status: "ok" }
  (or empty
  text
  2xx
  )
text
POST /admin/add
(maintainer secret only)
- Request:
  text
  { kind, actorId, payload, note?, status? }
- Success:
  text
  { status: "ok", credential }
text
POST /admin/remove
(maintainer secret only)
- Request:
  text
  { credentialId, actorId }
- Success:
  text
  { status: "ok", changed, credential }
- Active lease guard:
  text
  { status: "error", code: "LEASE_ACTIVE", ... }
text
POST /admin/list
(maintainer secret only)
- Request:
  text
  { kind?, status?, includePayload?, limit? }
- Success:
  text
  { status: "ok", credentials, count }

Payload shape for Telegram kind:

text
{ groupId: string, driverToken: string, sutToken: string }
text
groupId
must be a numeric Telegram chat id string.
text
admin/add
validates this shape for
text
kind: "telegram"
and rejects malformed payloads.

Adding a channel to QA

The architecture and scenario-helper names for new channel adapters live in QA overview → Adding a channel. The minimum bar: implement the transport runner on the shared

text

qa-lab

host seam, declare

text

qaRunners

in the plugin manifest, mount as

text

openclaw qa <runner>

, and author scenarios under

text

qa/scenarios/

Test suites (what runs where)

Think of the suites as “increasing realism” (and increasing flakiness/cost):

Unit / integration (default)

Command:
text
pnpm test
Config: untargeted runs use the
text
vitest.full-*.config.ts
shard set and may expand multi-project shards into per-project configs for parallel scheduling
Files: core/unit inventories under
text
src/**/*.test.ts
,
text
packages/**/*.test.ts
, and
text
test/**/*.test.ts
; UI unit tests run in the dedicated
text
unit-ui
shard
Scope:
- Pure unit tests
- In-process integration tests (gateway auth, routing, tooling, parsing, config)
- Deterministic regressions for known bugs
Expectations:
- Runs in CI
- No real keys required
- Should be fast and stable
- Resolver and public-surface loader tests must prove broad
  text
  api.js
  and
  text
  runtime-api.js
  fallback behavior with generated tiny plugin fixtures, not real bundled plugin source APIs. Real plugin API loads belong in plugin-owned contract/integration suites.

Stability (gateway)

Command:
text
pnpm test:stability:gateway
Config:
text
vitest.gateway.config.ts
, forced to one worker
Scope:
- Starts a real loopback Gateway with diagnostics enabled by default
- Drives synthetic gateway message, memory, and large-payload churn through the diagnostic event path
- Queries
  text
  diagnostics.stability
  over the Gateway WS RPC
- Covers diagnostic stability bundle persistence helpers
- Asserts the recorder remains bounded, synthetic RSS samples stay under the pressure budget, and per-session queue depths drain back to zero
Expectations:
- CI-safe and keyless
- Narrow lane for stability-regression follow-up, not a substitute for the full Gateway suite

E2E (gateway smoke)

Command:
text
pnpm test:e2e
Config:
text
vitest.e2e.config.ts
Files:
text
src/**/*.e2e.test.ts
,
text
test/**/*.e2e.test.ts
, and bundled-plugin E2E tests under
text
extensions/
Runtime defaults:
- Uses Vitest
  text
  threads
  with
  text
  isolate: false
  , matching the rest of the repo.
- Uses adaptive workers (CI: up to 2, local: 1 by default).
- Runs in silent mode by default to reduce console I/O overhead.
Useful overrides:
- text
  OPENCLAW_E2E_WORKERS=<n>
  to force worker count (capped at 16).
- text
  OPENCLAW_E2E_VERBOSE=1
  to re-enable verbose console output.
Scope:
- Multi-instance gateway end-to-end behavior
- WebSocket/HTTP surfaces, node pairing, and heavier networking
Expectations:
- Runs in CI (when enabled in the pipeline)
- No real keys required
- More moving parts than unit tests (can be slower)

E2E: OpenShell backend smoke

Command:
text
pnpm test:e2e:openshell
File:
text
extensions/openshell/src/backend.e2e.test.ts
Scope:
- Starts an isolated OpenShell gateway on the host via Docker
- Creates a sandbox from a temporary local Dockerfile
- Exercises OpenClaw's OpenShell backend over real
  text
  sandbox ssh-config
  + SSH exec
- Verifies remote-canonical filesystem behavior through the sandbox fs bridge
Expectations:
- Opt-in only; not part of the default
  text
  pnpm test:e2e
  run
- Requires a local
  text
  openshell
  CLI plus a working Docker daemon
- Uses isolated
  text
  HOME
  /
  text
  XDG_CONFIG_HOME
  , then destroys the test gateway and sandbox
Useful overrides:
- text
  OPENCLAW_E2E_OPENSHELL=1
  to enable the test when running the broader e2e suite manually
- text
  OPENCLAW_E2E_OPENSHELL_COMMAND=/path/to/openshell
  to point at a non-default CLI binary or wrapper script

Live (real providers + real models)

Command:
text
pnpm test:live
Config:
text
vitest.live.config.ts
Files:
text
src/**/*.live.test.ts
,
text
test/**/*.live.test.ts
, and bundled-plugin live tests under
text
extensions/
Default: enabled by
text
pnpm test:live
(sets
text
OPENCLAW_LIVE_TEST=1
)
Scope:
- “Does this provider/model actually work today with real creds?”
- Catch provider format changes, tool-calling quirks, auth issues, and rate limit behavior
Expectations:
- Not CI-stable by design (real networks, real provider policies, quotas, outages)
- Costs money / uses rate limits
- Prefer running narrowed subsets instead of “everything”
Live runs source
text
~/.profile
to pick up missing API keys.
By default, live runs still isolate
text
HOME
and copy config/auth material into a temp test home so unit fixtures cannot mutate your real
text
~/.openclaw
.
Set
text
OPENCLAW_LIVE_USE_REAL_HOME=1
only when you intentionally need live tests to use your real home directory.
text
pnpm test:live
now defaults to a quieter mode: it keeps
text
[live] ...
progress output, but suppresses the extra
text
~/.profile
notice and mutes gateway bootstrap logs/Bonjour chatter. Set
text
OPENCLAW_LIVE_TEST_QUIET=0
if you want the full startup logs back.
API key rotation (provider-specific): set
text
*_API_KEYS
with comma/semicolon format or
text
*_API_KEY_1
,
text
*_API_KEY_2
(for example
text
OPENAI_API_KEYS
,
text
ANTHROPIC_API_KEYS
,
text
GEMINI_API_KEYS
) or per-live override via
text
OPENCLAW_LIVE_*_KEY
; tests retry on rate limit responses.
Progress/heartbeat output:
- Live suites now emit progress lines to stderr so long provider calls are visibly active even when Vitest console capture is quiet.
- text
  vitest.live.config.ts
  disables Vitest console interception so provider/gateway progress lines stream immediately during live runs.
- Tune direct-model heartbeats with
  text
  OPENCLAW_LIVE_HEARTBEAT_MS
  .
- Tune gateway/probe heartbeats with
  text
  OPENCLAW_LIVE_GATEWAY_HEARTBEAT_MS
  .

Which suite should I run?

Use this decision table:

Editing logic/tests: run
text
pnpm test
(and
text
pnpm test:coverage
if you changed a lot)
Touching gateway networking / WS protocol / pairing: add
text
pnpm test:e2e
Debugging “my bot is down” / provider-specific failures / tool calling: run a narrowed
text
pnpm test:live

Live (network-touching) tests

For the live model matrix, CLI backend smokes, ACP smokes, Codex app-server harness, and all media-provider live tests (Deepgram, BytePlus, ComfyUI, image, music, video, media harness) — plus credential handling for live runs — see Testing — live suites.

Docker runners (optional "works in Linux" checks)

These Docker runners split into two buckets:

Live-model runners:
text
test:docker:live-models
and
text
test:docker:live-gateway
run only their matching profile-key live file inside the repo Docker image (
text
src/agents/models.profiles.live.test.ts
and
text
src/gateway/gateway-models.profiles.live.test.ts
), mounting your local config dir and workspace (and sourcing
text
~/.profile
if mounted). The matching local entrypoints are
text
test:live:models-profiles
and
text
test:live:gateway-profiles
.
Docker live runners default to a smaller smoke cap so a full Docker sweep stays practical:
text
test:docker:live-models
defaults to
text
OPENCLAW_LIVE_MAX_MODELS=12
, and
text
test:docker:live-gateway
defaults to
text
OPENCLAW_LIVE_GATEWAY_SMOKE=1
,
text
OPENCLAW_LIVE_GATEWAY_MAX_MODELS=8
,
text
OPENCLAW_LIVE_GATEWAY_STEP_TIMEOUT_MS=45000
, and
text
OPENCLAW_LIVE_GATEWAY_MODEL_TIMEOUT_MS=90000
. Override those env vars when you explicitly want the larger exhaustive scan.
text
test:docker:all
builds the live Docker image once via
text
test:docker:live-build
, packs OpenClaw once as an npm tarball through
text
scripts/package-openclaw-for-docker.mjs
, then builds/reuses two
text
scripts/e2e/Dockerfile
images. The bare image is only the Node/Git runner for install/update/plugin-dependency lanes; those lanes mount the prebuilt tarball. The functional image installs the same tarball into
text
/app
for built-app functionality lanes. Docker lane definitions live in
text
scripts/lib/docker-e2e-scenarios.mjs
; planner logic lives in
text
scripts/lib/docker-e2e-plan.mjs
;
text
scripts/test-docker-all.mjs
executes the selected plan. The aggregate uses a weighted local scheduler:
text
OPENCLAW_DOCKER_ALL_PARALLELISM
controls process slots, while resource caps keep heavy live, npm-install, and multi-service lanes from all starting at once. If a single lane is heavier than the active caps, the scheduler can still start it when the pool is empty and then keeps it running alone until capacity is available again. Defaults are 10 slots,
text
OPENCLAW_DOCKER_ALL_LIVE_LIMIT=9
,
text
OPENCLAW_DOCKER_ALL_NPM_LIMIT=10
, and
text
OPENCLAW_DOCKER_ALL_SERVICE_LIMIT=7
; tune
text
OPENCLAW_DOCKER_ALL_WEIGHT_LIMIT
or
text
OPENCLAW_DOCKER_ALL_DOCKER_LIMIT
only when the Docker host has more headroom. The runner performs a Docker preflight by default, removes stale OpenClaw E2E containers, prints status every 30 seconds, stores successful lane timings in
text
.artifacts/docker-tests/lane-timings.json
, and uses those timings to start longer lanes first on later runs. Use
text
OPENCLAW_DOCKER_ALL_DRY_RUN=1
to print the weighted lane manifest without building or running Docker, or
text
node scripts/test-docker-all.mjs --plan-json
to print the CI plan for selected lanes, package/image needs, and credentials.
text
Package Acceptance
is the GitHub-native package gate for "does this installable tarball work as a product?" It resolves one candidate package from
text
source=npm
,
text
source=ref
,
text
source=url
, or
text
source=artifact
, uploads it as
text
package-under-test
, then runs the reusable Docker E2E lanes against that exact tarball instead of repacking the selected ref.
text
workflow_ref
selects the trusted workflow/harness scripts, while
text
package_ref
selects the source commit/branch/tag to pack when
text
source=ref
; this lets current acceptance logic validate older trusted commits. Profiles are ordered by breadth:
text
smoke
is quick install/channel/agent plus gateway/config,
text
package
is the package/update/plugin contract plus the keyless upgrade-survivor fixture, the published-baseline upgrade survivor lane, and the default native replacement for most Parallels package/update coverage,
text
product
adds MCP channels, cron/subagent cleanup, OpenAI web search, and OpenWebUI, and
text
full
runs the release-path Docker chunks with OpenWebUI. For
text
published-upgrade-survivor
, Package Acceptance always uses
text
package-under-test
as the candidate and
text
published_upgrade_survivor_baseline
as the published baseline, defaulting to
text
openclaw@latest
; shard broader coverage by dispatching multiple runs with exact baseline values. The published lane configures its baseline with a baked
text
openclaw config set
command recipe, then records recipe steps in the lane summary. Release validation runs a custom package delta (
text
bundled-channel-deps-compat plugins-offline
) plus Telegram package QA because the release-path Docker chunks already cover the overlapping package/update/plugin lanes. Targeted GitHub Docker rerun commands generated from artifacts include prior package artifact, prepared image inputs, and the published upgrade-survivor baseline when available, so failed lanes can avoid rebuilding the package and images.
Build and release checks run
text
scripts/check-cli-bootstrap-imports.mjs
after tsdown. The guard walks the static built graph from
text
dist/entry.js
and
text
dist/cli/run-main.js
and fails if pre-dispatch startup imports package dependencies such as Commander, prompt UI, undici, or logging before command dispatch; it also keeps the bundled gateway run chunk under budget and rejects static imports of known cold gateway paths. Packaged CLI smoke also covers root help, onboard help, doctor help, status, config schema, and a model-list command.
Package Acceptance legacy compatibility is capped at
text
2026.4.25
(
text
2026.4.25-beta.*
included). Through that cutoff, the harness tolerates only shipped-package metadata gaps: omitted private QA inventory entries, missing
text
gateway install --wrapper
, missing patch files in the tarball-derived git fixture, missing persisted
text
update.channel
, legacy plugin install-record locations, missing marketplace install-record persistence, and config metadata migration during
text
plugins update
. For packages after
text
2026.4.25
, those paths are strict failures.
Container smoke runners:
text
test:docker:openwebui
,
text
test:docker:onboard
,
text
test:docker:npm-onboard-channel-agent
,
text
test:docker:update-channel-switch
,
text
test:docker:upgrade-survivor
,
text
test:docker:published-upgrade-survivor
,
text
test:docker:session-runtime-context
,
text
test:docker:agents-delete-shared-workspace
,
text
test:docker:gateway-network
,
text
test:docker:browser-cdp-snapshot
,
text
test:docker:mcp-channels
,
text
test:docker:pi-bundle-mcp-tools
,
text
test:docker:cron-mcp-cleanup
,
text
test:docker:plugins
,
text
test:docker:plugin-update
, and
text
test:docker:config-reload
boot one or more real containers and verify higher-level integration paths.

The live-model Docker runners also bind-mount only the needed CLI auth homes (or all supported ones when the run is not narrowed), then copy them into the container home before the run so external-CLI OAuth can refresh tokens without mutating the host auth store:

Direct models:
text
pnpm test:docker:live-models
(script:
text
scripts/test-live-models-docker.sh
)
ACP bind smoke:
text
pnpm test:docker:live-acp-bind
(script:
text
scripts/test-live-acp-bind-docker.sh
; covers Claude, Codex, and Gemini by default, with strict Droid/OpenCode coverage via
text
pnpm test:docker:live-acp-bind:droid
and
text
pnpm test:docker:live-acp-bind:opencode
)
CLI backend smoke:
text
pnpm test:docker:live-cli-backend
(script:
text
scripts/test-live-cli-backend-docker.sh
)
Codex app-server harness smoke:
text
pnpm test:docker:live-codex-harness
(script:
text
scripts/test-live-codex-harness-docker.sh
)
Gateway + dev agent:
text
pnpm test:docker:live-gateway
(script:
text
scripts/test-live-gateway-models-docker.sh
)
Observability smoke:
text
pnpm qa:otel:smoke
is a private QA source-checkout lane. It is intentionally not part of package Docker release lanes because the npm tarball omits QA Lab.
Open WebUI live smoke:
text
pnpm test:docker:openwebui
(script:
text
scripts/e2e/openwebui-docker.sh
)
Onboarding wizard (TTY, full scaffolding):
text
pnpm test:docker:onboard
(script:
text
scripts/e2e/onboard-docker.sh
)
Npm tarball onboarding/channel/agent smoke:
text
pnpm test:docker:npm-onboard-channel-agent
installs the packed OpenClaw tarball globally in Docker, configures OpenAI via env-ref onboarding plus Telegram by default, verifies doctor repairs activated plugin runtime deps, and runs one mocked OpenAI agent turn. Reuse a prebuilt tarball with
text
OPENCLAW_CURRENT_PACKAGE_TGZ=/path/to/openclaw-*.tgz
, skip the host rebuild with
text
OPENCLAW_NPM_ONBOARD_HOST_BUILD=0
, or switch channel with
text
OPENCLAW_NPM_ONBOARD_CHANNEL=discord
.
Update channel switch smoke:
text
pnpm test:docker:update-channel-switch
installs the packed OpenClaw tarball globally in Docker, switches from package
text
stable
to git
text
dev
, verifies the persisted channel and plugin post-update work, then switches back to package
text
stable
and checks update status.
Upgrade survivor smoke:
text
pnpm test:docker:upgrade-survivor
installs the packed OpenClaw tarball over a dirty old-user fixture with agents, channel config, plugin allowlists, stale plugin runtime-deps state, and existing workspace/session files. It runs package update plus non-interactive doctor without live provider or channel keys, then starts a loopback Gateway and checks config/state preservation plus startup/status budgets.
Published upgrade survivor smoke:
text
pnpm test:docker:published-upgrade-survivor
installs
text
openclaw@latest
by default, seeds realistic existing-user files, configures that baseline with a baked command recipe, validates the resulting config, updates that published install to the candidate tarball, runs non-interactive doctor, writes
text
.artifacts/upgrade-survivor/summary.json
, then starts a loopback Gateway and checks configured intents, state preservation, startup, and status budgets. Override the baseline with
text
OPENCLAW_UPGRADE_SURVIVOR_BASELINE_SPEC
; Package Acceptance exposes the same value as
text
published_upgrade_survivor_baseline
.
Session runtime context smoke:
text
pnpm test:docker:session-runtime-context
verifies hidden runtime context transcript persistence plus doctor repair of affected duplicated prompt-rewrite branches.
Bun global install smoke:
text
bash scripts/e2e/bun-global-install-smoke.sh
packs the current tree, installs it with
text
bun install -g
in an isolated home, and verifies
text
openclaw infer image providers --json
returns bundled image providers instead of hanging. Reuse a prebuilt tarball with
text
OPENCLAW_BUN_GLOBAL_SMOKE_PACKAGE_TGZ=/path/to/openclaw-*.tgz
, skip the host build with
text
OPENCLAW_BUN_GLOBAL_SMOKE_HOST_BUILD=0
, or copy
text
dist/
from a built Docker image with
text
OPENCLAW_BUN_GLOBAL_SMOKE_DIST_IMAGE=openclaw-dockerfile-smoke:local
.
Installer Docker smoke:
text
bash scripts/test-install-sh-docker.sh
shares one npm cache across its root, update, and direct-npm containers. Update smoke defaults to npm
text
latest
as the stable baseline before upgrading to the candidate tarball. Override with
text
OPENCLAW_INSTALL_SMOKE_UPDATE_BASELINE=2026.4.22
locally, or with the Install Smoke workflow's
text
update_baseline_version
input on GitHub. Non-root installer checks keep an isolated npm cache so root-owned cache entries do not mask user-local install behavior. Set
text
OPENCLAW_INSTALL_SMOKE_NPM_CACHE_DIR=/path/to/cache
to reuse the root/update/direct-npm cache across local reruns.
Install Smoke CI skips the duplicate direct-npm global update with
text
OPENCLAW_INSTALL_SMOKE_SKIP_NPM_GLOBAL=1
; run the script locally without that env when direct
text
npm install -g
coverage is needed.
Agents delete shared workspace CLI smoke:
text
pnpm test:docker:agents-delete-shared-workspace
(script:
text
scripts/e2e/agents-delete-shared-workspace-docker.sh
) builds the root Dockerfile image by default, seeds two agents with one workspace in an isolated container home, runs
text
agents delete --json
, and verifies valid JSON plus retained workspace behavior. Reuse the install-smoke image with
text
OPENCLAW_AGENTS_DELETE_SHARED_WORKSPACE_E2E_IMAGE=openclaw-dockerfile-smoke:local OPENCLAW_AGENTS_DELETE_SHARED_WORKSPACE_E2E_SKIP_BUILD=1
.
Gateway networking (two containers, WS auth + health):
text
pnpm test:docker:gateway-network
(script:
text
scripts/e2e/gateway-network-docker.sh
)
Browser CDP snapshot smoke:
text
pnpm test:docker:browser-cdp-snapshot
(script:
text
scripts/e2e/browser-cdp-snapshot-docker.sh
) builds the source E2E image plus a Chromium layer, starts Chromium with raw CDP, runs
text
browser doctor --deep
, and verifies CDP role snapshots cover link URLs, cursor-promoted clickables, iframe refs, and frame metadata.
OpenAI Responses web_search minimal reasoning regression:
text
pnpm test:docker:openai-web-search-minimal
(script:
text
scripts/e2e/openai-web-search-minimal-docker.sh
) runs a mocked OpenAI server through Gateway, verifies
text
web_search
raises
text
reasoning.effort
from
text
minimal
to
text
low
, then forces the provider schema reject and checks the raw detail appears in Gateway logs.
MCP channel bridge (seeded Gateway + stdio bridge + raw Claude notification-frame smoke):
text
pnpm test:docker:mcp-channels
(script:
text
scripts/e2e/mcp-channels-docker.sh
)
Pi bundle MCP tools (real stdio MCP server + embedded Pi profile allow/deny smoke):
text
pnpm test:docker:pi-bundle-mcp-tools
(script:
text
scripts/e2e/pi-bundle-mcp-tools-docker.sh
)
Cron/subagent MCP cleanup (real Gateway + stdio MCP child teardown after isolated cron and one-shot subagent runs):
text
pnpm test:docker:cron-mcp-cleanup
(script:
text
scripts/e2e/cron-mcp-cleanup-docker.sh
)
Plugins (install smoke, ClawHub kitchen-sink install/uninstall, marketplace updates, and Claude-bundle enable/inspect):
text
pnpm test:docker:plugins
(script:
text
scripts/e2e/plugins-docker.sh
) Set
text
OPENCLAW_PLUGINS_E2E_CLAWHUB=0
to skip the ClawHub block, or override the default kitchen-sink package/runtime pair with
text
OPENCLAW_PLUGINS_E2E_CLAWHUB_SPEC
and
text
OPENCLAW_PLUGINS_E2E_CLAWHUB_ID
. Without
text
OPENCLAW_CLAWHUB_URL
/
text
CLAWHUB_URL
, the test uses a hermetic local ClawHub fixture server.
Plugin update unchanged smoke:
text
pnpm test:docker:plugin-update
(script:
text
scripts/e2e/plugin-update-unchanged-docker.sh
)
Config reload metadata smoke:
text
pnpm test:docker:config-reload
(script:
text
scripts/e2e/config-reload-source-docker.sh
)
Bundled plugin runtime deps:
text
pnpm test:docker:bundled-channel-deps
builds a small Docker runner image by default, builds and packs OpenClaw once on the host, then mounts that tarball into each Linux install scenario. Reuse the image with
text
OPENCLAW_SKIP_DOCKER_BUILD=1
, skip the host rebuild after a fresh local build with
text
OPENCLAW_BUNDLED_CHANNEL_HOST_BUILD=0
, or point at an existing tarball with
text
OPENCLAW_CURRENT_PACKAGE_TGZ=/path/to/openclaw-*.tgz
. The full Docker aggregate and release-path bundled-channel chunks pre-pack this tarball once, then shard bundled channel checks into independent lanes, including separate update lanes for Telegram, Discord, Slack, Feishu, memory-lancedb, and ACPX. Release chunks split channel smokes, update targets, and setup/runtime contracts into
text
bundled-channels-core
,
text
bundled-channels-update-a
,
text
bundled-channels-update-b
, and
text
bundled-channels-contracts
; the aggregate
text
bundled-channels
chunk remains available for manual reruns. The release workflow also splits provider installer chunks and bundled plugin install/uninstall chunks; legacy
text
package-update
,
text
plugins-runtime
, and
text
plugins-integrations
chunks remain aggregate aliases for manual reruns. Use
text
OPENCLAW_BUNDLED_CHANNELS=telegram,slack
to narrow the channel matrix when running the bundled lane directly, or
text
OPENCLAW_BUNDLED_CHANNEL_UPDATE_TARGETS=telegram,acpx
to narrow the update scenario. Per-scenario Docker runs default to
text
OPENCLAW_BUNDLED_CHANNEL_DOCKER_RUN_TIMEOUT=900s
; the multi-target update scenario defaults to
text
OPENCLAW_BUNDLED_CHANNEL_UPDATE_DOCKER_RUN_TIMEOUT=2400s
. The lane also verifies that
text
channels.<id>.enabled=false
and
text
plugins.entries.<id>.enabled=false
suppress doctor/runtime-dependency repair.
Narrow bundled plugin runtime deps while iterating by disabling unrelated scenarios, for example:
text
OPENCLAW_BUNDLED_CHANNEL_SCENARIOS=0 OPENCLAW_BUNDLED_CHANNEL_UPDATE_SCENARIO=0 OPENCLAW_BUNDLED_CHANNEL_ROOT_OWNED_SCENARIO=0 OPENCLAW_BUNDLED_CHANNEL_SETUP_ENTRY_SCENARIO=0 pnpm test:docker:bundled-channel-deps
.

To prebuild and reuse the shared functional image manually:


bash
OPENCLAW_DOCKER_E2E_IMAGE=openclaw-docker-e2e-functional:local pnpm test:docker:e2e-build
OPENCLAW_DOCKER_E2E_IMAGE=openclaw-docker-e2e-functional:local OPENCLAW_SKIP_DOCKER_BUILD=1 pnpm test:docker:mcp-channels

Suite-specific image overrides such as

text

OPENCLAW_GATEWAY_NETWORK_E2E_IMAGE

still win when set. When

text

OPENCLAW_SKIP_DOCKER_BUILD=1

points at a remote shared image, the scripts pull it if it is not already local. The QR and installer Docker tests keep their own Dockerfiles because they validate package/install behavior rather than the shared built-app runtime.

The live-model Docker runners also bind-mount the current checkout read-only and stage it into a temporary workdir inside the container. This keeps the runtime image slim while still running Vitest against your exact local source/config. The staging step skips large local-only caches and app build outputs such as

text

.pnpm-store

text

.worktrees

text

__openclaw_vitest__

, and app-local

text

.build

or Gradle output directories so Docker live runs do not spend minutes copying machine-specific artifacts. They also set

text

OPENCLAW_SKIP_CHANNELS=1

so gateway live probes do not start real Telegram/Discord/etc. channel workers inside the container.

text

test:docker:live-models

still runs

text

pnpm test:live

, so pass through

text

OPENCLAW_LIVE_GATEWAY_*

as well when you need to narrow or exclude gateway live coverage from that Docker lane.

text

test:docker:openwebui

is a higher-level compatibility smoke: it starts an OpenClaw gateway container with the OpenAI-compatible HTTP endpoints enabled, starts a pinned Open WebUI container against that gateway, signs in through Open WebUI, verifies

text

/api/models

exposes

text

openclaw/default

, then sends a real chat request through Open WebUI's

text

/api/chat/completions

proxy. The first run can be noticeably slower because Docker may need to pull the Open WebUI image and Open WebUI may need to finish its own cold-start setup. This lane expects a usable live model key, and

text

OPENCLAW_PROFILE_FILE

(

text

~/.profile

by default) is the primary way to provide it in Dockerized runs. Successful runs print a small JSON payload like

text

{ "ok": true, "model": "openclaw/default", ... }

text

test:docker:mcp-channels

is intentionally deterministic and does not need a real Telegram, Discord, or iMessage account. It boots a seeded Gateway container, starts a second container that spawns

text

openclaw mcp serve

, then verifies routed conversation discovery, transcript reads, attachment metadata, live event queue behavior, outbound send routing, and Claude-style channel + permission notifications over the real stdio MCP bridge. The notification check inspects the raw stdio MCP frames directly so the smoke validates what the bridge actually emits, not just what a specific client SDK happens to surface.

text

test:docker:pi-bundle-mcp-tools

is deterministic and does not need a live model key. It builds the repo Docker image, starts a real stdio MCP probe server inside the container, materializes that server through the embedded Pi bundle MCP runtime, executes the tool, then verifies

text

coding

and

text

messaging

keep

text

bundle-mcp

tools while

text

minimal

and

text

tools.deny: ["bundle-mcp"]

filter them.

text

test:docker:cron-mcp-cleanup

is deterministic and does not need a live model key. It starts a seeded Gateway with a real stdio MCP probe server, runs an isolated cron turn and a

text

/subagents spawn

one-shot child turn, then verifies the MCP child process exits after each run.

Manual ACP plain-language thread smoke (not CI):

text
bun scripts/dev/discord-acp-plain-language-smoke.ts --channel <discord-channel-id> ...
Keep this script for regression/debug workflows. It may be needed again for ACP thread routing validation, so do not delete it.

Useful env vars:

text
OPENCLAW_CONFIG_DIR=...
(default:
text
~/.openclaw
) mounted to
text
/home/node/.openclaw
text
OPENCLAW_WORKSPACE_DIR=...
(default:
text
~/.openclaw/workspace
) mounted to
text
/home/node/.openclaw/workspace
text
OPENCLAW_PROFILE_FILE=...
(default:
text
~/.profile
) mounted to
text
/home/node/.profile
and sourced before running tests
text
OPENCLAW_DOCKER_PROFILE_ENV_ONLY=1
to verify only env vars sourced from
text
OPENCLAW_PROFILE_FILE
, using temporary config/workspace dirs and no external CLI auth mounts
text
OPENCLAW_DOCKER_CLI_TOOLS_DIR=...
(default:
text
~/.cache/openclaw/docker-cli-tools
) mounted to
text
/home/node/.npm-global
for cached CLI installs inside Docker
External CLI auth dirs/files under
text
$HOME
are mounted read-only under
text
/host-auth...
, then copied into
text
/home/node/...
before tests start
- Default dirs:
  text
  .minimax
- Default files:
  text
  ~/.codex/auth.json
  ,
  text
  ~/.codex/config.toml
  ,
  text
  .claude.json
  ,
  text
  ~/.claude/.credentials.json
  ,
  text
  ~/.claude/settings.json
  ,
  text
  ~/.claude/settings.local.json
- Narrowed provider runs mount only the needed dirs/files inferred from
  text
  OPENCLAW_LIVE_PROVIDERS
  /
  text
  OPENCLAW_LIVE_GATEWAY_PROVIDERS
- Override manually with
  text
  OPENCLAW_DOCKER_AUTH_DIRS=all
  ,
  text
  OPENCLAW_DOCKER_AUTH_DIRS=none
  , or a comma list like
  text
  OPENCLAW_DOCKER_AUTH_DIRS=.claude,.codex
text
OPENCLAW_LIVE_GATEWAY_MODELS=...
/
text
OPENCLAW_LIVE_MODELS=...
to narrow the run
text
OPENCLAW_LIVE_GATEWAY_PROVIDERS=...
/
text
OPENCLAW_LIVE_PROVIDERS=...
to filter providers in-container
text
OPENCLAW_SKIP_DOCKER_BUILD=1
to reuse an existing
text
openclaw:local-live
image for reruns that do not need a rebuild
text
OPENCLAW_LIVE_REQUIRE_PROFILE_KEYS=1
to ensure creds come from the profile store (not env)
text
OPENCLAW_OPENWEBUI_MODEL=...
to choose the model exposed by the gateway for the Open WebUI smoke
text
OPENCLAW_OPENWEBUI_PROMPT=...
to override the nonce-check prompt used by the Open WebUI smoke
text
OPENWEBUI_IMAGE=...
to override the pinned Open WebUI image tag

Docs sanity

Run docs checks after doc edits:

text

pnpm check:docs

. Run full Mintlify anchor validation when you need in-page heading checks too:

text

pnpm docs:check-links:anchors

Offline regression (CI-safe)

These are “real pipeline” regressions without real providers:

Gateway tool calling (mock OpenAI, real gateway + agent loop):
text
src/gateway/gateway.test.ts
(case: "runs a mock OpenAI tool call end-to-end via gateway agent loop")
Gateway wizard (WS
text
wizard.start
/
text
wizard.next
, writes config + auth enforced):
text
src/gateway/gateway.test.ts
(case: "runs wizard over ws and writes auth token config")

Agent reliability evals (skills)

We already have a few CI-safe tests that behave like “agent reliability evals”:

Mock tool-calling through the real gateway + agent loop (
text
src/gateway/gateway.test.ts
).
End-to-end wizard flows that validate session wiring and config effects (
text
src/gateway/gateway.test.ts
).

What’s still missing for skills (see Skills):

Decisioning: when skills are listed in the prompt, does the agent pick the right skill (or avoid irrelevant ones)?
Compliance: does the agent read
text
SKILL.md
before use and follow required steps/args?
Workflow contracts: multi-turn scenarios that assert tool order, session history carryover, and sandbox boundaries.

Future evals should stay deterministic first:

A scenario runner using mock providers to assert tool calls + order, skill file reads, and session wiring.
A small suite of skill-focused scenarios (use vs avoid, gating, prompt injection).
Optional live evals (opt-in, env-gated) only after the CI-safe suite is in place.

Contract tests (plugin and channel shape)

Contract tests verify that every registered plugin and channel conforms to its interface contract. They iterate over all discovered plugins and run a suite of shape and behavior assertions. The default

text

pnpm test

unit lane intentionally skips these shared seam and smoke files; run the contract commands explicitly when you touch shared channel or provider surfaces.

Commands

All contracts:
text
pnpm test:contracts
Channel contracts only:
text
pnpm test:contracts:channels
Provider contracts only:
text
pnpm test:contracts:plugins

Channel contracts

Located in

text

src/channels/plugins/contracts/*.contract.test.ts

plugin - Basic plugin shape (id, name, capabilities)
setup - Setup wizard contract
session-binding - Session binding behavior
outbound-payload - Message payload structure
inbound - Inbound message handling
actions - Channel action handlers
threading - Thread ID handling
directory - Directory/roster API
group-policy - Group policy enforcement

Provider status contracts

Located in

text

src/plugins/contracts/*.contract.test.ts

status - Channel status probes
registry - Plugin registry shape

Provider contracts

Located in

text

src/plugins/contracts/*.contract.test.ts

auth - Auth flow contract
auth-choice - Auth choice/selection
catalog - Model catalog API
discovery - Plugin discovery
loader - Plugin loading
runtime - Provider runtime
shape - Plugin shape/interface
wizard - Setup wizard

When to run

After changing plugin-sdk exports or subpaths
After adding or modifying a channel or provider plugin
After refactoring plugin registration or discovery

Contract tests run in CI and do not require real API keys.

Adding regressions (guidance)

When you fix a provider/model issue discovered in live:

Add a CI-safe regression if possible (mock/stub provider, or capture the exact request-shape transformation)
If it’s inherently live-only (rate limits, auth policies), keep the live test narrow and opt-in via env vars
Prefer targeting the smallest layer that catches the bug:
- provider request conversion/replay bug → direct models test
- gateway session/history/tool pipeline bug → gateway live smoke or CI-safe gateway mock test
SecretRef traversal guardrail:
- text
  src/secrets/exec-secret-ref-id-parity.test.ts
  derives one sampled target per SecretRef class from registry metadata (
  text
  listSecretTargetRegistryEntries()
  ), then asserts traversal-segment exec ids are rejected.
- If you add a new
  text
  includeInPlan
  SecretRef target family in
  text
  src/secrets/target-registry-data.ts
  , update
  text
  classifyTargetClass
  in that test. The test intentionally fails on unclassified target ids so new classes cannot be skipped silently.

OpenClaw Docs

Testing

note

Quick start

tip

QA-specific runners

Shared Telegram credentials via Convex (v1)

Adding a channel to QA

Test suites (what runs where)

Unit / integration (default)

Stability (gateway)

E2E (gateway smoke)

E2E: OpenShell backend smoke

Live (real providers + real models)

Which suite should I run?

Live (network-touching) tests

Docker runners (optional "works in Linux" checks)

Docs sanity

Offline regression (CI-safe)

Agent reliability evals (skills)

Contract tests (plugin and channel shape)

Commands

Channel contracts

Provider status contracts

Provider contracts

When to run

Adding regressions (guidance)

Related