Portable Coordinator

Read this when you want to run the Crabbox coordinator on an ordinary container platform instead of Cloudflare Workers, or when you are reviewing the runtime, database, ingress, and failure-recovery requirements of that deployment.

The portable coordinator is a shipped Node.js 22 service backed by PostgreSQL. It runs the same FleetCoordinator behavior, routes, portal, and bridge protocols as the Cloudflare Worker deployment. The runtime choice changes how state, alarms, HTTP, and WebSockets are hosted; it does not change the CLI or provider contracts.

#Runtime choices

Concern	Cloudflare runtime	Portable runtime
HTTP and portal	Worker	Node.js HTTP server
Durable records	Durable Object storage	PostgreSQL key/value compatibility table
Scheduled cleanup	Durable Object alarms and cron	pg-boss delayed jobs and reconciliation schedule
WebSockets	Durable Object sockets	`ws` sockets owned by the Node process
Mutation ordering	Durable Object event ordering	explicit lifecycle mutex
Packaging	Wrangler deployment	OCI image from `worker/Dockerfile.node`

Choose Cloudflare when its Worker, Durable Object, routing, and Access products already fit the deployment. Choose Node/PostgreSQL when the coordinator must run on a conventional container platform, use an existing PostgreSQL service, or sit behind an identity-aware reverse proxy.

#Topology

Crabbox CLI ---- HTTPS / WebSocket ---- ingress ---- Node coordinator
     |                                               |        |
     +---- SSH + rsync directly to runner -----------+        +---- PostgreSQL
                                                              +---- pg-boss

The coordinator remains a control plane. SSH readiness, repository sync, command execution, and stdout/stderr streaming stay direct between the CLI and the runner for normal CLI leases. The specialized private AWS workspace service instead uses SSM for an API-managed workspace bootstrap and intentionally exposes no SSH path. The coordinator handles authentication, leases, run records, sharing, usage, cleanup, and optional live bridges.

#Runtime boundary

Shared coordinator behavior depends on the CoordinatorRuntime interface:

key/value get, put, delete, and prefix list storage;
serialized lifecycle operations through runExclusive;
WebSocket upgrade, attachment, enumeration, and event handling;
alarm scheduling and cancellation.

CloudflareCoordinatorRuntime maps those operations to Durable Object APIs. NodeCoordinatorRuntime maps them to PostgreSQL, pg-boss, a process mutex, and the Node WebSocket server. Routes and fleet behavior stay in shared modules.

#Build and run

Build from the repository root:

npm ci --prefix worker
npm run check:node --prefix worker
npm run build:node --prefix worker
docker build -f worker/Dockerfile.node -t crabbox-coordinator:local worker

Run the built service with a protected environment file populated by the platform secret manager:

docker run --rm -p 8080:8080 \
  --env-file /secure/path/crabbox-coordinator.env \
  crabbox-coordinator:local

The image exposes port 8080 and starts dist-node/server.mjs. Terminate TLS at the ingress and forward WebSocket upgrades without buffering. The protected file supplies the variables below, including a DATABASE_URL with sslmode=verify-full and a trusted CA for any remote database. Keep database credentials and bearer tokens out of shell history and process arguments.

#Required platform resources

The initial production shape is intentionally small:

one always-on coordinator replica;
one PostgreSQL 13 or newer database reachable through DATABASE_URL;
durable PostgreSQL storage and backups;
an HTTPS ingress that supports WebSocket upgrades;
secret injection for coordinator auth and any managed-provider credentials that are not supplied by a workload identity or default provider chain;
outbound access to configured provider APIs;
an HTTP timeout long enough for persistent WebSockets.

Do not scale the Node service above one replica yet. Durable records and jobs are shared, but active socket ownership and lifecycle serialization are still process-local. Multi-replica operation requires distributed lifecycle locking and bridge-owner routing.

#Configuration groups

The portable runtime consumes the same coordinator and provider variables as the Worker, plus Node-specific settings.

Core runtime:

DATABASE_URL                         required PostgreSQL connection string; use verified TLS remotely
PORT                                 listener port, default 8080
CRABBOX_PUBLIC_URL                   canonical external origin
CRABBOX_CODE_ORIGIN_TEMPLATE         required for browser Code; HTTPS per-lease template needs wildcard TLS/WebSockets
CRABBOX_DATABASE_POOL_SIZE           PostgreSQL pool size, default 10
CRABBOX_DATABASE_CONNECT_TIMEOUT_MS  connection timeout, default 10000
CRABBOX_SHUTDOWN_TIMEOUT_MS          graceful shutdown budget, default 120000

Authentication requires at least one supported coordinator identity path. Use signed user sessions, a shared operator token, an admin token, or a trusted reverse-proxy identity as documented in Coordinator and Security. Managed-provider authentication can come from environment secrets or a provider default chain. On ECS, the dedicated private AWS deployment requires temporary task-role credentials and rejects static access keys. Registered direct leases need no provider credentials in the service.

#PostgreSQL state

Startup creates the crabbox schema and a compatibility table named coordinator_kv. Existing coordinator keys such as leases, runs, bridge tickets, image records, and cleanup records remain unchanged. Values are stored as JSONB with a text-preserving representation for exact compatibility.

pg-boss uses its own crabbox_jobs schema. It owns:

the next exact fleet alarm;
retryable alarm delivery;
a reconciliation job every 15 minutes;
a startup reconciliation pass after deployment or restart.

Back up both schemas together. If coordinator records are intact but job state is missing, startup and recurring reconciliation can rebuild scheduling from those records. Job state alone cannot reconstruct missing leases, runs, or cleanup obligations; recover crabbox.coordinator_kv before resuming the service or managed resources may be stranded. Provider operations and cleanup are idempotent, so repeated job delivery is expected and safe.

#Lifecycle serialization

Node does not inherit Durable Object event ordering, so lifecycle mutations use an explicit process mutex. Slow provider calls do not hold that mutex. Provisioning follows three phases:

reserve the lease and cost under the lifecycle lock;
call the provider outside the lock;
finalize the lease or persist cleanup-required failure state under the lock.

This keeps heartbeats, releases, portal requests, and bridge traffic responsive while a provider is slow. A release racing an in-flight create remains recoverable because the reservation exists before provider I/O begins.

HTTP routes are classified as lifecycle or direct operations. Direct routes perform their own smaller critical sections. Bridge data frames bypass the global lifecycle mutex and retain per-socket ordering; control messages that mutate fleet state stay serialized.

#WebSockets and shutdown

The Node runtime supports the same control, WebVNC, code, and egress bridge protocols as Cloudflare. It disables per-message compression, caps individual messages, pings sockets every 30 seconds, and terminates connections that stop answering.

On SIGTERM or SIGINT, the service:

marks the runtime as shutting down, stops new upgrades, and closes the HTTP listener plus idle connections;
drains active request handlers and lifecycle operations;
closes live sockets with restart code 1012 and drains their handlers;
waits for the current alarm, stops pg-boss gracefully, and closes PostgreSQL;
waits for the HTTP server to finish closing.

The supervised WebVNC daemon restarts foreground bridge processes and obtains fresh tickets after interruption. A foreground crabbox webvnc, code bridge, or ordinary egress bridge may exit on coordinator restart; rerun crabbox webvnc, crabbox code, or crabbox egress start unless that path has its own supervisor. Ingress rollouts should still use a termination grace period at least as long as CRABBOX_SHUTDOWN_TIMEOUT_MS.

#Trusted reverse proxies

Identity-aware ingress is opt-in. Configure all three:

CRABBOX_TRUSTED_USER_HEADER     header containing the authenticated user
CRABBOX_TRUSTED_USER_ORG        fixed organization for authenticated users
CRABBOX_TRUSTED_PROXY_CIDRS    comma-separated ingress peer CIDRs

The service trusts forwarded identity, host, protocol, and caller-address headers only when the actual TCP peer matches the configured CIDRs. It removes caller-supplied connection identity and derives the client address from the rightmost untrusted member of the forwarded chain. The ingress must strip any incoming copy of the trusted user header before setting its authenticated value. Set a stable trusted-user organization before creating records; without it, proxy-authenticated requests use a dedicated missing-org identity that is displayed as unknown but remains distinct from an org literally named unknown.

Trust only the actual ingress addresses or a dedicated ingress subnet, and use network policy to block direct access from other workloads. Do not configure 0.0.0.0/0, ::/0, or an entire shared private network. If the ingress source range cannot be made stable and narrow, use Crabbox bearer authentication instead of a trusted identity header.

#Health and request limits

Use these endpoints:

GET /v1/health   liveness; no database query
GET /v1/ready    readiness; verifies PostgreSQL
HEAD /v1/ready   readiness without a response body

The Node server limits unauthenticated bodies to 1 MiB, normal authenticated bodies to 16 MiB, and run-finish payloads to 64 MiB. Oversized requests return 413 and close the connection. Configure the ingress with limits at least as large as these application limits, but do not make unauthenticated limits larger.

#Deployment checklist

Before serving users:

create a dedicated PostgreSQL database and least-privilege role;
enable database backups and test a restore;
deploy exactly one service replica;
use a Recreate/no-surge rollout so old and new processes never overlap;
enable TLS and WebSocket upgrades at ingress;
keep request buffering disabled for upgrades;
inject credentials through the platform secret store, never image layers or argv;
require verified TLS and a trusted CA for remote PostgreSQL;
configure canonical public URL and auth;
configure exact trusted ingress peers and block direct application access;
set a termination grace period for orderly shutdown;
verify liveness and readiness separately;
verify unauthenticated requests fail closed;
create, heartbeat, inspect, share, and release a test lease;
connect and reconnect one WebVNC or code bridge;
restart the service and confirm records and scheduled cleanup survive.

#Upgrade and rollback

The Node and Cloudflare adapters share the same logical record keys, but the repository does not ship supported Durable Object export or PostgreSQL import tooling. Stateful cross-runtime migration is therefore unsupported. Do not move active leases, tickets, jobs, or pending cleanup between runtimes. A fresh cutover is safe only after draining and releasing all resources and verifying that the old coordinator has no remaining cleanup obligations.

Ordinary Node image upgrades are simpler. Back up PostgreSQL, deploy the new image with Recreate/no-surge semantics, wait for the old process to finish its graceful shutdown before starting the replacement, then wait for /v1/ready and verify startup reconciliation. Roll back the image with the same non-overlapping sequence and without rolling back the database unless release notes explicitly require it.

#Verification from source

npm run format:check --prefix worker
npm run lint --prefix worker
npm run check --prefix worker
npm run check:node --prefix worker
npm test --prefix worker
npm run build:node --prefix worker
npm run build --prefix worker

The Worker test suite covers shared fleet behavior. Node-specific tests cover PostgreSQL storage, request handling, proxy trust, WebSockets, lifecycle ordering, maintenance, and shutdown.