Compute diagrams

These diagrams describe the compute layer at a high level, for protocol and product docs rather than runtime observability.

Scheduler path

The scheduler selects an eligible idle worker. Busy workers remain visible in the registry but are not selected for the active request.

Client → router → worker
CLIENTrequestROUTERmatch jobGPU AidleGPU Bbusy

GPU execution model

A worker exposes enough capability data for routing without making the user understand GPU internals. The fields below are useful for benchmarks, route selection, and provider documentation.

Memory hierarchy

VRAM

Determines which model weights can fit on the device.

Runtime

CUDA, Metal, Vulkan, WebGPU, or managed bootstrap supply.

Throughput

Measured tokens per second for each supported model.

Latency

Observed queue delay and first token response time.