Scheduler path
The scheduler selects an eligible idle worker. Busy workers remain visible in the registry but are not selected for the active request.
Client → router → worker
GPU execution model
A worker exposes enough capability data for routing without making the user understand GPU internals. The fields below are useful for benchmarks, route selection, and provider documentation.
Memory hierarchy
VRAM
Determines which model weights can fit on the device.
Runtime
CUDA, Metal, Vulkan, WebGPU, or managed bootstrap supply.
Throughput
Measured tokens per second for each supported model.
Latency
Observed queue delay and first token response time.


