One request, one stream.
Send a model id and a prompt. The response streams back as server-sent events: a routing event, a token for each chunk generated, a usage event, then done.
curl -N -X POST https://kunagi-orchestrator.fly.dev/v1/inference \
-H 'content-type: application/json' \
-d '{"model":"kunagi-open-8b","prompt":"hello","stream":true}'{ "type": "routing", "workerId": "wkr_a1b2", "tps": 42 }
{ "type": "token", "token": "Kunagi" }
{ "type": "token", "token": " routes" }
{ "type": "usage", "promptTokens": 12, "completionTokens": 80, "credits": 0.0046 }
{ "type": "done" }Models
Every worker advertises which of these it can serve. A model is only routable while at least one online worker has it loaded.
General-purpose open model. Fast, balanced, the default.
Crisp instruction following with low latency.
Reasoning-tuned model for harder, multi-step prompts.
Tiny reasoning model that runs almost anywhere.
How a GPU worker spends a request.
A simplified view of the memory hierarchy a worker walks on every forward pass: weights and cache sit in global memory, staged through shared memory and L2, and computed against in registers.
Registers
Closest to compute. Fastest, smallest, scoped to a single thread.
Shared memory and L2
Shared across a block of threads. A staging area between registers and global memory.
HBM (global memory)
Largest and slowest tier. Model weights and the KV cache live here.


