One request, one stream.

Send a model id and a prompt. The response streams back as server-sent events: a routing event, a token for each chunk generated, a usage event, then done.

Request

curl -N -X POST https://kunagi-orchestrator.fly.dev/v1/inference \
  -H 'content-type: application/json' \
  -d '{"model":"kunagi-open-8b","prompt":"hello","stream":true}'

Response (text/event-stream)

{ "type": "routing", "workerId": "wkr_a1b2", "tps": 42 }
{ "type": "token", "token": "Kunagi" }
{ "type": "token", "token": " routes" }
{ "type": "usage", "promptTokens": 12, "completionTokens": 80, "credits": 0.0046 }
{ "type": "done" }

Models

Every worker advertises which of these it can serve. A model is only routable while at least one online worker has it loaded.

kunagi-open-8bllama

General-purpose open model. Fast, balanced, the default.

8,192 ctx · 50 credits / M tok

kunagi-mistralmistral

Crisp instruction following with low latency.

8,192 ctx · 50 credits / M tok

kunagi-reason-r1deepseek

Reasoning-tuned model for harder, multi-step prompts.

8,192 ctx · 120 credits / M tok

kunagi-reason-minideepseek

Tiny reasoning model that runs almost anywhere.

8,192 ctx · 20 credits / M tok

How a GPU worker spends a request.

A simplified view of the memory hierarchy a worker walks on every forward pass: weights and cache sit in global memory, staged through shared memory and L2, and computed against in registers.

Registers

Closest to compute. Fastest, smallest, scoped to a single thread.

Shared memory and L2

Shared across a block of threads. A staging area between registers and global memory.

HBM (global memory)

Largest and slowest tier. Model weights and the KV cache live here.