# kgpu — GPU rentals for researchers

`https://api.kgpu.net/v1` — rent a GPU, run your code, pull your results.

kgpu is a **compute provider**: you rent a GPU-backed container by the
minute, get a shell (or one-shot exec) into it, and move files in and out.
We don't host your data — the container's disk is **ephemeral** and is
destroyed when the rental ends, so copy results out before you release.

Two ways to drive it:

- **HTTPS REST** under `/v1` — rent / inspect / release pods and run
  one-shot commands. Plain `curl` + `jq` (or any HTTP client).
- **SSH bastion** at `ssh.kgpu.net:2222` — an interactive shell into your
  pod plus native `scp` / `rsync` / `sshfs` for file transfer.

---

## Quick start

```bash
# 1. Sign in at https://kgpu.net and copy your token from the dashboard.
export KGPU_API_TOKEN=kgpu_xxxxxxxxxxxxx
export KGPU_API_BASE=https://api.kgpu.net/v1
T=$KGPU_API_TOKEN; B=$KGPU_API_BASE

# 2. See what's available (no auth needed) and rent one.
curl -sS $B/gpu-models | jq
RENT=$(curl -sS -X POST $B/pods -H "Authorization: Bearer $T" \
  -H "Content-Type: application/json" \
  -d "$(jq -n --arg p "$(cat ~/.ssh/id_ed25519.pub)" \
        '{gpu_model:"rtx4090", ssh_pubkey:$p}')" | jq -r .pod_id)

# 3. Wait until it's ready (cold start is mostly image load, ~10-60s).
until [ "$(curl -sS $B/pods/$RENT -H "Authorization: Bearer $T" | jq -r .ready)" = "true" ]; do
  sleep 10
done

# 4. Use it — shell in over the bastion (your ~/.ssh/id_ed25519 is pinned).
ssh -p 2222 kgpu@ssh.kgpu.net nvidia-smi
scp -O -P 2222 mydata.zip kgpu@ssh.kgpu.net:/workspace/   # push data
ssh -p 2222 kgpu@ssh.kgpu.net 'setsid -f python /workspace/train.py < /dev/null > /tmp/train.log 2>&1'
scp -O -P 2222 kgpu@ssh.kgpu.net:/workspace/model.pt ./   # pull results

# 5. Release (the pod and its disk are destroyed).
curl -sS -X DELETE $B/pods/$RENT -H "Authorization: Bearer $T"
```

Every authenticated request carries `Authorization: Bearer $KGPU_API_TOKEN`.

**New accounts start with trial credits.** Maintainers can top up. Pricing
is per-minute; see [Pricing](#pricing).

---

## Tokens

- **Account token** — copy it from the dashboard at <https://kgpu.net>.
  It's your credential for everything. One active token per account: the
  dashboard's **Reissue** button revokes the old one and mints a fresh one
  (shown once — copy it then). We store only a salted hash, so a lost token
  can't be recovered, only reissued.
- **In-pod token** — inside a rented container, `$KGPU_API_TOKEN` and
  `$KGPU_API_BASE` are auto-injected, but the in-pod token is a
  **per-rental, scope-limited** bearer (`kgpu_pod_<id>.<sig>`), *not* your
  account token. It can read `/me`, read/release its own pod, and use
  `/objects` for the owning account. It **cannot** rent more pods, touch
  other pods, rotate its SSH key, or exec into itself. It dies when the
  rental ends. Use your account token for everything else.

---

## API reference

| Resource | Method · Path | What |
|---|---|---|
| **gpu-models** | `GET /gpu-models` | **public** — GPU classes available now, with `free`/`total`/price. No auth. |
| **me** | `GET /me` | credits, quota, role |
| **balance** | `GET /balance` | just the credit balance (pre-flight before renting) |
| **cluster** | `GET /cluster` | per-node availability (auth'd) |
| **pods** | `GET /pods` | your pods, newest first |
| | `POST /pods` | rent (body below; `ssh_pubkey` optional) |
| | `GET /pods/{id}` | pod state — phase, ready, idle clock, end_reason |
| | `PATCH /pods/{id}` | rotate the rental's SSH key |
| | `DELETE /pods/{id}` | release |
| | `POST /pods/{id}/exec` | one-shot command (synchronous) |
| **objects** | `POST /objects` | mint a 24 h transit object (presigned PUT) |
| | `GET /objects` · `GET /objects/{id}` | list / fetch (presigned GET) |
| | `DELETE /objects/{id}` | delete (idempotent) |

All paths are relative to `$KGPU_API_BASE`. Plus the SSH bastion at
`ssh.kgpu.net:2222`.

**Machine-readable spec:** the full OpenAPI 3.1 schema is served at
[`/openapi.json`](/openapi.json) — point your client/codegen at it. (A
human-browsable version is at [`/docs`](/docs).)

`GET /me` returns `{user_id, email, name, role, status, credits, quota}`.

---

## GPU models & pricing

`GET /v1/gpu-models` is **public** (no Bearer) so you can check the menu
before signing in. It aggregates by `gpu_model` and is the source of truth
for what's rentable *right now*:

```bash
$ curl -sS https://api.kgpu.net/v1/gpu-models | jq
[
  {"gpu_model":"rtx3080","total":1,"free":1,"price_per_hour_credits":300},
  {"gpu_model":"rtx3090","total":2,"free":2,"price_per_hour_credits":600},
  {"gpu_model":"rtx4090","total":1,"free":0,"price_per_hour_credits":1000},
  {"gpu_model":"rtx5090","total":1,"free":1,"price_per_hour_credits":1400},
  {"gpu_model":"v100",   "total":4,"free":4,"price_per_hour_credits":600}
]
```

Use the returned `gpu_model` strings as the `gpu_model` value on
`POST /pods`. For per-node detail (hostname, busy count) use the auth'd
`GET /cluster`.

### Pricing

`$1 = 1500 credits`. Per-minute deduction; balance hitting zero
auto-releases the pod. N-GPU rentals bill at N × the per-GPU rate.

| `gpu_model` | hardware | credits/hr | ≈ $/hr |
|---|---|---|---|
| `rtx3080` | RTX 3080 | 300 | 0.20 |
| `rtx3090` | RTX 3090 | 600 | 0.40 |
| `v100` | Tesla V100 32 GB | 600 | 0.40 |
| `rtx4090` | RTX 4090 | 1 000 | 0.67 |
| `rtx5090` | RTX 5090 | 1 400 | 0.93 |

More classes may come online — `GET /gpu-models` always reflects the live
fleet and prices.

---

## Renting a pod — `POST /pods`

All body fields are optional; defaults shown.

| Field | Default | Notes |
|---|---|---|
| `name` | `"rent"` | human label (DNS-safe, ≤ 30 chars) |
| `image` | `kgpu-pytorch:latest` | the default image is preloaded on every worker (no pull wait). Use a `ghcr.io/...` / `docker.io/...` ref for anything else. |
| `gpu_model` | `"rtx4090"` | live menu via `GET /gpu-models` |
| `gpu_count` | `1` | 0–4 (0 = CPU-only pod, still bills at the class rate) |
| `cpu` | `"2"` | CPU cores, ≤ 16 |
| `memory` | `"16Gi"` | RAM, ≤ 128 GiB |
| `ephemeral_storage_gib` | `100` | scratch disk, 1–2000 GiB |
| `env` | `null` | extra pod env vars |
| `ssh_pubkey` | `null` | your OpenSSH public key — see below |

Returns:

```jsonc
{
  "pod_id": "gpu-rent-1779745883-ab12cd",
  "gpu_model": "rtx4090",
  "price_per_hour_credits": 1000,
  "balance_after_precheck": 99000
}
```

### SSH access — bring your own key, or let us mint one

| `ssh_pubkey` on `POST` / `PATCH` | You get back | When |
|---|---|---|
| ✓ your one-line OpenSSH pubkey | `ssh_access` with a `fingerprint` only — **the private half never crosses the wire** | You have a key (`~/.ssh/id_*`). Recommended. |
| ✗ omitted | `ssh_access` with a fresh ephemeral `private_key` — **returned once** | No local key, fresh laptop, or per-pod isolation in CI. |

```jsonc
// server-mint response (no ssh_pubkey given)
{ "ssh_access": {
    "private_key": "-----BEGIN OPENSSH PRIVATE KEY-----\n...\n",
    "fingerprint": "SHA256:abc123...",
    "ssh_host": "ssh.kgpu.net", "ssh_port": 2222, "ssh_user": "kgpu" } }
```

The minted `private_key` is shown **once** — we never persist it. Lost it?
`PATCH /v1/pods/{id}` rotates: send a new `ssh_pubkey` to pin, or an empty
body to mint a fresh keypair. The old key dies the instant the new
fingerprint lands. (`PATCH` is account-token only — a compromised pod
can't lock you out by swapping its key.)

---

## Getting into the pod

### One-shot command — `POST /pods/{id}/exec`

Synchronous, no setup, no SSH key. Ideal for CI, dashboards, quick checks.

```bash
curl -sS -X POST $B/pods/$RENT/exec -H "Authorization: Bearer $T" \
     -H "Content-Type: application/json" \
     -d '{"cmd":["nvidia-smi"]}' | jq
# { "exit_code": 0, "stdout": "...", "stderr": "", "duration_ms": 1240, "truncated": false }
```

| Field | Default | Notes |
|---|---|---|
| `cmd` | required | argv list, e.g. `["bash","-c","ls /workspace"]` |
| `stdin` | `null` | piped to the command, ≤ 1 MiB |
| `timeout_seconds` | `60` | 1–300; killed on timeout, partial output returned |

Caps: combined stdout+stderr ≤ **1 MiB** (excess truncated, `truncated:true`);
timeout ≤ **300 s**; no PTY. Bigger output or longer jobs → use SSH.
Returns `409 pod_not_ready` if the pod isn't ready yet.

### Interactive shell + file transfer — SSH bastion

`ssh.kgpu.net:2222`, user always `kgpu`, authenticated by the key pinned to
your rental. The bastion looks up your key's fingerprint and routes you to
**exactly that one pod** — you can't land in someone else's.

```bash
ssh -p 2222 kgpu@ssh.kgpu.net              # interactive shell
ssh -p 2222 kgpu@ssh.kgpu.net nvidia-smi   # one-shot
```

A `~/.ssh/config` block per pod is handy:

```
Host kgpu-myrun
    HostName ssh.kgpu.net
    Port 2222
    User kgpu
    IdentityFile ~/.ssh/id_ed25519
    IdentitiesOnly yes
```

Locked down by design: **no port forwarding** (`-L`/`-R`/`-D`), no agent
or X11 forwarding, no tunnels. Don't enable `ControlMaster` — the session
wrapper is single-channel and multiplexing breaks it. The key dies the
moment the rental ends (`Permission denied (publickey)` thereafter).

---

## Moving files in and out

The pod disk is ephemeral — **pull your artifacts before releasing.**

### scp / rsync (over the bastion)

```bash
scp -O -P 2222 mit-bih.zip kgpu@ssh.kgpu.net:/tmp/             # push
rsync -av -e "ssh -p 2222" kgpu@ssh.kgpu.net:/workspace/out/ ./out/   # pull
```

`scp -O` (capital O) forces the legacy SCP protocol the bastion speaks —
without it OpenSSH 9.x tries SFTP and fails with `subsystem request
failed`. Transfers run over a direct byte stream, so they're reliable for
large files.

### Live mount — sshfs

Edit pod files with local tools (your editor, an LLM CLI, linters):

```bash
sshfs kgpu@ssh.kgpu.net:/workspace ./mnt -p 2222 \
  -o IdentityFile=~/.ssh/id_ed25519 \
  -o sftp_server=/usr/lib/openssh/sftp-server \
  -o reconnect,ServerAliveInterval=15
# ...edit ./mnt — changes hit the pod immediately...
fusermount -u ./mnt    # Linux   (umount on macOS)
```

The `-o sftp_server=...` flag is **required**. On Windows install
**WinFsp + SSHFS-Win** once (`winget install WinFsp.WinFsp
SSHFS-Win.SSHFS-Win`) and use a Windows-native `IdentityFile` path
(`C:/Users/<you>/.ssh/...`). The default image ships
`/usr/lib/openssh/sftp-server`; custom images need
`apt install openssh-sftp-server`.

### Large transfers & multi-pod fanout — transit objects (`/v1/objects`)

For anything over a few hundred MB, or the same dataset to many pods, push
once to Cloudflare R2 and pull directly in each pod — bytes go client → R2
→ pod over HTTPS, no bastion in the path.

```bash
# mint (returns a presigned PUT, valid 1 h) and upload once
RESP=$(curl -sS -X POST $B/objects -H "Authorization: Bearer $T" \
  -H "Content-Type: application/json" -d '{"label":"mit-bih"}')
OBJ=$(echo "$RESP" | jq -r .object_id)
curl -X PUT --upload-file mit-bih.zip "$(echo "$RESP" | jq -r .put_url)"

# fetch in each pod via a fresh GET url
GET=$(curl -sS -H "Authorization: Bearer $T" $B/objects/$OBJ | jq -r .get_url)
for POD in podA podB podC; do ssh kgpu-$POD "curl -L -o /tmp/data.zip '$GET'" & done; wait

curl -X DELETE -H "Authorization: Bearer $T" $B/objects/$OBJ   # tidy up
```

**This is scratch, not storage.** Objects auto-expire **24 h** after
creation (`expires_at` is absolute). Per-user quota 100 GB live; per-object
cap 50 GB. PHI / patient data must not outlive the active rental.

Traps: use `--upload-file <path>` (not `-d @-`) so curl sends a
Content-Length; don't add custom headers to the presigned URL (breaks the
signature); the `get_url` is reusable by N pods within its 1 h window.

---

## Inside the container

The default `kgpu-pytorch:latest` ships PyTorch (CUDA 12.9, nv25.05) with
prebuilt kernels for **sm_75 / 80 / 86 / 89 / 90 / 100 / 120** — every GPU
in the fleet, including Blackwell (RTX 5090 / B200). Also bundled: `wfdb`,
`vitaldb`, `scipy`, `scikit-learn`, `pandas`, `matplotlib`, `seaborn`,
`duckdb`, `pyarrow`, `tmux`, `uv`, `zstd`, `openssh-sftp-server`.

Auto-injected env:

| Var | Value |
|---|---|
| `KGPU_API_TOKEN` | per-rental scoped bearer (not your account token). Stripped from interactive SSH sessions; read it from `/proc/1/environ` if a script needs it. |
| `KGPU_API_BASE` | `https://api.kgpu.net/v1` |
| `KGPU_GPU_ID` / `KGPU_RENT_ID` | this rental's id |
| `NVIDIA_VISIBLE_DEVICES` | `void` — CDI handles GPU allocation; don't override |

Inside the container `nvidia-smi -L` and `/dev/nvidia*` expose only your
rental's GPU(s).

### Detaching long-running jobs

`nohup ... &` doesn't fully release the SSH session (OpenSSH waits on the
job's fds). Use `setsid -f`, which forks a new session and closes the
parent fds so the `ssh` call returns in milliseconds:

```bash
ssh kgpu-myrun 'setsid -f python /workspace/train.py < /dev/null > /tmp/train.log 2>&1'
ssh kgpu-myrun 'tail -f /tmp/train.log'   # check progress any time
```

`tmux new -d -s train 'python ...'` is the other safe option.

---

## Limits

- **GPU cap**: 4 concurrent GPUs across all your rentals.
- **Per-rental**: ≤ 16 CPU cores, ≤ 128 GiB RAM, 1–2000 GiB ephemeral disk.
- **Idle**: 12 h of no SSH activity → `idle_warning`; 24 h → auto-release
  (`auto_idle`). **Override**: past the warn threshold we probe pod load —
  any GPU > 5 % or process > 10 % CPU resets the clock, so a long training
  run with no API calls is *not* reaped.
- **Wall clock**: no hard ceiling. Rentals live until `DELETE`, idle-kill,
  or out-of-credits.
- **Disk eviction**: exceeding `ephemeral_storage_gib` terminates the pod
  (`evicted_by_kubelet:OutOfephemeral-storage`). Checkpoints, pip cache,
  dataset extracts and `torch.compile` cache all add up — bump the value or
  rotate checkpoints for multi-epoch runs.
- **Network isolation**: **no ingress** (pods accept nothing inbound).
  **Egress** to the public internet works (pip/git/HF) but is blocked into
  private ranges (RFC1918, link-local/metadata `169.254`, Tailscale CGNAT)
  and to other pods. No lateral scanning, no pod-to-pod.
- **Credits**: per-minute deduction; zero → auto-release
  (`out_of_credits`).

---

## Pod lifecycle

`GET /v1/pods/{id}.phase`:

| `phase` | meaning |
|---|---|
| `Pending` | scheduling / image load / container creating. Carries `pending_reason` (e.g. `starting`, `ImagePullBackOff: …`). `ready=false`. |
| `Running` | container up. `ready=true` once the readiness probe passes. |
| `Terminating` | released and tearing down; `ended_at` / `end_reason` already set. |
| `Ended` | pod object gone; the historical row remains so you can still read `end_reason`. |
| `Failed` | rare — kubelet failed it; see `end_reason`. |

**Polling**: poll `GET /pods/{id}` every 10–15 s only until `ready=true`,
then stop — the cold start is dominated by image load, sub-5 s polling buys
nothing. For training liveness, `ssh ... tail -f /tmp/train.log` or
`nvidia-smi` rather than hammering the API.

`GET /pods/{id}` works even after the pod is gone (falls back to the
historical row), so you can always read why it ended:

| `end_reason` | meaning |
|---|---|
| `returned` | you called `DELETE` |
| `auto_idle` | idle timeout |
| `out_of_credits` | balance hit zero |
| `evicted_by_kubelet:OutOfephemeral-storage` | disk exceeded — bump it or rotate checkpoints |
| `evicted_by_kubelet:OOMKilled` | memory limit hit |
| `start_failed:<reason>` | never escaped Pending (bad image tag, etc.); auto-released ~5 min after creation, quota refunded |

---

## Windows / PowerShell

PowerShell's `curl` is an alias for `Invoke-WebRequest` and mangles JSON
bodies. Prefer **`Invoke-RestMethod`**:

```powershell
$T = $env:KGPU_API_TOKEN
$B = "https://api.kgpu.net/v1"
$H = @{ Authorization = "Bearer $T" }

$rent = Invoke-RestMethod -Method Post -Uri "$B/pods" -Headers $H `
    -ContentType "application/json" `
    -Body (@{ gpu_model = "rtx4090"
              ssh_pubkey = (& ssh-keygen -y -f "$HOME\.ssh\id_ed25519") } | ConvertTo-Json)
$RENT = $rent.pod_id

do { Start-Sleep 10; $p = Invoke-RestMethod -Uri "$B/pods/$RENT" -Headers $H } until ($p.ready)

ssh -p 2222 kgpu@ssh.kgpu.net nvidia-smi       # uses ~/.ssh/id_ed25519 by default
Invoke-RestMethod -Method Delete -Uri "$B/pods/$RENT" -Headers $H
```

Or call the real `curl.exe` with a single-quoted here-string to copy the
bash examples verbatim (`@'...'@`, closing `'@` at column 0). `jq` works
the same once installed (`winget install jqlang.jq`).

---

## Status codes & errors

`200` ok · `400` bad request · `401` invalid/revoked token or account not
approved · `402` insufficient credits · `403` forbidden scope (in-pod token
tried to rent/exec/rotate) · `404` not found · `409` not ready / quota
exceeded / duplicate · `410` object expired · `422` validation failed ·
`500` server error.

Every response carries `X-Request-ID` (8-byte hex); 5xx bodies echo it as
`trace_id`. Quote it when filing a bug.
