# kgpu-gateway — Agent guide `https://api.kgpu.net` — Lab GPU dispatch for SNUH researchers. ## Access **One-time setup** — sign in at with Google. Once approved, your dashboard at `/view` displays a personal Bearer token (format `kgpu_<48 hex>`). Copy the `export` line shown there and paste into your shell rc: ```bash export KGPU_API_TOKEN=kgpu_xxxxxxxxxxxx... ``` Reopen Claude Code (or any terminal) — `$KGPU_API_TOKEN` is now visible. Token persists until you sign out — signing out revokes it; sign back in to get a new one. **Every `/v1/*` request** carries: ``` Authorization: Bearer $KGPU_API_TOKEN ``` Inside a rented GPU container, `$KGPU_API_TOKEN` and `$KGPU_API_BASE` are auto-injected, so workloads can call `/v1/files` from within without embedding secrets. **Approval status** — lab members whose email is in the admin allowlist are auto-approved on sign-in. Others land in `status=pending` and can view cluster state but not receive a token until an admin upgrades them. ## Credits The lab GPU is a shared resource. Every rental burns **credits** per minute so that idle rentals are economically painful and capacity rotates. - **$1 == 1500 credits.** New users start at **1,000,000 credits** (≈ $666). - **Price** for the current GPU class (GB10, ≈ RTX 3090): **500 credits/hour** (≈ $0.333/hr — RunPod 3090 Community ballpark). Charged per minute, rounded up. A new user's 1,000,000 credits buys ~2,000 hours. - **Precheck.** `POST /v1/gpus` returns `402 insufficient_credits` if your balance can't cover one hour at the current price. - **Out-of-credits return.** A rental whose user balance hits zero is auto-returned (`end_reason: out_of_credits`). - **Top-up.** Admin only — `POST /v1/admin/users/{id}/credits {"credits": N}`. Always-on endpoints: | Verb | Path | Auth | What | |---|---|---|---| | GET | `/v1/balance` | Bearer | `{balance, price_per_hour_credits, starting_credits}` | The `GET /v1/gpus` response also carries `credits.balance` and a per-rental `credits_charged` / `price_per_hour_credits` snapshot. ## Resources (3) — 16 endpoints total ### `/v1/gpus` — rent (대여) and return (반납) GPUs A GPU is the rental unit. POST a GPU into existence with the container image you want running on it; do your work over `/v1/runs` and `/v1/files`; DELETE to return it and free the hardware for someone else. The container image is fixed at rental time — to switch images, return this GPU and rent a fresh one. | Verb | Path | Auth | What | |---|---|---|---| | GET | `/v1/gpus` | optional | public: cluster summary. authenticated: + the caller's `my_gpus` list | | POST | `/v1/gpus` | Bearer | **Rent (대여)** a GPU. The container image is fixed for this rental's lifetime | | GET | `/v1/gpus/{gpu_id}` | Bearer | rental status + child runs | | DELETE | `/v1/gpus/{gpu_id}` | Bearer | **Return (반납)** the GPU. The container is destroyed and the hardware goes back to the pool | ### `/v1/runs` — execute commands inside a rented GPU | Verb | Path | Auth | What | |---|---|---|---| | POST | `/v1/runs` | Bearer | start a run; long-polls up to `wait_seconds` (default 10, max 60). Returns sync result if quick, `phase: Running` + run_id otherwise | | GET | `/v1/runs` | Bearer | list runs (`?gpu_id=` optional filter; without it lists across all your rented GPUs) | | GET | `/v1/runs/{run_id}` | Bearer | current state (stdout snapshot + phase). `?gpu_id=` is an **optional hint** — without it the gateway scans your rented GPUs to locate the run | | DELETE | `/v1/runs/{run_id}` | Bearer | SIGTERM the process (logs persist until the GPU is returned). `?gpu_id=` optional hint as above | ### `/v1/files` — persistent workspace Single API surface, two kinds of drives split by URL prefix. Both backed by Cloudflare R2; bytes never proxy through the gateway — every read/write redirects to a presigned R2 URL. Persistent across GPU rentals; the same files are reachable from any GPU you rent next time. - `/v1/files/` → your **my drive**, scoped to R2 prefix `users/{user_id}/` - `/v1/shared/` → list of **orgs** you belong to - `/v1/shared//` → that **org's shared drive** — read by any member, write by uploader/admin role The shared drive is org-namespaced (e.g. `shared/vitallab/datasets/ecg/...`). You can be a member of multiple orgs; the root listing shows each as a subdirectory. The first segment under `/v1/shared/` MUST be an org slug — there's no flat or default-org fallback. **My drive — per-user R2 (free egress, 100 GB quota per user by default)** | Verb | Path | Auth | What | |---|---|---|---| | GET | `/v1/files?prefix=&glob=` | Bearer | list your files (flat). Response includes `quota.used_bytes` / `quota.quota_bytes`. `?prefix=` narrows the R2 list; `?glob=` runs an fnmatch filter on the relative paths | | GET | `/v1/files/{path}` | Bearer | **302 → R2 presigned GET URL** (`curl -L` follows). `Accept: application/json` or `?json=1` returns `{url, size, expires_in}` instead | | PUT | `/v1/files/{path}` | Bearer | **307 → R2 presigned PUT URL** with `Connection: close`. Same Expect/no-Expect trade-off as the shared drive (see below). Soft quota: `Content-Length` + current `users/{user_id}/` usage must stay under `KGPU_USER_QUOTA_BYTES` (default 100 GiB) — otherwise 413 | | DELETE | `/v1/files/{path}` | Bearer | delete a single object, or every object under a prefix when `path` ends with `/` (or when no exact-key match exists) | **Shared drive — R2 (gateway issues redirects/URLs; bytes never proxy through it)** | Verb | Path | Auth | What | |---|---|---|---| | GET | `/v1/shared` | Bearer | list orgs you belong to (HTML for rclone, JSON via `?json=1`) | | GET | `/v1/shared//` | Bearer member | directory listing under that org | | GET | `/v1/shared//` | Bearer member | **302 → R2 presigned GET URL** (Accept JSON for the envelope shape) | | POST | `/v1/shared//` | Bearer + uploader/admin in org | mint presigned PUT URL (for clients without `Expect: 100-continue`) | | PUT | `/v1/shared//` | Bearer + uploader/admin in org | **307 → R2 presigned PUT URL** with `Connection: close` | | DELETE | `/v1/shared//` | Bearer + uploader/admin in org | delete the R2 object | | POST | `/v1/shared///registered` | Bearer + uploader/admin in org | record an out-of-band PUT in the SharedFile FAT | Role required per op: read (any member), write (uploader or admin in that org). Membership is managed by admins via `/v1/admin/orgs//members` — see below. Shared writes require an uploader/admin flag on the caller; reads are open to any Bearer-authenticated user. If R2 credentials are not configured the shared endpoints return 503. Both writeable paths use the same R2 presign under the hood — `POST` mints a URL for callers that prefer the explicit two-step; `PUT` is the one-shot that works from any HTTP client (just send `Expect: 100-continue` if you don't want to double-upload your own bytes). **Why GET stays a 302** — Cloudflare R2 egress is free; AWS Seoul egress is $0.126/GB. Proxying GETs through the gateway would convert the high-volume side (downloads) into AWS egress costs that scale linearly with researcher count, so the gateway hands out a signed URL and steps out of the data path. **One-shot examples** ```bash # Download — curl follows the 302 to R2: curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ "$KGPU_API_BASE/v1/shared/vitallab/datasets/ecg/mit-bih-arrhythmia-database-1.0.0.zip" \ -o mitdb.zip # Upload — fast path. `Expect: 100-continue` avoids sending the body twice # (curl 8.x no longer adds it automatically, so spell it out): curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ -H "Expect: 100-continue" \ -T mitdb.zip \ "$KGPU_API_BASE/v1/shared/vitallab/datasets/ecg/mit-bih-arrhythmia-database-1.0.0.zip" # Upload — also works without Expect, but the gateway eats your first copy # of the body before redirecting (you pay 2× bandwidth, we still pay $0): curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ -T mitdb.zip \ "$KGPU_API_BASE/v1/shared/vitallab/datasets/ecg/mit-bih-arrhythmia-database-1.0.0.zip" ``` **Two-step upload (Python `requests`, PowerShell, browser fetch — any client where you want a single network transfer without the Expect dance)** ```python import requests r = requests.post(f"{API}/v1/shared/{key}", headers={"Authorization": f"Bearer {tok}", "x-content-type": "application/zip"}).json() with open(local, "rb") as f: requests.put(r["url"], data=f, headers={"Content-Type": "application/zip"}) ``` ## Folders inside a rented GPU When you rent with the recommended `ghcr.io/vitaldb/kgpu-pytorch:latest` image, the gateway's pod startup runs `kgpu-bootstrap`, which mounts two filesystems before the rental is reported `Ready`: | Path | Backed by | RW? | Use it for | |---|---|---|---| | **`/shared//`** | each org's shared drive (rclone HTTP, vfs-cache full) | RO | reading lab datasets — `duckdb`, `pyarrow`, `wfdb` open files in place. Mount root lists every org you belong to as a subdir | | **`/files`** | your mydrive (rclone WebDAV, vfs-cache writes) | **RW** | small editable files, results drops, notebook saves | | `/workspace` | pod ephemeral local disk (≤100 GiB) | RW | training I/O — checkpoints, logs, scratch. Lost on rental end | `/files` also exposes `shared//` subdirs (same data, just via the slower WebDAV path). Stick with `/shared//` for dataset reads — e.g. `/shared/vitallab/datasets/ecg/foo.zip`. **Other images** — if you POST a `/v1/gpus` with a different image (e.g. `nvcr.io/nvidia/pytorch:24.10-py3` or your own), the pod falls back to plain `sleep infinity` and you mount manually: ```bash apt-get update -qq && apt-get install -y -qq --no-install-recommends fuse3 ca-certificates curl unzip zstd curl -fsSL https://rclone.org/install.sh | bash # then either: kgpu-mount-shared && kgpu-mount-files (if helpers shipped) # or the raw rclone commands below ``` On a plain image, set them up manually: ```bash apt-get update -qq && apt-get install -y -qq --no-install-recommends fuse3 ca-certificates curl unzip curl -fsSL https://rclone.org/install.sh | bash # Read-only shared (HTTP backend — simpler, no metadata RT-trips per stat) mkdir -p /shared rclone mount :http: /shared \ --http-url "$KGPU_API_BASE/v1/shared/" \ --http-headers "Authorization,Bearer $KGPU_API_TOKEN" \ --read-only --vfs-cache-mode full \ --vfs-read-chunk-size 16M --vfs-read-chunk-streams 4 --daemon # Read-write mydrive (WebDAV backend — also exposes shared/ as a subdir, RO) mkdir -p /files rclone mount :webdav: /files \ --webdav-url "$KGPU_API_BASE/v1/files/" \ --webdav-vendor other \ --webdav-headers "Authorization,Bearer $KGPU_API_TOKEN" \ --vfs-cache-mode writes --daemon ``` Cache modes: - `--vfs-cache-mode full` — random access without downloading the whole file. Best for big zips you'll seek into (`duckdb`, `pyarrow.parquet`, `wfdb`). - `--vfs-cache-mode writes` — writes go through local cache then upload. Required for any write workload on WebDAV. - `--vfs-cache-mode off` — pure stream. Best for one-shot extract: `unzip /shared/vitallab/datasets/ecg/foo.zip -d /workspace/`. To unmount: `fusermount3 -u /shared` (or `/files`). ### ⚠️ Filesystem write hygiene — read this before writing to the mount The WebDAV mount is **convenient for browsing and dropping small results**, but it's not a real POSIX filesystem under the hood. R2 is an object store, and every "write" is at minimum one full object upload. Specifically: - **Every append re-uploads the entire object.** `tee -a log.txt`, `>> file`, sqlite WAL — all of these will silently turn into "download → modify → upload entire file" cycles. Don't do this on training logs or checkpoints; the mount will crawl - **`open(..., 'r+').seek(N).write(...)` and random in-place writes** force whole-object rewrite per call. Same trap as above - **`utime()`/`touch` to update mtime is a no-op**, `chmod`/`chown` ignored, hardlinks / symlinks unsupported - **Two writers to the same path → last-write-wins** (no locking) **The pattern that works:** ```bash # 1. Write everything during training to local pod disk mkdir -p /workspace/exp42 python train.py --out /workspace/exp42 # checkpoints, logs, plots here # 2. At the end, package + upload in ONE transfer tar -czf /tmp/exp42.tgz -C /workspace exp42 curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ -H "Expect: 100-continue" \ -T /tmp/exp42.tgz \ "$KGPU_API_BASE/v1/files/exp42-outputs.tgz" ``` Or even simpler — stream the tar straight to the gateway without a temp file: ```bash tar -czf - -C /workspace exp42 | \ curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ -H "Expect: 100-continue" -H "Content-Type: application/gzip" \ --data-binary @- \ "$KGPU_API_BASE/v1/files/exp42-outputs.tgz" ``` **When the WebDAV mount IS the right tool:** Jupyter notebook save, dropping a small results.csv interactively, editing a config file across sessions. Not for the training loop's I/O path. ### Per-experiment venv (recommended) The base image ships with `uv`. Create a venv per experiment so its package set is isolated and reproducible: ```bash uv venv /workspace/exp42 --python 3.12 source /workspace/exp42/bin/activate uv pip install transformers wandb 'torch==2.5.*' # ... train ... ``` To persist the venv across rentals (so the next rental doesn't reinstall), tar it onto your mydrive at the end: ```bash tar -cf - -C /workspace exp42 | zstd -T0 | \ curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ -H "Expect: 100-continue" -H "Content-Type: application/zstd" \ --data-binary @- \ "$KGPU_API_BASE/v1/files/envs/exp42.tar.zst" # Next rental: curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ "$KGPU_API_BASE/v1/files/envs/exp42.tar.zst" | \ zstd -dT0 | tar -xf - -C /workspace source /workspace/exp42/bin/activate ``` ### API surface (what the mounts call) - `GET /v1/files/[/]` — HTML directory listing (rclone HTTP) or JSON with `?json=1` - `HEAD /v1/files/` — size, content-type, ETag - `GET /v1/files/` — 302 to a presigned R2 GET URL - `PUT /v1/files/` — 307 to a presigned R2 PUT URL - `DELETE /v1/files/` — single file or `/` prefix delete - `OPTIONS / PROPFIND / MKCOL / MOVE / COPY` — WebDAV (mydrive only; shared/ subtree's writes return 403) You can drive the same API with `curl`, `requests`, `boto3` (via the presigned URLs), or any HTTP client. ## Admin: organizations The shared drive is namespaced by **org** (e.g. `vitallab`, plus whatever other research groups get onboarded). Admin Bearer (env token or `role=admin`) only. ```bash # List orgs curl -sS -H "Authorization: Bearer $T" https://api.kgpu.net/v1/admin/orgs # Create an org (slug must be DNS-friendly lowercase) curl -sS -X POST -H "Authorization: Bearer $T" -H "Content-Type: application/json" \ https://api.kgpu.net/v1/admin/orgs \ -d '{"slug":"snuhecg","name":"SNUH ECG Research","owner_user_id":1}' # Add / change a member's role (admin | uploader | member) curl -sS -X POST -H "Authorization: Bearer $T" -H "Content-Type: application/json" \ https://api.kgpu.net/v1/admin/orgs/snuhecg/members \ -d '{"user_id":2,"role":"uploader"}' # Remove a member curl -sS -X DELETE -H "Authorization: Bearer $T" \ https://api.kgpu.net/v1/admin/orgs/snuhecg/members/2 ``` Roles: - **admin** — manage members + read + write - **uploader** — read + write - **member** — read only ## Auth ``` Authorization: Bearer $KGPU_API_TOKEN ``` Inside a rented GPU's container, env vars `$KGPU_API_TOKEN` and `$KGPU_API_BASE` are auto-injected. ## Workflow A — quick one-shot ```bash # 1. Rent a GPU with the image you want running on it GPU=$(curl -sS -X POST -H "Authorization: Bearer $T" -H "Content-Type: application/json" \ https://api.kgpu.net/v1/gpus -d '{ "name": "check", "image": "nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04" }' | jq -r .gpu_id) # Wait for the GPU to become Ready (~5–30s if the image isn't already cached on the node) sleep 10 # 2. Run command (sync return if <10s) curl -sS -X POST -H "Authorization: Bearer $T" -H "Content-Type: application/json" \ https://api.kgpu.net/v1/runs -d "{ \"gpu_id\": \"$GPU\", \"name\": \"nvidia-smi\", \"command\": [\"nvidia-smi\", \"-L\"] }" # → {phase:"Succeeded", exit_code:0, stdout:"GPU 0: NVIDIA GB10...", ...} # 3. Return (반납) the GPU curl -sS -X DELETE -H "Authorization: Bearer $T" https://api.kgpu.net/v1/gpus/$GPU ``` ## Persisting artifacts (중요) **The rented GPU's container disk is ephemeral.** When you return the GPU (`DELETE /v1/gpus/{id}`) or it auto-returns on idle / `duration_hours` expiry, everything inside the container — including `/tmp`, the writable layer, `$HOME`, and `/tmp/kgpu/runs/{run_id}/` — is destroyed. Pod-local run logs disappear with the pod; if you need them, mirror them through `/files/.../` (WebDAV mount) or curl-PUT to `/v1/files/...` before returning the rental. Anything *you* produced (model checkpoints, plots, predictions, intermediate csv, preprocessed datasets) is lost unless you upload it. **Rule for agents and humans alike:** upload every artifact you'd want to see again to `/v1/files/{path}` (my drive — per-user R2, 100 GB quota, persists across rentals). From inside a rented GPU the call is one curl (the env vars are auto-injected): ```bash # Save a checkpoint from inside the rented GPU's container curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ -H "Expect: 100-continue" \ -T /workspace/checkpoint.pt \ "$KGPU_API_BASE/v1/files/checkpoints/exp42/checkpoint.pt" # Save a results csv curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ -H "Expect: 100-continue" \ -T /workspace/results.csv \ "$KGPU_API_BASE/v1/files/exp42/results.csv" ``` A useful pattern is to make the **last command of your run** a `tar` of your output dir piped to `/v1/files/`: ```bash tar -czf - -C /workspace outputs/ | \ curl -L -H "Authorization: Bearer $KGPU_API_TOKEN" \ -H "Expect: 100-continue" \ -H "Content-Type: application/gzip" \ --data-binary @- \ "$KGPU_API_BASE/v1/files/runs-out/${KGPU_GPU_ID}-outputs.tgz" ``` Anything you skip uploading is gone the moment the pod dies. ## Workflow B — iterative dev with pip cache ```bash # Rent (대여) a GPU GPU=$(curl -sS -X POST -H "Authorization: Bearer $T" -H "Content-Type: application/json" \ https://api.kgpu.net/v1/gpus -d '{ "name": "exp", "image": "nvcr.io/nvidia/pytorch:24.10-py3" }' | jq -r .gpu_id) sleep 15 # Install deps once (cached in container until DELETE /v1/gpus) curl -sS -X POST -H "Authorization: Bearer $T" -H "Content-Type: application/json" \ https://api.kgpu.net/v1/runs -d "{ \"gpu_id\": \"$GPU\", \"name\": \"pip\", \"command\": [\"pip\", \"install\", \"transformers\", \"datasets\"], \"wait_seconds\": 60 }" # Upload code — `-L` to follow the 307 to R2; `Expect: 100-continue` so the # body never touches the gateway (curl 8.x doesn't add it automatically). curl -L -H "Authorization: Bearer $T" -H "Expect: 100-continue" -T train.py \ https://api.kgpu.net/v1/files/train.py # Run (long-poll 10s; if not done, poll). The inner curl follows the 302 # from /v1/files/train.py to R2 — same `-L` flag. RUN=$(curl -sS -X POST -H "Authorization: Bearer $T" -H "Content-Type: application/json" \ https://api.kgpu.net/v1/runs -d "{ \"gpu_id\": \"$GPU\", \"name\": \"train\", \"command\": [\"bash\", \"-c\", \"curl -sLH 'Authorization: Bearer $KGPU_API_TOKEN' $KGPU_API_BASE/v1/files/train.py > /tmp/train.py && python /tmp/train.py\"] }") # If phase="Running", poll: RUN_ID=$(echo $RUN | jq -r .run_id) while true; do ST=$(curl -sS -H "Authorization: Bearer $T" "https://api.kgpu.net/v1/runs/$RUN_ID?gpu_id=$GPU") PHASE=$(echo $ST | jq -r .phase) [ "$PHASE" != "Running" ] && break sleep 10 done echo $ST | jq .stdout -r # Return (반납) the GPU curl -X DELETE -H "Authorization: Bearer $T" https://api.kgpu.net/v1/gpus/$GPU ``` ## Run response shape ```json { "run_id": "run---", "gpu_id": "gpu---", "phase": "Succeeded" | "Failed" | "Running", "exit_code": 0 | null, "stdout": "...", "stderr": "...", "started_at": "2026-05-17T01:00:00Z", "completed_at": "2026-05-17T01:00:02Z" | null, "elapsed_seconds": 2.3, "pid": 12345 } ``` If `phase: "Running"`, follow up with `GET /v1/runs/{run_id}?gpu_id=...` until `phase` changes. ## Cluster info (no auth) ```bash curl https://api.kgpu.net/v1/gpus # → { # cluster: [{node, gpu_model, gpu_busy, gpu_total, ready, price_per_hour_credits}], # active_total: N, # price_per_hour_credits: 500 # current rental price; Phase-1 single price # } ``` Authenticated callers also see `credits.balance` and a per-rental `credits_charged` / `price_per_hour_credits` snapshot under `my_gpus[]`. ## Phase 1 limits - Single GPU node (spark-llm, NVIDIA GB10). Multiple concurrent rentals on the same node share the physical GPU (CUDA context contention) - `/v1/files/{path}` (my drive) is per-user R2 with a 100 GB default soft quota (`KGPU_USER_QUOTA_BYTES`). Persistent across rentals, free egress, no gateway bandwidth penalty. For datasets shared across an org use `/v1/shared//*` (uploader/admin role required for writes) - Run logs in `/tmp/kgpu/runs/{run_id}/` inside the rented GPU. Lost when the GPU is returned — if you want them after, mirror them through the WebDAV mount at `/files/` or curl-PUT them out before returning - **Idle timeout.** No `/v1/runs` POST activity on the rental for 1 hour → `idle_warning: true` appears on the GPU in `GET /v1/gpus`. 2 hours → the gateway auto-returns the GPU (`end_reason: auto_idle`). To keep a long-running training session alive, just keep your run going (any active `/v1/runs` counts as activity). `DELETE /v1/gpus/{gpu_id}` (반납) still works at any time, and `duration_hours` (default 12h, max 168h) is still the hard ceiling - **Credits.** Every rental costs credits per minute — see the [Credits](#credits) section above. Out-of-credits → auto-return (`end_reason: out_of_credits`) - Single admin user (`lucid80@gmail.com`). Phase 2: per-user namespaces and isolation - No GPU resource scheduling (K8s device plugin incompatible with GB10) — concurrent rentals on one node share the physical GPU cooperatively - The rented GPU's local disk (writable container layer + `/tmp`) is **ephemeral** and **capped at 100 GiB per rental** (`resources.limits.ephemeral-storage`). Exceeding the cap evicts your pod and you lose anything not uploaded. Anything you need to keep across rentals must be PUT to `/v1/files/...` before you return the GPU — see the "Persisting artifacts" section above ## Container env auto-injected | Var | Value | |---|---| | `KGPU_GPU_ID` | this rental's id | | `KGPU_API_TOKEN` | Bearer token | | `KGPU_API_BASE` | `https://api.kgpu.net` | | `NVIDIA_VISIBLE_DEVICES` | `all` | | `NVIDIA_DRIVER_CAPABILITIES` | `compute,utility` | ## Useful base images | Purpose | Image | |---|---| | **Recommended** — PyTorch + rclone + fuse + duckdb + `kgpu-mount-shared` | `ghcr.io/vitaldb/kgpu-pytorch:latest` | | Bare CUDA / nvidia-smi | `nvcr.io/nvidia/cuda:13.0.0-base-ubuntu24.04` | | PyTorch (vanilla NGC, no extras) | `nvcr.io/nvidia/pytorch:24.10-py3` | | TensorFlow | `nvcr.io/nvidia/tensorflow:24.10-tf2-py3` | The first image is built in [vitaldb/kgpu-images](https://github.com/vitaldb/kgpu-images) and saves ~30 s of `apt-get + rclone install` on every rental.