Products / For platform admins
TAIP Admin
AvailableThe whole AI cluster on one screen. kubectl optional.
TAIP Admin is a web-based administration console purpose-built for AI infrastructure. One Go binary serves the API and the SPA. It auto-detects metrics-server, Kueue, KServe, Training Operator, cert-manager, Gateway API, DRA, and VPA — each integration lights up when its API appears and disappears cleanly when it doesn't — and connects to Prometheus, Alertmanager, and Grafana with one URL each, degrading gracefully when they're absent. Resource accounting works in three tiers: requests and capacity from the K8s API alone, live usage with metrics-server, 30-day history with Prometheus. It is engineered, not just assembled: listing 617 secrets takes 0.5 seconds instead of 62 via metadata-only reads — and secret values never reach the browser.
Specification
- Version
- v1.6.8 — generally available
- Footprint
- Single Go binary · one Helm release · amd64 + arm64
- Auto-detected
- metrics-server · Kueue · KServe · Training Operator · cert-manager · Gateway API · DRA · VPA
- Connects to
- Prometheus · Alertmanager · Grafana — one URL each, optional
- GPU telemetry
- Per-accelerator — NVIDIA DCGM · Ascend NPU: utilization, memory, temp, power
- Roles
- OIDC · admin / viewer split · secrets never sent to browser
- Languages
- English · 简体中文
Proof, not promises
See it in one block.
No proprietary SDKs, no rewrites — TAIP Admin meets your tools where they already are.
$ helm install taip-admin taip/taip-admin
detected metrics-server ✓ kueue ✓ kserve ✓ cert-manager ✓
detected gateway-api ✓ dra ✓ vpa ✓ training-operator ✓
connected prometheus ✓ alertmanager ✓ grafana ✓ # one URL each
# remove a component and its pages vanish cleanly
# 617 secrets listed in 0.5s (metadata-only) · values never leave the cluster▌ One binary, one Helm release. The console grows and shrinks with your stack — across Kueue versions, without rebuilds.
Capabilities
What TAIP Admin gives you
GPU and AI workloads, first-class
Extended resources, DRA device browsing, and per-accelerator telemetry for NVIDIA DCGM and Ascend NPU — utilization, memory, temperature, power. A cluster GPU heatmap shows idle-vs-active capacity with owner attribution, alongside topology and MIG. KServe InferenceServices and ServingRuntimes, plus Kueue queue management via API discovery — one binary across versions.
Three-tier resource accounting
Requests, limits, and capacity from the K8s API alone; live CPU and memory when metrics-server is present; 1h–30d history when Prometheus is configured. The same UI scales with your stack — and stays fast: 617 secrets in 0.5s, not 62s.
Alerts, silences, and Grafana deep-links
Severity-coded alert tables with one-click silence creation pre-filled from the alert's matchers. Live alert badge in the sidebar. Open-in-Grafana buttons that carry cluster, node, namespace, and pod context with them. Configure outbound receivers — email/SMTP or a CloudSentry webhook — from the console.
War Room for incidents
A full-screen NOC dashboard with auto-refresh, live event SSE feed, node grid with per-node mini gauges, and resource panels — built for wall displays and on-call shifts. Node cordon and drain with real-time eviction progress.
Audit trail, idle reclaim, and an app catalog
A structured audit log of every mutating admin action, with optional persistent history you can query in-app. Idle-resource detection flags idle GPUs, inactive notebooks, and stale jobs to reclaim. An OCI/Helm app catalog browses charts from your registry and installs them through a guided wizard — and a bilingual user guide ships inside the binary.
Identity, topology, and queue analytics
Manage platform users and groups, sessions, and MFA devices from the console. Visualize cluster hierarchy and intra-node NUMA topology. Read Kueue queue analytics — wait time, depth, fairness, and preemptions — and curate the taip-portal app catalog and broadcast announcements.
How it works
Install it, then operate the cluster.
- Step 01
Point it at a cluster
One Go binary, one Helm release, one optional CRD. OIDC for SSO. The whole console is a single process.
- Step 02
Integrations light up automatically
Kueue, KServe, DRA, VPA, cert-manager, Gateway API — auto-detected from the cluster. Prometheus, Alertmanager, and Grafana connect with one URL each.
- Step 03
Operate and respond
Severity-coded alerts, one-click silences, War Room dashboard, live event SSE, node cordon and drain — without a kubectl tab open.
Who it's for
Built for these teams
- Platform engineers running shared AI clusters
- On-call responders investigating incidents
- Auditors and read-only viewers (admin/viewer roles built in)