Products / For AI builders
InferX
AvailableSelf-hosted Model-as-a-Service for the agentic era
InferX is an LLM inference platform you run on your own Kubernetes. It speaks both OpenAI and Anthropic Messages APIs on the front, deploys vLLM, TensorRT-LLM, SGLang, or llama.cpp on the back, and adds the operational layer most gateways skip: cost attribution per user and key, latency and error tracking per model, KServe deployment management, model downloads from Hugging Face and S3, and a built-in playground for testing before integration. The roadmap brings policy-based intelligent routing and safety modes for control-loop and safety-critical workloads.
- Protocols
- OpenAI · Anthropic · streaming SSE
- Runtimes
- vLLM · TRT-LLM · SGLang · llama.cpp
- Hardware
- NVIDIA · AMD · Intel · Ascend · Cambricon
Capabilities
What InferX gives you
OpenAI- and Anthropic-compatible
Drop-in /v1/chat/completions, /v1/embeddings, and /anthropic/v1/messages. Streaming SSE. Point your existing SDK or `claude-code` at InferX by changing the base URL.
Multi-vendor GPU and KServe-native
Auto-detected NVIDIA, AMD, Intel, Huawei Ascend, and Cambricon. Deploy InferenceServices from typed templates with built-in vLLM, TRT-LLM, AWQ, BF16, and GGUF presets.
Cost, latency, and errors per model
Every request is OTEL-instrumented. P50/P95/P99 latency, error rate, and cost are attributed per model and per API key — answerable from the admin dashboard, not a Grafana hunt.
Built for agents
Roadmap brings policy-based routing, session affinity, content-aware deployment selection, and safety modes — verified, consensus, and human-in-the-loop — for control-loop and audit-grade workloads.
How it works
From model weights to a measured endpoint.
-
Step 01
Deploy a model
Pick a runtime template — vLLM, TRT-LLM, GGUF — point at a PVC of weights, hit deploy. Multi-vendor GPU auto-detected.
-
Step 02
Get an endpoint
OpenAI- and Anthropic-compatible URLs, streaming SSE on both. API keys with rate limits, budgets, and per-model whitelists.
-
Step 03
Watch cost and latency
Every request OTEL-instrumented. P50/P95/P99, error rate, and cost attributed per model and per key — straight from the dashboard.
Who it's for
Built for these teams
- Teams shipping LLM products on dedicated capacity
- Platform teams consolidating inference cost and access
- Builders of agentic systems with safety and audit needs
Pairs well with
Other builder products
ConsoleX
AvailableThe self-service Kubernetes workspace for every user
Each user gets an isolated namespace with quotas, storage, networking, and a web terminal — no kubectl, no tickets, no per-user RBAC.
Learn moreDevSpace
AvailableManaged AI development environments on Kubernetes
Single-click Jupyter, Marimo, Streamlit, Gradio, and VS Code environments — GPU-ready, isolated per user, idle-shutdown by default.
Learn moreTrainX
AvailableCurated, multi-tenant training on Kubernetes
Templates that describe themselves render directly into a UI form — admins control the script and defaults, users supply the parameters.
Learn more