Products / For AI builders
GrokX
AvailableGround your agents in your documents — scanned PDFs included, every answer cited to the page.
GrokX is the knowledge component of TAIP — the third of the trio: InferX serves models, AgentX runs agents, GrokX serves knowledge. It turns a document corpus, including scanned PDFs, into something agents can ground their answers on, with page-level citations. Ingestion runs once: born-digital pages are read from their text layer and scanned pages are OCR'd, chunked (by paragraph, heading, or sentence), and kept fresh with SHA-256 manifests so re-ingest skips unchanged docs. Retrieval is hybrid by default — sparse keyword vectors and dense embeddings are combined with rank fusion in the vector DB, then reordered by a cross-encoder reranker — with a text-mirror mode (a markdown tree you mount and grep) for smaller corpora. Everything is multi-tenant: a web console with OIDC SSO, scoped personal access tokens, and per-KB RBAC governs many isolated knowledge bases, each its own collection in the vector DB. Registered on an AgentX Agent under the alias kb, mcp__kb__search(query, kb) returns matching passages with their source and page, so the model cites 'page 12 of report.pdf' instead of guessing. Embeddings and reranking call your own InferX endpoints; it ships as a Helm-packaged TAIP app — MCP server, console, vector DB, and an indexer — and runs end-to-end in production today.
Specification
- Status
- v0.7.0 — shipped, running in production
- Ingest
- Born-digital text + OCR for scans · PDF · Word · HTML · Markdown · text
- Retrieval
- Hybrid (sparse keyword + dense) with rank fusion · cross-encoder rerank · page citations
- Embeddings
- Pluggable model via InferX — OpenAI-compatible /embeddings
- Store
- Pluggable vector DB (named dense + sparse vectors) · local store for dev
- Access
- Web console · OIDC SSO · scoped PATs · per-KB RBAC · audit trail
- Serving
- MCP server (streamable HTTP) · Helm: server + console + vector DB + indexer
Proof, not promises
See it in one block.
No proprietary SDKs, no rewrites — GrokX meets your tools where they already are.
$ grokx push ./corpus --kb research # upload + OCR scans + index, resumable
ingested 142 docs · 38 OCR'd · 1,907 pages → indexed 9,841 chunks
$ grokx serve # MCP server (streamable HTTP) on :8080
serving 6 knowledge bases
# an AgentX Agent calls the tool — hybrid + reranked, with a citation:
mcp__kb__search("Q3 revenue", kb="research") → "…revenue was $4.2M…" [report.pdf p.12]▌ OCR and embedding run once at ingest, not per query. Sparse keyword and dense vectors are fused and reranked at search time, and every passage keeps its source and page so the answer can be cited.
Capabilities
What GrokX gives you
Ingestion that reads scanned PDFs
Walk a corpus and extract every page: born-digital text straight from the text layer, image-only scans via OCR. PDF, Word, HTML, Markdown, and plain text all ingest. grep over raw PDF bytes is useless and text models can't see page images — so extraction is mandatory, and GrokX does it once, degrading to ocr-skipped rather than failing.
Hybrid retrieval, reranked
Sparse keyword vectors and dense embeddings are stored together in the vector DB and combined with rank fusion, then a cross-encoder reranker reorders the top results. Lexical recall and semantic recall in one query — or mount the markdown text mirror and grep it for smaller corpora.
Page-level citations
Every chunk keeps its source and page, so an agent can answer 'per page 12 of report.pdf' instead of producing an unverifiable claim. Provenance is preserved from ingest through retrieval and rerank.
A tool AgentX can call
grokx serve exposes an MCP server. Registered on an AgentX Agent under the alias kb, it becomes mcp__kb__search(query, kb, k, source?, page?) — plus list_knowledge_bases, list_sources, and get_document. The model decides when to search and gets passages back with citations. The vector store lives in GrokX, never inside the agent sandbox.
Many knowledge bases, governed
A web console with OIDC SSO manages multiple isolated knowledge bases — each its own collection in the vector DB. Per-KB RBAC (viewer / editor / owner), ACL sharing to users and groups, scoped personal access tokens, and an append-only audit trail. The KB is the unit of access control.
Four ways to ingest, kept fresh
Upload through the web console (resumable), grokx push / sync from the CLI, mount a WebDAV folder, or wire a scheduled git connector. SHA-256 manifests track every source so re-ingest and re-index skip unchanged docs and prune deletions — expensive OCR and embedding work is never repeated.
Embeddings and rerank on your InferX
Embeddings and reranking call your own OpenAI-compatible InferX endpoints — no third-party embedding API, no data leaving the perimeter. A dependency-free local store and hash embedder cover dev with no infra.
How it works
From a pile of PDFs to a cited answer.
- Step 01
Ingest and OCR the corpus
grokx push (or the web console, WebDAV, or a git connector) extracts born-digital text and OCRs scanned pages, chunks them, and indexes — once, incrementally, with provenance preserved.
- Step 02
Index into hybrid search
Chunks are embedded and stored alongside sparse keyword vectors in a per-KB collection in the vector DB — ready for fused, reranked retrieval, or mounted as a markdown mirror to grep.
- Step 03
Serve it to agents over MCP
grokx serve registers the kb tool on an AgentX Agent. The model calls mcp__kb__search when it needs evidence and gets back passages with source and page.
- Step 04
Agents answer with citations
Responses are grounded in your documents and anchored to the exact page — verifiable, not guessed.
Who it's for
Built for these teams
- Teams building agents that must answer from private documents
- Anyone with a corpus of scanned PDFs that lexical search can't read
- AI app teams that need grounded, citable answers — not hallucinations
- Platform teams standing up a shared, governed, multi-tenant knowledge index
Pairs well with
Other builder products
ConsoleX
AvailableLog in, get a governed Kubernetes workspace. No kubectl, no tickets.
On first SSO login every user gets an isolated namespace with quotas, default-deny networking, storage, and a web terminal — provisioned automatically, reconciled continuously.
Learn moreDevSpace
AvailableJupyter or VS Code on a GPU in seconds. Idle environments shut themselves down.
Single-click Jupyter, Marimo, Streamlit, Gradio, and VS Code environments — GPU-ready, isolated per user behind a per-pod auth proxy, with SSH access and idle shutdown by default.
Learn moreTrainX
AvailableAdmins write the template. Users fill a form. Kubernetes runs the job.
Self-describing training templates render straight into UI forms — with live quota checks, streaming logs, parsed progress bars, and one-click TensorBoard.
Learn more