SYS.VER // 1.0.0
BUILD // 2026.04

LATENCY // <5MS
OVERHEAD // <2%

BRAHMAI // BRAHX LABS

PROTO.STAGE // PRODUCTION
DEPS // ZERO

BRAHMAI · BRAHX LABS · v1.0.0

INFERX

The most powerful
and production-ready inference system.

Priority scheduling, KV cache governance, and real-time observability for enterprise AI. Zero dependencies. Under 5ms overhead.

Explore Architecture Deploy in 4 commands

SDK / Client

BEARER TOKEN

Auth + Rate Limit

<1MS

INFERX

CONTROL PLANE

Priority Queue

MAX-HEAP

Tensor Core → GPU

INFERENCE

Real-Time Dashboard

Live performance
at a glance.

Metrics pushed via WebSocket at 1-second intervals. GPU utilization, KV cache pressure, queue depth, per-key usage — everything queryable.

Dashboard on port 8001 · Inference on 8000 · Never impacted

INFERX

Control Plane

Main

Overview

Requests

Metrics

Scheduler

System

KV Cache

Alerts

API Keys

Config

Prometheus

PagerDuty

Dashboard › Overview

Search...

Welcome back, Operator.

Total Requests

18,492

+0.94 this session

Tokens / sec

12.4k

+0.91 vs last window

KV Cache

67%

NORMAL · below warn

SYS.OBSERVABILITY // REQ-BY-REQ

12,430 tok/s

Token Gen KV Cache Ops Schedule Wait

System Architecture

The control plane
between clients
& GPUs.

An end-to-end inference stack from client request to native Tensor Core execution. Hover nodes to explore each component in the orbital topology.

Drag to orbit · Scroll to zoom · Hover to inspect

NODES: 9 · EDGES: 8

SYS.STATUS // ● ACTIVE

Global Ingress

Edge-to-core
fabric.

Raw data points converging from global edge nodes onto the central InferX execution ring. Zero abstraction.

us-east-1 // eu-central-1 // ap-northeast-1

Request Lifecycle

Nine-stage
middleware pipeline.

Every request traverses the full stack. No shortcuts — every decision logged with a reason code.

CORS

Origin filter

ReqID

UUID4 inject

Auth

Key validate

Rate

Token bucket

Admit

Queue check

Registry

State track

Schedule

Priority heap

Exec

Tensor fwd

Stream

SSE tokens

Data Ingress

Global Proxy Router

Traffic hits the edge. Authentication and rate limiting execute in under 1ms. Invalid requests are instantly rejected before system impact.

Priority Scheduler

Max-Heap Tensor Queue

Valid requests enter the priority queue. Enterprise tokens bypass standard traffic. Operators can inject prioritized jobs.

Execution

Native Tensor Core

Direct GPU execution via InferX's proprietary engine. Zero intermediate abstraction. Pure high-performance compute.

Admin Control

Interference & Triaging

Administrators manually govern links. Preempt requests or engage circuit breakers gracefully on-the-fly.

Capabilities

Everything
operators need.

Six production-grade components. Each engineered for one job, done without compromise.

01 // AUTH + RATE LIMIT

Millisecond-cached API key validation

SHA-256 hashed keys. In-memory cache for zero-latency auth. Token-bucket rate limiting with per-key RPM, RPD, and TPM budgets.

APIKeyManager

02 // PRIORITY SCHEDULER

Max-heap queue with FIFO tie-breaking

Priority 1–100. Production traffic wins. Admin can bump priority in-flight. Dispatch tick every 100ms, configurable.

PriorityScheduler

03 // KV GUARDIAN [LIVE DEFRAGMENTATION]

VRAM Heatmap & Graduated Eviction

Polls every 500ms. Four states: NORMAL → PRESSURE → CRITICAL → EMERGENCY. Intelligent eviction actively mitigates fragmented blocks via real-time compaction engine.

KVCacheController

NORMAL

42%

FREE

ALLOCATED

PRESSURE

CRITICAL

SCANNER

04 // CIRCUIT BREAKER

CLOSED → OPEN → HALF-OPEN automaton

Detects core engine unhealthiness. Stops forwarding during outage. Gracefully resumes on recovery. Never cascades.

CircuitBreaker

05 // METRICS ENGINE

GPU · CPU · RAM · EXEC — everything

40+ metrics per 500ms cycle. pynvml for NVIDIA, psutil for CPU/RAM, Prometheus scrape for native internal execution states.

MetricsCollector

06 // ALERT ENGINE

14 built-in conditions, cooldown logic

PagerDuty Events API v2, Slack, Webhook. Per-alert cooldown prevents storm. RESOLVED alerts auto-dispatched.

AlertEngine

Alert Conditions

14 conditions,
zero noise.

Independent cooldown timers. RESOLVED events auto-dispatched when conditions clear. Configurable via config.yaml.

Alert Conditions Registry

14 CONDITIONS · 5 MIN COOLDOWN

Alert ID	Condition	Threshold	Severity
KV_PRESSURE	KV cache ≥ warn	70%	WARN
KV_CRITICAL	KV cache ≥ evict	85%	CRIT
KV_EMERGENCY	KV cache ≥ emergency	97%	EMRG
GPU_MEM_HIGH	GPU VRAM ≥ threshold	90%	WARN
GPU_UTIL_HIGH	GPU util ≥ threshold	95%	WARN
CPU_HIGH	CPU usage ≥ threshold	85%	WARN
RAM_HIGH	RAM usage ≥ threshold	90%	CRIT
QUEUE_DEEP	Queue depth ≥ threshold	200	CRIT
CIRCUIT_OPEN	Breaker opened	N/A	CRIT
ENGINE_DOWN	Health check failed	N/A	EMRG
HIGH_PREEMPT	Preemptions/min ≥	10/min	WARN
TPT_DROP	Tokens/sec drops ≥	50%	WARN
WAIT_LONG	Request waiting ≥	30s	WARN
KEY_EXPIRED	API key TTL hit	N/A	INFO

4.2ms

P99 latency
overhead

<2%

CPU overhead
at peak load

186MB

Memory
footprint

500ms

KV cache
poll interval

Benchmarks

Empirical
proof.

Head-to-head against raw inference engines. Same model (Llama-3 70B), same hardware (8× H100 SXM5), same prompt distribution. InferX adds governance at near-zero cost.

Metric	INFERX	Raw vLLM	TGI	TensorRT-LLM
TTFT (Time to First Token)	18ms BEST	22ms	31ms	19ms
ITL (Inter-Token Latency)	4.8ms BEST	5.1ms	7.2ms	5.0ms
P99 Tail Latency	42ms BEST	48ms	89ms	45ms
Max Concurrent QPS	45,200 BEST	38,100	21,500	41,800
Control Plane Overhead	<1.8%	N/A (no control plane)	N/A	N/A
Priority Scheduling	✓ Native 1-100	✗	✗	✗
KV Cache Governance	✓ 4-state FSM	✗	✗	Partial

Benchmark conducted 2026-04-15 · H100 SXM5 80GB × 8 · NVLink 4.0 · Llama-3 70B FP16 · Prompt: 512 tokens avg · ShareGPT distribution

Hardware Topology

Multi-GPU
orchestration.

InferX maps tensor parallelism and pipeline parallelism across your full GPU fabric. Every device tracked, every link monitored.

NVLink 4.0 · 900 GB/s

PCIe Gen5 · x16

Tensor Parallel · TP=8

H100 SXM5

72.4 / 80 GB

H100 SXM5

68.1 / 80 GB

H100 SXM5

71.8 / 80 GB

H100 SXM5

65.3 / 80 GB

H100 SXM5

70.2 / 80 GB

H100 SXM5

73.6 / 80 GB

H100 SXM5

69.9 / 80 GB

H100 SXM5

74.1 / 80 GB

Configuration

Operator
cockpit.

Every parameter is runtime-tunable. Zero restarts required. Drag sliders to see the live config output.

MAX_BATCH_SIZE 64

KV_EVICTION_THRESHOLD 85%

DISPATCH_TICK_MS 100

DEFAULT_PRIORITY 50

PAGERDUTY_ENABLED

PROMETHEUS_EXPORT

CIRCUIT_BREAKER

REQUEST_LOGGING

ARCH.BLUEPRINT // ENGINE

SCROLL TO EXPLORE ↓

INGRESS_TCP

PORT 8000 / ASGI

Accepts incoming TCP connections, manages Uvicorn worker pools.

AUTH_GATE

API_KEY / LIMITS

O(1) memory hash checking. Token bucket enforces RPM/TPM. Unauth dropped instantly.

CIRCUIT_BREAKER

STATE: CLOSED

Gates traffic based on core health. Prevents cascading failures under high pressure.

PRIORITY_DISPATCH

MAX-HEAP 100MS TICK

O(LOG N) HEAP QUEUE

Sorts 10,000+ pending blocks. Preempts low-priority inferences for production traffic.

KV VRAM GUARDIAN

4-state FSM actively defragments VRAM. Evicts sequence blocks if threshold hit.

TENSOR_CORE

VLLM KERNEL · CUDA GRAPH

ZERO OVERHEAD EXEC

Continuous batching and SSE streaming directly from compiled C++/CUDA engines.

ASSEMBLED · 0%

Deploy

Four commands
to production.

Zero external dependencies. Bare-metal or Docker. OpenAI-compatible — any existing SDK works unchanged.

inferx-cli — bash

$ git clone https://github.com/brahmai/inferx

Cloning into 'inferx'...

$ pip install -e .

✓ Installed (0 external deps)

$ inferx init --config config.yaml

✓ Config validated · Key created

$ inferx serve --port 8000

✓ INFERX v1.0.0 on :8000

✓ Dashboard :8001 · Admin :8002

Circuit: CLOSED · KV: 0.0%

OpenAI Compatible

# Drop-in replacement

client = openai.OpenAI(

base_url="http://localhost:8000/v1",

api_key="inferx_sk_..."

)

Integrations

📊 Prometheus

🔔 PagerDuty

🗄 MongoDB

💬 Slack

All opt-in. None required for boot.