SYS.VER // 1.0.0
BUILD // 2026.04
LATENCY // <5MS
OVERHEAD // <2%
BRAHMAI // BRAHX LABS
PROTO.STAGE // PRODUCTION
DEPS // ZERO
BRAHMAI · BRAHX LABS · v1.0.0

INFERX

The most powerful
and production-ready inference system.

Priority scheduling, KV cache governance, and real-time observability for enterprise AI. Zero dependencies. Under 5ms overhead.

SDK / Client
BEARER TOKEN
Auth + Rate Limit
<1MS
INFERX
CONTROL PLANE
Priority Queue
MAX-HEAP
Tensor Core → GPU
INFERENCE
P99 Latency
4.2ms
↓ +18% vs baseline
CPU Overhead
1.8%
non-blocking async
Memory
186MB
zero external deps
Alert Rules
14
built-in conditions
Real-Time Dashboard
Live performance
at a glance.

Metrics pushed via WebSocket at 1-second intervals. GPU utilization, KV cache pressure, queue depth, per-key usage — everything queryable.

Dashboard on port 8001 · Inference on 8000 · Never impacted
IX
INFERX
Control Plane
Main
Overview
Requests
Metrics
Scheduler
System
KV Cache
Alerts
API Keys
Config
Prometheus
PagerDuty
Dashboard › Overview
Welcome back, Operator.
Total Requests
18,492
+0.94 this session
Tokens / sec
12.4k
+0.91 vs last window
KV Cache
67%
NORMAL · below warn
SYS.OBSERVABILITY // REQ-BY-REQ
12,430 tok/s
Token Gen KV Cache Ops Schedule Wait
System Architecture
The control plane
between clients
& GPUs.

An end-to-end inference stack from client request to native Tensor Core execution. Hover nodes to explore each component in the orbital topology.

Drag to orbit · Scroll to zoom · Hover to inspect
NODES: 9 · EDGES: 8
SYS.STATUS // ● ACTIVE

Global Ingress
Edge-to-core
fabric.

Raw data points converging from global edge nodes onto the central InferX execution ring. Zero abstraction.

us-east-1 // eu-central-1 // ap-northeast-1
Request Lifecycle
Nine-stage
middleware pipeline.

Every request traverses the full stack. No shortcuts — every decision logged with a reason code.

01
CORS
Origin filter
02
ReqID
UUID4 inject
03
Auth
Key validate
04
Rate
Token bucket
05
Admit
Queue check
06
Registry
State track
07
Schedule
Priority heap
08
Exec
Tensor fwd
09
Stream
SSE tokens
Data Ingress
Global Proxy Router
Traffic hits the edge. Authentication and rate limiting execute in under 1ms. Invalid requests are instantly rejected before system impact.
Priority Scheduler
Max-Heap Tensor Queue
Valid requests enter the priority queue. Enterprise tokens bypass standard traffic. Operators can inject prioritized jobs.
Execution
Native Tensor Core
Direct GPU execution via InferX's proprietary engine. Zero intermediate abstraction. Pure high-performance compute.
Admin Control
Interference & Triaging
Administrators manually govern links. Preempt requests or engage circuit breakers gracefully on-the-fly.
Capabilities
Everything
operators need.

Six production-grade components. Each engineered for one job, done without compromise.

01 // AUTH + RATE LIMIT
Millisecond-cached API key validation
SHA-256 hashed keys. In-memory cache for zero-latency auth. Token-bucket rate limiting with per-key RPM, RPD, and TPM budgets.
APIKeyManager
02 // PRIORITY SCHEDULER
Max-heap queue with FIFO tie-breaking
Priority 1–100. Production traffic wins. Admin can bump priority in-flight. Dispatch tick every 100ms, configurable.
PriorityScheduler
03 // KV GUARDIAN [LIVE DEFRAGMENTATION]
VRAM Heatmap & Graduated Eviction
Polls every 500ms. Four states: NORMAL → PRESSURE → CRITICAL → EMERGENCY. Intelligent eviction actively mitigates fragmented blocks via real-time compaction engine.
KVCacheController
NORMAL
42%
FREE
ALLOCATED
PRESSURE
CRITICAL
SCANNER
04 // CIRCUIT BREAKER
CLOSED → OPEN → HALF-OPEN automaton
Detects core engine unhealthiness. Stops forwarding during outage. Gracefully resumes on recovery. Never cascades.
CircuitBreaker
05 // METRICS ENGINE
GPU · CPU · RAM · EXEC — everything
40+ metrics per 500ms cycle. pynvml for NVIDIA, psutil for CPU/RAM, Prometheus scrape for native internal execution states.
MetricsCollector
06 // ALERT ENGINE
14 built-in conditions, cooldown logic
PagerDuty Events API v2, Slack, Webhook. Per-alert cooldown prevents storm. RESOLVED alerts auto-dispatched.
AlertEngine
Alert Conditions
14 conditions,
zero noise.

Independent cooldown timers. RESOLVED events auto-dispatched when conditions clear. Configurable via config.yaml.

Alert Conditions Registry
14 CONDITIONS · 5 MIN COOLDOWN
Alert IDConditionThresholdSeverity
KV_PRESSUREKV cache ≥ warn70%WARN
KV_CRITICALKV cache ≥ evict85%CRIT
KV_EMERGENCYKV cache ≥ emergency97%EMRG
GPU_MEM_HIGHGPU VRAM ≥ threshold90%WARN
GPU_UTIL_HIGHGPU util ≥ threshold95%WARN
CPU_HIGHCPU usage ≥ threshold85%WARN
RAM_HIGHRAM usage ≥ threshold90%CRIT
QUEUE_DEEPQueue depth ≥ threshold200CRIT
CIRCUIT_OPENBreaker openedN/ACRIT
ENGINE_DOWNHealth check failedN/AEMRG
HIGH_PREEMPTPreemptions/min ≥10/minWARN
TPT_DROPTokens/sec drops ≥50%WARN
WAIT_LONGRequest waiting ≥30sWARN
KEY_EXPIREDAPI key TTL hitN/AINFO
4.2ms
P99 latency
overhead
<2%
CPU overhead
at peak load
186MB
Memory
footprint
500ms
KV cache
poll interval
Benchmarks
Empirical
proof.

Head-to-head against raw inference engines. Same model (Llama-3 70B), same hardware (8× H100 SXM5), same prompt distribution. InferX adds governance at near-zero cost.

Metric INFERX Raw vLLM TGI TensorRT-LLM
TTFT (Time to First Token) 18ms BEST
22ms
31ms
19ms
ITL (Inter-Token Latency) 4.8ms BEST
5.1ms
7.2ms
5.0ms
P99 Tail Latency 42ms BEST
48ms
89ms
45ms
Max Concurrent QPS 45,200 BEST
38,100
21,500
41,800
Control Plane Overhead <1.8%
N/A (no control plane) N/A N/A
Priority Scheduling ✓ Native 1-100
KV Cache Governance ✓ 4-state FSM Partial
Benchmark conducted 2026-04-15 · H100 SXM5 80GB × 8 · NVLink 4.0 · Llama-3 70B FP16 · Prompt: 512 tokens avg · ShareGPT distribution
Hardware Topology
Multi-GPU
orchestration.

InferX maps tensor parallelism and pipeline parallelism across your full GPU fabric. Every device tracked, every link monitored.

G0
H100 SXM5
72.4 / 80 GB
G1
H100 SXM5
68.1 / 80 GB
G2
H100 SXM5
71.8 / 80 GB
G3
H100 SXM5
65.3 / 80 GB
G4
H100 SXM5
70.2 / 80 GB
G5
H100 SXM5
73.6 / 80 GB
G6
H100 SXM5
69.9 / 80 GB
G7
H100 SXM5
74.1 / 80 GB
Configuration
Operator
cockpit.

Every parameter is runtime-tunable. Zero restarts required. Drag sliders to see the live config output.

MAX_BATCH_SIZE 64
KV_EVICTION_THRESHOLD 85%
DISPATCH_TICK_MS 100
DEFAULT_PRIORITY 50
PAGERDUTY_ENABLED
PROMETHEUS_EXPORT
CIRCUIT_BREAKER
REQUEST_LOGGING
ARCH.BLUEPRINT // ENGINE
SCROLL TO EXPLORE ↓
INGRESS_TCP
PORT 8000 / ASGI
Accepts incoming TCP connections, manages Uvicorn worker pools.
AUTH_GATE
API_KEY / LIMITS
O(1) memory hash checking. Token bucket enforces RPM/TPM. Unauth dropped instantly.
CIRCUIT_BREAKER
STATE: CLOSED
Gates traffic based on core health. Prevents cascading failures under high pressure.
PRIORITY_DISPATCH
MAX-HEAP 100MS TICK
O(LOG N) HEAP QUEUE
Sorts 10,000+ pending blocks. Preempts low-priority inferences for production traffic.
KV VRAM GUARDIAN
4-state FSM actively defragments VRAM. Evicts sequence blocks if threshold hit.
TENSOR_CORE
VLLM KERNEL · CUDA GRAPH
ZERO OVERHEAD EXEC
Continuous batching and SSE streaming directly from compiled C++/CUDA engines.
ASSEMBLED · 0%
Deploy
Four commands
to production.

Zero external dependencies. Bare-metal or Docker. OpenAI-compatible — any existing SDK works unchanged.

inferx-cli — bash
$ git clone https://github.com/brahmai/inferx
Cloning into 'inferx'...
$ pip install -e .
✓ Installed (0 external deps)
$ inferx init --config config.yaml
✓ Config validated · Key created
$ inferx serve --port 8000
✓ INFERX v1.0.0 on :8000
✓ Dashboard :8001 · Admin :8002
Circuit: CLOSED · KV: 0.0%
$
OpenAI Compatible
# Drop-in replacement
client = openai.OpenAI(
  base_url="http://localhost:8000/v1",
  api_key="inferx_sk_..."
)
Integrations
📊 Prometheus
🔔 PagerDuty
🗄 MongoDB
💬 Slack
All opt-in. None required for boot.