Priority scheduling, KV cache governance, and real-time observability for enterprise AI. Zero dependencies. Under 5ms overhead.
Metrics pushed via WebSocket at 1-second intervals. GPU utilization, KV cache pressure, queue depth, per-key usage — everything queryable.
An end-to-end inference stack from client request to native Tensor Core execution. Hover nodes to explore each component in the orbital topology.
Raw data points converging from global edge nodes onto the central InferX execution ring. Zero abstraction.
Every request traverses the full stack. No shortcuts — every decision logged with a reason code.
Six production-grade components. Each engineered for one job, done without compromise.
Independent cooldown timers. RESOLVED events auto-dispatched when conditions clear. Configurable via config.yaml.
| Alert ID | Condition | Threshold | Severity |
|---|---|---|---|
| KV_PRESSURE | KV cache ≥ warn | 70% | WARN |
| KV_CRITICAL | KV cache ≥ evict | 85% | CRIT |
| KV_EMERGENCY | KV cache ≥ emergency | 97% | EMRG |
| GPU_MEM_HIGH | GPU VRAM ≥ threshold | 90% | WARN |
| GPU_UTIL_HIGH | GPU util ≥ threshold | 95% | WARN |
| CPU_HIGH | CPU usage ≥ threshold | 85% | WARN |
| RAM_HIGH | RAM usage ≥ threshold | 90% | CRIT |
| QUEUE_DEEP | Queue depth ≥ threshold | 200 | CRIT |
| CIRCUIT_OPEN | Breaker opened | N/A | CRIT |
| ENGINE_DOWN | Health check failed | N/A | EMRG |
| HIGH_PREEMPT | Preemptions/min ≥ | 10/min | WARN |
| TPT_DROP | Tokens/sec drops ≥ | 50% | WARN |
| WAIT_LONG | Request waiting ≥ | 30s | WARN |
| KEY_EXPIRED | API key TTL hit | N/A | INFO |
Head-to-head against raw inference engines. Same model (Llama-3 70B), same hardware (8× H100 SXM5), same prompt distribution. InferX adds governance at near-zero cost.
| Metric | INFERX | Raw vLLM | TGI | TensorRT-LLM |
|---|---|---|---|---|
| TTFT (Time to First Token) | 18ms BEST | 22ms | 31ms | 19ms |
| ITL (Inter-Token Latency) | 4.8ms BEST | 5.1ms | 7.2ms | 5.0ms |
| P99 Tail Latency | 42ms BEST | 48ms | 89ms | 45ms |
| Max Concurrent QPS | 45,200 BEST | 38,100 | 21,500 | 41,800 |
| Control Plane Overhead | <1.8% | N/A (no control plane) | N/A | N/A |
| Priority Scheduling | ✓ Native 1-100 | ✗ | ✗ | ✗ |
| KV Cache Governance | ✓ 4-state FSM | ✗ | ✗ | Partial |
InferX maps tensor parallelism and pipeline parallelism across your full GPU fabric. Every device tracked, every link monitored.
Every parameter is runtime-tunable. Zero restarts required. Drag sliders to see the live config output.
Zero external dependencies. Bare-metal or Docker. OpenAI-compatible — any existing SDK works unchanged.