K-AI 192 RomeDual 4090 5288TOPS — 8× RTX 4090 — Dual EPYC Milan
K-AI 192 RomeDual 4090 5288TOPS 192 GB VRAM 8-GPU Inference Server 8x RTX 4090 | Dual EPYC Milan | 5 288 TOPS INT8 5 288 INT8 TOPS 192 GB VRAM pool 8-GPU tensor parallel dual CPU 96C/192T Flagship 8x gaming-GPU inference box. 192 GB pool at consumer-card economics on a dual-socket EPYC Milan platform. A 7U 8-GPU chassis built around dual EPYC 7643 Milan CPUs (96C/192T total), ASRock Rack ROME2D32GM-NL dual-SP3 motherboard, 512 GB DDR4 ECC, 2 TB NVMe boot, and a 5x 1200 W server PSU set. Eight GeForce RTX 4090 connect via active PCIe Gen4 retimer risers at full x16. The cheapest path to 192 GB frontier MoE inference on Kentino hardware. Hardware Component Detail GPUs 8x NVIDIA GeForce RTX 4090 24 GB GDDR6X (Ada Lovelace, 450 W, PCIe 4.0 x16) VRAM pool 192 GB total across 8 cards (no NVLink on consumer RTX 4090) CPU 2x AMD EPYC 7643 Milan (48C/96T each — 96C/192T total, 225 W each, 2x 128 PCIe 4.0 lanes) Motherboard ASRock Rack ROME2D32GM-NL (dual SP3, PCIe 4.0, 32x DDR4 ECC DIMM slots) System RAM 512 GB DDR4-2666 ECC RDIMM (8x 64 GB — 4 per socket for 8-channel balance) Boot / storage 2 TB NVMe M.2 (PCIe 4.0 x4) Power supply 5x 1200 W server PSU set (HP-compatible, hot-swap) + full 12VHPWR adapter set Chassis 7U 8-GPU chassis (up to 10 PCIe cards including risers) Risers 8x active PCIe Gen4 x16 retimer risers (required over cable length) Cooling 2x Arctic Freezer 4U-M SP3 tower coolers + rack-mount front-to-back airflow (industrial fans) Network Onboard dual 10 GbE (Intel X550) Power envelope GPU draw: 8 x 450 W = 3 600 W CPU draw: 2 x 225 W = 450 W System total at full load: ~4 200 W PSU total: 6 000 W all-active (5x 1200 W) — 30.0 % headroom Lane topology ROME2D32GM-NL exposes 2x 128 PCIe Gen4 lanes — one 128-lane pool per EPYC socket — direct to GPU slots. Active Gen4 retimer risers for signal integrity. No PCIe switch. No NVLink. Measured 19-22 GB/s inter-GPU peer-to-peer on 4-GPU bench. What you can run With 192 GB across 8 cards, this server handles 200B+ frontier MoE at Q4, 8-way tensor-parallel inference, tenant-isolated multi-model serving, and high-batch throughput at consumer-card economics. LLMs — text / reasoning / coding Chinese frontier Qwen3 / Qwen3.5 (Alibaba): Qwen3-235B-A22B Q4 (~132 GB) with long ctx — the hero config (~15-25 tok/s single-stream on 8x RTX 4090); Qwen3-Coder-480B-A35B Q2 (~160 GB); Qwen3.5-122B-A10B fp8 (~75 GB) multi-stream; Qwen3-32B dense bf16 x multiple concurrent DeepSeek: DeepSeek-V3/R1 Q2 (~215 GB with 512 GB host spill); DeepSeek-R2 32B bf16 — up to 8 concurrent streams one per card (~30-40 tok/s per stream) GLM / Z.ai: GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB); GLM-4.5-Air fp8 or bf16; GLM-4.6V 106B Tencent Hunyuan: Hunyuan-Large Q3 (~160 GB); Hunyuan-A13B Q4/Q6 (RTX 4090 is Ada — fp8 upcasts to bf16, use GGUF quants) Others: Baidu ERNIE-4.5-424B Q3 (~180 GB); InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3.5-397B Q3 (~170 GB); MiniMax-M1 Q3 (~180 GB) Western frontier Meta Llama: Llama 3.3 70B bf16 with massive KV (~20 tok/s single-stream Q4, ~179 tok/s batch-32 vLLM — Kentino measured on 4-GPU bench); Llama 4 Scout bf16 (~218 GB tight); Llama 4 Maverick Q3 (~188 GB) Mistral: Mistral Large 2 / Pixtral Large 123B Q6 comfortable or bf16 (~248 GB spill); Mistral Small 3 multi-stream OpenAI (open weights): gpt-oss-120b MXFP4 native (80 GB) with huge KV NVIDIA Nemotron: Llama-3.1-Nemotron Ultra 253B Q4 (~147 GB); Super 49B bf16 Others: Cohere Command R+ 104B Q6 (~85 GB); Google Gemma 3 27B bf16 x multiple streams Vision-Language Models InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3-VL-235B-A22B Q4; Qwen3-VL-32B bf16 multi-stream; Llama 3.2 90B Vision bf16 (~180 GB); Pixtral Large 124B Q6; Molmo 72B bf16; GLM-4.6V 106B fp8/Q6; Gemma 3 27B multimodal x multiple streams. Image generation FLUX.1 [dev] bf16 — up to 8 concurrent generation streams (one per card, ~15-25 s/image at fp8); FLUX.1 Kontext [dev]; FLUX Tools; SD 3.5 Large bf16 x 8; HunyuanImage-2.1 bf16 (~34 GB) x 2-4 concurrent; HunyuanImage-3.0 base (80B MoE, 13B active) bf16; HunyuanDiT; Kolors / Kolors 2.0; AuraFlow; OmniGen v1; PixArt-Sigma. Video generation Wan 2.2 MoE dual-expert bf16 with full ctx — multiple concurrent streams; Wan 2.2 TI2V-5B x 8 concurrent; HunyuanVideo 13B bf16 both experts; HunyuanVideo 1.5; CogVideoX-5B bf16; Open-Sora 2.0 11B bf16; Genmo Mochi-1 bf16; LTX-Video x 8 concurrent; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos. Audio / Speech / TTS ASR: Whisper v3 large / turbo x 8 concurrent (~50x realtime per stream); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice TTS: CosyVoice 2/3; Kokoro 82M; XTTS v2; Stable Audio Open Realtime / S2S: Kyutai Moshi 7B x 8 concurrent voice streams; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2 Multi-model / multi-tenant serving 8-way tensor-parallel inference of 200-250B MoE at Q4 (Qwen3-235B, GLM-4.5/4.6/4.7) Tenant-isolated 8-stream serving — one 24 GB Q4 model per card (e.g. 8x Qwen3-14B agents) Large-batch 70B — tensor-parallel vLLM / SGLang batch-64 aggregate Mixed fleet: 235B MoE on 4 cards (TP4) + FLUX + video + realtime voice on remaining 4 Fine-tuning lab — 7-34B LoRA / QLoRA with large batch Target workloads 8-GPU tensor-parallel inference at the 192 GB pool — Qwen3-235B Q4, GLM-4.5/4.6/4.7 Q4, Llama 4 Scout bf16 Dense 70B bf16 (Llama 3.3 70B) with massive KV headroom for long ctx and high batch High-throughput batch inference gateway — vLLM / SGLang tensor-parallel at large batch Fine-tuning of 7-34B class models with high-batch LoRA / QLoRA Wan 2.2 dual-expert / HunyuanImage-3.0 / FLUX.1 full workflow video-image studio Measured performance Kentino bench (4-GPU reference) | 2026-04-10 | 4x RTX 4090 + EPYC 7542 + 512 GB DDR4 + ROMED8-2T Benchmark Result Sustained compute (fp16, 4-card ref) 647 TFLOPS vLLM — Llama 3.3 70B AWQ INT4 (single) 8.0 tok/s vLLM — Llama 3.3 70B AWQ INT4 (batch-32) 179 tok/s aggregate llama.cpp — Llama 3.3 70B Q4_K_M (single) 20.3 tok/s decode 8-GPU aggregate compute (extrapolation) ~1 294 TFLOPS fp16 expected (near-linear) 235B Q4 tensor-parallel 8-way (community) 15-25 tok/s single-stream on 8x RTX 4090 4-card data measured on Kentino hardware. 8-GPU extrapolation is published external reference. Kentino will publish first-party 8-GPU numbers after the first customer build. Not ideal for 5090-generation workloads (Blackwell fp8 native + higher TOPS) — see K-AI 256 TurinDual 5090 Training from scratch (no NVLink on consumer RTX 4090) ECC-sensitive 24/7 production — consumer RTX 4090 has no ECC; prefer 4x L40 or 2x RTX Pro 6000 Server Edition Hunyuan / DeepSeek fp8 native — RTX 4090 is Ada, fp8 checkpoints upcast to bf16 Warranty and lead time 2 years parts warranty 1 year labor warranty 10-28 days lead time Build includes assembly, BIOS config with dual-socket NUMA tuning, driver install, burn-in, memtest, full 8-GPU stress test, and LLM environment setup. Lead time depends on component availability, confirmed at order. Recommended add-ons 4 TB additional NVMe for weight staging and MoE offload workloads NVIDIA ConnectX-5 100 GbE for multi-node serving RAM upgrade to 1 TB (16x 64 GB) or 2 TB (32x 64 GB) — board supports 32 DIMM slots Full 24U rack cabinet + online UPS 5 kVA
Variants (1)
- Default Title — 38327.00 USD — In stock
AI Readiness
Good foundation, but some important product data is still missing.