K-AI 96 Rome L40 724TOPS — 2x NVIDIA L40 ECC Production Inference Server
K-AI 96 Rome L40 724TOPS 2x L40 ECC Production Server 96 GB ECC VRAM | EPYC Milan | 724 TOPS INT8 724 TOPS INT8 96 GB ECC VRAM ECC datacenter grade 24/7 production Entry enterprise ECC 24/7 box — 2x L40 passive, 96 GB ECC VRAM pool, datacenter-grade alternative to the 4090 tier for regulated deployments. A two-GPU production-class inference server built on ROMED8-2T / EPYC Milan with two passive NVIDIA L40 cards. 96 GB ECC GDDR6 pool at the same VRAM envelope as the 4x RTX 4090 workhorse, but with full datacenter certification, ECC memory on every card, and a thermal design built for 24/7 duty cycle. The right call where RTX 4090 would raise warranty, reliability or compliance concerns — finance, healthcare, formal verification, and any sustained-production LLM / VLM serving. Hardware Component Detail GPUs 2x NVIDIA L40 48 GB GDDR6 ECC (Ada Lovelace, passive, 300 W, dual-slot, PCIe 4.0 x16) VRAM pool 96 GB ECC (no NVLink) CPU AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) Motherboard ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) System RAM 256 GB DDR4-2666 ECC RDIMM (4x 64 GB) Boot / storage 1 TB NVMe M.2 (PCIe 4.0 x4) Power supply Single 2 kW ATX PSU Chassis 4U rack-mount, passive Gen4 x16 risers Cooling SP3 tower cooler (Arctic Freezer 4U-M), 3x 120 mm front intake + 1x 120 mm rear exhaust Network Onboard dual 10 GbE (Intel X550) + IPMI Power envelope GPU draw: 2 x 300 W = 600 W System total at full load: ~925 W PSU total: 2 000 W — 53.8 % headroom Comfortable single-PSU margin, quiet operation Lane topology PCIe Gen4 x16 at both GPUs (L40 is native Gen4 x16). 16 lanes direct from CPU root complex — no PCIe switch. NVLink not present on L40 — inter-GPU comms via PCIe P2P. 864 GB/s memory bandwidth per card. What you can run With 96 GB of ECC VRAM across 2 passive L40 cards, this server handles enterprise 24/7 LLM serving, regulated deployments, image and video generation, and multi-tenant inference where ECC reliability and datacenter warranty matter. LLMs — text / reasoning / coding Chinese frontier Qwen3-32B bf16 single-GPU on one L40 with 32k ctx headroom (~18-22 tok/s single-stream on L40, published reference) Qwen3.5-27B bf16; Qwen3-30B-A3B / Qwen3-Coder-30B-A3B bf16 (~60 GB) 256k ctx Qwen3.5-122B-A10B Q4 (~70 GB) — MoE flagship, long ctx QwQ-32B bf16; Hunyuan-A13B Q6 (~48 GB) DeepSeek-R2 32B sparse MoE bf16 — single-GPU capable, two parallel streams GLM-4.5-Air 106B/12B Q4-Q5 (60-70 GB comfortable) Seed-OSS-36B bf16 — 512k native ctx; ERNIE-4.5-47B-A3B Q6-Q8 Baichuan-M2-32B bf16 (medical reasoning — ECC advantage here) Western frontier Llama 3.3 70B Q6 (~58 GB) with KV headroom; Q4_K_M (~43 GB) very long ctx (~15-18 tok/s single-stream on 2x L40, published reference) Hermes 3 70B / Tulu 3 70B Q4-Q6; Llama 4 Scout 109B/17B MoE Q4 (~63 GB) Mistral Small 3 / Magistral Small 1.2 / Devstral Small 2 (24B) bf16; Mixtral 8x22B Q3-Q4 gpt-oss-120b MXFP4 (~80 GB) with KV room Gemma 3 27B multimodal bf16 with 128k ctx Phi-4 14B / Phi-4-reasoning / Phi-4-multimodal bf16 Nemotron-Super 49B Q6-Q8; IBM Granite 4.0 H-Small 32B/9B — enterprise compliance Reka Flash 3 21B bf16; OLMo 2 32B / OLMo 3.1-32B-Think bf16 Vision-Language Models Qwen3-VL-8B / 32B, Qwen3-VL-30B-A3B MoE, Qwen3-Omni-30B-A3B; InternVL3 up to 78B Q4 (~48 GB); InternVL3.5-38B bf16; DeepSeek-VL2; ERNIE-4.5-VL-28B-A3B-Thinking; Llama 3.2 11B Vision bf16; Pixtral 12B bf16; Gemma 3 12B / 27B multimodal; PaliGemma 2 (3/10/28B); MiniCPM-V 2.6 / MiniCPM-o 2.6; GLM-4.6V-Flash; Molmo 72B Q4; Aya Vision 32B. Image generation L40 has Ada tensor cores and 864 GB/s memory bandwidth per card — solid for production image pipelines: FLUX.1 [dev] / [schnell] fp16 (~24 GB) or fp8 (~12 GB) (~15-25 seconds per 1024x1024 image at fp8, published reference); FLUX.1 Kontext [dev]; FLUX Tools (Fill / Depth / Canny / Redux); SD 3.5 Large (18 GB fp16 / 11 GB fp8); SDXL 1.0 + ControlNet + AnimateDiff; HunyuanImage-2.1 bf16 (~34 GB); Kolors 2.0; AuraFlow v0.3; OmniGen v1; PixArt-Sigma. Video generation HunyuanVideo 13B bf16 fits on one L40 at 720p short clip; Wan 2.2 T2V-A14B / I2V-A14B bf16 (~54 GB) tensor-parallel 2-way; Wan 2.2 TI2V-5B bf16 per card; Wan 2.1 14B fp8 / bf16; HunyuanVideo 1.5 (8.3B) bf16; Open-Sora 2.0 (11B) bf16; CogVideoX-5B / 1.5 bf16; Mochi-1 bf16 (~42 GB); LTX-Video 2B; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2. Audio / Speech / TTS ASR: Whisper v3 large / turbo (~50x realtime on single GPU, published reference); Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice TTS: CosyVoice 2 / Fun-CosyVoice 3.0; Kokoro 82M; Stable Audio Open; Coqui XTTS v2; StyleTTS 2; Step-Audio-EditX Realtime / S2S: Kyutai Moshi (200 ms latency full-duplex); Step-Audio 2 mini / R1 / R1.1; Qwen2.5-Omni-7B Music / SFX / translation: MusicGen; AudioGen; Suno Bark; SeamlessM4T v2; MMS Multi-model / multi-tenant serving 4-8 concurrent users on 32-70B class LLMs via vLLM tensor-parallel or per-card partition Mixed stack: Qwen3-32B + FLUX.1 + Whisper-turbo + Moshi resident with partitioned VRAM LoRA inference + light fine-tuning of 7-14B; full-param possible on smaller models RAG pipelines with Command R / Qwen3 + BGE-M3 / E5 / Jina embeddings Target workloads Enterprise 24/7 LLM serving — 70B Q4-Q6, Qwen3-32B bf16, Mistral Small 3 bf16 Regulated deployment requiring ECC memory (finance, healthcare, formal verification) Long-context serving — Seed-OSS-36B 512k ctx fits comfortably on the 96 GB pool Mid-tier MoE serving — Hunyuan-A13B Q6, GLM-4.5-Air Q4, Qwen3-30B-A3B bf16 VLM document processing — InternVL3.5-38B, Pixtral 12B bf16, Qwen3-VL-32B Published performance references Published reference | 2x NVIDIA L40 comparable hardware Benchmark Result Llama 3.3 70B Q4_K_M across 2x L40 tensor-split ~15-18 tok/s single-stream Qwen3-32B bf16 single-GPU on one L40 ~18-22 tok/s single-stream vLLM Hunyuan-A13B Q6 on 2x L40 pool ~28-34 tok/s single-stream HunyuanVideo 13B bf16 on one L40 720p short clip — fits in 48 GB Per-card metrics 362 TOPS INT8, 864 GB/s, 300 W TDP Published, not measured on Kentino hardware. Not ideal for Cost-per-TFLOPS optimization — 4x RTX 4090 gives 2 644 aggregate TOPS at ~40 % of the component cost (without ECC / datacenter warranty) Frontier 200B+ dense models — 96 GB pool ceiling applies (need 192+ GB SKU) Video generation at bf16 long-form full-resolution (Wan 2.2 MoE two-expert wants more VRAM) Training from scratch — L40 is inference-certified; use RTX Pro 6000 / workstation Blackwell for training Warranty and lead time 2 years parts warranty 1 year labor warranty 10-28 days lead time NVIDIA OEM 3-year datacenter warranty on L40 + Kentino integration warranty (2 years parts, 1 year labor). Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Recommended add-ons Upgrade to 4x L40 (K-AI 192 Rome L40 1448TOPS) for 192 GB ECC pool and frontier-tier serving Upgrade RAM to 512 GB (add 4x 64 GB DDR4) for larger embedding / reranker stacks Upgrade NVMe to 4 TB for model library + dataset staging Redundant PSU upsell (dual 2 kW synced) available on request Rack PDU + 3 kVA online UPS for production colo
Variants (1)
- Default Title — 27480.00 USD — In stock
AI Readiness
Good foundation, but some important product data is still missing.