K-AI 192 Rome RTXPro6000 4000TOPS — 2× RTX Pro 6000 Blackwell Server Edition — EPYC Milan

Name: K-AI 192 Rome RTXPro6000 4000TOPS — 2× RTX Pro 6000 Blackwell Server Edition — EPYC Milan
Brand: Kentino s.r.o.

Brand: Kentino s.r.o.

29876.00 USD In stock Buy at Merchant

K-AI 192 Rome RTXPro6000 4000TOPS 192 GB ECC Blackwell Flagship Pair 2x RTX Pro 6000 Server Edition | EPYC Milan | 4 000 TOPS INT8 4 000 INT8 TOPS 192 GB ECC VRAM Blackwell fp8 native 2-card minimal TP Two passive RTX Pro 6000 Blackwell Server Edition cards — 96 GB ECC each. Less tensor-parallel overhead than 4- or 8-card builds. Datacenter flagship pair. A 4U rack-mount inference server with two passive RTX Pro 6000 Blackwell Server Edition cards (96 GB ECC GDDR7 per card), one AMD EPYC 7643 Milan CPU (48C/96T), 256 GB DDR4 ECC, 2 TB NVMe boot, and a single 2 kW ATX PSU. For 70B dense bf16 and mid-size MoE, fewer big cards beat more small cards — two-card tensor parallelism has minimal communication overhead, and each 96 GB card carries a complete copy of most models. Hardware Component Detail GPUs 2x NVIDIA RTX Pro 6000 Blackwell Server Edition 96 GB ECC GDDR7 (passive, 600 W, PCIe 5.0 x16, dual-slot) VRAM pool 192 GB ECC (96 GB x 2) — each card holds a 70B bf16 model standalone CPU AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) Motherboard ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) System RAM 256 GB DDR4-2666 ECC RDIMM (4x 64 GB) Boot / storage 2 TB NVMe M.2 (PCIe 4.0 x4) Power supply 1x 2 kW ATX PSU Chassis 4U rack-mount with front-to-back directed airflow Cooling Arctic Freezer 4U-M SP3 tower + 3x 120 mm front intake + 1x 120 mm rear exhaust Network Onboard dual 10 GbE (Intel X550) Power envelope GPU draw: 2 x 600 W = 1 200 W System total at full load: ~1 525 W PSU total: 2 000 W (single 2 kW) — 23.7 % headroom Single PSU sufficient; optional dual-PSU upgrade for N+1 redundancy Lane topology PCIe Gen4 x16 per GPU (card is Gen5 native; Rome board caps at Gen4). Direct root-complex connection — no PCIe switch. No NVLink — inter-GPU peer-to-peer. Five x16 slots remain open for expansion. Gen4 vs Gen5 negligible for inference at this VRAM density. What you can run With 192 GB ECC VRAM on just two Blackwell cards with native fp8/fp4, this is the cleanest path to dense 70B at bf16 and mid-size MoE. Two independent 70B streams — one per card — or 200B MoE across both with minimal 2-way TP overhead. LLMs — text / reasoning / coding Chinese frontier Qwen3 / Qwen3.5 (Alibaba): Qwen3-235B-A22B Q4 (~132 GB) comfortable with long ctx (~15-25 tok/s single-stream across 2 cards); Qwen3-Coder-480B-A35B Q2 (~160 GB); Qwen3.5-122B-A10B fp8 (~75 GB); Qwen3-32B dense bf16 with huge KV; QwQ-32B bf16 DeepSeek: DeepSeek-V3/R1 Q2 (~215 GB with small RAM spill) — Blackwell runs fp8 natively; DeepSeek-R2 32B bf16 two concurrent streams (one per card) GLM / Z.ai: GLM-4.5 / 4.6 / 4.7 Q4 (~177 GB) — hero config at this tier; GLM-4.5-Air fp8 or bf16 with huge KV Tencent Hunyuan: Hunyuan-Large Q3 (~160 GB) — 389B MoE with 256k ctx; Hunyuan-A13B fp8 native (~80 GB) with huge KV Others: Baidu ERNIE-4.5-424B Q3 (~180 GB); InternVL3.5-241B-A28B Q4 (~135 GB); MiniMax-M1 Q3 (~180 GB) Western frontier Meta Llama: Llama 3.3 70B bf16 on one card — two independent concurrent 70B streams (~20-30 tok/s per stream); Llama 4 Scout bf16 (~218 GB, tight); Llama 4 Maverick Q3 (~188 GB) Mistral: Mistral Large 2 / Pixtral Large / Devstral 2 123B Q6 (~88 GB) single-card or bf16 across both; Mistral Small 3 multi-stream OpenAI (open weights): gpt-oss-120b MXFP4 native (80 GB) — fits on ONE card, two independent concurrent streams NVIDIA Nemotron: Llama-3.1-Nemotron Ultra 253B Q4 (~147 GB); Super 49B bf16 on single card Others: Cohere Command R+ 104B Q6 (~85 GB) on one card; Google Gemma 3 27B bf16 multiple concurrent streams Vision-Language Models InternVL3.5-241B-A28B Q4 (~135 GB); Qwen3-VL-235B-A22B Q4; Qwen3-VL-32B bf16 single-card; Pixtral Large 124B bf16 or Q6; Llama 3.2 90B Vision bf16 (~180 GB); Molmo 72B bf16 (~144 GB); GLM-4.6V 106B fp8; Gemma 3 27B multimodal x 2-3 concurrent streams. Image generation FLUX.1 [dev] bf16 multiple concurrent streams; FLUX.1 Kontext [dev]; FLUX Tools; SD 3.5 Large bf16 concurrent; HunyuanImage-2.1 bf16 (~34 GB) x 2-4 concurrent; HunyuanImage-3.0 base (80B MoE, 13B active) bf16 — fits on one card; HunyuanDiT; Kolors / Kolors 2.0; AuraFlow; OmniGen v1; PixArt-Sigma. Video generation Wan 2.2 MoE dual-expert bf16 full context — fits on one card, two concurrent generation streams; Wan 2.2 TI2V-5B; HunyuanVideo 13B bf16 both experts; HunyuanVideo 1.5; CogVideoX-5B bf16; Open-Sora 2.0 11B bf16; Mochi-1 bf16 (~42 GB); LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2. Audio / Speech / TTS ASR: Whisper v3 large / turbo (~50x realtime); Parakeet-TDT; Canary 1B; Qwen3-ASR; SenseVoice TTS: CosyVoice 2/3; Kokoro 82M; XTTS v2; Stable Audio Open; Step-Audio-EditX Realtime / S2S: Kyutai Moshi 7B; Step-Audio 2 mini/R1; Qwen2.5-Omni-7B Music / SFX: MusicGen / AudioGen / Bark; SeamlessM4T v2 Multi-model / multi-tenant serving Two independent 70B streams — one per card, simplest form of tenant isolation Dense 70B bf16 + supporting stack — LLM on card 1, image/video/audio on card 2 200B MoE across both cards — minimal tensor-parallel overhead (2-way split) fp8-native frontier — DeepSeek V3 family, Hunyuan-Large fp8 with Blackwell native paths Target workloads Dense 70B bf16 inference — two cards tensor-parallel with minimal overhead, or one model per card for streaming 100-150B MoE at Q4-Q6 (GLM-4.5-Air, Qwen3.5-122B-A10B, Hunyuan-A13B, Llama 4 Scout) FP8-native frontier inference (DeepSeek V3 family, Hunyuan, Llama 4) — Blackwell runs fp8 natively Image + video generation studio at bf16 (Wan 2.2 T2V-A14B, HunyuanVideo 13B, FLUX.1 [dev]) Long-context document analysis (MiniMax-M1, Kimi-K2 1.58-bit UD with spill) Measured performance Published references | NVIDIA RTX Pro 6000 Blackwell Server Edition datasheet + community benchmarks Benchmark Result Per-card INT8 TOPS (NVIDIA datasheet) 2 000 TOPS Aggregate INT8 TOPS (2 cards) 4 000 TOPS Memory bandwidth per card ~1 800 GB/s, 96 GB ECC GDDR7 Llama 3.3 70B bf16 per-card (community) 15-25 tok/s single-stream, 60-90 tok/s batch Dual-card tensor-parallel 70B (community) ~30-45 tok/s single-stream expected Blackwell fp8 native DeepSeek-V3 fp8, Hunyuan-A13B fp8 run without bf16 upcast Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build. Not ideal for Very high concurrency multi-tenant serving — 4x L40 or 6x L4 distributes better across more cards Heavy KV cache at very long context — step up to K-AI 384 RTXPro6000 8000TOPS Training — Kentino does not sell H-class NVLink fabrics Budget inference at 192 GB pool — 8x RTX 4090 is cheaper (trading ECC and passive cooling for cost) Warranty and lead time 2 years parts warranty 1 year labor warranty 10-28 days lead time NVIDIA OEM 3-year warranty on RTX Pro 6000 Server Edition + Kentino integration warranty. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order. Recommended add-ons Upgrade to dual 2 kW synced PSU for N+1 redundancy Upgrade RAM to 512 GB (4 DIMM slots open) 4 TB NVMe for large weight libraries and model staging Expand to 4-card configuration (K-AI 384 RTXPro6000 8000TOPS) — chassis has slot capacity 24U rack cabinet + online UPS 5 kVA

Variants (1)

Default Title — 29876.00 USD — In stock

AI Readiness

Good foundation, but some important product data is still missing.

83%