K-AI 96 Rome RTXPro6000 2000TOPS — Single-Card 96 GB Blackwell Workstation Server
K-AI 96 Rome RTXPro6000 2000TOPS 96 GB ECC Single-Card Workstation Server 1x RTX Pro 6000 Blackwell | EPYC Milan | 2 000 TOPS INT8 2 000 INT8 TOPS 96 GB ECC VRAM single card design fp8 native Blackwell One card, 96 GB ECC VRAM, the entire Blackwell tensor pipeline. 70B dense bf16 on a single GPU — no tensor-parallel overhead. A 4U rack-mount workstation server with a single NVIDIA RTX Pro 6000 Blackwell Workstation card (96 GB ECC GDDR7), one AMD EPYC 7643 Milan CPU (48C/96T), 256 GB DDR4 ECC, 2 TB NVMe boot, and one 2 kW ATX PSU with 54 % headroom. The simplest software path Kentino ships — no tensor-parallel config, no multi-GPU debugging. vLLM, SGLang, llama.cpp, ComfyUI run single-device and just work. Hardware Component Detail GPU 1x NVIDIA RTX Pro 6000 Blackwell Workstation 96 GB ECC GDDR7 (600 W, PCIe 5.0 x16) VRAM 96 GB ECC on a single card — no pooling, no tensor-parallel overhead CPU AMD EPYC 7643 Milan (48C/96T, 225 W, 128x PCIe 4.0 lanes) Motherboard ASRock Rack ROMED8-2T (SP3, 7x PCIe 4.0 x16, 8x DDR4 ECC, 2x 10 GbE, IPMI) System RAM 256 GB DDR4-2666 ECC RDIMM (4x 64 GB) Boot / storage 2 TB NVMe M.2 (PCIe 4.0 x4) Power supply 1x 2 kW ATX PSU Chassis 4U rack-mount (4-slot capacity, 1 populated — room to expand) Cooling Arctic Freezer 4U-M SP3 tower + 3x 120 mm front intake + 1x 120 mm rear exhaust Network Onboard dual 10 GbE (Intel X550) Power envelope GPU draw: 1 x 600 W = 600 W System total at full load: ~925 W PSU total: 2 000 W — 53.8 % headroom Single PSU, simple cabling — generous margin for single-card build Lane topology PCIe Gen4 x16 at the GPU (card is Gen5 native; Rome board caps at Gen4). Direct root-complex connection — no PCIe switch. No NVLink required — single card, no inter-GPU link at all. Six x16 slots remain open for NIC / storage / expansion. What you can run With 96 GB of ECC VRAM on a single Blackwell card, this server handles 70B dense bf16 on one GPU, open-weight LLMs, vision models, image and video generation, speech AI, and production inference — no tensor-parallel coordination needed. LLMs — text / reasoning / coding Chinese frontier Qwen3 / Qwen3.5 (Alibaba): Qwen3-32B dense bf16 (~65 GB) with generous KV; Qwen3-72B Q6 (~58 GB, ~25-35 tok/s single-stream); Qwen3-30B-A3B MoE bf16; Qwen3-Coder-30B-A3B agentic at 256k ctx; Qwen3.5-122B-A10B Q4 (~70 GB) with tight KV; QwQ-32B bf16 reasoning DeepSeek: DeepSeek-R2 32B sparse MoE bf16 (~64 GB, 92.7 % AIME 2025 single-card); DeepSeek-R1-Distill-Qwen-32B bf16; DeepSeek-V2-Lite 16B full precision GLM / Z.ai: GLM-4.5-Air 106B/12B Q4-Q5 (60-70 GB); GLM-4.6V 106B Q4 Tencent Hunyuan: Hunyuan-A13B 80B/13B MoE Q4-fp8 (~48-80 GB) with 256k ctx and dual-mode reasoning ByteDance Seed-OSS-36B bf16 (~72 GB tight) or fp8 (~36 GB) with full 512k native context Baidu ERNIE-4.5-47B-A3B Q4-fp8 with long context Western frontier Meta Llama: Llama 3.3 70B at bf16 (~70 GB) on a single card with 8-16k ctx — the hero config; Llama 3.3 70B Q6 (~58 GB, ~35-50 tok/s single-stream); Llama 3.1 8B bf16 (~80-120 tok/s); Llama 3.2 90B Vision Q4 (~52 GB); Llama 4 Scout 109B/17B MoE Q4 (~63 GB) Mistral: Mistral Small 3 / Magistral Small 1.2 / Devstral Small 2 (24B) all at bf16 with 256k ctx; Mixtral 8x7B Q6; Codestral Mamba 7B; Pixtral 12B bf16 OpenAI (open weights): gpt-oss-20b MXFP4 native (16 GB); gpt-oss-120b MXFP4 native (80 GB) — single-card single-stream Google Gemma 3: 27B multimodal bf16 (~54 GB) with 128k ctx; 12B / 4B bf16 Microsoft Phi-4 14B dense bf16; Phi-4-reasoning; Phi-4-multimodal NVIDIA Nemotron: Llama-3.1-Nemotron-Super 49B Q6 (~40 GB); Nemotron-Nano 8B Others: IBM Granite 4.0 H-Small 32B/9B; OLMo 2 32B; Reka Flash 3 21B; Falcon H1R 7B; Command R 35B Vision-Language Models Qwen3-VL-8B / 32B bf16, Qwen3-VL-30B-A3B MoE bf16, Qwen3-Omni-30B-A3B; InternVL3 up to 78B Q4 (~48 GB); InternVL3.5-38B bf16; DeepSeek-VL2 full range; Llama 3.2 11B Vision bf16; Llama 3.2 90B Vision Q4 (~52 GB); Pixtral 12B bf16; Molmo 72B Q4; Molmo 7B bf16; Gemma 3 12B / 27B multimodal; PaliGemma 2 28B; Phi-3.5-Vision; Aya Vision 8B / 32B; MiniCPM-V 2.6 / MiniCPM-o 2.6; GLM-4.6V. Image generation FLUX.1 [dev] / [schnell] bf16 (~24 GB) and quantized (~15-25 s/image at fp8); FLUX.1 Kontext [dev] in-context editing; FLUX Tools (Fill / Depth / Canny / Redux); SD 3.5 Large bf16 (~18 GB); SDXL 1.0; HunyuanImage-2.1 bf16 (~34 GB) at 2K native; HunyuanDiT 1.5B; Kolors / Kolors 2.0; AuraFlow v0.3; OmniGen v1; PixArt-Sigma. Video generation Wan 2.2 T2V-A14B / I2V-A14B MoE bf16 (~54 GB, both experts resident); Wan 2.2 TI2V-5B fast path; HunyuanVideo 13B bf16 (~60-80 GB, tight at 720p); HunyuanVideo 1.5 (8.3B); CogVideoX-5B; Open-Sora 2.0 (11B) bf16; Genmo Mochi-1 bf16 (~42 GB); LTX-Video; Pyramid Flow; SVD / SV3D / SV4D; NVIDIA Cosmos Predict 2. Audio / Speech / TTS ASR: Whisper v3 large / turbo (~50x realtime); NVIDIA Parakeet-TDT 1.1B; Canary 1B; Qwen3-ASR; SenseVoice TTS: CosyVoice 2 / Fun-CosyVoice 3.0; Kokoro 82M; Stable Audio Open; Coqui XTTS v2; StyleTTS 2; Step-Audio-EditX Realtime / S2S: Kyutai Moshi (200 ms full-duplex); Step-Audio 2 mini; Step-Audio-R1 / R1.1; Qwen2.5-Omni-7B Music / SFX: Meta MusicGen; AudioGen; Suno Bark; SeamlessM4T v2 Multi-model / multi-tenant serving Single-tenant streaming coding assistant — 70B dense bf16, low latency, no TP penalty Mixed resident stack: Qwen3-32B bf16 + FLUX.1 fp8 + Whisper-turbo + Moshi on one card with partitioned VRAM Fine-tuning: LoRA / QLoRA on 13-34B models; full-param on 7B Embedding service: BGE-M3 / E5 / Jina resident alongside a generator LLM Target workloads Single-tenant streaming coding assistant running Llama 3.3 70B bf16 or Qwen3-Coder-30B-A3B — no TP coordination overhead Developer workstation for a single engineer or tight team needing a 70B-class model with 32-128k context Video or image generation lab — HunyuanVideo 13B, Wan 2.2 dual-expert, HunyuanImage-2.1 all at bf16 resident VLM / OCR bench — Qwen3-VL-32B bf16 or InternVL3.5-38B with long-document pipelines Clean appliance for a small LLM API gateway — one model, one card, easy ops Measured performance Published references | NVIDIA RTX Pro 6000 Blackwell datasheet + community benchmarks Benchmark Result Per-card INT8 TOPS (NVIDIA datasheet) 2 000 TOPS VRAM per card 96 GB ECC GDDR7 Memory bandwidth ~1 800 GB/s Llama 3.3 70B Q6 single-GPU (community) 40-55 tok/s single-stream Llama 3.3 70B bf16 single-GPU (community) 15-25 tok/s single-stream Blackwell fp8 native DeepSeek-V3 fp8, Hunyuan-A13B fp8 run without bf16 upcast Published external references, not measured on Kentino hardware. Kentino will publish first-party numbers after the first customer build. Not ideal for Training large models from scratch (single GPU — no tensor/pipeline parallelism) Frontier 200B+ MoE at real quantizations (Qwen3-235B Q4, GLM-4.5/4.6 — use K-AI 192 RTXPro6000 or larger) High-concurrency multi-tenant inference (single card caps aggregate throughput; 4x RTX 4090 or 4x L40 scale better) Warranty and lead time 2 years parts warranty 1 year labor warranty 10-28 days lead time NVIDIA OEM 3-year warranty on RTX Pro 6000 + Kentino integration warranty. Build includes assembly, BIOS configuration, driver install, burn-in testing, and functional verification. Lead time depends on component availability, confirmed at order. Recommended add-ons Upgrade RAM to 512 GB (add 4x 64 GB DDR4 — four DIMM slots still open) 4 TB NVMe secondary drive for model library / dataset staging 24U open cabinet for production rack-mount For Gen5 x16 link speed consider the Genoa-platform variant on request
Variants (1)
- Default Title — 18816.00 USD — In stock
AI Readiness
Good foundation, but some important product data is still missing.