GPU POD Solution

AI-ready supercomputing infrastructure solution for all workloads at scale

Scalable AI incorporates best of breed compute, networking, storage, power, and cooling to deliver the fastest application performance and meet the demands of evolving AI workloads.

Providing the computational power to train deep learning models

The AMAX GPU POD with NVIDIA A100 GPUs is an artificial intelligence (AI) supercomputing infrastructure, providing the computational power necessary to training today’s state-of-the-art deep learning (DL) models and to fuel innovation well into the future. The AMAX GPU POD delivers groundbreaking performance and is designed to solve the world’s most challenging computational problems.

This GPU POD reference architecture is the result of co-design between data scientists, application performance engineers, and system architects to build a system capable of supporting the widest range of deep learning workloads.

A100_3QTR_Right-header

Powered by

NVIDIA A100 Tensor Core GPU

The NVIDIA A100 Tensor Core GPU delivers unprecedented acceleration at every scale for AI, data analytics, and high-performance computing (HPC) to tackle the world’s toughest computing challenges. As the engine of the NVIDIA data center platform, A100 can efficiently scale to thousands of GPUs or, with NVIDIA Multi-Instance GPU (MIG) technology, be partitioned into seven GPU instances to accelerate workloads of all sizes.

AMAX AceleMax™ DGS-428A

Each AceleMax DGS-428A system with flexible configuration supports up to eight NVIDIA Tensor Core A100 GPUs, powered by AMD EPYC™ 7003 series dual-socket processors in a 4U form factor.

The AceleMax DGS-428A features up to 11 PCIe 4.0 slots and up to 160 PCIe lanes for compute, graphics, storage and networking expansion. PCIe 4.0 provides transfer speed of up to 16 GT/s – double the bandwidth of PCIe 3.0 – and delivers lower power consumption, better lane scalability and backwards compatibility.

 PRODUCT DETAILS

NVIDIA InfiniBand Network

 

NVIDIA provides the world’s smartest switches, enabling in-network computing through the Co-Design Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) ™ technology. The QM8700 series has the highest fabric performance available in the market with up to 16Tb/s of non-blocking bandwidth with sub 130ns port-to-port latency.

For this reference architecture, the StorMax® storage to the AceleMax DGS-428A systems by two NVIDIA HDR InfiniBand (for high-availability) network to provide the most efficient scalability of the GPU workloads and datasets. Built with NVIDIA’s Quantum InfiniBand switch device, the QM8700 series provides up to forty 200Gb/s full bi-directional bandwidth per port.

AMAX StorMax® Storage Systems

 

AMAX, together with Excelero, are delivering StorMax® all-flash and hybrid flash storage solutions, featuring 200Gb/s NVMe over Fabrics on InfiniBand with NVIDIA® ConnectX-6 adapters. StorMax® platforms are the highest performance, most secure and extremely flexible architectures in the market, with unmatched price-performance that accelerates all AI computing, database, big data analytics, cloud, web 2.0, and video processing workloads.

StorMax A-1110NV (1U) and StorMax A-2440 (2U) offer two ports of 200Gb/s InfiniBand and Ethernet connectivity, sub- 600 nanosecond latency, and 215 million messages per second. The two systems deliver low-latency distributed block storage for web-scale applications, enabling shared NVMe across any network and supports any local or distributed file system. These StorMax® solutions feature an intelligent management layer that abstracts underlying hardware with CPU offload, creates logical volumes with redundancy, and provides centralized, intelligent management and monitoring.

All applications benefit from the ultra-low latency, extremely high throughput and high IOPs of a local NVMe device with the convenience of centralized storage while avoiding proprietary hardware lock-in and reducing the overall TCO.

GPU POD Reference Architecture

Designed for any dataset size, GPU POD enables training at vastly improved performance in three deployment options.

SMALL REFERENCE ARCHITECTURE: 61.44 TB Raw

SM-GPU-POD-Reference

GPU Server:

  • 1x AceleMax DGS-428A
  • 4x A100 NVIDIA DPUs
  • 5x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters

 

Networking:

  • 1x NVIDIA QM8700 Switch
PerformanceReadsWrites
Bandwidth20 GB/s7.5 GB/s
IOPS5M340K
Latency95µs21µs

High-Performance Storage:

  • 1x StorMax® A-1110NV
  • 1x 2nd or 3rd Gen AMD EPYC™ Processor
  • 128GB RAM (8x 16GB) DDR4-32—DIMMs
  • 2x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters
  • 4x Kioxia CM6-R 15.36TB NVMe

MEDIUM REFERENCE ARCHITECTURE: 245.76 TB Raw

MD-GPU-POD-Reference

GPU Server:

  • 2x AceleMax DGS-428A, each with:
  • 4x A100 NVIDIA DPUs
  • 5x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters

 

Networking:

  • 2x NVIDIA QM8700 Switch
PerformanceReadsWrites
Bandwidth40 GB/s15 GB/s
IOPS10M680K
Latency95µs21µs

High-Performance Storage:

  • 2x StorMax® A-1110NV
  • 1x 2nd or 3rd Gen AMD EPYC™ Processor
  • 128GB RAM (8x 16GB) DDR4-32—DIMMs
  • 2x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters
  • 4x Kioxia CM6-R 15.36TB NVMe

LARGE REFERENCE ARCHITECTURE: 368.64 TB Raw

LG-GPU-POD-Reference
PerformanceReadsWrites
Bandwidth160 GB/s46 GB/s
IOPS30M2M
Latency95µs21µs

GPU Server:

  • 4x AceleMax DGS-428A, each with:
  • 4x A100 NVIDIA DPUs
  • 6x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters

Networking:

  • 2x NVIDIA QM8700 Switch

High-Performance Storage:

  • 1x StorMax® A-2440 (2U4N), each includes:
  • 1x 2nd or 3rd Gen AMD EPYC™ Processor
  • 128GB RAM (8x 16GB) DDR4-32—DIMMs
  • 2x NVIDIA ConnectX-6 VPI HDR/200GbE dual-port adapters
  • 24x Kioxia CM6-R 15.36TB NVMe

Contact us to learn more or to request a quote.

Get in touch