[SMART]Rack AI

Inference & Training at Scale

“Our requirement for computational power, density and network speed are very different from what conventional servers provide. AMAX was able to deliver the [SMART]Rack AI, a fully customized rackscale Deep Learning solution with comprehensive power management [SMART]DC, in-rack cooling system, and ultra-fast network speed, that solved our problems above and beyond what we thought we were looking for."

AMAX [SMART]Rack AI is a turnkey Machine Learning cluster designed for optimal manageability and performance, featuring 96x NVIDIA® P40s or V100s for up to 1.34 PFLOPs per rack. Delivered plug-and-play and fully-loaded, the solution features All-Flash storage for an ultra-fast in-rack data repository, 25G high speed networking, [SMART]DC Data Center Manager and an in-rack battery for graceful shutdown in power loss scenarios. [SMART]Rack AI is the perfect platform for on-premise AI clouds and DL-as-a-service, or to drop into any data center environment for the highest-performance in training and inference at scale.

·

    

·

    

·

[SMART] DC Data Center Manager

[SMART]DC is an HPC-optimized, fully integrated DCIM to remotely monitor, manage and orchestrate power-dense GPU-based ML deployments, where real time temperature, power and system health monitoring are critical to ensure uninterrupted operation. Features include policy-based emergency power and resource management, remote KVM, alert and event notification, and advanced analytics.

25GbE Fabrics Featuring RoCE

Benefit from the latest 25G network technology for increased in-rack bandwidth and productivity. 48x 25G downlink and 6x 100G uplink remove existing bottlenecks between compute and SSD/NVMe storage solutions, and accelerate application workloads. AMAX’s 25G fabrics supports RDMA over Converged Ethernet (RoCE) and link aggregation for connectivity previously only available to HPC architectures.

10kW In-Rack Backup Battery Solution

Designed to bridge short power outages and to safely shut down servers without external UPS, the in-rack battery provides 2.5 min of backup power at 10kW load per battery. In addition, [SMART]DC smart power policies will reduce power consumption of GPU servers and stretch the battery hold up time to 5 min, comparable with state-of-the-art centralized UPS solutions.

[SMART]DC Data Center Manger

[SMART]DC is an HPC-optimized, fully integrated DCIM to remotely monitor, manage and orchestrate power-dense GPU-based ML deployments, where real time temperature, power and system health monitoring are critical to ensure uninterrupted operation. Features include policy-based emergency power and resource management, remote KVM, alert and event notification, and advanced analytics.

48 Port 25G TOR Switch

Benefit from the latest 25G network technology for increased in-rack bandwidth and productivity. 48x 25G downlink and 6x 100G uplink remove existing bottlenecks between compute and SSD/NVMe storage solutions, and accelerate application workloads. AMAX’s 25G fabrics supports RDMA over Converged Ethernet (RoCE) and link aggregation for connectivity previously only available to HPC architectures.

10kW In-Rack Backup Battery Solution

Designed to bridge short power outages and to safely shut down servers without external UPS, the in-rack battery provides 2.5 min of backup power at 10kW load per battery. In addition, [SMART]DC smart power policies will reduce power consumption of GPU servers and stretch the battery hold up time to 5 min, comparable with state-of-the-art centralized UPS solutions.

Compute Module

One [SMART]Rack AI Compute Module consists of four MATRIX 280 8x GPU Servers, featuring P40, P100, or V100 GPUs. Each rack encloses up to three Compute Modules to provide over 1 PetaFLOP of compute power.

4x MATRIX 280 2U 8 GPU Servers

All-Flash Storage Appliance

Scalable Multi-Framework Deep Learning IDE

MATRIX Deep Learning In A Box Solutions

The MATRIX is a fully-integrated software/hardware platform geared towards increased efficiency by providing unprecedented resource flexibility and utilization to accelerate all phases of AI and Deep Learning projects, including model development/testing, training and inference at scale. The GPU over Fabrics technology enables sharing & scaling of large numbers of GPUs across systems for multi-tenancy and highly-customizable self-service features. GPU over Fabrics is available for bare metal, VM and container applications.

  • GPU over Fabrics Technology enables sharing & scaling of large numbers of GPUs across systems for multi-tenancy and highly-customizable self-service features
  • Dynamically allocates GPUs across multiple jobs and users for optimal resource utilization & efficiency
  • Connects any compute servers remotely, over any Ethernet, IB or RoCE network to GPU servers pools
  • Attaches and detaches GPUs to workloads in real-time, offering unprecedented utilization of GPUs
  • Runs in userspace and proven to work in public cloud, private cloud, on-premise hardware, any hypervisor, and container
  • Support FPGAs and ASICs (any OpenCL compliant hardware)
  • High Power Density Rack Cooling System

    Live Chat