Getting Over Cloud Limits to Operationalize AI

Many businesses are now experimenting with artificial intelligence (AI) applications such as machine learning and deep learning, yet few have fully integrated AI into business and IT operations. That is changing, though: IDC expects that 90% of new enterprise applications will use AI by 2025.

AI can no longer be treated as a science experiment. As models become more infused into the business, organizations must take control of development activities and align them with business need. This creates a compelling argument for operationalizing AI within the data center.

Cloud infrastructure is a favored platform for experimentation, because of its easy accessibility and abundance of platform and tool set choices. However, pilot projects in the cloud often lack the visibility and oversight that IT organizations require. Also:

  • Developers’ experiments may have little relevance to the company’s strategic imperatives.
  • Cost/benefit analysis, issues of privacy, and intellectual property preservation may all be overlooked.
  • Cloud platforms don’t permit the same level of scrutiny as those within the company walls.

As AI becomes more ingrained into business priorities, organizations need to take a disciplined approach to choosing priority projects and allocating funds and people. Projects must be assessed for bottom-line impact and integration with operational systems. Realistic timeframes and budgets need to be allocated. Regulatory issues must be taken into account.

Cloud Limits

Every cloud service provider (CSP) offers high-performance computing (HPC) platforms for AI training. At a base level, the specs and performance benchmarks are about the same, but cloud platforms have some limitations that can make them a less attractive platform for large-scale model training. For example, it is assumed that cloud platforms are less costly than on-premises environments, because they are built from commodity, off-the-shelf components. However, running AI training models isn’t like processing payroll. The more specific the application, the more value organizations can realize from purpose-built hardware. Significant resources, planning, time, and effort are required to prepare and run AI training models. Processors left idle because of insufficient bandwidth or storage performance not only drive up cloud costs but also increase time to value.

Use of cloud instances imposes some configuration limits that may reduce the performance of AI training models. For example, according to tests conducted by AMAX, compared to Amazon Web Services’ top-of-the-line EC2 P4d Instances for HPE, an on-premises NVIDIA DGX™ A100 can provide:

  • Up to 2.6 times the number of cores and virtual CPUs
  • A 20% increase in GPU compute performance through liquid cooling
  • Up to 4 times as much server memory capacity
  • Up to 4.5 times as much network bandwidth
  • Up to 12 times the storage capacity

Another factor is the amount of data that must be processed. Training is typically conducted on copies of production datasets, which can easily grow to terabyte or even petabyte scale. Depending on connection speed, uploading that much data to the cloud can take hours or even days.

Very large models can consume so much memory that they overwhelm the capacity of a single server. In such a case, processing must be spread across multiple GPUs in a rack. Synchronizing nodes requires a high-speed network. Bandwidth can become a significant bottleneck at this point. Processing training workloads in the cloud can be 30% slower than on a local server, due to network factors alone.

Commodity cloud object storage is also slow to meet the I/O needs of multi-GPU systems, so data lakes hosted in intermediate ultrafast storage are needed to serve GPUs as quickly as possible. Systems built to process multi-GPU workloads can do this faster than most clouds, reducing AI model training times by up to 30%.

In addition, artificial neural networks—statistical models inspired by biological neural networks—involve deep learning techniques often developed by trial and error. Multiple network derivatives are created and run continuously to see which produce the best output. This can consume a huge amount of cloud processor capacity and drive up costs accordingly. Also, developers often experiment with multiple learning frameworks, which requires running tests across each one and escalating utilization costs.

Best of Both Worlds

The best solution is a combination of both approaches. Use the cloud as a “sandbox” for experimentation, and move operational AI processing to a platform that is easier to fine-tune, control, and monitor. As AI development expands, local processing also demonstrates superior economics. Tests conducted by AMAX found that the total cost of ownership of an NVIDIA DGX™ A100 with eight NVIDIA A100 GPUs was 27% less than the comparable AWS EC2 P4d instance under a three-year contract.

On-premises systems afford much faster data ingestion and parameter tuning. IT gains full visibility into the work AI developers are doing. A local HPC platform provides better security than a multitenant cloud instance as well as protection against the risk of inadvertent data exposure. Organizations can test in the cloud and then move promising projects to the data center for training and integration with operational systems.

The Bottom Line

Operational AI shouldn’t be an either/or proposition. Align the platform with the task, and take AI development to the next level.

Discover how to accelerate AI initiatives. Visit

Comments are closed.