Storage Infrastructure for Artificial Intelligence and Deep Learning

November 2022 | us49723322
Peter Rutten

Peter Rutten

Research Vice-President, Infrastructure Systems, Platforms and Technologies Group, Performance-Intensive Computing Solutions Global Research Lead

Product Type:
IDC: White Paper
Sponsored by: NetApp
Watch Video

IDC Opinion

As enterprises move into the digital era through a process termed digital transformation, they are moving to more data-centric business models. While there are big data and analytics workloads that do not use artificial intelligence (AI), AI-driven applications are growing at a rapid rate over the next five years. AI workloads include machine learning (ML) and deep learning (DL) workloads, and while more data helps drive better business insights across both of these application types, it is particularly true for DL workloads. Experience with these types of workloads over the past three years indicates that outdated storage architectures can pose serious challenges in efficiently scaling large AI-driven workloads. 

AI-driven applications commonly have a multistage data pipeline that includes ingest, transformation, training, inferencing, production, and archiving, with much of the data shared between stages (see Figure 1). 

Figure 1: AI Data Pipeline Stages

Source: IDC 2022

When enterprises can consolidate the storage processing for all of these stages onto a single storage system, it provides the most cost-effective infrastructure. This type of consolidation can, however, potentially put performance, availability, and security at risk if the underlying storage does not support low latency, high data concurrency, and the multitenant management to manage the data for each stage according to individual requirements.

Enterprises often want to use public cloud–based services during the data life cycle, so systems need to be enabled with an ability to support cloud-native capabilities and integration. And finally, because these systems tend to be quite large (often growing to multi-petabyte [PB] stage and beyond), high infrastructure efficiency is important to be able to drive a low total cost of ownership.

Because of these requirements, enterprises are gravitating more and more to software-defined, scale-out storage infrastructures that support extensive hybrid cloud integration.

NetApp ONTAP AI provides a prepackaged solution, including the accelerated compute and networking from NVIDIA that is often needed for AI workloads, and a software-defined, scale-out storage architecture based on ONTAP, the vendor’s enterprise-class storage operating system.

Converged infrastructure stack offerings like NetApp ONTAP AI deliver fast time to value for enterprises deploying AI workloads, and they include access to common AI tools and other components that enterprises invariably need as they harness AI to drive faster, better business decisions.

Situation Overview

AI technologies are becoming an increasingly important driver of business success for digitally transformed enterprises today. ML, a subset of AI, reviews data inputs, identifies correlations and patterns in that data, and then applies that learning to make informed decisions across a wide variety of use cases — everything from recommendation engines and fraud detection to customer analysis and forecasting events — that can supplement and help guide better human decision making. DL is an evolution of machine learning that uses more complex, multilayered neural networks that can actually learn and make intelligent decisions on their own without human involvement.

AI workloads in general perform better when they leverage larger data sets for training purposes, and this is particularly true for DL workloads. AI life-cycle applications are one of the fastest-growing workloads over the next five years, and they are contributing strongly to the projected data growth rates in the enterprise. Most enterprises are experiencing data growth rates from 30% to 40% per year and will soon be managing multi-petabyte storage environments. Roughly 70% of enterprises undergoing digital transformation will be modernizing their storage infrastructure over the next two years to deal with the performance, availability, scalability, and security requirements for the new workloads they are deploying in an era of heightened privacy concerns and rampant cyberattacks.

Implementing the right storage infrastructure can also make a big difference, particularly for larger-scale AI workloads. Already, over 43% of enterprises running these workloads operate in integrated hybrid cloud environments, and 69% of them are regularly moving data between on- and off-premises locations. Software-defined storage is important in providing the flexibility and data mobility needed in these hybrid cloud environments. 

Figure 2: % of organizations with hybrid cloud and on-premises – cloud data movement for AI workloads

Source: IDC 2022

AI Training Compute Infrastructure Considerations

AI infrastructure has begun to resemble high-performance computing (HPC) infrastructure and large-scale AI systems are starting to look more and more like supercomputers, while supercomputers are increasingly used for executing AI workloads. AI and HPC are not only converging on the same infrastructure but also converging as workloads, with AI front-ending an HPC simulation to cut back on the number of simulation runs or with an AI model operating as a side process inside an HPC loop. IDC refers to these types of applications as performance-intensive computing (PIC), an umbrella term for AI, HPC, and big data and analytics workloads. PIC has upended the homogeneous, general-purpose datacenter and led a revolution of purpose-built designs.

The major driver for these developments is the AI model size. To achieve high levels of accuracy, it is not enough to feed an AI training model large amounts of data; the model also needs to have a large number of layers and parameters, which are the weights of the connections in the neural network.

Unfortunately, the number of parameters in a neural network and the volumes of data fed into it correlate directly with the amount of compute required. In other words, the more capable and/or accurate an AI model needs to be, the more compute is required to train that model. What’s more, the faster an organization wants to have its model trained, the more compute is required as well since model training can be distributed across many nodes in a cluster; hence a larger cluster trains a model with greater speed.

The most relevant paper on this subject is The Computational Limits of Deep Learning (July 2020) by Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F. Manso. These researchers have demonstrated that the computational requirements of AI applications vary widely, as do their error rates and economic costs (see Figure 3).

Figure 3: Computational requirements in PFLOPS of select applications areas

 Source: Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, Gabriel F. Manso, The Computational Limits of Deep Learning, July 2020

AI Training Storage Infrastructure Considerations

Enterprises by and large realize that AI training workloads have requirements that are not necessarily well met by legacy storage infrastructure, and over 88% of them purchase a storage system specifically for the new workload.

While the data scientists may not care much about the details of the storage infrastructure that supports the AI data pipeline, they do care about the ability of it to gracefully scale to accommodate data growth over time without impacting performance. The ability to deliver consistent performance at scale is important not only to handle data growth but also to accommodate the addition of new applications and users that may want to simultaneously use the same data set. End users also care about the integrity of the data; how easy it is to create copies of data and make it available for use with other applications, to their colleagues, and for rapid recovery purposes; and that the data is protected against failures so that it is not lost. 

Those higher-level objectives translate into a number of specific requirements for the IT administrators actually managing the storage infrastructure for AI training workloads. A primary capability needed in storage that will be used for AI is its ability to support high degrees of concurrency. The same data is generally used across different stages, each of which can have very different I/O profiles, and multiple stages of the AI data pipeline will be operating concurrently much of the time. For real-time workloads, which are on the rise for enterprises, the production inferencing stage can also require extremely rapid response, driving the need for very low latencies. 

When it comes to high-level architectural considerations, working with two-tiered storage infrastructure is strongly preferred. 90% of survey respondents are working with a file system–based front end, which tiers data to a back-end object-based storage platform. For many enterprises, the front end is all-flash while the back end is HDD based.

Figure 4: Storage requirements for AI workflows

Source: IDC 2022

Enterprise Deployment Model Preferences

Simplified deployment models are very important to enterprises as they offer easier ordering, quicker installation, faster time to value and, in many cases, streamlined support. Converged infrastructure offerings, which bundle compute, storage, and networking into a factory-configured rack that can be purchased under a single SKU and offers a single point of support contact for all infrastructure components, were introduced in the early 2010s and have grown into a $21.3 billion market (in 2021). This same idea has taken hold in the AI infrastructure market, and there are now several enterprise storage providers that have created converged infrastructure stacks that are specifically targeted at AI workloads. 

Reference architectures are available from many vendors as well, although their benefits do not go quite as far as converged infrastructure stacks. A reference architecture specifies a pretested configuration using multivendor components so that they have been validated to work together but leaves it up to the customer to buy the components from their various vendors. Ordering a complete system requires more manual effort on the part of customers, support contacts are split across the various vendors, and there is no unified management interface. When converged infrastructure stacks that include the products an enterprise wants are not available, it may be better to work from a reference architecture since enterprises will not need to validate all the product combinations themselves.

Considering NetApp

Today, the vendor is recognized as a leader in enterprise storage, and its broad portfolio includes block-, file-, and object-based storage platforms as well as converged infrastructure, technical support, and consulting services — all based around a software-defined product strategy that offers outright purchase and subscription-based deployment options. NetApp has an extremely mature hybrid cloud offering, allowing customers to run their enterprise-grade solutions either on premises or in major hyperscaler public cloud environments and manage them all under a unified management interface that supports hybrid multicloud operations.

At the heart of ONTAP AI are NVIDIA DGX compute systems, a fully integrated hardware and software turnkey AI platform that’s purpose built for analytics, AI training, and AI inferencing. The storage in NetApp ONTAP AI is based around the vendor’s flagship scale-out storage operating system (ONTAP). All ONTAP storage systems come with high availability, storage efficiency, data management, scalable NAS data protection, security, compliance, and cloud integration to ensure data integrity. 

Enterprises vary in terms of who makes the buying decisions for AI infrastructure. To select the best system for a given enterprise’s AI training workloads, decision makers must have a good idea of the requirements from the data scientist/developer side as well as the IT operations side. When selecting the right storage infrastructure for these workloads, it is important that all those affected are consulted and have an opportunity to agree on objectives and priorities. 

The opportunity for enterprises when adding a new storage system specifically for AI workloads is to determine how much other workload consolidation that platform could (and should) support. The more AI data pipeline stages can be consolidated onto a single storage platform, the better, and when that platform has the performance and scalability to support additional workloads beyond AI, enterprises can reap a great return on investment. Vendors like NetApp have an opportunity to help enterprises understand their objectives and the requirements of different constituencies up front to help make a quick, AI workload–enabling storage decision.


While it is still early in the move toward AI in the enterprise, it is clear that this will be a central and strategic workload in digitally transformed enterprises.

NetApp ONTAP AI is a converged infrastructure stack that combines NVIDIA DGX accelerated compute, NVIDIA high-speed switching, ONTAP-based storage, and a large selection of tools to help manage AI workloads effectively. The stack combines all of those components into an integrated system, purchased under a single SKU, that is easy to buy and deploy, is fully supported by NVIDIA, and includes a unified management interface that boosts administrative productivity for these types of configurations.

Enterprises in the process of adding and/or scaling AI-driven workloads would do well to consider NetApp as the storage platform for these environments.

Storage Infrastructure Considerations for Artificial Intelligence Deep Learning Training Workloads in Enterprises

NetApp provides a comprehensive range of AI Training solutions that tackle customer challenges at every stage of their AI journey.  Our solutions, developed in partnership with NVIDIA and other industry leaders are designed to simplify operations, minimize deployment risk, and often lower Cloud costs for customers, regardless of whether their workflows are on premises, in the Cloud, or spread across multiple clouds.

For more information on our AI solutions, please visit