\n
AI Training Compute Infrastructure Considerations<\/h3>\n\n\n\n
AI infrastructure has begun to resemble high-performance computing (HPC) infrastructure and large-scale AI systems are starting to look more and more like supercomputers, while supercomputers are increasingly used for executing AI workloads. AI and HPC are not only converging on the same infrastructure but also converging as workloads, with AI front-ending an HPC simulation to cut back on the number of simulation runs or with an AI model operating as a side process inside an HPC loop. IDC refers to these types of applications as performance-intensive computing (PIC), an umbrella term for AI, HPC, and big data and analytics workloads. PIC has upended the homogeneous, general-purpose datacenter and led a revolution of purpose-built designs.<\/p>\n\n\n\n
The major driver for these developments is the AI model size. To achieve high levels of accuracy, it is not enough to feed an AI training model large amounts of data; the model also needs to have a large number of layers and parameters, which are the weights of the connections in the neural network.<\/p>\n\n\n\n
Unfortunately, the number of parameters in a neural network and the volumes of data fed into it correlate directly with the amount of compute required. In other words, the more capable and\/or accurate an AI model needs to be, the more compute is required to train that model. What’s more, the faster an organization wants to have its model trained, the more compute is required as well since model training can be distributed across many nodes in a cluster; hence a larger cluster trains a model with greater speed.<\/p>\n\n\n\n
The most relevant paper on this subject is The Computational Limits of Deep Learning (July 2020) by Neil C. Thompson, Kristjan Greenewald, Keeheon Lee, and Gabriel F. Manso<\/em>. These researchers have demonstrated that the computational requirements of AI applications vary widely, as do their error rates and economic costs (see Figure 3).<\/p>\n<\/div><\/div>\n\n\n\n