The deep learning component of AI can be a high-performance computing problem as it requires a large amount of computation and a data motion (IO and network). Deep learning needs computationally-intensive training and lots of computational power help to enable speeding up the training cycles.
High-performance computing (HPC) allows businesses to scale computationally to build deep learning algorithms that can take advantage of high volumes of data. With more data comes the need for larger amounts of computing needs with great performance specs. This is leading to HPC and AI converging, unleashing a new era.
Compared to traditional servers or cloud, HPC is quite different in terms of cost, computational power, specifications and architecture. Here are some use cases where HPC is being used:
- Develop, redesign and model products
- Analyse large data sets
- Conduct large scale research projects
Fast streaming data, coupled with high-performance computing makes training cycles shorter so that companies can update their neural networks to new types of scenarios. Using hundreds of fast clusters leads to the potential to perform an automatic evaluation of complex multi-component nonlinear applications in automobiles and manufacturing.
An HPC system consists of hundreds to thousands of physical servers, each powered by powerful processors. All the servers act as one, with hundreds of thousands of nodes to perform a specific task. Such custom-designed systems can be used on-prem or even provided through the cloud.
With the expansion of AI/ML over the last few years, there has been an increasing need for HPC within data centres. These platforms typically are large numbers of GPU cores and CPU cores working in parallel, permitting complex workloads like AI in medical diagnostics or large computational fluid dynamics models to execute many more instructions than they could on traditional server infrastructure.
HPC For A Competitive Advantage In Data-Intensive Workloads
HPC advances have successfully helped scientists, researchers to achieve various breakthrough innovations in the field of healthcare, technology, retail, banking and so on. Oregon State University harnesses supercomputing from advanced HPC to help save snow leopards from extinction. Gene sequencing work is extremely data-intensive and needs ultra-powerful workstations to process the ever-growing data.
Running machine simulations to test designs without having to create anything is a necessary tool for any engineering company, whether they are producing self-driving cars or bridges. HPC considerably decreases the amount of time and simulations required, enabling engineers to get designs completed and in production shortly.
Adding more cloud capacity allows researchers to run critical digital simulations with faster iterations, enabling more rapid technological breakthroughs such as the development of AI-powered autonomous vehicles or enhancing automotive fuel efficiency and safety features.
Convergence Of AI And HPC
HPC is a cluster of systems working together seamlessly as one unit to achieve the performance goals — such as processing data and performing complex calculations at high speeds. AI/ML or data analysis requires high computational power for specific applications that led to the innovation of high-performance computing.
Because AI systems are designed in a way which is capable of processing so much data, it needs to run on optimised hardware which has the capacity of performing trillions of calculations per second, or more. This is where HPC and AI converge since HPC utilises dense computer clusters in sync with one another to run the most advanced AI.
For example, many large banks are leveraging AI on HPC systems in new approaches— processing large data sets at rapid speed to recognise and prevent fraudulent transactions. The connections are optimised to assure the clusters work quickly and in sync. To cope with the density of components, the energy supply and cooling has to be considerably improved.
HPC: Hardware Customisations For Machine Learning
There is a growing demand for GPU and CPU-based compute at scale among traditional enterprises as well as start-ups looking to exploit machine learning. Both high core count CPUs and GPUs are needed for building an HPC cluster, and demand for GPU-based processing is increasing rapidly as they are good at performing what’s called ‘Floating Point Calculations’. These applications require racks full of high-end AMD or NVIDIA GPUs such as the Titan-X or Tesla cards (currently NVIDIA leads the field with their CUDA language and cores).
Here are a few enterprise examples to elucidate how HPC is fast evolving:
Oracle Cloud Infrastructure (OCI) recently announced its ramp-up of HPC prowess with major announcements. It included the general availability of cloud instances with Nvidia’s latest powerful GPU, the A100. Oracle also has expanded collaborations with Rescale and Altair to make robust cloud HPC infrastructure. Oracle has long been focusing on the growing commercial HPC use for its enterprise users.
Now to look at the specs, the new bare metal instance, GPU4.8, includes eight Nvidia A100 Tensor Core GPUs with 40 GB of memory individually, all combined with Nvidia NVLink. The CPU chip onboard has 64 physical cores of AMD Rome processors operating at 2.9 GHz and supported by 2,048 GB of RAM and 24 TB of NVMe storage. Oracle’s latest bare-metal GPU instance combines the high-speed, latency Cluster Network architecture. It is bringing the NVIDIA A100 Tensor Core GPUs into its cloud service and offering the ability to scale to more than 500 GPUs interconnected with Mellanox networking for large-scale distributed workloads.
Available in the Oracle Cloud Marketplace, it has pre-configured Data Science and AI image, includes NVIDIA’s Deep Neural Network libraries, common ML/deep learning frameworks, Jupyter Notebooks and common Python/R integrated development environments which can be run on the HPC products.
Speaking of Oracle’s partner for its HPC project, Rescale in collaboration with Hyundai Motors Group also announced it is building a multi-cloud high-performance computing (HPC) environment for innovating in the smart mobility industry. The Rescale turnkey platform can help companies efficiently run hundreds of simulation software on multi-cloud high-performance computing infrastructure, empowering researchers to scale and accelerate research as required on a single platform.
Another firm NUVIA raised $240 million for Series B funding round. The firm is building a leading-edge SoC and CPU core, codenamed “Orion” and “Phoenix,” that are designed to deliver industry-leading performance on real cloud workloads. NUVIA’s first-generation CPU, Phoenix, will be a custom core based on the Arm architecture. NUVIA was founded on the goal of reimagining chip design for high-performance computing applications. It is focused on building products that blend the best attributes of computing performance, power efficiency and scalability.
These developments and more suggest that high-performance computing are indeed the next frontier for enterprise AI, and many large tech companies are grabbing on the opportunity to facilitate these developments.