Scaling HPC applications

Applies to: ✔️ Linux VMs ✔️ Windows VMs ✔️ Flexible scale sets ✔️ Uniform scale sets

Optimal scale-up and scale-out performance of HPC applications on Azure requires performance tuning and optimization experiments for the specific workload. This section and the VM series-specific pages offer general guidance for scaling your applications.

Application setup

The azurehpc repo contains many examples of:

Setting up and running applications optimally.
Configuration of file systems, and clusters.
Tutorials on how to get started easily with some common application workflows.

Optimally scaling MPI

The following suggestions apply for optimal application scaling efficiency, performance, and consistency:

For smaller scale jobs (< 256K connections) use: UCX_TLS=rc,sm
For larger scale jobs (> 256K connections) use: UCX_TLS=dc,sm
To calculate the number of connections for your MPI job, use: Max Connections = (processes per node) x (number of nodes per job) x (number of nodes per job)

Adaptive Routing

Adaptive Routing (AR) allows Azure Virtual Machines (VMs) running EDR and HDR InfiniBand to automatically detect and avoid network congestion by dynamically selecting optimal network paths. As a result, AR offers improved latency and bandwidth on the InfiniBand network, which in turn drives higher performance and scaling efficiency. For more information, see TechCommunity article.

Process pinning

Pin processes to cores using a sequential pinning approach (as opposed to an autobalance approach).
Binding by Numa/Core/HwThread is better than default binding.
For hybrid parallel applications (OpenMP+MPI), use four threads and one MPI rank per [CCX](HB-series virtual machines overview including info on CCXs) on HB and HBv2 VM sizes.
For pure MPI applications, experiment with between one to four MPI ranks per CCX for optimal performance on HB and HBv2 VM sizes.
Some applications with extreme sensitivity to memory bandwidth may benefit from using a reduced number of cores per CCX. For these applications, using three or two cores per CCX may reduce memory bandwidth contention and yield higher real-world performance or more consistent scalability. In particular, MPI 'Allreduce' may benefit from this approach.
For larger scale runs, it's recommended to use UD or hybrid RC+UD transports. Many MPI libraries/runtime libraries use these transports internally (such as UCX or MVAPICH2). Check your transport configurations for large-scale runs.

Compiling applications

Next steps

Test your knowledge with a learning module on optimizing HPC applications on Azure.
Review the HBv3-series overview and HC-series overview.
Read about the latest announcements, HPC workload examples, and performance results at the Azure Compute Tech Community Blogs.
Learn more about HPC on Azure.

Feedback

Was this page helpful?

Additional resources

Last updated on

URL: https://learn.microsoft.com/en-us/azure/virtual-machines/compiling-scaling-applications