Note
Access to this page requires authorization. You can try signing in or .
Access to this page requires authorization. You can try .
Scaling HPC applications
Applies to: ✔️ Linux VMs ✔️ Windows VMs ✔️ Flexible scale sets ✔️ Uniform scale sets
Optimal scale-up and scale-out performance of HPC applications on Azure requires performance tuning and optimization experiments for the specific workload. This section and the VM series-specific pages offer general guidance for scaling your applications.
Application setup
The azurehpc repo contains many examples of:
- Setting up and running applications optimally.
- Configuration of file systems, and clusters.
- Tutorials on how to get started easily with some common application workflows.
Optimally scaling MPI
The following suggestions apply for optimal application scaling efficiency, performance, and consistency:
For smaller scale jobs (< 256K connections) use:
UCX_TLS=rc,smFor larger scale jobs (> 256K connections) use:
UCX_TLS=dc,smTo calculate the number of connections for your MPI job, use:
Max Connections = (processes per node) x (number of nodes per job) x (number of nodes per job)
Adaptive Routing
Adaptive Routing (AR) allows Azure Virtual Machines (VMs) running EDR and HDR InfiniBand to automatically detect and avoid network congestion by dynamically selecting optimal network paths. As a result, AR offers improved latency and bandwidth on the InfiniBand network, which in turn drives higher performance and scaling efficiency. For more information, see TechCommunity article.
Process pinning
- Pin processes to cores using a sequential pinning approach (as opposed to an autobalance approach).
- Binding by Numa/Core/HwThread is better than default binding.
- For hybrid parallel applications (OpenMP+MPI), use four threads and one MPI rank per [CCX](HB-series virtual machines overview including info on CCXs) on HB and HBv2 VM sizes.
- For pure MPI applications, experiment with between one to four MPI ranks per CCX for optimal performance on HB and HBv2 VM sizes.
- Some applications with extreme sensitivity to memory bandwidth may benefit from using a reduced number of cores per CCX. For these applications, using three or two cores per CCX may reduce memory bandwidth contention and yield higher real-world performance or more consistent scalability. In particular, MPI 'Allreduce' may benefit from this approach.
- For larger scale runs, it's recommended to use UD or hybrid RC+UD transports. Many MPI libraries/runtime libraries use these transports internally (such as UCX or MVAPICH2). Check your transport configurations for large-scale runs.
Compiling applications
Next steps
- Test your knowledge with a learning module on optimizing HPC applications on Azure.
- Review the HBv3-series overview and HC-series overview.
- Read about the latest announcements, HPC workload examples, and performance results at the Azure Compute Tech Community Blogs.
- Learn more about HPC on Azure.
Feedback
Was this page helpful?
