OpenMP Offload in Applications of the Exascale Computing Project
Author: Vivek Kale and Matthijs van Waveren
OpenMP is a crucial ingredient for porting and running applications on supercomputers, in particular on exascale supercomputers that have nodes with accelerators, e.g., GPUs, on them. Experience with OpenMP from the US Department of Energy (DoE), and specifically its Exascale Computing Project (ECP), helps to show this. A variety of the DoE’s ECP applications which were run on its emerging supercomputers have made use of open-source OpenMP implementations from the LLVM® project and the GNU® project as well as well as proprietary OpenMP implementations from vendors such as HPE®, NVIDIA®, AMD®, IBM®, and Intel®. A critical use of the OpenMP API in these applications is the use of features that offload the application’s work from a CPU (host) to an accelerator (device). The DoE uses the OpenMP API for many of the ECP applications, such as: GAMESS, GenASiS, Grid, LSMS, GESTS, and QMCPACK. The DoE also uses the OpenMP API in libraries that other applications link to, such as RAJA, a library used to provide performance portability of C++ applications across accelerators, and PLASMA and SLATE, libraries for numerical linear algebra. This blog post describes the strategy used to implement the OpenMP offload functionality in these DoE projects. We hope to inspire readers to use OpenMP offload in their applications. More detail on the implementations can be found in the original article.
QMCPACK is an application for quantum Monte Carlo simulations. On the CPU the population of “walking” electrons in the simulation is parallelized into smaller crowds of walkers through a
#pragma omp parallel for. Each crowd of walkers is then partitioned into a set of one or more walkers and assigned to a stream, i.e., an accelerator thread team. A league of OpenMP thread teams runs on the accelerator. Each of the walker electrons is then parallelized within the stream, or thread team, through
omp parallel for simd. One can view the implementation of the OpenMP offload version in the GitHub repo. More details can be found by viewing the SC20 booth talk here.
Grid is an application that simulates particle interactions between quarks and gluons, which is used to help to understand the beginning of the universe. Grid is part of the ECP LatticeQCD project investigating quantum chromodynamics. GridMini is a mini-app for the full Grid application code. GridMini represents the most time-consuming parts of the Grid application, focused on its use of OpenMP offload. GridMini uses an
accelerator_for macro, which can be used to parallelize a for loop with either CUDA®, OpenACC®, OpenMP, or some other node programming model. The OpenMP version of the
accelerator_for macro uses
target teams distribute parallel for thread_limit(thread_lim) num_threads(nthreads) and the programmer tunes the
nthreads parameters. GridMini primarily uses LLVM’s implementation of the OpenMP API. One can see how GridMini uses OpenMP offload features directly by looking at the following code. The full Grid code is available on GitHub.
RAJA is an application library used for performance-portable node-level parallelization of DoE’s C/C++ applications developed at Lawrence Livermore National Laboratory (LLNL). RAJA uses OpenMP as one of its backends for parallelization on accelerator nodes; other backends target CUDA, OpenACC, and TBB®. To test for how well RAJA uses OpenMP offload, an OpenMP-friendly RAJA version of a daxpy loop using a reduction was developed: a local reduction variable passed in by reference is used for efficient and portable use of accelerators. For more information, see the RAJA GitHub file. Passing a bare double reference to the body makes it possible to use OpenMP reductions and to have only a single reduction tree rather than one per variable being reduced. With an experimental microbenchmark using RAJA’s new interface for reductions, it was found that the OpenMP target region is 7x faster than even the corresponding highly optimized CUDA implementation. RAJA is open-source and available on GitHub.
SLATE and PLASMA
PLASMA and SLATE are numerical linear algebra libraries that rely heavily on OpenMP to express runtime dependencies between linear algebra kernels. They represent basic on-core or on-device units of work that were extracted from BLAS and LAPACK code to form a task-based representation of common linear algebra functions. SLATE and PLASMA both use OpenMP throughout, and each numerical linear algebra algorithm routine uses a set of OpenMP offload features. SLATE and PLASMA use OpenMP tasks to improve overlap of MPI communication and OpenMP computation and for load balancing within a node. SLATE and PLASMA also make heavy use of numerical tiling and blocking to improve spatial data locality. For a QR factorization in SLATE run on Cori-GPU with 4 nodes (8 GPUs each) compiled with GCC version 8.3, a CPU+GPU version with OpenMP offload obtains 12 teraflop/second while the CPU version with OpenMP obtains 3 teraflop/second, yielding a 3x performance increase. One can use SLATE or PLASMA on DoE supercomputers by loading the libraries through the Spack package manager. More information can be found by viewing the SC20 OpenMP booth talk here.
GenASiS is a General Astrophysical Simulation System. It is used for large scale simulations of astrophysical phenomena. OpenMP offload has been used to increase the size of systems to simulate. First, the data allocation on the host is mirrored on the device, and second, the memory allocation on the host is associated with that on the device. This association avoids implicit data mapping and data transfer when the OpenMP runtime encounters that host variable within a target region. This association is accomplished by using the OpenMP library routine
omp_target_associate_ptr(). One of the lessons learned is that a static schedule with a chunk size 1, using
schedule(static, 1), gives the best performance on the device with both IBM’s and GCC’s OpenMP implementations. Benchmarks of GenASiS in 3D with 2563 cells for 50 cycles show that the performance of the OpenMP offload version is the same as the performance of the CUDA version. More details can be found in the SC booth talk here.
GESTS (GPUs for Extreme Scale Turbulence Simulations) is a pseudo-spectral based direct numerical simulation (DNS) application written in Fortran which is capable of simulating turbulent flows using trillions of grid points. It uses advanced OpenMP features to target GPUs portably. Communication between nodes, data copies between the CPU and the GPU, and computations on the GPU are overlapped with each other in order to improve performance. More specifically, it uses
omp_target_memcp_rect() for copying a specified subvolume to GPU and the OpenMP task depend clause to invoke vendor libraries like rocFFT and cuFFT in an asynchronous manner. The OpenMP
depend clause helps to enforce the necessary synchronization. On Summit, for a problem size of 1.8 trillion grid points (12,2283), the basic synchronous version of the code shows comparable performance to a synchronous CUDA Fortran version. A speedup of 2.57x over the CPU version is obtained with OpenMP offload. Larger speedups are expected for larger problem sizes and the full DNS code. For more information, see the OpenMP SC booth talk here.
GAMESS is a package used for electronic structure quantum chemistry computations. It uses OpenMP to resolve bottlenecks in computation of large number O(N4) integrals and in operations on large O(N2) matrices, where N is a measure of the size of the molecules. To develop the OpenMP offload implementation, load imbalance and branching were reduced by refactoring the code. Then, OpenMP offload directives were added, in particular,
omp target teams distribute parallel do. Using IBM’s xl OpenMP on one GPU of a node of Summit, the OpenMP offload code obtains between ⅕ to 5x speedup over code running on 2 Power9 CPUs, depending on the size of the molecules. The GAMESS speedup of 5x on the largest molecule demonstrates near-peak floating-point performance on the GPUs, since the ratio of the peak performance of one V100 to that of two Power9 CPUs is 7x. For more information, see the OpenMP booth talk here.
LSMS is an application for understanding the behavior of materials, dominated by complex dense linear algebra. It uses OpenMP for its performance portability as well as a variety of compiler optimizations, and evaluates the performance gap between OpenMP offload and a library-based approach for GPU parallelization. For its implementation with OpenMP offload, a matrix needs to be built and solved with a standard matrix solver. Its OpenMP offload implementation uses
target teams distribute parallel for for many routines and makes use of the target update directive for explicit updates of references on the device to the host. When running LSMS on one NVIDIA Volta® GPU, the performance of the version using OpenMP offload is 2x slower than the CUDA version. Improvements can likely be made to the OpenMP offload version through, among other things, in-depth experimentation with support in LLVM’s OpenMP for complex numbers.
All applications described here show that OpenMP offload allows for performance portability on DoE supercomputers and shows promise for heterogeneous accelerators. Experiments show that the performance of OpenMP offload comes close to the corresponding CUDA versions and illustrate the progress of OpenMP implementations over time. The OpenMP application experience within the US DoE drives DoE efforts to contribute to improving performance and correctness of OpenMP offload 5.x features in OpenMP implementations. This also drives DOE contributions to proposals for novel features in upcoming versions of the OpenMP specification. Members of the ECP’s Software Technology projects involving OpenMP contribute to development of these OpenMP 5.1 features, focusing on LLVM’s OpenMP implementation given DOE’s investment in it. They also actively participate in the development of the OpenMP 6.0 specification, contributing to OpenMP features of relevance to the next-generation DoE supercomputers. For example, ECP’s SOLLVE members are working on OpenMP loop transformation features for CPUs and GPUs, and the tile and unroll loop transformations are supported in clang/LLVM OpenMP 13. Through documenting such application experiences with OpenMP and identifying gaps in required OpenMP support based on them, the DoE’s ECP can provide a valuable contribution to the OpenMP community and help with the development of OpenMP for emerging and next-generation supercomputers. We hope that this process of identifying OpenMP application experiences, useful for driving contributions to the development of OpenMP, is also useful to organizations around the world that want to make lasting and high-impact contributions to the OpenMP API.
 Seonmyeong Bak, Colleen Bertoni, Swen Boehm , Reuben Budiardja, Barbara M. Chapman, Johannes Doerfert, Markus Eisenbach, Hal Finkel, Oscar Hernandez, Joseph Huber, Shintaro Iwasaki, Vivek Kale, Paul R.C. Kent, JaeHyuk Kwack, Meifeng Lin, Piotr Luszczek, Ye Luo, Buu Pham, Swaroop Pophale, Kiran Ravikumar, Vivek Sarkar, Thomas Scogland, Shilei Tian, P.K.Yeung, “OpenMP Application Experiences: Porting to Accelerated Nodes”, in Parallel Computing Volume 109, March 2022, https://doi.org/10.1016/j.parco.2021.102856