OpenMP Accelerator Support for GPUs

A contributed article by Kelvin Li, an advisory software developer at the IBM Toronto Lab.

In the past decades, we see that the increase in CPU speed is slowing down and the problems that we want to solve become more complex. With the advancement of the GPU technology, utilizing the computing power of GPU becomes a promising approach. This presents a challenge to the programming model as the architecture is no longer homogeneous. Programmers may not want to deal with different ISA (Instruction Set Architecture) in a single application if they want to offload the compute intense part of the application to the GPU or other devices. A programming model that makes the underneath hardware transparent and provides a high level of usability is needed.

The OpenMP language committee has been adding features to the specification to exploit the hardware that has the offloading capability. In OpenMP API 4.0 (published in 2013), the specification provides a set of directives to instruct the compiler and runtime to offload a block of code to the device. The device can be GPU, FPGA etc. The accelerator subcommittee continues the effort to add more features and clarifications of the device constructs in OpenMP API 4.5 (published in 2015).

Bring Us Closer to Exascale Computing

In the new generations of POWER architecture, the POWER processor can be attached to the Nvidia GPU via the high speed NVLINK for fast data transfer between CPU and GPU. This hardware configuration is an essential part of the CORAL project with the U.S. national labs and can bring us closer to the Exascale Computing. The IBM XL compilers has a long history of supporting OpenMP API starting from the first version of the specification. The XL compilers continue to support OpenMP specification and exploit the POWER hardware architecture with GPU. The XL compiler team works closely with the IBM Research team to develop the compiler infrastructure for the offloading mechanism. In addition, the team also collaborates with the open source community for the runtime interface on the GPU device runtime library. The first XL compiler that supports Nvidia GPU offloading was released in Dec 2016.

Offloading Compute Intensive Code to the GPU

I will take the LULESH benchmark as a simple example to illustrate the advantage of offloading the compute intensive code to GPU. A hot loop is chosen to be annotated with “#pragma omp parallel for” for parallelization on CPU or with “#pragma omp target teams distribute parallel for” for offloading to GPU. The speedup from offloading to GPU (OpenMP GPU) is much higher than that on CPU (OpenMP CPU). In this case, the loop has regular access to memory and helps to exploit the GPU computing power.

for (int i = 0 ; i < n ; ++i) {
 int indx = idxlist 
[i] ; double dtf = 1.0e+20; double dtdvov = 1.0e+20; if (vdov[indx] != double(0.)) { dtf = ss[indx] * ss[indx] ; dtdvov = dvovmax / (FABS(vdov[indx])+double(1.e-20)); if (vdov[indx] < double(0.) ) { dtf = dtf + qqc2 * arealg[indx] * arealg[indx] * vdov[indx] * vdov[indx] ; } dtf = SQRT(dtf) ; dtf = arealg[indx] / dtf ; } dtcourant = dtf < dtcourant ? dtf : dtcourant; dthydro = dtdvov < dthydro ? dtdvov : dthydro; }

Performance Results – End-to-End

LULESH – Speedup Over Serial (Higher is Better)

kelvin-blog-benchmark-1b

The XL compiler takes a holistic approach to exploit the POWER processor that is connected to Nvidia GPU’s. In typical applications, some parts are suitable for running on GPU to get speedup, other parts for running on CPU. Hence, the performance on both CPU and GPU is important for the overall applications. In the diagram below, an outline of the compilation flow and its components illustrates how the GPU offloading is supported.

kelvin-blog-architecture-1b

OpenMP 4.5 Support in XL C/C++ and Fortran

The OpenMP program (C, C++ or Fortran) with device constructs is fed into the High-Level Optimizer and partitioned into the CPU and GPU parts. The intermediate code is optimized by High-level Optimizer. Note that such optimization benefits both code for CPU as well as GPU. The CPU part is sent to the POWER Low-level Optimizer for further optimization and code generation. The GPU part of the code is translated to the LLVM IR and then fed into the LLVM optimizer in the CUDA Toolkit for optimization specific for Nvidia GPU and PTX code generation. Finally, the linker is invoked to link the objects to create an executable. From this outline view, one can see that the compiler employs the expertise from the both worlds to ensure that the applications are being optimized accordingly. For the CPU part, the POWER Low-level Optimizer which accumulates many years of optimization knowledge on the POWER architecture generates optimized code. For the GPU part, the GPU expertise from the CUDA Toolkit is used to generate optimized code on the Nvidia device. As a result, the entire applications are optimized in a balanced way.

The XL C/C++ V13.1.5 and XL Fortran V15.1.5 compilers are one of the first compilers that provide support for Nvidia GPU offloading using OpenMP 4.5 programming model. This release has the basic device constructs (i.e. target, target update and target data directives) support to allow users to experiment the offloading mechanism and porting code for GPU. The other important aspect of offloading computation to devices is the data mapping. The map clause is also supported in this release. As any leading edge technology, the development process is more or less iterative. We provide functionality and receive user feedback, and then use that to improve in the next iteration. The same applies to the XL compiler. In addition, the compiler team collects many feedbacks from users and we channel that back to the committee to improve the usability of the OpenMP features. The effort is ongoing in both the specification development and implementation of the specification. The XL compiler team continues the development of other OpenMP 4.5 feature after this release. More update will come. Stay tuned.

Author Biography

Kelvin Li is an advisory software developer in IBM Toronto Lab. He has been working in the compiler team for 15+ years. He works in C, C++ and Fortran frontends. He also participates in the llvm/clang projects to implement OpenMP pragmas in the frontend. Kelvin represents IBM in OpenMP ARB and actively participates in the development of the OpenMP specifications. He also chairs the OpenMP Fortran subcommittee to rebase the Fortran language supported in the specification as well as resolving any Fortran specific issues.

Additional Material