

# Accelerating HPC Applications on AMD Instinct<sup>TM</sup> GPUs with OpenMP® offloading: An Overview

Suyash Tandon, Member of Technical Staff

May 2023



#### CAUTIONARY STATEMENT

This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) such as the features, functionality, performance, availability, timing and expected benefits of AMD's current products, future products and markets, which are made pursuant to the Safe Harbor provisions of the Private Securities Litigation Reform Act of 1995. Forward-looking statements are commonly identified by words such as "would," "may," "expects," "believes," "plans," "intends," "projects" and other terms with similar meaning. Investors are cautioned that the forward-looking statements in this presentation are based on current beliefs, assumptions and expectations, speak only as of the date of this presentation and involve risks and uncertainties that could cause actual results to differ materially from current expectations. Such statements are subject to certain known and unknown risks and uncertainties, many of which are difficult to predict and generally beyond AMD's control, that could cause actual results and other future events to differ materially from those expressed in, or implied or projected by, the forward-looking information and statements. Investors are urged to review in detail the risks and uncertainties in AMD's Securities and Exchange Commission filings, including but not limited to AMD's most recent reports on Forms 10-K and 10-Q.

AMD does not assume, and hereby disclaims, any obligation to update forward-looking statements made in this presentation, except as may be required by law.

#### **Agenda**

- 1. Introduction to MI 200 hardware
- 2. Software stack and tools
- 3. Basics of OpenMP® offloading
- 4. HIP & OpenMP® compatibility
- 5. Case studies
- 6. Heterogenous memory management (HMM)

#### **COMPUTE GPU ARCHITECTURE ROADMAP**







2020

■ 2024
AMD

together we advance\_

## 2ND GENERATION CDNA ARCHITECTURE TAILORED-BUILT FOR HPC & AI



TSMC 6NM **TECHNOLOGY** 4 MATRIX CORES PER **COMPUTE UNIT** 8 INFINITY FABRIC LINKS PER DIE

UP TO 110 CU PER **GRAPHICS CORE DIE** MATRIX CORES ENHANCED FOR HPC SPECIAL FP32 OPS FOR **DOUBLE THROUGHPUT** 

#### **MULTI-CHIP DESIGN**

#### TWO GPU DIES IN PACKAGE TO MAXIMIZE COMPUTE & DATA THROUGHPUT

INFINITY FABRIC FOR CROSS-DIE CONNECTIVITY

4 LINKS RUNNING AT 25GBPS

400GB/S OF BI-DIRECTIONALBANDWIDTH





#### From AMD MI100 to AMD MI210

MI 100

- 32GB of HBM2 memory
- 11.5 TFLOPS peak performance
- 1.2 TB/s peak memory bandwidth
- 120 CU

AMD CDNA™ 2 white paper: https://www.amd.com/system/files/documents/am d-cdna2-white-paper.pdf MI 210

- 64GB of HBM2e memory
- 26.5 TFLOPS peak performance
- 1.6 TB/s peak memory bandwidth
- 108 CU
- 128 single precision FMA operations per cycle
- AMD CDNA 2 Matrix Core supports doubleprecision data

#### SCIENTISTS TARGET APPLICATIONS FOR WIDE RANGE OF SYSTEMS

**Pre-Exascale** Systems [Aggregate Linpack (Rmax) = 323 PF!]

Cray/Intel Xeon®/KNL

First U.S Exascale Systems



HPE/Intel

IBM/NVIDIA

IBM BG/Q

# IDEAL APPLICATION DEVELOPMENT FROM THE SCIENTIST'S PERSPECTIVE

#### **Performant**

Efficient use of hardware resources for energy consumed

Scale from single to multi-node

#### **Portable**

Support both CPUs and GPUs

Execute application on various platform architectures

#### **Productive**

Optimize time to solution for new research

Abstract the computer science (code, data movement, scaling, etc)

#### PERFORMANCE VS PORTABILITY TRADEOFF

Portability drops as software is tuned for specific HW features



#### GPU PROGRAMMING IS DIFFICULT – BUT EASIER IF HW IS ABSTRACTED



#### Agenda

- Introduction to MI 200 hardware
- 2. Software stack and tools
- 3. Basics of OpenMP® offloading
- 4. HIP & OpenMP® compatibility
- 5. Case studies
- 6. Heterogenous memory management (HMM)

#### **Open Software Platform For GPU Compute**

### ROCm

- Unlocked GPU Power To Accelerate Computational Tasks
- Optimized for HPC and Deep Learning Workloads at Scale
- Open Source Enabling Innovation,
   Differentiation, and Collaboration



#### Compiler with OpenMP® support on AMD GPUs



#### **AMD** development tools





GMI

hipPushCallConfigur

ipGetDevicePropertie

GetDeviceCount

ipGetLastError

ipEventCreate

inEventDestroy

nipGetDevice

pSetDevice

1856 224177

1494 100458

330 64671

76675

51808

11611

401

120 0.000308

67 0.000138

232 0.000105

195 8.87F-05

1102 7.11E-05

181 1.59E-05

401 5.50E-07

220 3.02E-07

#### **Agenda**

- Introduction to MI 200 hardware
- 2. Software stack and tools
- 3. Basics of OpenMP® offloading
- 4. HIP & OpenMP® compatibility
- 5. Case studies
- 6. Summary

#### **Basics of OpenMP® offloading**

**OMP TARGET** 

Defines a target region to be offloaded on device

```
program target_example
    complex :: M,N
    N=(2,2)
    M=(0,0)

    !$omp target map(from:M) map(to:N)
    M=N
    !$omp end target
    write(*,*) "M= ", M

end program target_example
```

PARALLEL

Defines a parallel region within the code, usually loops

```
!$omp target parallel do map(from:A)
          do i=1, 500
          A(i) = (2,2)
          enddo
!$omp end target parallel do
```

```
!$omp target teams distribute parallel do map(from:A)
     do i=1, 500
     A(i) = (2,2)
     enddo
!$omp end target teams distribute parallel do
```

```
coe62819@cedar003:~/kernel/teamsdist> ./teamsdis
ACC: Version 5.0 of HIP already initialized, runtime version 50120532
ACC: Get Device 0
ACC: Set Thread Context
ACC: Start transfer 1 items from ./teamsdis2.f90:13
          allocate 'a(:)' (4000 bytes)
ACC: End transfer (to acc 0 bytes, to host 0 bytes)
ACC: Execute kernel test $ck L13 1 blocks:1 threads:256 async(auto) from ./teamsdis2.f90:13
ACC: Wait async(auto) from ./teamsdis2.190:1/
ACC: Start transfer 1 items from ./teamsdis2.f90:17
          copy to host, free 'a(:)' (4000 bytes)
ACC: End transfer (to acc 0 bytes, to host 4000 bytes)
A(1)=(2.,2.)
coe62819@cedar003:~/kernel/teamsdist> ./teamsdis
ACC: Version 5.0 of HIP already initialized, runtime version 50120532
ACC: Get Device 0
ACC: Set Thread Context
ACC: Start transfer 1 items from ./teamsdis2.f90:13
          allocate 'a(:)' (4000 bytes)
ACC: End transfer (to acc 0 bytes, to be
ACC: Execute kernel test $ck L13 1 blocks:2 threads:256 async(auto) from ./teamsdis2.f90:13
ACC: Wait async(auto) from ./teamsdisz.190:1/
ACC: Start transfer 1 items from ./teamsdis2.f90:17
          copy to host, free 'a(:)' (4000 bytes)
ACC: End transfer (to acc 0 bytes, to host 4000 bytes)
A(1)=(2.,2.)
```

#### Basics of OpenMP® offloading

**OMP TARGET** 

Defines a target region to be offloaded on device

```
program target_example
    complex :: M,N
    N=(2,2)
    M=(0,0)

    !$omp target map(from:M) map(to:N)
    M=N
    !$omp end target
    write(*,*) "M= ", M
end program target_example
```

PARALLEL

Defines a parallel region within the code, usually loops

```
!$omp target parallel do map(from:A)
          do i=1, 500
          A(i) = (2,2)
          enddo
!$omp end target parallel do
```

DATA

Ensures that data is correctly exposed to a device

#### **Common errors**

HSA\_STATUS\_ERROR\_MEMORY\_FAULT: Agent attempted to access an inaccessible address. code: 0x2b

Data is not present on GPU!

Host region (7ffc4df0dd20 to 7ffc4df1dd20) overlaps present region (7ffc4df19e80 to 7ffc4df22e80 index 42) but is not contained for A in hamil.f90

Data is mapped to device but is not deleted/released!



#### **Debugging with AOMP: LIBOMPTARGET\_DEBUG**

```
LIBOMPTARGET_KERNEL_TRACE

Print useful statistics for device operations. Setting to 1 emits name of every kernel, number of teams, threads, and register usage. Setting to 2 prints timing and data transfer information.

Value of 1 or higher to print information from device runtime. Setting to -1 will print all information

Setting to 1 emits further detailed debugging information about data transfer operations and kernel launch.
```

```
LIBOMPTARGET DEBUG = 1
Libomptarget --> Init target library!
Libomptarget --> OMPT: library_ompt_connect = libomp_ompt_connect
Libomptarget --> OMPT: library_ompt_connect = 0x7f16ea9b1bd0
Libomptarget --> OMPT: Exit ompt init
Libomptarget --> register_image_info image 0 of 1 offload-arch:gfx90a VERSION:1
Libomptarget --> Loading RTLs...
Libomptarget --> Loading library 'libomptarget.rtl.ppc64.so'..
Libomptarget --> Unable to load 'libomptarget.rtl.ppc64.so': libomptarget.rtl.ppc64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.x86_64.so'
Libomptarget --> Successfully loaded library 'libomptarget.rtl.x86_64.so'!
Libomptarget --> Registering RTL libomptarget.rtl.x86_64.so supporting 4 devices!
Libomptarget --> Loading library 'libomptarget.rtl.cuda.so'...
Target CUDA RTL --> Start initializing CUDA
Target CUDA RTL --> Unable to load library 'libcuda.so': libcuda.so: cannot open shared object file: No such file or directory!
Target CUDA RTL --> Failed to load CUDA shared library
Libomptarget --> Successfully loaded library 'libomptarget.rtl.cuda.so'!
Libomptarget --> No devices supported in this RTL
Libomptarget --> Loading library 'libomptarget.rtl.aarch64.so'.
Libomptarget --> Unable to load 'libomptarget.rtl.aarch64.so': libomptarget.rtl.aarch64.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.ve.so'..
Libomptarget --> Unable to load 'libomptarget.rtl.ve.so': libomptarget.rtl.ve.so: cannot open shared object file: No such file or directory!
Libomptarget --> Loading library 'libomptarget.rtl.amdgpu.so'...
Target AMDGPU RTL --> Start initializing AMDGPU
Target AMDGPU RTL --> There are 8 devices supporting HSA.
Target AMDGPU RTL --> Alloc allowed in memory pool check failed: HSA_STATUS_ERROR: A generic error has occurred.
Target AMDGPU RTL --> Alloc allowed in memory pool check failed: HSA_STATUS_ERROR: A generic error has occurred.
Target AMDGPU RTL --> Alloc allowed in memory pool check failed: HSA_STATUS_ERROR: A generic error has occurred.
Target AMDGPU RTL --> Alloc allowed in memory pool check failed: HSA_STATUS_ERROR: A generic error has occurred.
Target AMDGPU RTL --> Alloc allowed in memory pool check failed: HSA_STATUS_ERROR: A generic error has occurred.
Target AMDGPU RTL --> Alloc allowed in memory pool check failed: HSA_STATUS_ERROR: A generic error has occurred.
Target AMDGPU RTL --> Alloc allowed in memory pool check failed: HSA_STATUS_ERROR: A generic error has occurred.
Target AMDGPU RTL --> Alloc allowed in memory pool check failed: HSA_STATUS_ERROR: A generic error has occurred.
Target AMDGPU RTL --> Device 0: Initial groupsPerDevice 128 & ThreadsPerGroup 256
Target AMDGPU RTL --> Device 1: Initial groupsPerDevice 128 & ThreadsPerGroup 256
Target AMDGPU RTL --> Device 2: Initial groupsPerDevice 128 & ThreadsPerGroup 256
Target AMDGPU RTL --> Device 3: Initial groupsPerDevice 128 & ThreadsPerGroup 256
Target AMDGPU RTL --> Device 4: Initial groupsPerDevice 128 & ThreadsPerGroup 256
Target AMDGPU RTL --> Device 5: Initial groupsPerDevice 128 & ThreadsPerGroup 256
Target AMDGPU RTL --> Device 6: Initial groupsPerDevice 128 & ThreadsPerGroup 256
Target AMDGPU RTL --> Device 7: Initial groupsPerDevice 128 & ThreadsPerGroup 256
Target AMDGPU RTL --> OMPT: Entering ompt_init
Target AMDGPU RTL --> OMPT: library_ompt_connect = libomptarget_ompt_connect
Target AMDGPU RTL --> OMPT: library_ompt_connect = 0x7f16ea889f80
Libomptarget --> OMPT: Enter libomptarget_ompt_connect
Libomptarget --> OMPT: Leave libomptarget_ompt_connect
Target AMDGPU RTL --> OMPT: Exiting ompt_init
Libomptarget --> Successfully loaded library 'libomptarget.rtl.amdgpu.so'!
Libomptarget --> Registering RTL libomptarget.rtl.amdgpu.so supporting 8 devices!
Libomptarget --> Loading library 'libomptarget.rtl.rpc.so'
Libomptarget --> Unable to load 'libomptarget.rtl.rpc.so': libomptarget.rtl.rpc.so: cannot open shared object file: No such file or directory!
Libomptarget --> RTLs loaded!
Libomptarget --> Image 0x0000563eb018fb40 is NOT compatible with RTL libomptarget.rtl.x86_64.so!
Libomptarget --> Image 0x0000563eb018fb40 is compatible with RTL libomptarget.rtl.amdgpu.so!
Libomptarget --> RTL 0x0000563eb18eda40 has index 0!
Libomptarget --> Registering image 0x0000563eb018fb40 with RTL libomptarget.rtl.amdgpu.so!
Libomptarget --> Done registering entries!
M= (2.000000,2.000000)
Target AMDGPU RTL --> Finalizing the AMDGPU DeviceInfo.
Libomptarget --> Unloading target library!
Libomptarget --> Image 0x0000563eb018fb40 is compatible with RTL 0x0000563eb18eda40!
Libomptarget --> Unregistered image 0x0000563eb018fb40 from RTL 0x0000563eb18eda40!
Libomptarget --> Done unregistering images!
Libomptarget --> Removing translation table for descriptor 0x00000563eb01931f8
Libomptarget --> Done unregistering library!
```

Libomptarget --> Deinit target library!

#### CRAY\_ACC\_DEBUG=3

#### Debugging with Cray compiler: CRAY\_ACC\_DEBUG

#### CRAY ACC DEBUG=1

```
ACC: Transfer 2 items (to acc 8 bytes, to host 0 bytes) from ./teamsdis3.f90:6
ACC: Execute kernel target example <a href="mailto:sck_l6_1">$ck_l6_1</a> async(auto) from ./teamsdis3.f90:6
ACC: Wait async(auto) from ./teamsdis3.f90:8
ACC: Transfer 2 items (to acc 0 bytes, to host 8 bytes) from ./teamsdis3.f90:8
M = (2.,2.)
ACC: Version 5.0 of HIP already initialized, runtime version 50120532
ACC: Get Device 0
ACC: Set Thread Context
ACC: Start transfer 2 items from ./teamsdis3.f90:6
           allocate 'm' (8 bytes)
           allocate, copy to acc 'n' (8 bytes)
ACC: End transfer (to acc 8 bytes, to host 0 bytes)
ACC: Execute kernel target_example_$ck_L6_1 blocks:1 threads:1 async(auto) from ./teamsdis3.f90:6
ACC: Wait async(auto) from ./teamsdis3.f90:8
ACC: Start transfer 2 items from ./teamsdis3.f90:8
ACC:
           copy to host, free 'm' (8 bytes)
           free 'n' (8 bytes)
ACC: End transfer (to acc 0 bytes, to host 8 bytes)
M = (2.,2.)
                                                                          CRAY ACC DEBUG=2
```

```
ACC: Version 5.0 of HIP already initialized, runtime version 50120532
ACC: Get Device 0
ACC: Compute level 9.0
ACC: Device Name:
ACC: Number of cus 120
ACC: Device name
ACC: AMD GCN arch name: gfx908:sramecc+:xnack-
ACC: Max shared memory 65536
ACC: Max thread blocks per cu 8
ACC: Max concurrent kernels 8
ACC: Async table size 8
ACC: Set Thread Context
ACC: Establish link bewteen libcrayacc and libcraymp
ACC: libcrayacc interface v5
      libcraymp interface v5
ACC: Start transfer 2 items from ./teamsdis3.f90:6
       flags:
ACC:
ACC:
       Trans 1
ACC:
           Simple transfer of 'm' (8 bytes)
ACC:
ACC:
ACC:
                host ptr 4053c0
                acc ptr 0
                flags: ALLOCATE ACQ PRESENT REG PRESENT
ACC:
                memory not found in present table
ACC:
                allocate (8 bytes)
ACC:
                  get new reusable memory, added entry
ACC:
                new allocated ptr (7f4d67608000)
ACC:
                add to present table index 0: host 4053c0 to 4053c8, acc 7f4d67608000
ACC:
                new acc ptr 7f4d67608000
ACC:
ACC:
       Trans 2
ACC:
           Simple transfer of 'n' (8 bytes)
ACC:
                host ptr 4053c8
ACC:
ACC:
                acc ptr 0
                flags: ALLOCATE COPY HOST TO ACC ACQ PRESENT REG PRESENT
                memory not found in present table
ACC:
                allocate (8 bytes)
                  get new reusable memory, added entry
ACC:
ACC:
                new allocated ptr (7f4d67609000)
ACC:
                add to present table index 1: host 4053c8 to 4053d0, acc 7f4d67609000
ACC:
                copy host to acc (4053c8 to 7f4d67609000)
ACC:
                     internal copy host to acc (host 4053c8 to acc 7f4d67609000) size = 8
ACC:
                new acc ptr 7f4d67609000
ACC: End transfer (to acc 8 bytes, to host 0 bytes)
ACC: Start kernel target_example_$ck_L6_1 async(auto) from ./teamsdis3.f90:6
ACC: flags: CACHE_MOD CACHE_FUNC AUTO_ASYNC
        mod cache: 0x405640
ACC: kernel cache: 0x405440
       async info: 0x7f4d7b0918d0
ACC:
       arguments: GPU argument info
ACC:
                param size: 16
             param pointer: 0x7ffcd9b8ffc0
```

#### Debugging with Cray compiler: -hlist=aimd

```
teamdis.f90
  p<mark>rogram test</mark>
            integer :: i
           complex, pointer :: A(:)
           allocate(A(500))
           do i=1, 500
           A(i) = (0,0)
           enddo
10
            !$omp target teams distribute parallel do simo map(from:A)
11
12
           do i=1,500
13
           A(i) = (2,2)
14
           enddo
15
           !$omp end target teams distribute parallel do simd
16
17
           write(*,*) "A(1)= ", A(1)
18
19 end program test
```

```
Compiling with Cray Fortran

$ftn -hnoacc -homp -fopenmp -hlist=aimd -o ./teamsdis ./teamsdis.f90
```

```
*.lst
                 program test
                         integer :: i
                         complex, pointer :: A(:)
                         allocate(A(500))
                         do i=1, 500
         A----<
ftn-6202 ftn: VECTOR TEST, File = teamsdis.f90, Line = 7
 A loop starting at line 7 was replaced by a library call.
                         A(i) = (0,0)
         A---->
                         enddo
                        !$omp target teams distribute parallel do simd map(from:A)
ftn-6405 ftn: ACCEL TEST, File = teamsdis.f90, Line = 11
 A region starting at line 11 and ending at line 15 was placed on the accelerator.
ftn-6823 ftn: THREAD TEST, File = teamsdis.f90, Line = 11
 A region starting at line 11 and ending at line 15 was multi-threaded.
ftn-6420 ftn: ACCEL TEST, File = teamsdis.f90, Line = 11
If not already present: allocate memory for user shaped variable "a" on accelerator, copy back at line 15 (acc cop
ftn-6823 ftn: THREAD TEST, File = teamsdis.f90, Line = 11
 A region starting at line 11 and ending at line 15 was multi-threaded.
ftn-6823 ftn: THREAD TEST, File = teamsdis.f90, Line = 11
 A region starting at line 11 and ending at line 15 was multi-threaded.
ftn-7256 ftn: WARNING TEST, File = teamsdis.f90, Line = 11
  An OpenMP parallel construct in a target region is limited to a single thread.
  12. MG q--<
                         do i=1, 500
ftn-6430 ftn: ACCEL TEST. File = teamsdis.f90. Line = 12
A loop starting at line 12 was partitioned across the threadblocks and the 256 threads within a threadblock.
         MG a
                         A(i) = (2,2)
         MG a-->
         MG--->
                         !$omp end target teams distribute parallel do simd
                         write(*,*) "A(1)= ", A(1)
                 end program test
```

#### Profiling OpenMP® offloading code on AMD GPUs

```
Basic profiling with rocprof:

Compile:

$ftn -hnoacc -fopenmp -homp -o ./test ./test.f90

Profile and collect HIP trace:

$rocprof -hip-trace ./test

Open the .json file in <a href="mailto:chrome://tracing/">chrome://tracing/</a> or <a href="https://ui.perfetto.dev/">https://ui.perfetto.dev/</a>
```





#### **Agenda**

- Introduction to MI 200 hardware
- 2. Software stack and tools
- 3. Basics of OpenMP® offloading
- 4. HIP & OpenMP® compatibility
- 5. Case studies
- 6. Heterogenous memory management (HMM)

#### HIP & OpenMP® – Hybrid programming: compatible & competitive

Hybrid programming here stands for the interaction of OpenMP with a lower-level programming model like HIP. In other words, one can program with OpenMP in the style one might program with HIP.

OpenMP supports the following interactions:

- Calling low-level HIP kernels from OpenMP application code
- Calling HIP/ROCM math libraries (rocBLAS, rocFFT, etc.) from OpenMP application code
- Calling OpenMP kernels from low-level HIP application code

#### HIP & OpenMP® – Saxpy example

```
Basic profiling with rocprof:

void example() {
    float a = 2.0;
    float * x;
    float * y;
    #pragma omp target data map(to:x[0:count]) map(tofrom:y [0:count])
    {
       compute_1(n, x);
       compute_2(n, y);
       #pragma omp target update to(x[0:count]) to(y[0:count])
       saxpy(n, a, x, y)
       compute_3(n, y);
    }
}
```

#### HIP & OpenMP® – HIP kernel for saxpy()

A HIP version of the SAXPY kernel:

We need a way to translate the host pointer that was mapped by OpenMP directives and retrieve the associated device pointer.

#### HIP & OpenMP® – Putting it together

```
global void saxpy kernel (size t n, float a, float * x , float * y ){
           size t i = threadIdx.x + blockIdx.x * blockDim.x;
          y[i] = a * x[i] + y[i];
                                                                                         Translation unit 1
                                                                                                                       hipcc
Void saxpy hip (size t n, float a, float * x , float * y ){
           assert(n % 256 == 0);
           saxpy kernel <<< n/256,256,0,NULL>>>(n, a, x, y);
void example() {
          float a = 2.0;
                                                                                         Translation unit 2
           float * x = ...; //assume: x = 0xabcd
           float * y = ...;
           // allocate the device memory
           #pragma omp target data map(to:x [0:count]) tofrom:y [0:count])
           compute_1(n, x); // mapping table: x:[0xabcd ,0xef12 ], x = 0xabcd
                                                                                                                       clang
           compute 2(n, y);
           #pragma omp target update to(x[0:count]) to(y[0:count]) // update x and y on the target
           #pragma omp target data use device ptr (x,y)
                saxpy_hip(n, a, x, y) // mapping table: 0xabcd ,0xef12 ], x = 0xef12
           compute 3(n, y);
```

#### HIP & OpenMP® – Fortran and DGEMM example

```
You can either create your own FORTRAN
                                                                       ▶ module rocm interface
                                                                             interface
subroutine example
    use rocm interface
                                                                               subroutine init_rocblas(handle) bind(C)
   use iso_c_binding
                                                                                     use iso_c_binding
   implicit none
                                                                                     type(c ptr)
                                                                                                        :: handle
   real(8),allocatable,target,dimension(:,:) :: a, b, c
                                                                                 end subroutine init rocblas
    type(c_ptr)
                                              :: rocblas handle
                                                                                 subroutine omp_dgemm(handle,ma,mb,m,n,k,alpha, &
    . . .
                                                                                     a,lda,b,ldb,beta,c,ldc) bind(C)
                                                                                     use iso c binding
   allocate(da(M,N),db(N,K),dc(M,K))
                                                                                     type(c ptr), value :: a,b,c
   call init_matrices(da,db,dc,M,N,K)-
                                           Initialize matrices
                                          ! Initialize rocBLAS
    call init rocblas(rocblas handle)
                                                                                     type(c ptr)
                                                                                                        :: handle
                                                                                                        :: ma,mb,m,n,k,lda,ldb,ldc
                                                                                     integer(c_int)
                                                                                     real(c double)
                                                                                                        :: alpha, beta
    !$OMP target enter data map(to:a,b,c)
                                                   Translation unit 1
                                                                                 end subroutine omp dgemm
    !$OMP target data use device ptr(a,b,c)
                                                                             end interface
   call omp dgemm(rocblas handle, modea, modeb, M, N, K, alpha, &
                                                                         end module rocm interface
       c loc(a),lda,c loc(b),ldb,beta,c loc(c),ldc)
    !$OMP end target data
                                                                        #include <rocblas.h>
    !$OMP target update from(c)
                                                                         extern "C" {
   !$OMP target exit data map(delete:a,b,c)
                                                            flang
                                                                             void omp dgemm(void *ptr, int modeA, int modeB, int m, int n,
end subroutine example
                                                                                     int k, double alpha, double *A, int lda,
                                                                                     double *B, int ldb, double beta, double *C, int ldc/
                                                                                 rocblas handle *handle = (rocblas handle *) ptr;
 ... or build hipfort with flang and use their readily
                                                                                 rocblas dgemm(*handle,convert(modeA),convert(modeB),m/n,k,
                                                                                     &alpha,A,lda,B,ldb,&beta,C,ldc);
 available FORTRAN to HIP interface
 https://github.com/ROCmSoftwarePlatform/hipfort
                                                                             void init rocblas(void *ptr) {
                                                                                 rocblas handle *handle = (rocblas handle *)
                                                                                 rocblas_create_handle(handle);
```

#### HIP & OpenMP® – Babelstream case study

Full comparison of OpenMP Offloading vs HIP for all kernels in single precision and double precision All experiments performed on a single Instinct MI100 using AOMP 13.06

Default Threads \* Teams configuration already optimal for some kernels

| Single Precision      | Default Threads * Teams                          | OpenMP/HIP ratio             | Optimal Threads * Teams             | Optimal OpenMP/HIP ratio      |
|-----------------------|--------------------------------------------------|------------------------------|-------------------------------------|-------------------------------|
| Read                  | 256 * 480                                        | 1.48                         | -                                   | -                             |
| Write                 | 256 * 480                                        | 1.96                         | 1024 * 1440                         | 2.05                          |
| Сору                  | 256 * 480                                        | 0.92                         | 128 * 1920                          | 0.97                          |
| Mul                   | 256 * 480                                        | 0.92                         | 128 * 1440                          | 0.97                          |
| Add                   | 256 * 480                                        | 0.89                         | 128 * 1680                          | 0.93                          |
| Triad                 | 256 * 480                                        | 0.88                         | 128 * 1440                          | 0.92                          |
| Dot                   | 256 * 480                                        | 0.57                         | 64 * 1920                           | 0.72                          |
|                       |                                                  |                              |                                     |                               |
| Double Precision      | Default Threads * Teams                          | OpenMP/HIP ratio             | Optimal Threads * Teams             | Optimal OpenMP/HIP ratio      |
| Double Precision Read | Default Threads * Teams<br>256 * 480             | OpenMP/HIP ratio             | Optimal Threads * Teams  1024 * 960 | Optimal OpenMP/HIP ratio 1.06 |
|                       |                                                  |                              |                                     |                               |
| Read                  | 256 * 480                                        | 1.01                         | 1024 * 960                          | 1.06                          |
| Read<br>Write         | 256 * 480<br>256 * 480                           | 1.01<br>0.90                 | 1024 * 960<br>1024 * 60             | 1.06<br>0.95                  |
| Read Write Copy       | 256 * 480<br>256 * 480<br>256 * 480              | 1.01<br>0.90<br>0.93         | 1024 * 960<br>1024 * 60<br>-        | 1.06<br>0.95<br>-             |
| Read Write Copy Mul   | 256 * 480<br>256 * 480<br>256 * 480<br>256 * 480 | 1.01<br>0.90<br>0.93<br>0.92 | 1024 * 960<br>1024 * 60<br>-<br>-   | 1.06<br>0.95<br>-<br>-        |

Optimization for BabelStream would require a different number of Threads\*Teams for each of the sub-benchmarks



#### **Agenda**

- Introduction to MI 200 hardware
- 2. Software stack and tools
- 3. Basics of OpenMP® offloading
- 4. HIP & OpenMP® compatibility
- 5. Case studies
- 6. Heterogenous memory management (HMM)

#### Case Study 1 – VASP (Vienna Ab Initio Simulation Package)

- A computer program for atomic scale materials modelling, e.g., electronic structure calculations and quantum-mechanical molecular dynamics
- Currently used by more than 1400 research groups in academia and industry worldwide
- Software license agreements with the University of Vienna
- ~550K lines of FORTRAN 90 code (some FORTRAN 77)



#### Supporting concurrent directive-based paradigms in VASP

- Switch between different directive-based paradigms without letting them impact on each other
- Take advantage of source preprocessing
  - Pros: switch between different directive-based paradigms
  - Cons: makes the code messy

```
#ifdef OFFLOAD
#define D00FF
#define D00MP
                            !!
#else
                                                                             Used when VASP is compiled with OpenACC
#define D00FF
                            !!
#define D00MP
#endif
                                               LOOP PRESENT(CH.CW.DATAKE.WDES1) PRIVATE(MM)
                                 OOMP NOACC !$OMP PARALLEL DO SHARED(WDES1,CH,ISPINOR,DATAKE,EVALUE) PRIVATE(M,MM
                                            TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD PRIVATE(M,MM)
                                        DO M=1,WDES1%NGVECTOR
                                           MM=M+ISPINOR*WDES1%NGVECTOR
       Used when OpenMP
                                                                                                                      Used when OpenMP (host) is
                                           CH(MM)=CH(MM)+CW(MM)*(WDES1%DATAKE(M, ISPINOR+1)-EVALUE)
       offloading is enabled
                                                                                                                         enabled and OpenMP
                                        ENDDO
                                                                                                                     offloading/OpenACC is disabled
```



#### Enable/disable offloading in different code paths

- Many of the VASP subroutines are called from different code paths
  - How can we enable offloading for a subroutine in one path and disable offloading for others
    - It would be useful for code development and debugging

```
DOOFF !$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO COLLAPSE(2) REDUCTION(+:EKIN) PRIVATE(MM,CPT) IF(OMP EXEC ON)
DO ISPINOR=0, WDES1%NRSPINORS-1
   DO M=1,WDES1%NGVECTOR
      MM=M+ISPINOR*WDES1%NGVECTOR
      CPT=W1%CW(MM)
      EKIN =EKIN+ REAL( CPT*CONJG(CPT) ,KIND=q) * WDES1%DATAKE(M, ISPINOR+1)
   ENDDO
DOOFF !$OMP END TARGET TEAMS DISTRIBUTE PARALLEL DO
```

We can call OMP\_PUSH\_EXEC\_ON(.TRUE.) or OMP\_PUSH\_EXEC\_ON(.FALSE.) to enable or disable offloading in different code paths

```
MODULE moffload struct def
#ifdef OFFLOAD
    PUBLIC :: OMP PUSH EXEC ON, OMP POP EXEC ON
    INTEGER, PARAMETER :: MAXLEVEL=20
    INTEGER :: OMP EXEC ON LEVEL=0
    LOGICAL :: OMP EXEC ON STACK(MAXLEVEL)=.FALSE.
    LOGICAL, PUBLIC :: OMP EXEC ON=.TRUE.
    CONTAINS
    SUBROUTINE OMP PUSH EXEC ON(VAR)
       LOGICAL :: VAR
       IF (OMP EXEC ON LEVEL==MAXLEVEL) THEN
        WRITE(*,*) "OMP PUSH EXEC ON: ERROR: stack is full"
       ENDIF
       OMP EXEC ON LEVEL=OMP EXEC ON LEVEL+1
       OMP EXEC ON STACK(OMP EXEC ON LEVEL)=OMP EXEC ON
       OMP EXEC ON=VAR
    END SUBROUTINE OMP PUSH EXEC ON
    SUBROUTINE OMP POP EXEC ON
    IF (OMP EXEC ON LEVEL==0) THEN
      WRITE(*,*) "OMP POP EXEC ON: ERROR: stack is empty"
    OMP EXEC ON=OMP EXEC ON STACK(OMP EXEC ON LEVEL)
    OMP EXEC ON LEVEL=OMP EXEC ON LEVEL-1
    END SUBROUTINE OMP POP EXEC ON
#endif
    END MODULE moffload struct def
```

#### **Interface OMP offloading with ROCM libraries**

- VASP uses FFT, BLAS, and LAPACK extensively
- Developed a wrapper to interface OMP target regions with ROCM libraries
  - rocFFT
  - rocBLAS
  - rocSolver

```
CALL OFF_ZGEMM('N', 'N', m_ WDES1%NPL_RED , NSIM_, NSIM_*ITER, one, & WOPT%CW_RED(1,1), m_ WDES%NRPLWV_RED, CEIG(1,1), NSUBD, & zero, WA%CW_RED(1,NPOS_RED+1), m_ WDES%NRPLWV_RED)
```

```
SUBROUTINE OFF ZGEMM(TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)
USE MROCBLAS
USE moffload struct def
USE moffload
INTEGER
                 :: M,N,K,LDA,LDB,LDC
CHARACTER(1)
                 :: TRANSA, TRANSB
                 :: A(LDA, COLNUM(TRANSA, K, M)), B(LDB, COLNUM(TRANSB, N, K)), C(LDC, N)
COMPLEX(q)
COMPLEX(q)
                 :: ALPHA, BETA
DOOFF !$OMP TARGET DATA USE DEVICE PTR(A,B,C)
   CALL HIP ZGEMM(ROCBLAS HANDLE, CHAR TO OP(TRANSA), CHAR TO OP(TRANSB), M, N, K, &
            ALPHA, C_LOC(A), LDA, C_LOC(B), LDB, BETA, C_LOC(C), LDC)
DOOFF !$OMP END TARGET DATA
END SUBROUTINE OFF ZGEMM
```

WOPT%CW\_RED(A), CEIG(B), and WA%CW\_RED(C) are mapped to device with "omp target enter data map" directive

#### **Exponential of Complex Variables**

#### Original Code:

```
implicit none
integer :: i
complex(8) :: D,R

D=(1,2)

!$OMP TARGET MAP(FROM:R) MAP(TO:D)
R=EXP(D)
!$OMP END TARGET
write(*,*) "R= ", R
end program loop test
```

```
coeb2819@cedar004:~/kernel/exp>
Coe62819@cedar004:~/kernel/exp> ftn -hnoacc -fopenmp -homp -o ./exp ./exp.f90
Error message :: Unimplemented: Complex lib call for AMDGCN
Error detected :: File 'pdgcs/llvm-call-expr.c', line 283
          Optimizer built :: 2022-02-11 (production)
         File
                                                              :: ./exp.f90
          Function
                                                             :: loop_test_
         at or near line
                                                            :: 11
         File path
                                                              :: /home/users/coe62819/kernel/exp/exp.f90
         Compiler hash
                                                              :: 47886aea6358792836950db7cc33c2061a1a9d36
          Target
                                                             :: Heterogeneous
               ACCEL
                                                              :: amdgcn-gfx908
Creating internal compiler error backtrace (please wait):
[0x00000000110d873] linux backtrace ??:?
[0x00000000110d8752] pdgcs_internal_error(char_const*, char_const*, int) ??:?
[0x000000012d263c] llvm_cg::gen_llvm_complex_lib_call_expr(EXP_INFO, std::pair<llvm::Value*, llvm::Value*>*) ??:?
[0x0000000013d9ee4d] llvm_cg::gen_llvm_unary_expr(EXP_INFO) ??:?
[0x000000013d9a9] llvm_cg::gen_llvm_expr(EXP_INFO, bool) ??:?
[0x000000013d1b524] llvm_cg::gen_llvm_expr_stmt(int) llvm-stmt.c:?
[0x000000011d1bf2] llvm_cg::gen_llvm_stmt(int) ??:?
[0x0000000011d17] llvm_cg::lvm_function with_body() llvm-pdgcs.c:?
[0x000000011e448c] internal_llvm_function(int, int, int, int, llvm_cg::FunctionTranslateMode) llvm-pdgcs.c:?
[0x0000000011e6a52] llvm_function(int) ??:?
[0x0000000011e6a52] llvm_function(int) ??:?
  Creating internal compiler error backtrace (please wait):
   [0x000000000718dca] PDGCS do proc ??:?
 [0x0000000006767f4] cvrt_proc_to_pdg m_cvrt.c:?

[0x0000000067e7f4] cvrt_proc_to_pdg m_cvrt.c:?

[0x0000000067f2a8] m_cvrt_to_pdg ??:?

[0x000000006ad67e] m_start_tpa ??:?

[0x000000006ad67e] m_start_tpa ??:?
   [0x007f56974eb34c] ?? ??:0
  [0x00000000639e3d] start /home/abuild/rpmbuild/BUILD/glibc-2.19/csu/../sysdeps/x86_64/start.S:122
 Note: This is a non-debug compiler. Technical support should continue problem isolation using a compiler built for
                     debugging.
  ftn-7991 ftn: INTERNAL LOOP_TEST, File = ./exp.f90, Line = 11
      INTERNAL COMPILER ERROR: "Unimplemented: Complex lib call for AMDGCN" (pdgcs/llvm-call-expr.c, line 283, version 47886aea6358792836950db7cc33c2061a1a9d36)
 coe62819@cedar004:~/kernel/exp>
coe62819@cedar004:~/kernel/exp>
  coe62819@cedar004:~/kernel/exp> vi exp.f90
  coe62819@cedar004:~/kernel/exp> cp exp.f90 exp2.f90
  coe62819@cedar004:~/kernel/exp> vi exp2.f90
  coe62819@cedar004:~/kernel/exp>
  coe62819@cedar004:~/kernel/exp>
coe62819@cedar004:~/kernel/exp> ftn -hnoacc -fopenmp -homp -o ./exp ./exp2.f90
Error message :: Unimplemented: Complex lib call for AMDGCN
Error detected :: File 'pdgcs/llvm-call-expr.c', line 283
         Optimizer built :: 2022-02-11 (production)
                                                              :: ./exp2.f90
         Function
                                                             :: loop_test_
         at or near line
         File path
                                                              :: /home/users/coe62819/kernel/exp/exp2.f90
          Compiler hash
                                                              :: 47886aea6358792836950db7cc33c2061a1a9d36
                                                              :: Heterogeneous
                                                              :: x86-rome
                                                             :: amdgcn-gfx908
 Creating internal compiler error backtrace (please wait):
[0x00000001104873] linux_backtrace ??:?
[0x0000000110e752] pdgcs_internal_error(char const*, char const*, int) ??:?
[0x000000012d63c1]lym_cg::gen_llym_cgn:plex_llym_cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn:plex_llym.cgn
```

#### **Exponential of Complex Variables**

#### Workaround:

```
program loop_test
      implicit none
      integer :: i
      complex(8) :: D,R
      REAL
              :: CE img
      D=(1,2)
      !$OMP TARGET MAP(FROM:R) MAP(TO:D)
      CE_img= AIMAG(D)
      R = 2.71828 * * (REAL(D))
      R=R*cmplx(COS(CE_img),SIN(CE_img))
      !$OMP END TARGET
      write(*,*) "R= ", R
end program loop_test
```

```
$ftn -hnoacc -fopenmp -homp -o ./exp ./exp_workaround2.f90
$ ./exp
R= (-1.1312035958327016,2.4717250246105067)
```

 $Exp(a+bj)=e^a*(cos(b)+sin(b)j)$ 

#### **Mapping Scalar Variables**

#### Original Code:

```
program test
    real(8),target :: CE
    CE=0

!$OMP TARGET ENTER DATA MAP(TO:CE)

!$OMP TARGET
    CE=1
    !$OMP END TARGET

!$OMP END TARGET

!$OMP TARGET UPDATE FROM(CE)
    write(*,*) "CE= ", CE

!$OMP TARGET EXIT DATA MAP(RELEASE:CE)
end program test
```

```
$ftn -hnoacc -fopenmp -homp -o ./enter_scalar ./enter_scalar.f90
$./enter_scalar
CE= 0.
```

#### Workaround:

```
program test
    real(8),target :: CE
    CE=0

!$OMP TARGET MAP(FROM:CE)
    CE=1
    !$OMP END TARGET
    write(*,*) "CE= ", CE
end program test
```

```
$ftn -hnoacc -fopenmp -homp -o
./enter_scalar ./enter_scalar_workaround.f90
$./enter_scalar
CE= 1.
```



#### Pointer aliasing

```
1 MODULE wave struct def
 2 TYPE wavedes
           INTEGER,POINTER :: LMMAXX(:)
           END TYPE wavedes
 6 TYPE wavedes1
           INTEGER,POINTER :: LMMAXX(:) => NULL()
           END TYPE wavedes1
 9 END MODULE wave struct def
11
12 program test
13
         use wave struct def
14
         integer :: i, j, k, N
15
         TYPE (wavedes) WDES
16
         TYPE (wavedes1) WDES1
17
         INTEGER,POINTER :: OUTPUT(:)
18
19
20
         ALLOCATE(WDES%LMMAXX(N))
         ALLOCATE(OUTPUT(N))
22
23
24
         do i=1, N
         WDES%LMMAXX(i) = 1
25
         OUTPUT(i) = 0
26
27
         !$OMP TARGET ENTER DATA MAP(T0:WDES)
28
29
         !$OMP TARGET ENTER DATA MAP(T0:WDES%LMMAXX)
30
         ! use WDES / WDES%LMMAXX in different loops/directives
31
32
         WDES1%LMMAXX => WDES%LMMAXX
         !$OMP TARGET ENTER DATA MAP(T0:WDES1)
         !$OMP TARGET ENTER DATA MAP(T0:WDES1%LMMAXX)
37
         !$OMP TARGET TEAMS DISTRIBUTE MAP(FROM:OUTPUT)
38
39
         do i=1, N
         OUTPUT(i) = WDES1%LMMAXX(i)
40
41
         !$OMP END TARGET TEAMS DISTRIBUTE
42
43
44
         write(*,*) "OUTPUT(", i, ")=", OUTPUT(i)
45
46
         !$OMP TARGET EXIT DATA MAP(DELETE:WDES1%LMMAXX)
47
         !$OMP TARGET EXIT DATA MAP(DELETE:WDES1)
48
49
         !$OMP TARGET EXIT DATA MAP(DELETE:WDES%LMMAXX)
50
         !$OMP TARGET EXIT DATA MAP(DELETE:WDES)
51
52 end program test
```

- Pointer aliasing occurs a lot in VASP
  - It can be challenging for the compilers to deal with pointer aliasing on device
- Set CRAY\_ACC\_DEBUG=3 as environment variable to get the log
- This issue is resolved in CCE15

```
coe62819@cedar004:~/kernel/ticket5/crayticket> ./map aliased
ACC: Version 5.0 of HIP already initialized, runtime version 50120532
ACC: Get Device 0
ACC: Set Thread Context
ACC: Start transfer 1 items from ./map aliased orig.f90:27
           allocate, copy to acc 'wdes' (72 bytes)
ACC: End transfer (to acc 72 bytes, to host 0 bytes)
ACC: Start transfer 3 items from ./map aliased orig.f90:28
           allocate, copy to acc 'wdes%lmmaxx(:)' (40 bytes)
ACC:
ACC:
           present 'wdes' (72 bytes)
           attach pointer 'wdes%lmmaxx' (72 bytes)
ACC:
ACC: End transfer (to acc 40 bytes, to host 0 bytes)
ACC: Start transfer 1 items from ./map aliased orig.f90:34
ACC:
           allocate, copy to acc 'wdes1' (72 bytes)
ACC: End transfer (to acc 72 bytes, to host 0 bytes)
ACC: Start transfer 3 items from ./map aliased orig.f90:35
           present 'wdes1%lmmaxx(:)' (\overline{40} \text{ bytes})
ACC:
ACC:
           present 'wdes1' (72 bytes)
           no attach pointer 'wdes1%lmmaxx' (72 bytes)
ACC:
ACC: End transfer (to acc 0 bytes, to host 0 bytes)
ACC: Start transfer 2 items from ./map aliased orig.f90:37
           allocate 'output(:)' (40 bytes)
ACC:
           present 'wdes1' (72 bytes)
ACC:
ACC: End transfer (to acc 0 bytes, to host 0 bytes)
ACC: Execute kernel test $ck L37 1 blocks:1 threads:256 async(auto) from ./map aliased orig.f90:3
:0:rocdevice.cpp
                            :2615: 15286944398 us: 9126 : [tid:0x7fc99b2af700] Device::callbackQu
: 0x2b
Aborted
```

#### Pointer aliasing (alternative methods)

```
1 MODULE wave struct def
 2 TYPE wavedes
            INTEGER,POINTER :: LMMAXX(:)
 4
           END TYPE wavedes
 6 TYPE wavedes 1
            INTEGER,POINTER :: LMMAXX(:) => NULL()
           END TYPE wavedes1
 9 END MODULE wave struct def
10
11
12 program test
13
         use wave struct def
          integer :: i, j, k, N
15
         TYPE (wavedes) WDES
16
         TYPE (wavedes1) WDES1
17
         INTEGER,POINTER :: OUTPUT(:)
18
         N = 10
19
20
21
         ALLOCATE(WDES%LMMAXX(N))
         ALLOCATE(OUTPUT(N))
23
24
25
26
27
28
29
         do i=1, N
         WDES%LMMAXX(i) = 1
         OUTPUT(i) = 0
          !$OMP TARGET ENTER DATA MAP(TO:WDES)
          !$OMP TARGET ENTER DATA MAP(TO:WDES%LMMAXX)
30
         ! use WDES / WDES%LMMAXX in different loops/directives
31
32
33
          !$OMP TARGET
         WDES1%LMMAXX => WDES%LMMAXX
                                             —→Launch a kernel
34
35
36
         !$0MP END TARGET
          !$OMP TARGET ENTER DATA MAP(T0:WDES1)
          !$OMP TARGET ENTER DATA MAP(T0:WDES1%LMMAXX)
38
39
40
          !$OMP TARGET TEAMS DISTRIBUTE MAP(FROM:OUTPUT)
         do i=1, N
41
         OUTPUT(i) = WDES1%LMMAXX(i)
42
43
44
          !$OMP END TARGET TEAMS DISTRIBUTE
45
46
         write(*,*) "OUTPUT(", i, ")=", OUTPUT(i)
47
48
49
50
51
          !$OMP TARGET EXIT DATA MAP(DELETE:WDES1%LMMAXX)
          !$OMP TARGET EXIT DATA MAP(DELETE:WDES1)
          !$OMP TARGET EXIT DATA MAP(DELETE:WDES%LMMAXX)
          !$OMP TARGET EXIT DATA MAP(DELETE:WDES)
```

```
ODULE wave struct def
TYPE wavedes
        REAL, POINTER :: LMMAXX(:)
        END TYPE wavedes
TYPE wavedes1
        REAL, POINTER :: LMMAXX(:) => NULL()
        END TYPE wavedes1
END MODULE wave struct def
program test
      use wave struct def
      !!$omp requires unified_shared_memory
integer :: i, j, k, N,q
      TYPE (wavedes) WDES
      TYPE (wavedes1) WDES1
      REAL, POINTER :: OUTPUT(:)
      !do q=1, 1000000
      ALLOCATE(WDES%LMMAXX(N))
      ALLOCATE(OUTPUT(N))
      do i=1, N
      WDES%LMMAXX(i) = 1
      OUTPUT(i) = 0
      !$OMP TARGET ENTER DATA MAP(T0:WDES)
      !$OMP TARGET ENTER DATA MAP(TO:WDES%LMMAXX)
      !use WDES / WDES%LMMAXX in different loops/directives
      !$OMP TARGET DATA USE DEVICE PTR(WDES
                                              Using target data construct
      WDES1%LMMAXX => WDES%LMMAXX
      !$OMP END TARGET DATA
      !$OMP TARGET TEAMS DISTRIBUTE MAP(FROM:OUTPUT)
      do i=1. N
      OUTPUT(i) = WDES1%LMMAXX(i)
      !$OMP END TARGET TEAMS DISTRIBUTE
      do i=1, N
      write(*,*) "OUTPUT(", i, ")=", OUTPUT(i)
      !$OMP TARGET EXIT DATA MAP(DELETE:WDES%LMMAXX)
      !$OMP TARGET EXIT DATA MAP(DELETE:WDES)
      deallocate(OUTPUT)
      deallocate(WDES%LMMAXX)
      !enddo
```

together we advance\_

end program test

#### Pointer mismatch in subroutine calls

```
MODULE wave struct def
 TYPE wavespin
          COMPLEX(8), POINTER
                                       :: CPTWFP(:,:)
          REAL
                    .POINTER
                                       :: CPROJ(:,:)
  END TYPE wavespin
  TYPE wavefun1
          COMPLEX(8), POINTER, CONTIGUOUS :: CPTWFP(:) => NULL()
                    , POINTER, CONTIGUOUS :: CPROJ(:) => NULL()
          COMPLEX(8), POINTER, CONTIGUOUS :: CR(:)
                                                       => NULL()
  END TYPE wavefun1
END MODULE wave struct def
program PointerAliasing
      use wave struct def
      TYPE (wavespin) :: W
      TYPE (wavefun1), TARGET :: W1(10)
      INTEGER NP, NSIM
      ALLOCATE(W%CPTWFP(100,100))
      ALLOCATE(W%CPROJ(100,100))
      NSIM=10
!$OMP TARGET ENTER DATA MAP(T0:W1)
      DO NP=1, NSIM
     CALL NEWWAY R(W1(NP))
      ENDDO
      DO NP=1, NSIM
     DO I=1, 10
     CALL SETWAV(W,W1(NP),I)
      CALL ECCP(W1(NP))
      ENDDO
      ENDDO
      DO NP=1, NSIM
      CALL DELWAY R(W1(NP))
      ENDDO
     end program
```

```
SUBROUTINE NEWWAY R(W1)
  use wave struct def
  TYPE (wavefun1), INTENT(INOUT) :: W1
  INTEGER MPLWV
  MPLWV=100
  ALLOCATE(W1%CR(MPLWV))
  !$OMP TARGET ENTER DATA MAP(ALLOC:W1%CR)
  !$OMP TARGET
  W1\%CR=(1,1)
  !$OMP END TARGET
END SUBROUTINE
SUBROUTINE DELWAY R(W1)
       use wave struct def
       TYPE (wavefun1) W1
       !$OMP TARGET EXIT DATA MAP(DELETE:W1%CR)
       DEALLOCATE(W1%CR)
END SUBROUTINE
BROUTINE ECCP(W1)
       use wave struct def
       TYPE (wavefun1) :: W1
       INTEGER MM
       COMPLEX(8), TARGET :: CE
       CE=0
       !$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD REDUCTION(+:CE)
        DO MM =1, 100
        CE=CE+W1%CR(MM)
        !$OMP END TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD
        write(*,*) "ce= ", CE
ND SUBROUTINE
SUBROUTINE SETWAV(W,W1,I)
       use wave struct def
       TYPE (wavespin), INTENT(IN) :: W
       TYPE (wavefun1), INTENT(INOUT) :: W1
       INTEGER I,J,NP
       !$OMP TARGET EXIT DATA MAP(DELETE:W1%CPTWFP)
       !$OMP TARGET EXIT DATA MAP(DELETE:W1%CPROJ)
       W1%CPTWFP=>W%CPTWFP(:,I)
       W1%CPROJ =>W%CPROJ(:,I)
       !$OMP TARGET ENTER DATA MAP(T0:W1%CPTWFP,W1%CPROJ)
END SUBROUTINE
```

```
$./aliasing
ce= (100.,100.)
:0:rocdevice.cpp
                      :2660: 1637590862517 us:
86531: [tid:0x7fbe82217700]
Device::callbackQueue aborting with error :
HSA STATUS ERROR MEMORY APERTURE V
IOLATION: The agent attempted to access memory
beyond the largest legal address. code: 0x29
Aborted
```



#### Pointer mismatch in subroutine calls (alternative method)

```
MODULE wave struct def
 TYPE wavespin
          COMPLEX(8), POINTER
                                       :: CPTWFP(:,:)
                                       :: CPROJ(:,:)
                    ,POINTER
  END TYPE wavespin
  TYPE wavefun1
          COMPLEX(8), POINTER, CONTIGUOUS :: CPTWFP(:) => NULL()
                    , POINTER, CONTIGUOUS :: CPROJ(:) => NULL()
          COMPLEX(8), POINTER, CONTIGUOUS :: CR(:)
                                                       => NULL()
  END TYPE wavefun1
END MODULE wave struct def
program PointerAliasing
     use wave struct def
     TYPE (wavespin) :: W
      TYPE (wavefun1), TARGET :: W1(10)
      INTEGER NP, NSIM
      ALLOCATE(W%CPTWFP(100,100))
      ALLOCATE(W%CPROJ(100,100))
     NSIM=10
!$OMP TARGET ENTER DATA MAP(T0:W1)
     DO NP=1, NSIM
     CALL NEWWAY R(W1(NP))
      ENDDO
      DO NP=1, NSIM
     DO I=1, 10
     CALL SETWAV(W,W1(NP),I)
      CALL ECCP(W1,NP)
      ENDDO
      ENDDO
      DO NP=1, NSIM
      CALL DELWAY R(W1(NP))
      ENDD0
      end program
```

```
SUBROUTINE NEWWAY R(W1)
   use wave struct def
   TYPE (wavefun1), INTENT(INOUT) :: W1
   INTEGER MPLWV
   MPLWV=100
   ALLOCATE(W1%CR(MPLWV))
   !$OMP TARGET ENTER DATA MAP(ALLOC:W1%CR)
   !$OMP TARGET
   W1%CR=(1,1)
   !$OMP END TARGET
END SUBROUTINE
SUBROUTINE DELWAY R(W1)
        use wave struct def
        TYPE (wavefun1) W1
        !$OMP TARGET EXIT DATA MAP(DELETE:W1%CR)
        DEALLOCATE(W1%CR)
END SUBROUTINE
 UBROUTINE ECCP(W1,NP)
        use wave struct def
        TYPE (wavefun1) :: W1(10)
        INTEGER MM
        COMPLEX(8), TARGET :: CE
        CE=0
        !$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD REDUCTION(+:CE)
         DO MM = 1, 100
         CE=CE+W1(NP)%CR(MM)
         !$OMP END TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD
         write(*,*) "ce= ", CE
  D SUBROUTINE
SUBROUTINE SETWAV(W,W1,I)
        use wave struct def
        TYPE (wavespin), INTENT(IN) :: W
        TYPE (wavefun1), INTENT(INOUT) :: W1
        INTEGER I, J, NP
        !$OMP TARGET EXIT DATA MAP(DELETE:W1%CPTWFP)
        !$OMP TARGET EXIT DATA MAP(DELETE:W1%CPROJ)
        W1%CPTWFP=>W%CPTWFP(:,I)
        W1%CPROJ =>W%CPROJ(:,I)
        !$OMP TARGET ENTER DATA MAP(T0:W1%CPTWFP,W1%CPROJ)
END SUBROUTINE
```

#### **Atomic update for complex(8)**

```
program test
      integer :: i,j,N,M,k2,k
                                                          Original code
      complex(8) :: B(51,42), C(51,42),X
      M=100
      do i=1, 51
      do j=1, 41
      B(i,j)=0
      C(i,j)=0
      enddo
      X=(1,1)
 !$omp target teams distribute map(tofrom:B) private(k,k2)
      do ĭ=1, M
!$omp parallel do
      do j=1, N/M
      k=(i*(N/M))+j
      k2 = mod(k, 40) + 1
!$omp atomic update
      B(k,k2)%re=B(k,k2)%re+REAL(X)
!$omp atomic update
      B(k,k2)%im=B(k,k2)%im+AIMAG(X)
!$omp end parallel do
!$omp end target teams distribute
      write(*,*) "B(1,1)%im= ", B(1,1)%im
      do i=1, M
      do j=1, N/M
      k=(i*(N/M))+j
      k2 = mod(k, 40) + 1
      k = mod(k, 50) + 1
      C(k,k2)=C(k,k2)+(1,1)
      enddo
      do i=1,51
      do j=1, 41
      if(B(i,j)/=C(i,j)) then
              write(*,*) "error at index (", i, j, ") B= ", B(i,j), "C= ", C(i,j)
      endif
      enddo
      end program test
```

```
program test
                                                          Alternative
      integer :: i,j,N,M,k2,k
      complex(8) :: B(51,42), C(51,42), X
      N=3000
      M=100
      do i=1, 51
      do j=1, 41
      B(i,j)=0
      C(i,j)=0
      enddo
      X=(1,1)
!$omp target teams distribute map(tofrom:B) private(k,k2)
      do i=1, M
!$omp parallel do
      do j=1, N/M
      k=(i*(N/M))+j
      k2 = mod(k, 40) + 1
      k = mod(k.50) + 1
     call SPLIT_CMPLX_ATOMIC_ADD_FROM_CMPLX(B(k,k2),X)
!$omp end parallel do
!$omp end target teams distribute
                             SUBROUTINE SPLIT CMPLX ATOMIC ADD FROM CMPLX(SPLIT CMPLX, TO ADD)
      do i=1, M
                                REAL(8), DIMENSION(2) :: SPLIT CMPLX
      do j=1, N/M
                                COMPLEX(8) :: TO ADD
      k=(i*(N/M))+i
                             !$OMP ATOMIC UPDATE
      k2 = mod(k, 40) + 1
                                SPLIT CMPLX(1)=SPLIT CMPLX(1)+REAL(TO ADD) ! real part
      k=mod(k,50)+1
                             !$OMP ATOMIC UPDATE
      C(k,k2)=C(k,k2)+(1,1)
                                SPLIT CMPLX(2)=SPLIT CMPLX(2)+AIMAG(TO ADD) ! imaginary part
                               END SUBROUTINE SPLIT CMPLX ATOMIC ADD FROM CMPLX
      enddo
      do i=1,51
      do j=1,41
      if(B(i,j)/=C(i,j)) then
      write(*,*) "error at index (", i, j, ") B= ", B(i,j), "C= ", C(i,j)
      endif
      enddo
      enddo
                                                                                   AMD
      end program test
```

# The overhead of subroutine call assuming there is no need for atomic update

```
DOOFF !SOMP TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD PRIVATE(ISPIRAL,NI,NP,NT,LMMAXC,INDMAX,LMBASE,NLIIND,IBLO
OODFF !$OMP TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD PRIVATE(ISPIRAL,NI,NP.NT,LMMAXC,INDMAX,LMBASE,NLIIND,IBLO
                                                                                                                   DO ITER=0, COUNTER-1
0 ITER=0,COUNTER-:
                                                                                                                           ISPINOR=TOT ITER(ITER*3+1)
       ISPINOR=TOT_ITER(ITER*3+1)
                                                                                                                           NI=TOT ITER(ITER*3+2)
       NI=TOT ITER(ITER*3+2)
                                                                                                                           NP=TOT ITER(ITER*3+3)
       NP=TOT ITER(ITER*3+3)
                                                                                                                                NT=NONLR_S%ITYP(NI)
           NT=NONLR S%ITYP(NI)
                                                                                                                                LMMAXC=NONLR S%LMMAX(NT)
           LMMAXC=NONLR_S%LMMAX(NT)
                                                                                                                                 INDMAX=NONLR_S%NLIMAX(NI __NOACC_omp_arg(i))
            INDMAX=NONLR_S%NLIMAX(NI __NOACC_omp_arg(i))
                                                                                                                                 LMBASE=NONLR S%LMBASE(NI)+ISPINOR*NONLR S%LMBASE(NONLR S%NIONS+1)
            LMBASE=NONLR S%LMBASE(NI)+ISPINOR*NONLR S%LMBASE(NONLR S%NIONS+1)
                                                                                                                                NLIIND=NONLR S%NLIBASE(NI NOACC omp arg(i))
           NLIIND=NONLR S%NLIBASE(NI NOACC omp arg(i))
                                                                                                                                ISPIRAL=1: IF (NONLR S%LSPIRAL) ISPIRAL=ISPINOR+1
           ISPIRAL=1; IF (NONLR S%LSPIRAL) ISPIRAL=ISPINOR+1
                                                                                                                                DO IBLOCK=0, INDMAX/BLOCKSIZE
           DO IBLOCK=0, INDMAX/BLOCKSIZE
                                                                                                                                DO IND=IBLOCK*BLOCKSIZE+1,MIN((IBLOCK+1)*BLOCKSIZE,INDMAX)
           DO IND=IBLOCK*BLOCKSIZE+1, MIN((IBLOCK+1)*BLOCKSIZE, INDMAX)
               CTMP=0
                                                                                                                                    DO L=1.LMMAXC
              DO L=1,LMMAXC
                                                                                                                                       CTMP=CTMP+CPROJ(L+LMBASE,NP)*NONLR S%RPROJ(IND+(L-1)*INDMAX+NLIIND NOACC omp arg(i))
                 CTMP=CTMP+CPROJ(L+LMBASE,NP)*NONLR_S%RPROJ(IND+(L-1)*INDMAX+NLIIND __NOACC_omp_arg(i))
                                                                                                                   #ifndef gammareal
#ifndef gammareal
                                                                                                                                    CTMP=CTMP*CONJG(NONLR_S%CRREXP(IND,NI,ISPIRAL __NOACC_omp_arg(i)))
               CTMP=CTMP*CONJG(NONLR S%CRREXP(IND,NI,ISPIRAL NOACC omp arg(i)))
                                                                                                                   #endif
#endif
              CALL SPLIT CMPLX ATOMIC ADD FROM CMPLX(CRACC(IP,NP),CTMP*WDES1%RINPL)
                                                                                                                                    CRACC2(IP.NP)=CRACC2(IP.NP)+CTMP*WDES1%RINPL
 DOFF !$OMP END TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD
                                                                                                                   DOOFF !$OMP END TARGET TEAMS DISTRIBUTE PARALLEL DO SIMD
```

Kernel time= 80 ms

Kernel time= 22 ms



#### Declare target

```
PROGRAM reproducer
    IMPLICIT NONE
    INTEGER, PARAMETER :: DP = selected real kind(14, 200)
    COMPLEX(DP), ALLOCATABLE :: psi(:,:), ew(:)
   INTEGER :: n, notcnv, nbn, npwx, npol, nvecx, ierr, nbase, npw
   REAL(DP), EXTERNAL :: MYDDOT VECTOR GPU
    nbase = 1
   n = 10
   nbn = 2
   notcnv = 1
   npwx = 2
   npw = 1
   npol = 2
   nvecx = 1
   allocate(ew(n))
    !$omp target data map(alloc: ew)
    ALLOCATE( psi( npwx*npol, nvecx ), STAT=ierr )
    !$omp target enter data map(alloc:psi)
    !$omp target teams distribute private(nbn)
   D0 n = 1, notcnv
     nbn = nbase + n
     ew(n) = ew(n) + MYDDOT VECTOR GPU( 2*npw, psi(npwx+1,nbn), psi(npwx+1,nbn) )
    END DO
    !$omp target update from(ew)
    !$omp end target data
    deallocate(ew)
    deallocate(psi)
END PROGRAM
```

```
DOUBLE PRECISION FUNCTION MYDDOT_VECTOR_GPU(N,DX,DY)

INTEGER, INTENT(IN) :: N

DOUBLE PRECISION, INTENT(IN) :: DX(*),DY(*)

DOUBLE PRECISION :: RES

INTEGER :: I

!$omp declare target

!$omp parallel do simd reduction(+:RES)

DO I = 1, N

RES = RES + DX(I) * DY(I)

END DO

!$omp end parallel do simd

MYDDOT_VECTOR_GPU = RES

END FUNCTION MYDDOT_VECTOR_GPU
```

```
$make
ftn -fopenmp -c myddot.f90 -o myddot.o
     !$omp parallel do simd reduction(+:RES)
ftn-7212 ftn: WARNING MYDDOT VECTOR GPU, File = myddot.f90, Line = 7
 Variable "res" is used before it is defined.
ftn-7256 ftn: WARNING MYDDOT_VECTOR_GPU, File = myddot.f90, Line = 7
  An OpenMP parallel construct in a target region is limited to a single thread.
Cray Fortran: Version 15.0.0.3 (20220920162820 088e5928c3724749216ddb6b2fbbcd2152ed2bb8)
Cray Fortran: Thu Jan 05, 2023 15:58:21
Cray Fortran: Compile time: 0.0472 seconds
Crav Fortran: 13 source lines
Cray Fortran: 0 errors, 2 warnings, 0 other messages, 0 ansi
Cray Fortran: "explain ftn-message number" gives more information about each message.
ftn -fopenmp -c reproducer.f90 -o reproducer.o
ftn -fopenmp myddot.o reproducer.o -o reproducer.x
error: reproducer.f90:28:0: in function reproducer $ck_L25_1 void (i64, i64, i64, i64, i64, i64): unsupported call,
to variadic function myddot_vector_gpu_
make: *** [Makefile:8: reproducer] Error 1
```

#### **Declare target (alternative method)**

```
DOUBLE PRECISION FUNCTION MYDDOT VECTOR GPU(N,DX,DY)
       INTEGER, INTENT(IN) :: N
       DOUBLE PRECISION, INTENT(IN) :: DX(*),DY(*)
       DOUBLE PRECISION :: RES
       INTEGER :: I
        !$omp declare target
        !$omp parallel do simd reduction(+:RES)
       DO I = 1. N
         RES = RES + DX(I) * DY(I)
        !$omp end parallel do simd
       MYDDOT VECTOR GPU = RES
END FUNCTION MYDDOT_VECTOR_GPU
PROGRAM reproducer
   IMPLICIT NONE
   INTEGER, PARAMETER :: DP = selected_real_kind(14, 200)
   COMPLEX(DP), ALLOCATABLE :: psi(:,:), ew(:)
   INTEGER :: n, notcnv, nbn, npwx, npol, nvecx, ierr, nbase, npw
   REAL(DP), EXTERNAL :: MYDDOT VECTOR GPU
   nbase = 1
   n = 10
   nbn = 2
   notcnv = 1
   npwx = 2
   npw = 1
   npol = 2
   nvecx = 1
   allocate(ew(n))
   !$omp target data map(alloc: ew)
   ALLOCATE( psi( npwx*npol, nvecx ), STAT=ierr )
   !$omp target enter data map(alloc:psi)
   !$omp target teams distribute private(nbn)
   D0 n = 1, notchv
     nbn = nbase + n
     ew(n) = ew(n) + MYDDOT VECTOR GPU( 2*npw, psi(npwx+1,nbn), psi(npwx+1,nbn) )
   !$omp target update from(ew)
   !$omp end target data
   deallocate(ew)
   deallocate(psi)
END PROGRAM
```

- To get around the error, we can define function in the same file as function call
  - It would be challenging to apply his workaround in the applications with many function/subroutine calls

## Case study 2 - MPAS

The Model for Prediction Across Scales (MPAS) is a collaborative project for developing atmosphere, ocean and other earth-system simulation components for use in climate, regional climate, and weather studies.

- Finite volume solver for non-hydrostatic atmospheric equations.
- Written in FORTRAN. Uses directives for GPU acceleration
  - ~2.5k lines of !\$acc code, still an ongoing effort
  - AMD approach: OpenMP® directives

See https://mpas-dev.github.io/ and https://github.com/MPAS-Dev/MPAS-Model for more information









#### MPAS code structure: Memory and data management

```
program mpas
    use mpas_subdriver
    use mpas_derived_types, only : core_type, domain_type
    implicit none
    type (core_type), pointer :: corelist => null()
    type (domain_type), pointer :: domain => null()
    call mpas_init(corelist,domain) ! Allocate domain and host arrays
    call mpas_run(domain)
    call mpas_finalize(corelist, domain)
end program mpas
```

```
subroutine mpas_pool_get_array_2d_real_gpu(inPool, key, array, timeLevel)!{{{
   implicit none
   type (mpas_pool_type), intent(in) :: inPool
   character (len=*), intent(in) :: key
   real (kind=RKIND), dimension(:,:), pointer :: array
   integer, intent(in), optional :: timeLevel
   type (field2DReal), pointer :: field
   type (mpas_coupler_type), pointer :: coupler
   nullify(field)
   call mpas pool get field 2d real(inPool, key, field, timeLevel)
   nullify(array)
   if (associated(field)) then
        coupler => field%block%domain%mpas cpl
       array => field % array
#ifdef CORE ATMOSPHERE
       if (coupler % role_includes(ROLE_INTEGRATE)) then
           !$acc enter data copyin(field%array)
       end if
#endif
    end if
end subroutine mpas_pool_get_array_2d_real_gpu!}}}
```

- All GPU memory buffers allocated at the first time step and is reused for subsequent time steps.
- Updating the host from device occurs at the end of every time step.
- Now we can strictly focus on porting and optimizing the compute

```
subroutine mpas_update_gpu_data_on_host(domain)
 ! extract information from domain ...
  call mpas_pool_get_array_gpu(state, 'w', w, 2)
  call mpas_pool_get_array_gpu(state, 'u', u, 2)
 !$acc update host(w, u)
end subroutine mpas_update_gpu_data_on_host
```



#### **Example #1: OPENACC code**

```
1 !$acc parallel vector_length(32)
    !$acc loop gang
    do iEdge=edgeStart,edgeEnd
        cell1 = cellsOnEdge(1,iEdge)
        cell2 = cellsOnEdge(2,iEdge)
        ! update edges for block-owned cells
        if (cell1 <= nCellsSolve .or. cell2 <= nCellsSolve ) then
 8 !DIR$ IVDEP
    !$acc loop vector
            do k=1,nVertLevels
                pgrad = ((rtheta_pp(k,cell2)-rtheta_pp(k,cell1))*invDcEdge(iEdge) )/(.5*(zz(k,cell2)+zz(k,cell1)))
                pgrad = cqu(k,iEdge)*0.5*c2*(exner(k,cell1)+exner(k,cell2))*pgrad
                pgrad = pgrad + 0.5*zxu(k,iEdge)*gravity*(rho_pp(k,cell1)+rho_pp(k,cell2))
                ru_p(k,iEdge) = ru_p(k,iEdge) + dts*(tend_ru(k,iEdge) - (1.0_RKIND - specZoneMaskEdge(iEdge))*pgrad)
            end do
            ! accumulate ru p for use later in scalar transport
17 !DIR$ IVDEP
18 !$acc loop vector
           do k=1,nVertLevels
                ruAvg(k,iEdge) = ruAvg(k,iEdge) + ru p(k,iEdge)
            end do
        end if ! end test for block-owned cells
23 end do ! end loop over edges
   !$acc end parallel
```

Although the existing OpenACC code may not be efficiently implemented, it still serves as a rough guideline for our OpenMP offloading port

First step of the porting & optimization process is to add existing OpenMP directives on top of the OpenACC directives

#### **Example #1: OpenMP® initial port**

```
!$omp target teams distribute
    !$acc parallel vector_length(32)
    !$acc loop gang
 4 do iEdge=edgeStart,edgeEnd
        cell1 = cellsOnEdge(1,iEdge)
       cell2 = cellsOnEdge(2,iEdge)
       ! update edges for block-owned cells
        if (cell1 <= nCellsSolve .or. cell2 <= nCellsSolve ) then
    !$omp parallel do simd
    !DIR$ IVDEP
           do k=1,nVertLevels
               pgrad = ((rtheta pp(k,cell2)-rtheta pp(k,cell1))*invDcEdge(iEdge) )/(.5*(zz(k,cell2)+zz(k,cell1)))
               pgrad = cqu(k,iEdge)*0.5*c2*(exner(k,cell1)+exner(k,cell2))*pgrad
               pgrad = pgrad + 0.5*zxu(k,iEdge)*gravity*(rho_pp(k,cell1)+rho_pp(k,cell2))
               ru_p(k,iEdge) = ru_p(k,iEdge) + dts*(tend_ru(k,iEdge) - (1.0_RKIND - specZoneMaskEdge(iEdge))*pgrad)
    !$omp end parallel do simd
           ! accumulate ru p for use later in scalar transport
    !$omp parallel do simd
    !DIR$ IVDEP
   !$acc loop vector
           do k=1,nVertLevels +-----
               ruAvg(k,iEdge) = ruAvg(k,iEdge) + ru_p(k,iEdge)
           end do
    !$omp end parallel do simd
                                                                                 be suboptimal
        end if ! end test for block-owned cells
28 end do! end loop over edges
    !$acc end parallel
```

Note: number of vertical levels (nVertLevels) depends on mesh. (e.g., nVertLevels = 26 in the JW Baroclinic Wave benchmark)

This may be okay for a hardware with shorter SIMD (warp). With warp size exceeding the nVertLevels use of recourses will be suboptimal

30 !\$omp end target teams distribute

#### Example #1: OpenMP® initial optimization – number of threads

```
1 !$omp target teams distribute thread_limit(64) ◀------
    !$acc parallel vector_length(32)
    !$acc loop gang
   do iEdge=edgeStart,edgeEnd
        cell1 = cellsOnEdge(1,iEdge)
        cell2 = cellsOnEdge(2,iEdge)
       ! update edges for block-owned cells
        if (cell1 <= nCellsSolve .or. cell2 <= nCellsSolve ) then</pre>
    !$omp parallel do simd
    !DIR$ IVDEP
    !$acc loop vector
                                                                                                    loops
            do k=1,nVertLevels
                pgrad = ((rtheta_pp(k,cell2)-rtheta_pp(k,cell1))*invDcEdge(iEdge) )/(.5*(zz(k,cell2)+zz(k,cell1)))
               pgrad = cqu(k,iEdge)*0.5*c2*(exner(k,cell1)+exner(k,cell2))*pgrad
               pgrad = pgrad + 0.5*zxu(k,iEdge)*gravity*(rho_pp(k,cell1)+rho_pp(k,cell2))
               ru_p(k,iEdge) = ru_p(k,iEdge) + dts*(tend_ru(k,iEdge) - (1.0_RKIND - specZoneMaskEdge(iEdge))*pgrad)
    !$omp end parallel do simd
            ! accumulate ru p for use later in scalar transport
    !$omp parallel do simd
    !DIR$ IVDEP
    !$acc loop vector
            do k=1,nVertLevels
                ruAvg(k,iEdge) = ruAvg(k,iEdge) + ru_p(k,iEdge)
            end do
    !$omp end parallel do simd
        end if ! end test for block-owned cells
28 end do! end loop over edges
    !$acc end parallel
   !$omp end target teams distribute
```

Default number of threads is 256

Obvious step, reduce it to 64.

Still <50% utilization. How to ensure most threads are doing useful work for these smaller meshes? One approach could be to collapse the inner do

AMD together we advance\_

## Example #1: OpenMP® better optimization – collapsed do loops

```
!$omp target teams distribute collapse(2) ←-----
                                                        Can now use default number of threads
                                                        (256)
!$acc loop gang
do iEdge=edgeStart,edgeEnd
GPUOMP do k=1,nVertLevels
    cell1 = cellsOnEdge(1,iEdge)
    cell2 = cellsOnEdge(2,iEdge)
    ! update edges for block-owned cells
    if (cell1 <= nCellsSolve .or. cell2 <= nCellsSolve ) then
!$omp parallel do simd
!DIR$ IVDEP
GPUACC do k=1,nVertLevels
           pgrad = ((rtheta_pp(k,cell2)-rtheta_pp(k,cell1))*invDcEdge(iEdge) )/(.5*(zz(k,cell2)+zz(k,cell1)))
           pgrad = cqu(k,iEdge)*0.5*c2*(exner(k,cell1)+exner(k,cell2))*pgrad
           pgrad = pgrad + 0.5*zxu(k,iEdge)*gravity*(rho pp(k,cell1)+rho pp(k,cell2))
           ru_p(k,iEdge) = ru_p(k,iEdge) + dts*(tend_ru(k,iEdge) - (1.0_RKIND - specZoneMaskEdge(iEdge))*pgrad)
           ruAvg(k,iEdge) = ruAvg(k,iEdge) + ru_p(k,iEdge)
GPUOMP
GPUACC end do
!$omp end parallel do simd
        ! accumulate ru p for use later in scalar transport
!$omp parallel do simd
!DIR$ IVDEP
GPUACC do k=1,nVertLevels
GPUACC
           ruAvg(k,iEdge) = ruAvg(k,iEdge) + ru p(k,iEdge)
GPUACC end do
!$omp end parallel do simd
    end if ! end test for block-owned cells
GPUOMP end do
end do ! end loop over edges
!$acc end parallel
!$omp end target teams distribute
```



Macro-definitions added at the top of each source file to distinguish do loops for the OpenACC backend from do loops for the new OpenMP offloading backend End goal is to have one code with few adaptations for optimal use of different directive-based programming models AMDI

#### Example #2: OpenACC code

```
!$acc parallel vector length(64)
                                                                 !DIR$ IVDEP
     !$acc loop gang private(wduz, tend wk, eoe w, we w)
     do iEdge=edgeSolveStart,edgeSolveEnd
                                                                    do k=1,nVertLevels
     !$acc cache(tend wk,wduz,eoe w,we w)
                                                                         q1 = pv edge(k,iEdge)
                                                           21
         cell1 = cellsOnEdge(1,iEdge)
                                                                         a2 = 0.0
                                                           22
         cell2 = cellsOnEdge(2,iEdge)
                                                           23 V !$acc loop seq
                                                                        do j = 1,nEdgesOnEdge(iEdge)
     !DIR$ IVDEP
     !$acc loop vector
                                                                             eoe = eoe w(j)
                                                           25
                                                                             workpv = 0.5 * (q1 + pv edge(k, eoe))
         do k=1,nVertLevels
             ! compute ...
                                                                             q2 = q2 + we w(j) * u(k,eoe) * workpv
10
                                                           27
         end do
11
                                                                         end do
         ! Compute ...
                                                                         t w = - rdzw(k)*(wduz(k+1)-wduz(k))
12
                                                           29
     !$acc loop vector shortloop
                                                                         tend u(k,iEdge) = t w + rho edge(k,iEdge) * &
13
                                                           30 🗸
         do j = 1,nEdgesOnEdge(iEdge)
                                                                                              (q2 - (ke(k,cell2) - ke(k,cell1)) * &
14
                                                                                              invDcEdge(iEdge)) - tend wk(k) * 0.5 * &
15
             eoe w(j) = edgesOnEdge(j,iEdge)
                                                           32
             we w(j) = weightsOnEdge(j,iEdge)
                                                                                              (h divergence(k,cell1)+h divergence(k,cell2))
         end do
17
                                                                          end do
                                                                       end do
Caches local arrays into shared memory
                                                                 !$acc end parallel
```

No OpenMP equivalent for !\$acc cache

Collapsing inner loops not always possible

#### **Example #2: OpenMP® initial port**

```
!$omp target teams distribute parallel do thread limit(64) 23
                                                                    !DIR$ IVDEP
                                                                    !$omp parallel do simd
     !$acc parallel vector length(64)
                                                                    !$acc loop vector
     !$acc loop gang private(wduz, tend wk, eoe w, we w)
                                                                        do k=1,nVertLevels
     do iEdge=edgeSolveStart,edgeSolveEnd
                                                                             q1 = pv_edge(k,iEdge)
                                                                             q2 = 0.0
         cell1 = cellsOnEdge(1,iEdge)
         cell2 = cellsOnEdge(2,iEdge)

✓ !$acc loop seq
     !$omp parallel do simd
                                                                             do j = 1,nEdgesOnEdge(iEdge)
                                                                                 eoe = eoe w(j)
     !DIR$ IVDEP
                                                                                workpv = 0.5 * (q1 + pv edge(k,eoe))
10 ∨ !$acc loop vector
                                                                                 q2 = q2 + we_w(j) * u(k,eoe) * workpv
         do k=1,nVertLevels
11 🗸
12
                                                                             end
                                                                                  - rdzw(k)*(wduz(k+1)-wduz(k))
         end do
                                                                             t w
                                                                             tend u(k,iEdge) = t w + rho edge(k,iEdge) * &
14 ∨ !$omp end parallel do simd
                                                                                                   (q2 - (ke(k,cell2) - ke(k,cell1)) * &
15
     !$omp parallel do simd
                                                                                                  invDcEdge(iEdge)) - tend wk(k) * 0.5 * &
                                                                                                   (h divergence(k,cell1)+h divergence(k,cell2))
17 ∨ !$acc loop vector shortloop
         do j = 1,nEdgesOnEdge(iEdge)
                                                                        end do
                                                                    !$omp end parallel do simd
             eoe w(j) = edgesOnEdge(j,iEdge)
                                                               41
             we w(j) = weightsOnEdge(j,iEdge)
                                                                    end do
                                                               42
                                                                    !$acc end parallel
         end do
                                                               43
     !$omp end parallel do simd
                                                                    !$omp end talget teams distribute
```

Directly apply thread\_limit(64) trick

Suboptimal performance of RHS loop – register spills and scratch usage according to rocprof Collapsing the loops not possible because of the reduction of q2 variable.

However, can we rearrange the order of the parallel and sequential loops?

#### Example #2: OpenMP® optimization – rearranging and splitting loops

```
#ifdef OPENACC
     !$omp target teams distribute parallel do thread limit(64) 23
     !$acc parallel vector length(64)
                                                                     #else
     !$acc loop gang private(wduz, tend wk, eoe w, we w)
                                                               26 ∨ !$omp parallel do simd
     do iEdge=edgeSolveStart,edgeSolveEnd
                                                                         do k=1,nVertLevels
                                                               27 🗸
                                                                             t w = - rdzw(k)*(wduz(k+1)-wduz(k))
         cell1 = cellsOnEdge(1,iEdge)
                                                                             tend u(k,iEdge) = t w + rho edge(k,iEdge) * &
         cell2 = cellsOnEdge(2,iEdge)
                                                                                              (-(ke(k,cell2) - ke(k,cell1))* invDcEdge(iEdge)) &
     !$omp parallel do simd
                                                                                              - tend wk(k) * 0.5 * (h divergence(k,cell1) &
     !DIR$ IVDEP
                                                                                              + h divergence(k,cell2))
10 ∨ !$acc loop vector
         do k=1,nVertLevels
                                                                         end do
11 🗸
                                                               34 ∨ !$omp end parallel do simd
12
                                                                        do j = 1,nEdgesOnEdge(iEdge)
         end do
                                                                            eoe = eoe w(j)
14 ∨ !$omp end parallel do simd
                                                               37 ✓ !$omp parallel do simd
         ! Compute ...
15
     !$omp parallel do simd
                                                               38 v
                                                                           do k=1,nVertLevels
                                                                               tend_u(k,iEdge) = tend_u(k,iEdge) + rho_edge(k,iEdge) * we w(j) &
                                                                                              * u(k,eoe) * 0.5 * (pv edge(k,iEdge) + pv edge(k,eoe))
         do j = 1,nEdgesOnEdge(iEdge)
             eoe w(j) = edgesOnEdge(j,iEdge)
                                                                           end do
             we w(j) = weightsOnFuge(j,iEdge)

√ !$omp end parallel do simd

                                                                         end do
21
         end do
                                                               43
     !$omp end parallel do simd
                                                                     #endif
                                                               44
                                                                     end do
                                                                     !$acc end parallel
Rearrange order of the sequential and parallel
                                                                     !$omp end target teams distribute
                                                               47
```

Rearrange order of the sequential and paralled do loops. Multiple parallel loops to minimize global memory reads/writes

# JW Baroclynic wave – Initial performance



Overall GPU port (including the OpenACC backend) still in progress

Only a couple variables copied back to the host about ~7% of time integration

- The "mpas update GPU data on host" event will significantly increase as more physics/variables are ported
- HMM can play a big role



# Agenda

- 1. Introduction to MI 200 hardware
- 2. Software stack and tools
- 3. Basics of OpenMP® offloading
- 4. HIP & OpenMP® compatibility
- 5. Case studies
- 6. Heterogenous memory management (HMM)

#### Heterogenous memory management (HMM)

HMM allows the same pointer to an object to be used both by the CPU and a device [GPU] even if the physical location of the object were moved by the operating system or device driver. Furthermore, the device driver can control the policy of whether the current physical location of the object is in CPU or device memory.

https://www.kernel.org/doc/html/v5.0/vm/hmm.html

#### OpenMP® programming on systems with HMM

#### 

#### Highlights:

Uses system memory allocators.

"Pointer is a pointer" data can be accessed by threads running on any device, regardless of the current physical location of the data

HMM allows OS, driver, and HW to manage physical memory location, while OpenMP directives are used primarily for expressing parallelism and execution space (HOST, DEVICE 0, DEVICE 1, etc.)

Footnotes: manual management of data, memory location, and expression of parallelism (for example using HIP programming models) may provide higher performance. Some performance optimizations may also be done via using additional directives, clauses, and APIs

#### Performance comparison of unified vs non-unified memory

```
DEVID: 0 SGN:2 ConstWGSize:1024 args: 5 teamsXthrds:( 128X1024) reqd:( 128X1024) lds_usage:16716B sgpr_count:60 vgpr_count:58 sgpr_spill_count:0 vgpr_spill_count:0 tripcount:131072 rpc:0 n:__omp_offloading_10304_ac24ca_main_l50
                                                                                                                                                                    131072, 0x7ffd4b4b40d0)
Call <u>tgt_rtl_run_target_team_region_async:</u>
                                                  10us
                                                                        0. 0x00000203cf40. 0x00000203bd20. 0x00000203bd70.
                                                                                                                                               128.
                                                                                                                                                          1024.
Call
            tgt rtl synchronize: 2207us
                                                             0, 0x7ffd4b4b40d0)
                                                                                            With UNIFIED MEMORY
GPU par: dot loop time = 0.00224113 [s] effective read BW = 0.871489 [GB/s]
DEVID: 0 SGN:2 ConstWGSize:1024 args: 4 teamsXthrds:(128X1024) reqd:(128X1024) lds_usage:16452B sgpr_count:24 vgpr_count:44 sgpr_spill_count:0 vgpr_spill_count:0 tripcount:131072 rpc:0 n:__omp_offloading_10304_ac24ca_main_l57
Call <u>__tgt_rtl_run_target_team_region_async</u>:
                                                                       0, 0x00000203d1b0, 0x00000203d170, 0x00000203a940,
                                                                                                                                               128.
                                                                                                                                                         1024,
                                                                                                                                                                    131072, 0x7ffd4b4b40d0)
                                                  7us
Call
            __tgt_rtl_synchronize: 18us
                                                  0 (
                                                           0, 0x7ffd4b4b40d0)
GPU par: read loop time = 4.1008e-05 [s] effective read BW = 47.6279 [GB/s]
             tgt rtl data alloc:
                                  Ous 0x7ff450408000 (
                                                                       8, 0x7ffe2056e990)
        tgt rtl data submit async:
                                                             0, 0x7ff450408000, 0x7ffe2056e990,
                                                                                                      8, 0x7ffe2056e8c0)
            __tgt_rtl_data_alloc:
                                  Ous 0x7ff450409000
         tgt rtl data submit async:
                                                                                                    1048576, 0x7ffe2056e8c0)
             _tgt_rtl_data_alloc: 1us 0x7ff440200000 (
         tgt rtl data submit async:
                                                             0, 0x7ff440200000, 0x0000026f7670,
DEVID: 0 SGN:2 ConstWGSize:1024 args: 5 teamsXthrds:(128X1024) reqd:(128X1024) lds usage:16716B sgpr count:60 vgpr spill count:0 vgpr spill count:0 tripcount:131072 rpc:0
n: omp offloading 10304 ac24ca main l50
Call <u>__tgt_rtl_run_target_team_region_async:</u>
                                                                                                                                                  1024,
                                               13us
                                                             0, 0x7ffe2056e990, 0x7ff450408000,
        tgt rtl data retrieve async: 517us
                                                                                                       8, 0x7ffe2056e8c0)
           __tgt_rtl_synchronize: 501us
           __tgt_rtl_data_delete:
                                                                                                     NO UNIFIED MEMORY
           __tgt_rtl_data_delete:
            tgt rtl data delete:
GPU par: dot loop time = 0.00127697 [s] effective read BW = 1.5295 [GB/s]
                                   Ous 0x7ff450408000 (
Call
             tgt rtl data alloc:
                                                                   1048576, 0x0000025f7660)
        __tgt_rtl_data_submit_async:
Call
                                                             0, 0x7ff450408000, 0x0000025f7660,
                                      39us
                                                                                                    1048576, 0x7ffe2056e8c0)
Call
             tgt_rtl_data_alloc:
                                   Ous 0x7ff440200000 (
                                                                    1048576, 0x0000026f7670)
Call
        __tgt_rtl_data_submit_async:
                                                             0. 0x7ff440200000. 0x0000026f7670.
                                                                                                    1048576, 0x7ffe2056e8c0)
DEVID: 0 SGN:2 ConstWGSize:1024 args: 4 teamsXthrds:(128X1024) read:(128X1024) lds usage:16452B sgpr count:24 vgpr count:44 sgpr spill count:0 vgpr spill count:0 tripcount:131072 rpc:0
n: omp offloading 10304 ac24ca main 157
Call <u>__tgt_rtl_run_target_team_region_async:</u>
                                                                   0, 0x00000285c180, 0x00000285ad20, 0x00000285ca60,
                                                                                                                                                           131072, 0x7ffe2056e8c0)
                                                                                                                                        128,
                                                                                                                                                 1024,
Call
           __tgt_rtl_synchronize:
                                                0 (
                                                         0, 0x7ffe2056e8c0)
Call
           __tgt_rtl_data_delete:
                                               0 (
                                                        0, 0x7ff440200000)
Call
           __tgt_rtl_data_delete:
                                               0 (
                                                        0. 0x7ff450408000)
GPU par: read loop time = 0.00051403 [s] effective read BW = 3.79963 [GB/s]
```

#### OpenMP® offloading with HMM – OpenFOAM® case-study

```
template < class T >
Foam::List < T > ::List (const Foam::one, const T& val)
:
    UList < T > (new T[1], 1)
{
    this - > v_[0] = val;
}
```

Memory allocations are decoupled from the "rest" of the code.

This makes altering memory allocation easier but it also makes it difficult to understand the mapping between variables and memory allocations.

```
> new
/oid Foam::List<T>::doResize(const label len)
   if (len == this->size )
       return;
   if (len > 0)
       // With sign-check to avoid spurious -Walloc-size-larger-than
       size t alignement = 16;
       size_t bytes_needed = sizeof(T)*len;
       if (bytes_needed > 2*128){ //LG1 AMD
          alignement = 512;  //LG1 AMD
       T* nv = new (std::align val t( alignement)) T[len]; //LG1 AMD
       //if (bytes needed > 2*128){      //LG1 AMD
            #pragma omp target enter data map(to:nv[0:len]) //LG1 AMD
      // } //LG1 AMD
```

# OpenFOAM® initial porting with HMM - choosing the executing device and expressing parallelism

```
//LG using OpenMP offloading and HMM
#include <omp.h>
#pragma omp requires unified_shared_memory
```

Notify OpenMP run-time about intent to use unified virtual memory. Not required in openmp spec 5.2

```
#pragma omp target teams distribute parallel for //LG
for (label cell=0; cell<nCells; cell++)
{
    pAPtr[cell] = wAPtr[cell] + beta*pAPtr[cell];
}</pre>
```

Compiler will produce a Host and a Device code, apply parallelization and offload computation to the Device

Note – no need to allocate memory on the device and copy data between the Host and the Device

```
#pragma omp target teams distribute parallel for //map(tofrom:ApsiPtr[0:nCells])
for (label face=0; face<nFaces; face++)

#pragma omp atomic hint(AMD_fast_fp_atomics)
    ApsiPtr[uPtr[face]] += lowerPtr[face]*psiPtr[lPtr[face]];
    #pragma omp atomic hint(AMD_fast_fp_atomics)
    ApsiPtr[lPtr[face]] += upperPtr[face]*psiPtr[uPtr[face]];
}</pre>
```

By default, all memory is fine-grained, hence atomics are system scope atomics. Our GPUs can not handle system scope atomics in a performing way ... by applying the "map" clause we can force memory to be coarse grained (==additional expense)

# HMM at work with OpenFOAM®



Current state: after adding about 60 lines of OpenMP® target directives ~50-60% of the code is executed on GPUs.

Switching execution between the CPU and GPU does not require explicit data transfers – HMM is moving pages as needed.

OpenMP standard and implementation are evolving/improving.



#### **Summary**

- OpenMP® offloading is compatible and competitive with HIP
- OpenMP® can interface to ROCm /HIP math libraries
- Performance of OpenMP® regions can be tuned by modifying the number of teams or threads
- Debugging and profiling OpenMP® offloading code on AMD GPUs
- Discussed the challenges in adding OpenMP® offloading support in HPC applications
- Compiler related challenges
  - Having a standard benchmark for capturing the compiler related issues would be helpful
- Heterogeneous Memory Management (HMM) is available to use on AMD systems

#### **Disclaimer**

The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions, and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. Any computer system has risks of security vulnerabilities that cannot be completely prevented or mitigated. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes.

THIS INFORMATION IS PROVIDED 'AS IS." AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS, OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY RELIANCE, DIRECT, INDIRECT, SPECIAL, OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Third-party content is licensed to you directly by the third party that owns the content and is not licensed to you by AMD. ALL LINKED THIRD-PARTY CONTENT IS PROVIDED "AS IS" WITHOUT A WARRANTY OF ANY KIND. USE OF SUCH THIRD-PARTY CONTENT IS DONE AT YOUR SOLE DISCRETION AND UNDER NO CIRCUMSTANCES WILL AMD BE LIABLE TO YOU FOR ANY THIRD-PARTY CONTENT. YOU ASSUME ALL RISK AND ARE SOLELY RESPONSIBLE FOR ANY DAMAGES THAT MAY ARISE FROM YOUR USE OF THIRD-PARTY CONTENT.

© 2023 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ROCm<sup>™</sup>, EPYC<sup>™</sup>, Instinct<sup>™</sup> and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.

The OpenMP® name and the OpenMP logo are registered trademarks of the OpenMP® Architecture Review Board



#