December 1st, 2023 OpenMP Users Monthly Teleconferences # OpenMP offloaded Quantum ESPRESSO Ferrari-Ruffino Fabrizio CNR-IOM Bellentani Laura HPC - CINECA # QUANTUM ESPRESSO ON ACCELERATORS #### AB INITIO QUANTUM MECHANICS no input parameters for material modeling reduces costs, accelerates discoveries QUANTUM ESPRESSO is an integrated parallel suite of Open-Source computer codes for electronic-structure calculations and materials modeling at the nanoscale. Two-dimensional materials from high-throughput computational exfoliation of experimentally known compounds, Nature Nanotechnology **13**, 246 (2018). doi:10.1038/s41565-017-0035-5 # Materials design at the Exascale #### ICSC National Research Centre for High Performance Computing, Big Data and Quantum Computing Geographic distribution of the authors of the articles citing the main reference articles as QuantumESPRESSO #### **DENSITY FUNCTIONAL THEORY** $$\left[- rac{\hbar^2}{2m} abla^2 + V_{ extsf{s}}(\mathbf{r}) ight]arphi_i(\mathbf{r}) = arepsilon_iarphi_i(\mathbf{r}).$$ ### PLANE WAVES & PSEUDOPOTENTIAL $$\varphi_{\alpha}(\mathbf{r}) = \frac{1}{\sqrt{\Omega}} \exp[iG_{\alpha} \cdot \mathbf{r}]$$ #### **DUAL SPACE TECHNIQUE** $$\psi(\textbf{r}) \rightarrow \psi(\textbf{k}) \rightarrow \psi(\textbf{r})$$ #### **DENSITY FUNCTIONAL THEORY** $$\left[- rac{\hbar^2}{2m} abla^2 + V_{ extsf{s}}(\mathbf{r}) ight]arphi_i(\mathbf{r}) = arepsilon_iarphi_i(\mathbf{r}).$$ ### PLANE WAVES & PSEUDOPOTENTIAL $$\varphi_{\alpha}(\mathbf{r}) = \frac{1}{\sqrt{\Omega}} \exp[iG_{\alpha} \cdot \mathbf{r}]$$ #### **DUAL SPACE TECHNIQUE** $$\psi(\textbf{r}) \rightarrow \psi(\textbf{k}) \rightarrow \psi(\textbf{r})$$ ### Quantum ESPRESSO on HPC clusters ### Quantum ESPRESSO on HPC clusters ### Quantum ESPRESSO on HPC clusters The transition from CUDA to Openacc DIRECTIVE-BASED PROGRAMMING MODELS MAINTAINABLE PORTABLE SINGLE SOURCE CODE Modularity supports interoperability and new programming models Modularity supports interoperability and new programming models Fortran / CUDAF + OpenACC V<7.2 Modularity supports interoperability and new programming models Modularity supports interoperability and new programming models ### **FFTXlib** #### SLAB DECOMPOSITION R planes distributed on z **G** sticks distributed on x and y to balance the workload #### MPI DISTRIBUTED 3D FFTs Local 1D and 2D FFTs MPI collective communications (All-to-all ### **FFTXlib** #### SLAB DECOMPOSITION R planes distributed on z **G** sticks distributed on x and y to balance the workload #### MPI DISTRIBUTED 3D FFTs Local 1D and 2D FFTs + MPI collective communications (All-to-all) ### Batched FFTs for the GPU driver - \* batching of FFTs - \* non blocking MPI communications - \* overlap data/computation/communication by using streams - \* exploits GPUdirect ### Performance assessment - Significant improvement from GPU to CPU - Time to solution decreases with GPU memory - Pool distribution can be added to further speedup Quantum ESPRESSO: one further step towards the exascale, I. Carnimeo et al., Journal of Chemical Theory and Computation J. Chem. Theory Comput. 19, 6992 (2023) # EXPANDING QE PORTABILITY WITH OpenMP ### Outline - Porting roadmap of QE and present status of the porting; - transition from CUDA to directive based porting (openACC and OpenMP); - results from first basic OpenMP porting; - FFTs in QE (FFTXlib library); - streams management in FFTXlib; - results from running on LUMI; - summary & outlook. # On the porting roadmap J. Chem. Phys. 152, 154105 (2020) J. Chem. Theory Comput.19, 6992 (2023) Until v 6.8; • from v 7.0; under development; current goal. ## QE & OpenMP5 - Main parts of FFTXlib ported for Intel® with OpenMP5 (Giacomo Rossi); - Intel<sup>®</sup> Hackathon (May 2022); - first scratch of PWscf porting with omp on Intel® DevCloud (June 2022); - AMD® collaboration (Ossian O'Reilly) for low-level libraries porting (starting July 2022); - LUMI available to QE developers (October 2022); - MAX-3 kick-off preliminary porting (Modena, February 2023); - develop\_omp5 branch on QE official repository: <a href="https://gitlab.com/QEF/q-e">https://gitlab.com/QEF/q-e</a> (July 2023); - LUMI Hackathon at CSC (Helsinki, September 2023). ## OMP porting of QE #### Basic features: - loop offloading; - global variables offloading and pinning; - manage different backends (linear algebra and FFTs); - streams and/or tasks (for async batched FFTs). ## Status of OMP porting in PWscf #### Ported: - FFTs (cpu driver); - KS\_Solver (except diagonalization); - Interfaces for mathematical libraries; - qe instrumentation routines (rocprof) have been added. To be ported: - diagonalization (zhegv); - forces, stress; - codes other than PW. # OpenMP5 Offload #### **CUF** only Host to Device if ( use\_gpu ) then arg\_d = arg endif Routine calls if ( use\_gpu ) then call abc( arg\_d ) else call abc( arg ) endif Interfaces interface abc subroutine abc\_cpu( v ) subroutine abc\_gpu( v\_d ) end interface # OpenMP5 Offload | | CUF only | CUF interfaces OpenACC parent code | |----------------|-------------------------------------------------------------------------------|-----------------------------------------------------------------------| | Host to Device | <pre>if ( use_gpu ) then arg_d = arg endif</pre> | !\$acc update device(arg) | | Routine calls | if ( use_gpu ) then call abc( arg_d ) else call abc( arg ) endif | !\$acc host_data use_device(arg) call abc( arg ) !\$acc end host_data | | Interfaces | interface abc subroutine abc_cpu( v ) subroutine abc_gpu( v_d ) end interface | | | | CUF only | CUF interfaces OpenACC parent code | OpenACC only | |----------------|-------------------------------------------------------------------------------|-----------------------------------------------------------------------|-----------------------| | Host to Device | <pre>if ( use_gpu ) then arg_d = arg endif</pre> | !\$acc update device(arg) | | | Routine calls | if ( use_gpu ) then call abc( arg_d ) else call abc( arg ) endif | !\$acc host_data use_device(arg) call abc( arg ) !\$acc end host_data | call abc_acc( arg ) | | Interfaces | interface abc subroutine abc_cpu( v ) subroutine abc_gpu( v_d ) end interface | | subroutine abc_acc(v) | | | CUF only | CUF interfaces OpenACC parent code | OpenACC only | OpenACC +<br>OpenMP5 | | |----------------|-------------------------------------------------------------------------------|-----------------------------------------------------------------------|---------------------|-------------------------------------------------------------------------------|--| | Host to Device | <pre>if ( use_gpu ) then arg_d = arg endif</pre> | !\$acc update device(arg) | | !\$acc update device(arg)<br>!\$omp target update to(arg) | | | Routine calls | if ( use_gpu ) then call abc( arg_d ) else call abc( arg ) endif | !\$acc host_data use_device(arg) call abc( arg ) !\$acc end host_data | call abc_acc( arg ) | #if defOPENACC call abc_acc( arg ) #elif defOPENMP call abc_omp( arg ) #endif | | | Interfaces | interface abc subroutine abc_cpu( v ) subroutine abc_gpu( v_d ) end interface | | | | | | | CUF only | CUF interfaces OpenACC parent code | OpenACC only | OpenACC +<br>OpenMP5 | |----------------|-------------------------------------------------------------------------------|-----------------------------------------------------------------------|-----------------------|-------------------------------------------------------------------------------| | Host to Device | <pre>if ( use_gpu ) then arg_d = arg endif</pre> | !\$acc update device(arg) | | !\$acc update device(arg)<br>!\$omp target update to(arg) | | Routine calls | if ( use_gpu ) then call abc( arg_d ) else call abc( arg ) endif | !\$acc host_data use_device(arg) call abc( arg ) !\$acc end host_data | call abc_acc( arg ) | #if defOPENACC call abc_acc( arg ) #elif defOPENMP call abc_omp( arg ) #endif | | Interfaces | interface abc subroutine abc_cpu( v ) subroutine abc_gpu( v_d ) end interface | | subroutine abc_acc(v) | subroutine abc_acc( v ) subroutine abc_omp( v ) | | | CUF only | CUF interfaces OpenACC parent code | OpenACC only | OpenACC +<br>OpenMP5 | |----------------|---------------------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------| | Host to Device | <pre>if ( use_gpu ) then arg_d = arg endif</pre> | !\$acc update device(arg) | | !\$acc update device(arg)<br>!\$omp target update to(arg) | | Routine calls | if ( use_gpu ) then call abc( arg_d ) else call abc( arg ) endif | !\$acc host_data use_device(arg) call abc( arg ) !\$acc end host_data | call abc( arg, offload ) | call abc( arg, offload ) | | nterfaces | interface abc subroutine abc_cpu(v) subroutine abc_gpu(v_d) end interface | | interface abc subroutine abc_cpu( v, offload ) subroutine abc_acc( v, offload ) subroutine abc_omp( v, offload ) end interface | | Courtesy of Ivan Carnimeo (SISSA) The Yambo group in Modena is developing a portable library (devXlib) to manage porting to multiplatform heterogeneous architectures Main developers: A .Ferretti (CNR-NANO) N. Spallanzani (CNR-NANO) G. Rossi (Intel) Courtesy of Ivan Carnimeo (SISSA) #### devXlib - MAX component for wrapping device-oriented routines and utilities - Fortran 90/95 + iso c binding from F2003 - Source code is generated starting from Jinja2 sources via Python scripts - Main components: - Device buffers - Device memcpy / memset - Device allocation\* - Device linear algebra - Auxiliary functions - complex conjugate - scalar add - scale 1. max-centre / Components / deviceXlib · GitLab OpenMP offload enabling wasn't difficult: just added few lines into the original source code ``` !DEV_ACC data present(A,X,Y) !DEV_ACC host_data use_device(A,X,Y) !DEV_OMPGPU target variant dispatch use_device_ptr(A,X,Y) #ifdef _DXL_CUBLAS call cublasZGEMV(TRANS,M,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY) #else call ZGEMV(TRANS,M,N,ALPHA,A,LDA,X,INCX,BETA,Y,INCY) #endif !DEV_OMPGPU end target variant dispatch !DEV_ACC end host_data !DEV_ACC end data ``` ## Wrappers instead of interfaces ``` SUBROUTINE wave r2g( f in, f out, dfft, igk, howmany set, omp mod ) USE fft helper subroutines, ONLY: fftx psi2c gamma, fftx psi2c k #if defined( OPENMP GPU) USE fft helper subroutines, ONLY: fftx psi2c gamma omp, fftx psi2c k omp #endif USE control flags, ONLY: many fft IMPLICIT NONE omp offload = .FALSE omp map #if defined( OPENMP GPU) IF (PRESENT(omp mod)) THEN omp offload = omp mod>=0 ! run FFT on device (data already mapped) = omp mod>=1 ! map data and run FFT on device omp map ENDIF #endif ``` ``` IF (PRESENT(howmany set)) THEN IF(omp offload) THEN #if defined ( OPENMP GPU) IF(omp map) THEN !$omp target data map(tofrom:f in) CALL fwfft y omp( 'Wave', f in, dfft, howmany=howmany set(3) ) !somp end target data ELSE CALL fwfft y omp( 'Wave', f in, dfft, howmany=howmany set(3) ) #endif ELSE CALL fwfft( 'Wave', f in, dfft, howmany=howmany set(3) ) ENDIF ELSE IF(omp offload) THEN #if defined ( OPENMP GPU) IF(omp map) THEN !$omp target data map(tofrom:f in) CALL fwfft y omp( 'Wave', f in, dfft ) !somp end target data ELSE CALL fwfft y omp( 'Wave', f in, dfft ) ENDIF #endif ELSE CALL fwfft( 'Wave', f in, dfft ) ENDIF ENDIF IF (gamma only) THEN ``` #### Some notes - We need both offloaded and non-offloaded low level routines (e.g. FFTXlib, LAXlib) at the same time; - dispatch construct currently not available on all compilers: not an option at the moment (in general); - we use wrappers with offloading switch to sort CPU and GPU low level library calls; - duplication ("\_omp") of low level routines still necessary (avoidable? In the future?); - omp target for GPU protected from openACC and from CPU omp. #### Backends ``` SUBROUTINE MYDGEMM2( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, & BETA, C, LDC, OMP OFFLOAD ) #if defined( CUDA) use cudafor use cublas #elif defined( OPENMP GPU) #if defined( ONEMKL) use onemkl blas gpu #endif #if defined( ROCBLAS) use rocblas utils #endif #endif CHARACTER*1, INTENT(IN) :: TRANSA, TRANSB INTEGER, INTENT(IN) :: M, N, K, LDA, LDB, LDC DOUBLE PRECISION, INTENT(IN) :: ALPHA, BETA DOUBLE PRECISION :: A( LDA, * ), B( LDB, * ), C( LDC, * ) LOGICAL, INTENT(IN) :: OMP OFFLOAD #if defined( CUDA) attributes(device) :: A, B, C CALL cublasdgemm( TRANSA, TRANSB, M. N. K. ALPHA, A. LDA, B. LDB, & BETA, C, LDC) #elif defined( ONEMKL) IF (OMP OFFLOAD) THEN !$omp target variant dispatch use device ptr(A, B, C) CALL dgemm( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, & C, LDC) !$omp end target variant dispatch ELSE CALL dgemm( TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, & C. LDC) ENDIF #elif defined( ROCBLAS) IF (OMP OFFLOAD) CALL rocblas dgemm( TRANSA, TRANSB, M, N, K, ALPHA, & A, LDA, B, LDB, BETA, C, LDC) IF (.NOT. OMP OFFLOAD) CALL dgemm( TRANSA, TRANSB, M, N, K, ALPHA, A, & LDA, B, LDB, BETA, C, LDC) #else CALL dgemm(TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC) #endif END SUBROUTINE MYDGEMM2 ``` ## Exchange-Correlation library - XClib ## Exchange-Correlation library - XClib ``` #if defined( OPENACC) !sacc data present( rho in, grho in, sx out, sc out, vlx out, v2x out, v1c out, v2c out ) copy( err out ) #elif defined( OPENMP GPU) !somp target teams distribute parallel do #elif defined( OPENMP) MODULE exch qga !somp parallel if(ntids==1) default(none) & !$omp private( rho, grho, sx, sx , sxsr, v1x, v1x , v1xsr, & !$omp v2x, v2x , v2xsr, sc, v1c, v2c, iflag, in err ) & !$omp shared( rho in, grho in, length, igcx, exx started, & CONTAINS grho threshold gga, rho threshold gga, gau parameter, & !somp screening parameter, exx fraction, igcc, v1x out, v2x out, & !somp vlc out, v2c out, sx out, sc out, err out ) ! Somp SUBROUTINE becke88( rho, grho, sx, v1x, v2x ) !somp do #endif D0 ir = 1, length in err = 0 USE kind l, ONLY: DP grho = grho in(ir) IMPLICIT NONE rho = ABS(rho in(ir)) #if defined( OPENACC) SELECT CASE( igcx ) CASE(1) #elif defined( OPENMP GPU) !$omp declare target CALL becke88( rho, grho, sx, vlx, v2x ) #endif CASE(2) REAL(DP), INTENT(IN) :: rho, grho REAL(DP), INTENT(OUT) :: sx, v1x, v2x CALL ggax( rho, grho, sx, v1x, v2x ) CASE(3) REAL(DP) :: rho13, rho43, xs, xs2, sa2b8, shm1, dd, dd2, ee CALL pbex( rho, grho, 1, sx, v1x, v2x ) REAL(DP), PARAMETER :: beta=0.0042 DP REAL(DP), PARAMETER :: third=1. DP/3. DP, two13=1.259921049894873 DP CASE(4) rho13 = rho**third ``` rho43 = rho13\*\*4 ## Parent code porting - starting results Reference benchmark: **Ausurf 112** (1 scf step). Starting results (**April 2023**) with **basic porting only**: The total time on LUMI is still bigger than the reference one from M100. The main reasons: - diagonalization backend to be implemented (~70%); - batched FFTs (~15%); - smaller routines to be ported(~15%) # FFTXlib - slab decomposition # FFTXlib - slab decomposition ### FFTXlib - standard flow ## FFTXlib - basic porting Reference benchmark: Ausurf 112 (1 scf step). FFTXlib calls, preliminary results: 3d FFTs with SLAB decomposition (standard case): - reference runs: M100 (V100 gpus) - overall match between LUMI and M100; - H2D-D2H part of the FFT looks a bit slower on LUMI side (still under investigation). # FFTXlib - many bands # FFTXlib - slab decomp. & many bands #### Batched FFTs in QE - HIP - **Batched** 3d-FFT of the **wave-function**: - the input array divided in **4 batches** (on bands); - 1 stream for **FFTs**, 4 streams for **data movements**; - 4 async mpi communications (ISEND, IRECV). Notes: - non-asynchronous memcpy; - memcpy operations D2H/H2D much more time consuming than FFT calls; - memcpy operations **D2D** same order of FFT calls. ### Hip streams & OMP Besides ffts and memcpy, few loops converted into hip-kernels Pinning variables in order to get asynchronicity with data movements from/to streams ``` #if defined(_OPENMP_GPU) pinned_alloc = omp_init_allocator(omp_default_mem_alloc, ntraits, traits) !$omp_allocate(aux) allocator(pinned_alloc) ALLOCATE(aux(current_size), STAT=info) IF ( info /= 0 ) CALL fftx_error__( ' fft_buffers ', ' Allocation failed ', 4 ) !$omp_target_enter_data_map(alloc:aux) !$omp_allocate(aux2) allocator(pinned_alloc) ALLOCATE(aux2(current_size), STAT=info) IF ( info /= 0 ) CALL fftx_error__( ' fft_buffers ', ' Allocation failed ', 5 ) !$omp_target_enter_data_map(alloc:aux2) #else ``` #### Batched FFTs in QE - OMP task+HIP - Need to set up pure OMP porting of batched FFTs for the Intel<sup>®</sup> side; - setting up a starting scheme by using omp task with hip streams and detach clause; ``` !somp parallel !$omp single DO j = 0, batchsize-1, dfft%subbatchsize currsize = min(dfft%subbatchsize, batchsize - j) !$omp task firstprivate(j,currsize) private(i) shared(ptr callback) detach(event) D0 i = 0, currsize - 1 CALL cft 1z omp( f((j+i)*dfft nnr + 1:), sticks(me p), n3, nx3, isgn, & aux(j*dfft nnr + i*ncpx*nx3 +1:), & stream=dfft%bstreams(j/dfft%subbatchsize+1) ) CALL fft scatter many columns to planes store omp( dfft, aux(j*dfft nnr+1:), nx3, & dfft nnr, f(j*dfft nnr+1:), & sticks, dfft%nr3p, isgn, currsize, & i/dfft%subbatchsize+1 ) iadc = hipStreamAddCallback(dfft%bstreams(j/dfft%subbatchsize+1),ptr callback ,c_loc(event), 0) !somp end task ENDDO ``` ## Batched FFTs in QE - OMP task+dep - Starting point: oneMKL does not get explicit streams as input; - Simplest scheme given by n tasks associated to n subbatches; - still in progress ## Streams with multiple standards - Full implementation of hip-based streams FFTs (both aware and non-aware); - implementation of omp task+hip streams; - OpenMP tasks only version in progress. **Streamed FFTXlib** tested by **benchmarking** the **vloc\_psi** routine of PWscf: it is the part of the scf iteration that relies the most on FFTs. #### Next: - Merge of hip streamed code with the cuda one; - Aim: hip-cuda streams and omp tasks with minimal code duplication. ### Results apus - Gold surface; - 112 atoms; - ~1600 electrons. - vloc\_psi only. ### Results (2) cri3-small, vloc\_psi / call #### What's next - Complete the porting of PWscf minor routines; - complete the **openMP task** based version of the **batched FFTs** to enable them on **Intel**® architectures. → Medium-large size benchmarks by the first half of the next year. - Port QE codes other than PWscf (PHonon, CP, EELS, ...); - test openMP on Nvidia® architectures; - incorporate DevXlib. ### Acknowledgments #### QE dev group - Pietro Delugas, SISSA - Ivan Carnimeo, SISSA - Fabrizio Ferrari Ruffino, CNR-IOM - Oscar Baseggio, SISSA - Riccardo Bertossa, SISSA - · Aurora Ponzi, CNR-IOM - Stefano Baroni, SISSA, CNR-IOM - Paolo Giannozzi, UniUD, CNR-IOM #### **CINECA** - Fabio Affinito - . Laura Bellentani - Sergio Orlandini #### **QE** Foundation Francesca Garofalo #### ANL . Ye Luo #### - . Ossian O'Reilly - . Jakub Kurzak Intel® . Giacomo Rossi #### **Nvidia**® - . Filippo Spiga - . Louis Stuber ## THANK YOU