A number of Compilers and Tools from various vendors and open source community initiatives implement the OpenMP API. If we are missing any please Contact Us with your suggestions.

Updated: November 2024

Compilers

Vendor / Source

Compiler/Language

Information

AOCC – C/C++/Fortran

The AMD Optimizing C/C++ Compiler (AOCC) is a high performance x86 CPU compiler suite supporting C/C++ and Fortran applications, and providing advanced optimizations. This is a Clang/LLVM and Classic Flang based compiler suite with OpenMP version 5.0 and a good subset of OpenMP versions 5.1 & 5.2 supported for C/C++ and OpenMP version 4.5 supported for Fortran. OpenMP offloading is currently not supported with AOCC. Read More

ROCm – C/C++/Fortran

The ROCm compiler collection is part of the AMD ROCm software stack to support offloading to AMD Instinct accelerators and AMD Radeon GPUs. The C/C++ compiler is based on the latest LLVM compiler with additional open-source features and optimizations provided by AMD. Read More. The current Fortran support in the ROCm compiler is based on the open-source Classic Flang with limited offloading support. Improved support for Fortran and the OpenMP API is expected in 2025.

C/C++/Fortran – Available on Linux

C/C++ – Support for OpenMP 3.1 and all non-offloading features of OpenMP 4.0/4.5/5.0. Offloading features are under development. Fortran – Full support for OpenMP 3.1 and limited support for OpenMP 4.0/4.5. Compile and link your code with -fopenmp. Read More

Mercurium – C/C++/Fortran

Mercurium is a source-to-source research compiler that is available to download at https://github.com/bsc-pm/mcxx. OpenMP 3.1 is almost fully supported for C, C++, and Fortran. Apart from that, almost all tasking features introduced in newer versions of OpenMP are also supported. Read More

Flang – Classic aqs a

Flang – Fortran

Classic Flang is a Fortran compiler for LLVM. Classic Flang implements substantially full OpenMP 4.5 on Linux/x86-64, Linux/ARM, Linux/OpenPOWER with limited target offload support on NVIDIA GPUs.

By default, TARGET regions are mapped to the multicore host CPU as the target with DO and DISTRIBUTE loops parallelized across all OpenMP threads. SIMD works by passing vectorisation metadata to LLVM. Known limitations: DECLARE SIMD has no effect on SIMD code generation; TASK DEPEND/PRIORITY, TASKLOOP FIRSTPRIVATE/LASTPRIVATE, DECLARE REDUCTION and the LINEAR/SCHEDULE/ORDERED(N) clauses on the DO construct are not supported. The limited support for target offload to NVIDIA GPUs includes basic support for offload of !$omp target combined constructs.

Compile with -mp to enable OpenMP for multicore CPUs on all platforms. Compile with -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda to enable target offload to NVIDIA GPUs. Read More.

C/C++/Fortran

The compilers in the software package of ‘Technical Computing Suite for the PRIMEHPC FX1000/700′ support OpenMP 4.5 features(*1) and some OpenMP 5.0 features. Read More.
*1 Device constructs is excluded

GCC – C/C++/Fortran

The free and open-source GNU Compiler Collection (GCC) supports, among others, Linux, Solaris, AIX, MacOSX, Windows, FreeBSD, NetBSD, OpenBSD, DragonFly BSD, HPUX, RTEMS, for architectures such as x86_64, PowerPC, ARM, RISC-V, and many more.

Code offloading to NVIDIA GPUs (nvptx) and AMD GPUs (Graphics Core Next (GCN), Instinct CDNA, and Radeon (RDNA) is supported on Linux.

GCC 14 supports all of OpenMP 4.5 and most of 5.0, 5.1, and 5.2 for C, C++ and Fortran; compared with GCC 13, the OpenMP 5.x and AMD GPU support was extended. Mainly missing is the support for metadirectives, some mapping features, interop, and unified-shared memory, and OMPT/OMPD.

The GCC 15 development branch (mainline) is being updated and supports additional 5.x and first 6.0 features, including loop transformations, unified-shared memory, Fortran deep mapping but also performance improvements. The devel/omp/gcc-14 (OG14) branch, based on GCC 14 augmented by 5.x features.

OpenMP 4.0 is fully supported for C, C++ and Fortran since GCC 4.9; OpenMP 4.5 is fully supported for C and C++ since GCC 6 and partially for Fortran since GCC 7. OpenMP 5.0 is partially supported for C and C++ since GCC 9 and extended in GCC 10. Since GCC 11, OpenMP 4.5 is fully supported for Fortran and OpenMP 5.0 support has been extended for C, C++ and Fortran. GCC 12 has the initial support of OpenMP 5.1 and extends the OpenMP 5.0 coverage. GCC 13 and GCC 14 implement several of the OpenMP 5.2 features and extend the OpenMP 5.0 and 5.1 support.

Compile with -fopenmp to enable OpenMP.

GCC binary builds are provided by Linux distributions, often with offloading support provided by additional packages, and by multiple entities for other platforms – and you can build it from source.

CCE – C/C++/Fortran

CCE is part of the HPE Cray Programming Environment. HPC Cray Compiling Environment (CCE) 18.0.0 (August 2024) supports OpenMP 5.0 and partial OpenMP 5.1 and 5.2 for C, C++ and Fortran (see links below). OpenMP is turned off by default for all languages. For more information on OpenMP support in CCE, see: CCE Release Overview and Introduction to OpenMP manual page.

XL – C/C++/Fortran

XL C/C++ for Linux V16.1.1 and XL Fortran for Linux V16.1.1 fully support OpenMP 4.5 features including the target constructs. Compile with -qsmp=omp to enable OpenMP directives and with -qoffload for offloading the target regions to GPUs. For more information, please visit: IBM XL C/C++ for Linux and IBM XL Fortran for Linux

C/C++/Fortran

Windows, Linux, and MacOSX.

  • OpenMP 3.1 C/C++/Fortran fully supported in version 12.0, 13.0, 14.0 compilers
  • OpenMP 4.0 C/C++/Fortran supported in version 15.0 and 16.0 compilers
  • OpenMP 4.5 C/C++/Fortran supported in version 17.0, 18.0, and 19.0 compilers
  • OpenMP 4.5 and subset of OpenMP 5.0 in C/C++/Fortran compiler classic 2021.1
  • OpenMP 4.5 and subset of OpenMP 5.2 supported in oneAPI SYCL/C++/Fortran compiler 2023.2 under -fiopenmp -fopenmp-targets=spir64
  • OpenMP 5.2 and subset of OpenMP 6.0 supported in oneAPI SYCL/C++/Fortran compiler 2024.2 and 2025.0 under -fiopenmp -fopenmp-targets=spir64

Compile with -Qopenmp on Windows, or just -qopenmp or –fiopenmp on Linux or Mac OSX. Compile with -fiopenmp -fopenmp-targets=spir64 on Windows and Linux for offloading support. Read More

Compile with -fiopenmp -fopenmp-targets=spir64 on Windows and Linux for offloading support. Read More

Clang – C/C++

Clang is an open-source (permissively licensed) C/C++ compiler that is available to download gratis at https://releases.llvm.org/download.html.

Support for all non-offloading features of OpenMP 4.5 has been available since Clang 3.9. Support for offload constructs that run on the host is available since Clang 7.0. Support for offloading to GPU devices has been available since Clang 8.0. Support for OpenMP 5.0 is almost complete. Most OpenMP 5.1 features are available as well, see https://clang.llvm.org/docs/OpenMPSupport.html .

Clang defaults to OpenMP 5.0 semantics since release 11.0. Read More

Frontend independent LLVM/OpenMP documentation, including FAQ and information about optimizations as well as runtime configurations, can be found at https://openmp.llvm.org .Many non-standard extensions and tooling capabilities for tuning and debugging are available, see also https://openmp.llvm.org/CommandLineArgumentReference.html and https://openmp.llvm.org/design/Runtimes.html.

Flang – Fortran

Flang is the Fortran frontend of the LLVM compiler infrastructure. The OpenMP support in Flang is a work in progress but has advanced considerably in the recent releases. Flang supports parsing of all OpenMP 4.5 constructs and certain OpenMP 5.X constructs/clauses. Semantic checks of OpenMP 4.5 and 5.0 constructs are mostly complete. Code generation for most common constructs in standards upto OpenMP 2.5 is ready, the later standards are a work in progress. Offloading code and moving data across devices works for common cases but is still experimental. Most LLVM/Clang flags will work with Flang, including offloading related flags and environment variables (see https://openmp.llvm.org/design/Runtimes.html).  Read More.

MSVC – C/C++

The Microsoft Visual C/C++ compiler supports the OpenMP 2.0 standard with the -openmp switch.

Experimental support for more recent versions of the standard can be enabled by using the -openmp:llvm switch instead of the -openmp switch. As of Visual Studio 2022 version 17.7, this includes most of the OpenMP 3.1 standard (except auto schedule, static class members in a threadprivate directive, constqualified types in firstprivate clause, and iterators). Implementation of loop collapse in parallel for loops conforms to the OpenMP 5.2 standard. Read More.

Support for the SIMD directives from the OpenMP 4.0 standard is enabled with the -openmp:experimental switch. Read More.

Nagfor – Fortran

NAG Fortran Compiler 7.2 (Release Note) supports x86 and x64, for Linux, Mac and Windows, Apple silicon Macs for MacOS, ARM for Linux, Partial support for OpenMP 4.0 and 4.5 is included in the initial release of 7.2. An update will follow in early 2024 to upgrade that to full support. Read More.

C/C++/Fortran

NVIDIA HPC Compilers support a subset of OpenMP 5.1 in Fortran/C/C++ on Linux/x86_64 and Linux/AAch64, and NVIDIA GPUs. Full support for OpenMP 3.1 in Fortran/C/C++ on Linux/x86_64 and Linux/AArch64. Compile with -mp to enable OpenMP for multicore CPUs on all platforms. Compile with -mp=gpu to enable target offload to NVIDIA GPUs. Read More.

C/C++/Fortran

The OpenUH 3.x compiler has a full open-source implementation of OpenMP 2.5 and near-complete support for OpenMP 3.0 (including explicit task constructs) on Linux 32-bit or 64-bit platforms.  Read More & Download

C/C++/Fortran

Oracle Developer Studio 12.6 compilers (C, C++, and Fortran) support OpenMP 4.0 features. Compile with -xopenmp to enable OpenMP in the compiler. For this to work use at least optimization level -xO3, or the recommended -fast option to generate the most efficient code. To debug the code, compile without optimization option, add -g and use -xopenmp=noopt. Use the -xvpara option for static correctness checking and the -xloopinfo option for loop level messages. The latter is less comprehensive than the preferred er_src tool to get more detailed information on compiler optimizations. Add the -g option to the compile options to enable this and execute the command “er_src file.o” to extract the information. Read More

C/C++/Fortran

Refer to NVidia HPC Compilers. Read More

Pyjama research compiler – Java

Pyjama is a research compiler for OpenMP directives in Java developed by the Parallel and Reconfigurable Computing lab, University of Auckland. It supports most of the OpenMP Version 2.5 specification, corresponding to the Common Core. Beyond this, it supports advanced features, including GUI-aware directives and concepts and directives for OpenMP asynchronous event-driven programming as well as Java specific features like strong Exception handling, loops over iterators etc. It is based on a source-to-source compiler and a runtime library, both published Open Source.  The Pyjama website provides Pyjama, examples, documentation and more. The source code is hosted at: https://github.com/ParallelAndReconfigurableComputing/Pyjama.

Tools

Vendor / Source

Tools/Language

Information

Codee – C/C++/Fortran

Codee is a programming development tool to help improve the performance of C/C++/Fortran applications on multicore CPUs and GPUs using OpenMP. Codee Static Code Analyzer provides a systematic predictable approach to enforce the parallel programming coding rules of the Open Catalog of Best Practices for Modernization and Optimization (https://github.com/codee-com/open-catalog). It also provides Coding Assistant capabilities to enable semi-automatic source code rewriting using OpenMP directives for vectorization, multithreading and offloading. Overall, it helps novice programmers write OpenMP codes at the expert level. Codee’s integrations with IDEs and CI/CD frameworks through standard data exchange file formats enabling optimization improvements earlier in the development cycle. Read More.

Extrae – C, C++. Fortran, Java, Python

Extrae is an instrumentation package that collects performance data and saves it in Paraver trace format. It supports the instrumentation of MPI, OpenMP, OmpSs, pthreads, CUDA/CUPTI, HIP, OpenACC, OpenCL, and GASPI programming models, with programs written in C, C++, Fortran, Java, and Python, as well as combinations of different languages, hybrid, and modular codes. In addition to capturing the activity of the parallel runtime, Extrae can track I/O activity, memory operations, hardware counter metrics, including uncore and network counters, as well as references to the source code.

It is available for most UNIX-based operating systems and has been deployed on all major HPC architectures and platforms, including x86-64, ARM, ARM64, POWER, RISC-V, SPARC64, BlueGene, Cray, and NVIDIA GPUs. For OpenMP, it recognizes the primary runtime calls for Intel, LLVM, and GNU compilers and supports the latest OMPT interface standard, enabling instrumentation at loading time with the production binary. Extrae also features a new tracing mode that summarizes information at the level of long computing phases and parallel regions, supporting both pure-MPI and hybrid MPI + OpenMP applications. Read More

Paraver – C, C++. Fortran, Java, Python

Paraver is a performance analysis tool based on trace data, offering great flexibility for exploring collected information. It was developed to address the need for both a qualitative, high-level overview of application behavior through visual inspection and the ability to conduct detailed, quantitative analysis of specific issues. The tool functions as a data browser, capable of exploring any information contained within its trace format. While Extrae is the primary provider of Paraver traces, the trace format is public and has also been used to collect data on system behavior, power metrics, and user-defined custom metrics. Read More

GNU

gprofng

The gprofng application profiling tool is an open source project that is part of the GNU binutils tool set (https://sourceware.org/binutils). It supports profiling of a C, C++, Java, or Fortran application and fully supports multithreaded applications using OpenMP, Pthreads, or Java Threads. On the hardware side, processors from Intel, AMD, and Arm are supported.

There is no need to recompile the code. With gprofng, one can profile the same executable that is used in the production runs. A brief introduction to gprofng can be found in this blog: gprofng: The Next Generation GNU Profiling Tool.

There is also a GUI that runs on top of the gprofng tools with additional features (https://www.gnu.org/software/gprofng-gui/). For example, the Timeline shows a graphical representation of the run time behavior of the threads. A brief introduction to the gprofng GUI blog. can be found in this blog: The gprofng GUI: An Easy To Use Performance Analysis Tool.

Code Parallelization Assistant (Reveal) – C, C++, Fortran

HPE’s Code Parallelization Assistant, which is part of the HPE Cray Programming Environment, combines runtime performance statistics and program source code visualization with Cray Compiling Environment (CCE) compile-time optimization feedback to identify and exploit parallelism. This tool provides the ability to easily navigate through source code to highlight dependencies or bottlenecks during the optimization phase of program development or porting.

Using the program library provided by CCE and the performance data collected by HPE’s Performance Analysis Tool, the user can navigate through his or her source code to understand which high-level loops could benefit from OpenMP parallelism from loop-level optimizations such as exposing vector parallelism. It provides dependency and variable scoping information for those loops and assists the user with creating parallel directives. Read More

Performance Analysis Tools (CrayPat, Perftools, Apprentice2, Apprentice 3) – C, C++, Fortran

HPE’s Performance Analysis Tools suite, which is part of the HPE Cray Programming Environment, provides an integrated infrastructure for measurement, analysis, and visualization of computation, communication, I/O, and memory utilization to help users optimize programs for faster execution and more efficient computing resource usage. With both simple and advanced interfaces, HPE’s Performance Analysis Tools allow the user to easily extract performance information from applications and use the tools’ wealth of capability to profile large, complex codes at scale.

The toolset allows developers to perform sampling and tracing experiments on executables, collecting information at the whole program, function, loop, and line level. Programs that use MPI, SHMEM, OpenMP (including target offload), CUDA, HIP, or a combination of these programming models are supported. Profiling applications built with CCE, Intel, Arm Allinea, AMD, and GNU compilers are supported. Read More

VTune Profiler – SYCL, C, C++, C#, Fortran, Python, Go, Java, OpenCL

Intel® VTune™ Profiler is a low-overhead and high resolution performance profiling and analysis tool which helps to find and fix performance bottlenecks quickly and realize all the value of your hardware. It is capable of collecting performance statistics on CPUs and GPUs for applications written in various languages including SYCL, C, C++, Fortran, Python or any combination of languages and using OpenMP and MPI. Intel® VTune™ Profiler includes various analysis types such as Hotspots, Threading, HPC Performance Characterization, Memory Consumption, Memory Access and Microarchitecture Exploration analysis. The expanded Accelerator Profiling enables to identify issues that prevent GPU code from efficiently using available execution unit threads by using the GPU Compute/Media hotspots analysis and analyze data transfer to pinpoint inefficient code paths between host and device in the GPU Offload Analysis. Intel® VTune™ Profiler also supports efficient workflows to analyze application performance characterization with pointers to deeper analysis using Performance Snapshot as well as gain insight into system configuration, performance, and behavior with Platform Profiler. Furthermore, the Application Performance Snapshot capability enables analyzing workloads at scale to identify outliers and where they occurred. Read More

Advisor – SYCL, C, C++, Fortran

Intel® Advisor s a design and analysis tool to help ensure your applications realize full performance potential on modern Intel processors. The tool supports C, C++, Fortran, SYCL, OpenMP, OpenCL™ code, and Python. It helps design performant applications on CPU for efficient threading, vectorization, and memory use. The expanded Offload Modeling capability helps to ensure efficient GPU offload: identify parts of the code that can be profitably offloaded and optimize the code for compute and memory. The Flow Graph Analyzer enables the visualization and analysis of OpenMP task dependence graphs for performance bottlenecks. Additionally, with cache-aware Roofline Analysis, visualization of actual performance against hardware-imposed performance ceilings (rooflines), such as memory bandwidth and compute capacity help you identify effective optimization strategies. Intel® Advisor supports viewing the results of roofline profiling and offload modeling in a browser with an interactive HTML report. Read More

Inspector – C, C++, Fortran

Find errors early when they are less expensive to fix. Intel® Inspector is an easy-to-use memory and threading error debugger for C, C++, and Fortran applications that run on Windows* and Linux*. Memory and threading errors analysis are enabled for SYCL and OpenMP offloaded codes that are run on a CPU target. No special compilers or builds are required. Just use a normal debug or production build. Use the graphical user interface or automate regression testing with the command line. It has a stand-alone user interface on Windows and Linux or it can be integrated with Microsoft Visual Studio. Read More

Trace Analyzer & Collector – C, C++, Fortran

Intel Trace Collector is a low-overhead tracing library that performs event-based tracing in applications at runtime. It collects data about the application MPI and serial or OpenMP* regions, and can trace custom set functions.  Intel Trace Analyzer is a GUI-based tool that provides a convenient way to monitor application activities gathered by the Intel Trace Collector. The tools can help you evaluate profiling statistics and load balancing, analyze performance of subroutines or code blocks, learn about communication patterns, parameters, performance data, check MPI correctness and identify communication hotspots.  Read More

The Scalasca Trace Tools are a collection of trace-based performance analysis tools that have been specifically designed for use on large-scale systems. A distinctive feature is the scalable automatic trace-analysis component which provides the ability to identify wait states that occur, e.g., as a result of unevenly distributed workloads. Besides merely identifying wait states, the trace analyzer is also able to pinpoint their root causes and to identify the activities on the critical path of the target application, highlighting those routines which determine the length of the program execution and therefore constitute the best candidates for optimization. The Scalasca Trace Tools process traces generated by the Score-P measurement infrastructure and produce reports that can be explored with Cube or TAU ParaProf/PerfExplorer.  Read More

Linaro

Forge  (includes DDT, Map and Performance Reports) – C, C++, Fortran, Python

Linaro Forge is a software development toolkit designed to assist Linux developers write correct, scalable and performance applications for a variety of hardware architectures, including Arm (aarch64), x86_64, and NVIDIA, AMD and Intel GPUs. Forge includes three components: DDT, MAP and Performance Reports and can be used for serial or parallel applications relying on MPI and/or OpenMP.

Linaro DDT is a powerful, easy-to-use graphical debugger. It includes static analysis that highlights potential problems in the source code, integrated memory debugging that can catch reads and writes outside of array bounds, integration with MPI message queues and much more. It provides a complete solution for finding and fixing problems whether on a single thread or thousands of threads. Debug with Linaro DDT

Linaro MAP is a parallel profiler that shows you which lines of code took the most time and why. It supports both interactive and batch modes for gathering profile data, and supports MPI, OpenMP and single-threaded programs. Syntax-highlighted source code with performance annotations, enable you to drill down to the performance of a single line, and has a rich set of zero-configuration metrics, showing memory usage, floating-point calculations and MPI usage across processes. Profile with Linaro MAP

Linaro Performance Reports is a lightweight performance analysis tool that generates easy to read reports on an application. The tool processes data from a wide range of sources (including CPU, memory, IO or even energy sensors) and provides actionable feedback to help end-users improve the efficiency of their applications. Analyze with Performance Reports

NVIDIA

NVIDIA Nsight Systems

Nsight Systems for Linux is capable of capturing information about OpenMP events. This functionality is built on the OpenMP Tools Interface (OMPT), full support is available only for runtime libraries supporting tools interface defined in OpenMP 5.0 or greater.

See the documentation for details and platform support.

Perforce Software
(RogueWave)

TotalView for HPC – C/C++/Fortran/Python

The TotalView for HPC debugger is designed to handle debugging of thousands of threads and processes at a time with an easy-to-use, modern user interface. TotalView allows you to get complete control over program execution: running, stepping, and halting line-by-line through code within a single thread or arbitrary groups of processes or threads. Resolve bugs faster by working backward from failure using reverse debugging. Track down and solve difficult problems in concurrent programs that use threads, OpenMP (including support for OpenMP 5.x), and MPI. Easily debug CUDA code running on NVIDIA GPUs and HIP code running on AMD GPUs. Find memory leaks, buffer overruns, and other memory problems with just a click of a button using TotalView’s memory debugging. Debug Python and C/C++ code with TotalView’s mixed language debugging capabilities. Quickly learn how to use TotalView with video tutorials and help documentation. Read More

HPCToolkit is an integrated suite of tools for measurement and analysis of program performance. HPCToolkit collects profiles and traces of CPU and GPU-accelerated programs with low measurement overhead on computers ranging from multicore desktop systems to the largest supercomputers. HPCToolkit supports measurement of C, C++, Fortran, and Python programs on Arm, AMD, Intel, and IBM processors as well as AMD, Intel, and NVIDIA GPUs. HPCToolkit can measure applications developed with one or more parallel programming models including MPI, OpenMP, OpenACC, RAJA, Kokkos, and DPC++. If an OpenMP runtime implements the OpenMP Standard’s OMPT interface for tools on CPUs and/or GPUs, HPCToolkit will use it to reconstruct and attribute costs to user-level calling contexts instead of implementation-level calling contexts. Read More

HPCToolkit is an integrated suite of tools for measurement and analysis of program performance. HPCToolkit collects profiles and traces of CPU and GPU-accelerated programs with low measurement overhead on computers ranging from multicore desktop systems to the largest supercomputers. HPCToolkit supports measurement of C, C++, Fortran, and Python programs on Arm, AMD, Intel, and IBM processors as well as AMD, Intel, and NVIDIA GPUs. HPCToolkit can measure applications developed with one or more parallel programming models including MPI, OpenMP, OpenACC, RAJA, Kokkos, and DPC++. If an OpenMP runtime implements the OpenMP Standard’s OMPT interface for tools on CPUs and/or GPUs, HPCToolkit will use it to reconstruct and attribute costs to user-level calling contexts instead of implementation-level calling contexts. Read More

Archer – Data race detection, C, C++, Fortran

Archer is an extension of ThreadSanitizer for OpenMP-aware data race detection. LLVM releases include the tool since version 10. Archer is distributed with many LLVM-based compilers (AMD, HPE/Cray, Intel, …). As an OMPT-based tool, Archer understands all OpenMP synchronization semantics and feeds this information into ThreadSanitzer’s runtime analysis which results in highly accurate data race reports. Read more

The Score-P measurement infrastructure is an extremely scalable and easy-to-use tool suite for call-path profiling and event tracing of applications written in C, C++, or Fortran. It supports a wide range of HPC platforms and programming models; besides OpenMP, Score-P can hook into other common models, including MPI, SHMEM, Pthreads, CUDA, HIP, OpenCL, OpenACC, Kokkos, and their valid combinations. Score-P is capable of gathering performance information through automatic instrumentation of functions, library interception/wrapping, source-to-source instrumentation, event- and interrupt-based sampling, and hardware performance counters. Score-P measurements are the primary input for a range of specialized analysis tools, such as: Cube, Vampir, Scalasca Trace Tools, TAU, or Extra-P. Read More

Vampir is an easy-to-use framework for performance analysis, which enables developers to quickly study program behavior at a fine-grained level of detail. Performance data obtained from a parallel program run can be analyzed with a collection of specialized performance views. Intuitive navigation and zooming are the key features of the tool, which help to quickly identify inefficient or faulty parts of a program code. Vampir allows analysis of load imbalances in OpenMP programs, visualizes the interplay of parallel APIs, such as MPI and OpenMP, and supports hardware performance counters to evaluate OpenMP code regions. Vampir can visualize multiple input file formats. The primary input format is OTF2 which is the tracing format produced by Score-P. Additionally Vampir also accepts the Trace Event Format, commonly known as Chrome Trace, which is produced by a wide variety of profiling tools. Lastly it also accepts the WfCommons format to visualize scientific workflows. Read More

TAU – C, C++, Fortran, Java, Python, Spark

C, C++, Fortran, Java, Python, and Spark. For instrumentation of OpenMP programs, TAU includes source-level instrumentation (Opari), a runtime “collector” API (called ORA) built into an OpenMP compiler (OpenUH), and an OpenMP runtime library supporting OMPT from the OpenMP 5.2 standard, including asynchronous target offload events prototyped by the AMD compiler. View technical paper.

TAU supports both direct probe based measurements as well as event-based sampling modes for profiling. TAU supports an LLVM plugin for selective compiler-based instrumentation using exclude or include lists of functions for inserting probes at function boundaries. For tracing, TAU provides an open-source trace visualizer (Jumpshot) and can generate native OTF2 trace files that may be visualized in the Vampir trace visualizer. It can also generate Google Trace Events traces that may be visualized in Perfetto.dev.

TAU supports instrumentation of OpenMP target directives to offload computation to GPUs using OMPT.  TAU Commander simplifies the TAU workflow and installation. TAU supports both PAPI and LIKWID toolkits to access low-level processor specific hardware performance counter data to correlate it to the OpenMP code regions. TAU ships with a BSD style license and is integrated with the Extreme-scale Scientific Software Stack (E4S). Read More.

APEX – C/C++, Fortran

APEX is an introspection and runtime adaptation library for asynchronous multitasking runtime systems. However, APEX is not only useful for AMT/AMR runtimes – it can be used by any application wanting to perform runtime adaptation to deal with heterogeneous and/or variable environments. APEX provides an API for measuring actions within the OpenMP runtime using the OpenMP 5.2 OMPT interface – including support for target offload events prototyped by the AMD compiler. APEX can generate TAU profiles, CSV files, task graphs and trees, task scatterplots, OTF2 races, or Google Trace Events traces. APEX also provides a policy engine for autotuning of OpenMP parameters such as thread count, scheduler or chunk size or for adaptation to a changing environment such as soft or hard power caps. In the last year, this support has been extended to enable OpenMP autotuning support for the Kokkos Performance Portability library. Read More and additional information.

APEX – C/C++, Fortran

ZeroSum will monitor OS threads, OpenMP threads, MPI processes, and the hardware assigned to them including CPUs, memory usage and GPU utilization. Supported systems include all Linux operating systems, as well as NVIDIA (CUDA/NVML), AMD (HIP/ROCm-SMI) and Intel (Intel SYCL) GPUs. Host side monitoring happens through the virtual /proc filesystem, so should be portable to all Linux systems. For more information see the GitHub page or the HUST @SC2023 publication.