OpenMP Compilers & Tools

>>OpenMP Compilers & Tools
OpenMP Compilers & Tools 2019-11-20T13:21:12+00:00

A number of compilers and tools from various vendors or open source community initiatives implement the OpenMP API. If we are missing any please Contact Us with your suggestions.

Vendor/Source Compiler/Language Information
Absoft Pro Fortran Fortran Versions 11.1 and later of the Absoft Fortran 95 compiler for Linux, Windows and Mac OS X include integrated OpenMP 3.0 support. Version 18.0 supports OpenMP 3.1. Compile with -openmp.
More information
AMD C/C++ AOMP is AMD’s LLVM/Clang based compiler that supports OpenMP and offloading to multiple GPU acceleration targets (multi-target).
More Information
ARM C/C++/Fortran
Available on Linux
C/C++ – Support for OpenMP 3.1 and all non-offloading features of OpenMP 4.0/4.5. Offloading features are under development. Fortran – Full support for OpenMP 3.1 and limited support for OpenMP 4.0/4.5. Compile and link your code with -fopenmp
More information
Barcelona Supercomputing Center Mercurium
C/C++/Fortran
Mercurium is a source-to-source research compiler that is available to download at https://github.com/bsc-pm/mcxx. OpenMP 3.1 is almost fully supported for C, C++, Fortran. Apart from that, almost all tasking features introduced in newer versions of OpenMP are also supported.
More Information
Cray CCE
C/C++/Fortran
CCE Compiling Environment (CCE) 9.1 (November 2019) supports OpenMP 4.5 for C, C++ and Fortran. Limited support for OpenMP 5.0 is also available (see links below). As of CCE 9.0, the default C and C++ compiler is based on Clang and OpenMP is turned off by default for all languages.

For more information on OpenMP support in current and past versions of CCE, see:

Flang Flang
Fortran
Fortran for LLVM. Substantially full OpenMP 4.5 on Linux/x86-64, Linux/ARM, Linux/OpenPOWER, limited target offload support on NVIDIA GPUs.

By default, TARGET regions are mapped to the multicore host CPU as the target with DO and DISTRIBUTE loops parallelized across all OpenMP threads. Known limitations: SIMD and DECLARE SIMD have no effect on SIMD code generation; TASK DEPEND/PRIORITY, TASKLOOP FIRSTPRIVATE/LASTPRIVATE, DECLARE REDUCTION and the LINEAR/SCHEDULE/ORDERED(N) clauses on the DO construct are not supported. The limited support for target offload to NVIDIA GPUs includes basic support for offload of !$omp target combined constructs.

Compile with -mp to enable OpenMP for multicore CPUs on all platforms. Compile with -fopenmp -fopenmp-targets=nvptx64-nvidia-cuda to enable target offload to NVIDIA GPUs.

More information.

GNU GCC

C/C++/Fortran

Free and open source – Linux, Solaris, AIX, MacOSX, Windows, FreeBSD, NetBSD, OpenBSD, DragonFly BSD, HPUX, RTEMS

  • From GCC 4.2.0, OpenMP 2.5 is fully supported for C/C++/Fortran.
  • From GCC 4.4.0, OpenMP 3.0 is fully supported for C/C++/Fortran.
  • From GCC 4.7.0, OpenMP 3.1 is fully supported for C/C++/Fortran.
  • In GCC 4.9.0, OpenMP 4.0 is supported for C and C++, but not Fortran.
  • From GCC 4.9.1, OpenMP 4.0 is fully supported for C/C++/Fortran.
  • From GCC 6.1, OpenMP 4.5 is fully supported for C and C++.
  • From GCC 7.1, OpenMP 4.5 is partially supported for Fortran.
  • From GCC 9.1, OpenMP 5.0 is partially supported for C and C++.

Compile with -fopenmp to enable OpenMP.

Online documentation: https://gcc.gnu.org/onlinedocs/libgomp/
OpenMP support history: https://gcc.gnu.org/projects/gomp/

IBM XL

C/C++/Fortran

XL C/C++ for Linux V16.1.1 and XL Fortran for Linux V16.1.1 fully support OpenMP 4.5 features including the target constructs.
Compile with -qsmp=omp to enable OpenMP directives and with -qoffload for offloading the target regions to GPUs.
For more information, please visit IBM XL C/C++ for Linux and IBM XL Fortran for Linux.
Intel C/C++/Fortran Windows, Linux, and MacOSX.

  • OpenMP 3.1 C/C++/Fortran fully supported in version 12.0, 13.0, 14.0 compilers
  • OpenMP 4.0 C/C++/Fortran supported in version 15.0 and 16.0 compilers
  • OpenMP 4.5 C/C++/Fortran supported in version 17.0, 18.0, and 19.0  compilers
  • OpenMP 4.5 and subset of OpenMP 5.0 C/C++/Fortran supported in 19.1 compilers under -qnextgen -fiopenmp.

Compile with -Qopenmp on Windows, or just -openmp or -qopenmp on Linux or Mac OSX

More information

Lahey/Fujitsu Fortran 95 C/C++/Fortran The compilers in the software package of ‘Technical Computing Suite for the PRIMEHPC FX100′ support OpenMP 3.1.

More information

LLNL Rose Research Compiler C/C++/Fortran ROSE is a source-to-source research compiler supporting OpenMP 3.0 and some OpenMP 4.0 accelerator features targeting NVIDIA GPUs.
More information
LLVM Clang
C/C++
Clang is an open-source (permissively licensed) C/C++ compiler that is available to download gratis at http://llvm.org/releases/download.html. Support for all non-offloading features of OpenMP 4.5 has been available since Clang 3.9. Support for offload constructs that run on the host is available in Clang 7.0.  Support for offloading to GPU devices is available with some limitations is available in Clang 8.0. Support for OpenMP 5.0 is under active development.

For full details of supported OpenMP features, compiler flags to use and so on, see: https://clang.llvm.org/docs/OpenMPSupport.html

NAG Nagfor

Fortran

NAG Fortran Compiler 6.2 supports OpenMP 3.1 on x86 and x64, for Linux, Mac and Windows. Compile with –openmp.

More Information

OpenUH Research Compiler C/C++/Fortran The OpenUH 3.x compiler has a full open-source implementation of OpenMP 2.5 and near-complete support for OpenMP 3.0 (including explicit task constructs) on Linux 32-bit or 64-bit platforms. For more information or to download: https://github.com/uhhpctools/openuh
Oracle C/C++/Fortran Oracle Developer Studio 12.6 compilers (C, C++, and Fortran) support OpenMP 4.0 features.

Compile with -xopenmp to enable OpenMP in the compiler. For this to work use at least optimization level -xO3, or the recommended -fast option to generate the most efficient code.

To debug the code, compile without optimization option, add -g and use -xopenmp=noopt. Use the -xvpara option for static correctness checking and the -xloopinfo option for loop level messages. The latter is less comprehensive than the preferred er_src tool to get more detailed information on compiler optimizations. Add the -g option to the compile options to enable this and execute the command “er_src  file.o”  to extract the information.

More information

PGI C/C++/Fortran Support for substantially full OpenMP 4.5 in Fortran/C/C++ on Linux/x86-64 and Linux/OpenPOWER.  TARGET regions are implemented with default support for the multicore host as the target, and PARALLEL and DISTRIBUTE loops are parallelized across all OpenMP threads.

Known limitations:  SIMD and DECLARE SIMD have no effect on SIMD code generation, except that the SIMD directive is interpreted to mean there are no dependences in a loop and it is safe to auto-vectorize;   TASK DEPEND/PRIORITY, DECLARE REDUCTION and the LINEAR/SCHEDULE/ORDERED(N) clauses on the LOOP construct are not supported.

Support for full OpenMP 3.1 in Fortran/C/C++ on MacOS/x86-64, and in Fortran/C on Windows/x86-64.  Compile with -mp to enable OpenMP on all platforms.

More information

Texas Instruments C The TI cl6x compiler v8.x supports OpenMP 3.0 for multicore C66x on TI’s Keystone I family of Multicore C667x/C665x Digital Signal Processor (DSP) SoCs using the Processor-SDK-RTOS.

The Linaro toolchain (gcc) 8.3.0 supports OpenMP 4.5 for multicore Cortex-A15 on TI’s AM572x and Keystone II family (K2H/K2K, K2E, K2L, K2G) SoCs using the Processor-SDK-Linux.

The TI clacc v1.x compiler supports OpenMP 3.0 and device constructs from OpenMP 4.0 heterogeneous multicore Cortex-A15+C66x-DSP on TI’s AM57x and Keystone II family (K2H/K2K, K2E, K2L, K2G) SoCs using both the Processor-SDK-Linux (A15) and Processor-SDK-RTOS (C66x).

See here for the latest versions of the Processor-SDKs for various TI SoCs:

University of Auckland PARC Lab Pyjama – research compiler
Java
Pyjama is a research compiler for OpenMP directives in Java developed by the Parallel and Reconfigurable Computing lab, University of Auckland. It supports most of the OpenMP Version 2.5 specification, corresponding to the Common Core. Beyond this, it supports advanced features, including GUI-aware directives and concepts and directives for OpenMP asynchronous event-driven programming as well as Java specific features like strong Exception handling, loops over iterators etc. It is based on a source-to-source compiler and a runtime library, both published Open Source.
The Pyjama website provides Pyjama, examples, documentation and more. The source code is hosted at: https://github.com/ParallelAndReconfigurableComputing/Pyjama.

(Updated November, 2019)

Vendor/Sources Tools/Language Information
Appentra Parallelware Trainer
C, C++, Fortran
Parallelware Trainer is an integrated development environment (IDE) designed to facilitate the learning, usage, and implementation of parallel programming, along with the ability to test the performance improvements of particular parallel implementations. Among other technologies, it allows to parallelize code for multicore CPUs and GPUs through the multithreading, tasking or offloading paradigms using directives of OpenMP 4.5. Parallelware Trainer also looks for defects in your code as well as potential issues related to parallelism, reporting recommendations on how to fix them right from the integrated code editor.

More Information

Appentra Parallelware Analyze
C, C++, Fortran
Parallelware Analyzer, is a suite of command-line tools aimed at helping software developers to build better quality parallel software in less time. Designed around the needs of developers, Parallelware Analyzer provides the appropriate tools for the key stages of the parallel development workflow, aiding developers with code analysis that would otherwise be error-prone and completed manually. The complexity of parallelism is addressed from three different perspectives: finding parallel defects in the code, discovering new opportunities for parallelization in the code, and generating parallel-equivalent code that enables tasks to complete in less time. It can also be easily integrated with DevOp tools to benefit from its automatic usage during Continuous Integration.

More Information

Arm Forge
(includes DDT, Map and Performance Reports)
C, C++, Fortran, Python
Arm Forge is a software development toolkit designed to assist Linux developers write correct, scalable and performance applications for a variety of hardware architectures, including x86, Power, Armv8 and accelerators such as NVIDIA GPUs. Forge includes three components: DDT, MAP and Performance Reports and can be used for serial or parallel applications relying on MPI and/or OpenMP.

Arm DDT is a powerful, easy-to-use graphical debugger. It includes static analysis that highlights potential problems in the source code, integrated memory debugging that can catch reads and writes outside of array bounds, integration with MPI message queues and much more. It provides a complete solution for finding and fixing problems whether on a single thread or thousands of threads. Debug with Arm DDT (https://developer.arm.com/products/software-development-tools/hpc/arm-forge/arm-ddt)

Arm MAP is a parallel profiler that shows you which lines of code took the most time and why. It supports both interactive and batch modes for gathering profile data, and supports MPI, OpenMP and single-threaded programs. Syntax-highlighted source code with performance annotations, enable you to drill down to the performance of a single line, and has a rich set of zero-configuration metrics, showing memory usage, floating-point calculations and MPI usage across processes. Profile with Arm MAP (https://developer.arm.com/products/software-development-tools/hpc/arm-forge/arm-map)

Arm Performance Reports is a lightweight performance analysis tool that generates easy to read reports on an application. The tool processes data from a wide range of sources (including CPU, memory, IO or even energy sensors) and provides actionable feedback to help end-users improve the efficiency of their applications. Analyze with Performance Reports.
https://developer.arm.com/tools-and-software/server-and-hpc/arm-architecture-tools/arm-performance-reports

BSC Extrae, Paraver / C, C++. Fortran, Java, Python Extrae is an instrumentation package that collects performance data and saves it in Paraver trace format. It supports the instrumentation of MPI, OpenMP, pthreads, OmpSs, CUDA, OpenCL, with C, C++, Fortran, Java and Python. With respect to OpenMP, it recognizes the main runtime calls for Intel and GNU compilers allowing instrumentation at loading time with the production binary. Support for GASPI and XMPhas been started in the scope of projects but is not yet included in the public release.   The OMPT interface support is currently outdated but will be upgraded as soon as resources are available.

More information

Paraver is a performance analyzer based on traces with a great flexibility to explore the collected data. It was developed to respond to the need to have a qualitative global perception of the application behavior by visual inspection and then to be able to focus on the detailed quantitative analysis of the problems. The tool can be considered a data browser that can explore any information expressed on its trace format. Extrae is the main provider of Paraver traces despite the trace format is public and it has been used to collect information of system behavior, power metrics and user customized metrics.

More information

Cray Reveal Reveal is Cray’s performance analysis and code optimization tool that combinines run time performance statistics and program source code visualization with Cray Compiling Environment (CCE) compile-time optimization feedback. Reveal supports source code navigation using whole-program analysis data provided by the Cray Compiling Environment, coupled with performance data collected during program execution by the Cray performance tools, to understand which high-level serial loops could benefit from improved parallelism.

Cray Cray Performance Measurement and Analysis Tools
(CrayPat, Apprentice2)
C/C++/Fortran
Cray’s Performance Measurement and Analysis Tools provides an integrated
infrastructure for measurement, analysis, and visualization of computation,
communication, I/O, and memory utilization to help users optimize programs
for faster execution and more efficient computing resource usage. With both
simple and advanced interfaces, Cray’s tools allow the user to easily
extract performance information from applications and use the tools wealth
of capability to profile large, complex codes at scale.

The toolset allows developers to perform sampling and tracing experiments
on executables, extracting information at the whole program, function,
loop, and line level. Programs that use MPI, SHMEM, OpenMP (including
target offload), CUDA, or a combination of these programming models are
supported. Profiling applications built with Cray, Intel, Arm Allinea, AMD,
and GNU compilers are supported.

More information

Intel VTune Amplifier
C, C++, C#, Fortran, Python*, Go*, Java*, OpenCL
Whether you’re tuning a simple application for the first time―or doing advanced performance optimization on a threaded MPI* application―you get the data you need with Intel® VTune™ Amplifier. Collect a rich set of performance data for hotspots, threading, locks and waits, DirectX*, OpenCL*, OpenMP*, Intel® Threading Building Blocks, bandwidth, cache, memory access, non-uniform memory, storage latency, and more (Figure 1). Profile C, C++, C#, Fortran, Python*, Go*, Java*, and OpenCL―or any mix. Unlike single-language profilers, Intel VTune Amplifier analyzes mixed code. You can:  See more data:  CPU, FPU, GPU, threading, memory access, and more, get fast answers. Easy analysis turns data into insight and tune with accurate data and low overhead. Intel® VTune Amplifier can Improve your workflow with both local and remote collection and a command line/graphical interface

Intel® VTune Amplifier gets the data you need such as:  Hotspot (statistical call tree), call counts (statistical); Thread profiling with locks and waits analysis; Memory access, cache miss, bandwidth, NUMA analysis; FLOPS and FPU utilization; Storage accesses mapped to source, latency histogram, and I/O wait; OpenCL program kernel tracing and GPU offload; OpenMP scalability analysis and graphical frame analysis; Local and remote data collection, multi-rank setup for MPI applications.  To aid in analysis, visualize thread and task activity on a timeline. Low-Overhead/High-Resolution Hardware Profiling In addition to basic analysis that works on both Intel® and compatible processors, Intel VTune Amplifier has advanced analysis that uses the on-chip Performance Monitoring Unit (PMU) on Intel processors to collect data with very low overhead. This also finds important performance issues like cache misses, branch mispredictions, bandwidth, and more.

Intel® VTune™ Amplifier – Platform Profiler is a tool that helps users to identify how well an application uses the underlying architecture and how users can optimize hardware configuration of their system. It displays high-level system configuration such as processor, memory, storage layout, PCIe* and network interfaces (see Figure 1), as well as performance metrics observed on the system such as CPU and memory utilization, CPU frequency, cycles per instruction (CPI), memory and disk input/output (I/O) throughput, power consumption, cache miss rate per instruction, and so on. It is used for longer analysis to see how the system is performing and helps you answer the following questions:
Do you have the best configuration?

  • Would more memory help?
  • More I/O?
  • Is the workload performing well?
  • Which stage needs the most tuning?
  • Is it an I/O bottleneck, memory bottleneck or compute bottleneck?

VTune Amplifier and Platform Profiler support Intel® Optane DC persistent memory, and can help answer the questions: Can my application benefit from Persistent Memory? Is my application optimized for Persistent Memory?

Intel® VTune™ Amplifier’s Application Performance Snapshot gives you a fast way to see if your HPC application is making the best use of modern computer hardware. Get a quick overview of MPI, OpenMP, Memory and Floating Point metrics to see what kind of optimization will have the most impact. Application performance snapshot and MPI performance snapshot now merge to give a unified performance snapshot view which is easy, convenient, faster to view and analyze.

Intel® VTune™ Amplifier is available as part of Intel® Parallel Studio XE Professional and Cluster Edition.

More Information

Intel Advisor
C, C++, Fortran
Intel® Advisor provides two tools to help ensure your Fortran, C and C++ applications realize full performance potential on modern Intel processors:  Vectorization Advisor and Threading Advisor.

Vectorization Advisor is a vectorization optimization tool that lets you identify loops that will benefit most from vectorization, identify what is blocking effective vectorization, forecast the benefit of alternative data reorganizations, and increase the confidence that vectorization is safe.  Additionally, with cache-aware Roofline Analysis, visualization of actual performance against hardware-imposed performance ceilings (rooflines), such as memory bandwidth and compute capacity help you identify effective optimization strategies.

Threading Advisor is a threading design and prototyping tool that lets you analyze, design, tune, and check threading design options without disrupting your normal development.  Parallelism can be modeled using OpenMP, Threading Building Blocks and Microsoft TPL with adding simple annotations to code and Threading Advisor will model the design providing scalability and performance and identify potential dependency errors.

Intel® Advisor is available as part of Intel® Parallel Studio XE Professional and Cluster Edition.

More information

Intel Inspector
C, C++, Fortran
Find errors early when they are less expensive to fix. Intel® Inspector is an easy-to-use memory and threading error debugger for C, C++, and Fortran applications that run on Windows* and Linux*. No special compilers or builds are required. Just use a normal debug or production build. Use the graphical user interface or automate regression testing with the command line. It has a stand-alone user interface on Windows and Linux or it can be integrated with Microsoft Visual Studio*.

Dynamic analysis reveals subtle defects or vulnerabilities when the cause is too complex to be discovered by static analysis. Unlike static analysis, debugger integration lets you diagnose the problem and find the root cause. Intel Inspector finds latent errors on the executed code path plus intermittent and nondeterministic errors, even if the timing scenario that caused the error does not happen.

Unlike other memory and threading analysis tools, Intel Inspector never requires any special recompiles for analysis. Just use your normal debug or production build. (Include symbols so we can map to the source.) This not only makes your workflow faster and easier, it increases reliability and accuracy.

Inspector supports Intel® Optane DC persistent memory, and can find persistent memory errors such as missing redundant cache flushes, missing store fences, out of order persistent memory stores, and PMDK transaction redo logging errors.

Intel® Inspector is available as part of Intel® Parallel Studio XE Professional and Cluster Edition.

More information

Intel Trace Analyzer & Collector / C, C++, Fortra Intel® Trace Collector is a low-overhead tracing library that performs event-based tracing in applications at runtime. It collects data about the application MPI and serial or OpenMP* regions, and can trace custom set functions. The product is completely thread safe and integrates with C/C++, FORTRAN and multithreaded processes with and without MPI. Additionally, it can check for MPI programming and system errors.  Recently, support for OpenSHMEM has been added as a supported language.

Intel® Trace Analyzer is a GUI-based tool that provides a convenient way to monitor application activities gathered by the Intel Trace Collector. You can view the desired level of detail, quickly identify performance hotspots and bottlenecks, and analyze their causes. The tools can help you evaluate profiling statistics and load balancing, analyze performance of subroutines or code blocks, learn about communication patterns, parameters, performance data, check MPI correctness and identify communication hotspots.  The goal is to decrease time to solution and increase application efficiency.

Intel® Trace Analyzer and Collector is part of Intel® Parallel Studio XE Cluster Edition.

More information

Juelich Supercomputing Centre Scalasca Trace Tools The Scalasca Trace Tools are a collection of trace-based performance analysis tools that have been specifically designed for use on large-scale systems. A distinctive feature is the scalable automatic trace-analysis component which provides the ability to identify wait states that occur, e.g., as a result of unevenly distributed workloads. Besides merely identifying wait states, the trace analyzer is also able to pinpoint their root causes and to identify the activities on the critical path of the target application, highlighting those routines which determine the length of the program execution and therefore constitute the best candidates for optimization. The Scalasca Trace Tools process traces generated by the Score-P measurement infrastructure and produce reports that can be explored with Cube or TAU ParaProf/PerfExplorer.

More information

ParaFormance Technologies ParaFormance ParaFormance is a software tool-chain that allows software developers to quickly and easily write multi-core software. ParaFormance enables software developers to find the sources of parallelism within their code, automatically (through user-controlled guidance) inserting the parallel business logic (using OpenMP and TBB), and checking that the parallelised code is thread-safe.

More information

Perforce Software
(RogueWave)
TotalView for HPC
C/C++/Fortran/Python
The TotalView for HPC debugger was originally designed for debugging multi-threaded and multi-processing code, Simultaneous debug many processes and threads in a single window to get complete control over program execution: Running, stepping, and halting line-by-line through code within a single thread or arbitrary groups of processes or threads. Work backwards from failure through reverse debugging, isolating the root cause faster by eliminating repeated restarts of the application. Reproduce difficult problems that occur in concurrent programs that use threads, OpenMP, MPI, and CUDA.  Use TotalView’s memory debugging to find memory leaks, API errors and memory overruns in allocated memory. The new GUI extends TotalView’s mixed language support to include Python wrappers and filters the stack trace of unwanted ‘glue’ routines.  TotalView contains early support for OMPD as defined for OpenMP 5.0.

More Information

Rice University HPCToolkit HPCToolkit is an integrated suite of tools for measurement and analysis of program performance on computers ranging from multicore desktop systems to the nation’s largest supercomputers. HPCToolkit provides accurate measurements of a program’s work, resource consumption, and inefficiency, correlates these metrics with the program’s source code, works with multilingual, fully optimized binaries, has very low measurement overhead, and scales to large parallel systems. HPCToolkit’s measurements provide support for analyzing a program execution cost, inefficiency, and scaling characteristics both within and across nodes of a parallel system.

More Information

Score-P Developer Community Score-P The Score-P measurement infrastructure is an extremely scalable and easy-to-use tool suite for call-path profiling, event tracing, and online analysis of applications written in C, C++, or Fortran. It supports a wide range of HPC platforms and programming models; besides OpenMP, Score-P can hook into other common models, including MPI, SHMEM, Pthreads, CUDA, OpenCL, OpenACC, and their valid combinations. Score-P is capable of gathering performance information through automatic instrumentation of functions, library interception/wrapping, source-to-source instrumentation, event- and interrupt-based sampling, and hardware performance counters. Score-P measurements are the primary input for a range of specialized analysis tools, such as: Cube, Vampir, Scalasca Trace Tools, TAU, or Periscope.

More information

Signalogic CIM Heterogeneous Programming / C, C+ CIM enables code generation for combined  Intel x86 and Texas Instruments c66x platforms.  Within C/C++ source code, OpenMP pragmas can be used to mark sections of code that should be compiled and built for c66x run-time.  c66x I/O functions are supported, allowing c66x to “front” incoming data for high capacity media and streaming applications.

More Info    http://processors.wiki.ti.com/index.php/C66x_Heterogeneous_Programming

Technische Universität Dresden Vampir Vampir is an easy-to-use framework for performance analysis, which enables developers to quickly study program behavior at a fine-grained level of detail. Performance data obtained from a parallel program run can be analyzed with a collection of specialized performance views. Intuitive navigation and zooming are the key features of the tool, which help to quickly identify inefficient or faulty parts of a program code. Vampir allows analysis of load imbalances in OpenMP programs, visualizes the interplay of parallel APIs, such as MPI and OpenMP, and supports hardware performance counters to evaluate OpenMP code regions. Score-P is the primary code instrumentation and run-time measurement framework for Vampir. It supports various instrumentation methods and tool interfaces, such as OMPT.

More Information.

University of Oregon TAU
C, C++, Fortran, Java, Python, Spark
TAU is a performance evaluation tool that supports both profiling and tracing for programs written in C, C++, Fortran, Java, Python, and Spark. For instrumentation of OpenMP programs, TAU includes source-level instrumentation (Opari), a runtime “collector” API (called ORA) built into an OpenMP compiler (OpenUH), and an OpenMP runtime library supporting OMPT from the OpenMP 5.0 standard. View technical paper. TAU supports both direct probe based measurements as well as event-based sampling modes for profiling. For tracing, TAU provides an open-source trace visualizer (Jumpshot) and can generate native OTF2 trace files that may be visualized in the Vampir trace visualizer. TAU Commander simplifies the TAU workflow and installation. TAU supports both PAPI and LIKWID toolkits to access low-level processor specific hardware performance counter data to correlate it to the OpenMP code regions. TAU ships with a BSD style license.

More Information.

(Updated November 2019)