|
|
|
Tim Mattson |
|
Intel Corporation |
|
Computational Sciences Laboratory |
|
|
|
|
|
|
Setting the stage |
|
Parallel computing, hardware, software, etc. |
|
OpenMP: A quick overview |
|
OpenMP: A detailed introduction |
|
Automatic Parallelism and Tools support. |
|
OpenMP case studies. |
|
Including performance tuning. |
|
Common bugs in OpenMP programs |
|
and how to avoid them. |
|
The future of OpenMP |
|
|
|
|
|
Parallel computing is when a program uses
concurrency to either: |
|
decrease the runtime for the solution to a
problem. |
|
Increase the size of the problem that can be
solved. |
|
|
|
|
|
|
|
|
|
Most ISV’s have ignored parallel computing (other
than coarse-grained multithreading for GUI’s and systems programming) |
|
Why? |
|
The perceived difficulties of writing parallel
software out-weigh the benefits |
|
The benefits are clear. To increase the amount of parallel
software, we need to reduce the perceived difficulties. |
|
|
|
|
|
|
Thread Libraries |
|
Win32 API - a low level approach. |
|
threads.h++ - a high level class library. |
|
Compiler Directives |
|
OpenMP - portable shared memory parallelism. |
|
Message Passing Libraries |
|
MPI - www.mpi-softtech.com |
|
High Level Tools |
|
www.apri.com,
www.kai.com, www.pgroup.com |
|
|
|
|
|
|
Setting the stage |
|
Parallel computing, hardware, software, etc. |
|
OpenMP: A quick overview |
|
OpenMP: A detailed introduction |
|
Automatic Parallelism and Tools support. |
|
OpenMP case studies. |
|
Including performance tuning. |
|
Common bugs in OpenMP programs |
|
and how to avoid them. |
|
The future of OpenMP |
|
|
|
|
|
|
OpenMP: An API for Writing Multithreaded Applications |
|
A set of compiler directives and library
routines for parallel application
programmers |
|
Makes it easy to create multi-threaded (MT)
programs in Fortran, C and C++ |
|
Standardizes last 15 years of SMP practice |
|
|
|
|
|
|
Hardware vendors |
|
Intel, HP, SGI, IBM, SUN, Compaq |
|
Software tools vendors |
|
KAI, PGI, PSR, APR, Absoft |
|
Applications vendors |
|
ANSYS, Fluent, Oxford Molecular, NAG, DOE ASCI,
Dash, Livermore Software, and many others |
|
|
|
|
|
|
|
|
OpenMP is usually used to parallelize loops: |
|
Find your most time consuming loops. |
|
Split them up between threads. |
|
|
|
|
|
|
OpenMP is a shared memory model. |
|
Threads communicate by sharing variables. |
|
Unintended sharing of data can lead to race
conditions: |
|
race condition: when the program’s outcome
changes as the threads are scheduled differently. |
|
To control race conditions: |
|
Use synchronization to protect data conflicts. |
|
Synchronization is expensive so: |
|
Change how data is stored to minimize the need
for synchronization. |
|
|
|
|
|
|
Setting the stage |
|
Parallel computing, hardware, software, etc. |
|
OpenMP: A quick overview |
|
OpenMP: A detailed introduction |
|
Automatic Parallelism and Tools support. |
|
OpenMP case studies. |
|
Including performance tuning. |
|
Common bugs in OpenMP programs |
|
and how to avoid them. |
|
The future of OpenMP |
|
|
|
|
|
|
Most of the constructs in OpenMP are compiler
directives or pragmas. |
|
For C and C++, the pragmas take the form: |
|
#pragma omp construct [clause [clause]…] |
|
For Fortran, the directives take one of the
forms: |
|
C$OMP construct [clause [clause]…] |
|
!$OMP construct [clause [clause]…] |
|
*$OMP construct [clause [clause]…] |
|
Since the constructs are directives, an OpenMP
program can be compiled by
compilers that don’t support OpenMP. |
|
|
|
|
|
|
|
|
Most OpenMP constructs apply to structured
blocks. |
|
Structured block: a block of code with one point
of entry at the top and one point of exit at the bottom. The only other branches allowed are STOP
statements in Fortran and exit() in C/C++. |
|
|
|
|
|
OpenMP’s constructs fall into 5 categories: |
|
Parallel Regions |
|
Worksharing |
|
Data Environment |
|
Synchronization |
|
Runtime functions/environment variables |
|
OpenMP is basically the same between Fortran and
C/C++ |
|
|
|
|
You create threads in OpenMP with the “omp
parallel” pragma. |
|
For example, To create a 4 thread Parallel
region: |
|
|
|
|
Each thread executes the same code redundantly. |
|
|
|
|
Write a multithreaded program where each thread
prints a simple message (such as “hello world”). |
|
Use two separate printf statements and include
the thread ID: |
|
|
|
|
|
|
Dynamic mode (the default mode): |
|
The number of threads used in a parallel region
can vary from one parallel region to another. |
|
Setting the number of threads only sets the maximum number of threads - you
could get less. |
|
Static mode: |
|
The number of threads is fixed and controlled by
the programmer. |
|
OpenMP lets you nest parallel regions, but… |
|
A compiler can choose to serialize the nested
parallel region (i.e. use a team with only one thread). |
|
|
|
|
|
OpenMP’s constructs fall into 5 categories: |
|
Parallel Regions |
|
Worksharing |
|
Data Environment |
|
Synchronization |
|
Runtime functions/environment variables |
|
|
|
|
The “for” Work-Sharing construct splits up loop
iterations among the threads in a
team |
|
|
|
|
|
|
|
|
|
The schedule clause effects how loop iterations
are mapped onto threads |
|
schedule(static [,chunk]) |
|
Deal-out blocks of iterations of size “chunk” to
each thread. |
|
schedule(dynamic[,chunk]) |
|
Each thread grabs “chunk” iterations off a queue
until all iterations have been handled. |
|
schedule(guided[,chunk]) |
|
Threads dynamically grab blocks of iterations.
The size of the block starts large and shrinks down to size “chunk” as the
calculation proceeds. |
|
schedule(runtime) |
|
Schedule
and chunk size taken from the OMP_SCHEDULE environment variable. |
|
|
|
|
The Sections work-sharing construct gives a
different structured block to each thread. |
|
|
|
|
A short hand notation that combines the Parallel
and work-sharing construct. |
|
|
|
|
|
|
On the following slide, you’ll see a sequential
program that uses numerical integration to compute an estimate of PI. |
|
Parallelize this program using OpenMP. There are several options (do them all
if you have time): |
|
Do it as an SPMD program using a parallel region
only. |
|
Do it with a work sharing construct. |
|
Remember, you’ll need to make sure multiple
threads don’t overwrite each other’s variables. |
|
|
|
|
|
|
|
|
|
OpenMP’s constructs fall into 5 categories: |
|
Parallel Regions |
|
Worksharing |
|
Data Environment |
|
Synchronization |
|
Runtime functions/environment variables |
|
|
|
|
|
|
Shared Memory programming model: |
|
Most variables are shared by default |
|
Global variables are SHARED among threads |
|
Fortran: COMMON blocks, SAVE variables, MODULE
variables |
|
C: File scope variables, static |
|
But not everything is shared... |
|
Stack variables in sub-programs called from
parallel regions are PRIVATE |
|
Automatic variables within a statement block are
PRIVATE. |
|
|
|
|
|
|
|
|
One can selectively change storage attributes
constructs using the following clauses* |
|
SHARED |
|
PRIVATE |
|
FIRSTPRIVATE |
|
THREADPRIVATE |
|
The value of a private inside a parallel loop
can be transmitted to a global
value outside the loop with: |
|
LASTPRIVATE |
|
The default status can be modified with: |
|
DEFAULT (PRIVATE | SHARED | NONE) |
|
|
|
|
|
|
private(var)
creates a local copy of var for each thread. |
|
The value is uninitialized |
|
Private copy is not storage associated with the
original |
|
|
|
|
|
|
Firstprivate is a special case of private. |
|
Initializes each private copy with the
corresponding value from the master thread. |
|
|
|
|
Lastprivate passes the value of a private from the last iteration to a global variable. |
|
|
|
|
Here’s an example of PRIVATE and FIRSTPRIVATE |
|
|
|
|
|
Note that the default storage attribute is DEFAULT(SHARED)
(so no need to specify) |
|
To change default: DEFAULT(PRIVATE) |
|
each variable in static extent of the parallel
region is made private as if specified in a private clause |
|
mostly saves typing |
|
DEFAULT(NONE): no default for variables in
static extent. Must list storage attribute for each variable in static
extent |
|
|
|
|
|
|
|
Makes global data private to a thread |
|
Fortran: COMMON
blocks |
|
C: File scope and static variables |
|
Different from making them PRIVATE |
|
with PRIVATE global variables are masked. |
|
THREADPRIVATE preserves global scope within each
thread |
|
Threadprivate variables can be initialized using
COPYIN or by using DATA statements. |
|
|
|
|
|
|
|
|
Another clause that effects the way variables
are shared: |
|
reduction (op : list) |
|
The variables in “list” must be shared in the
enclosing parallel region. |
|
Inside a parallel or a worksharing construct: |
|
A local copy of each list variable is made and
initialized depending on the “op” (e.g. 0 for “+”) |
|
pair wise “op” is updated on the local value |
|
Local copies are reduced into a single global
copy at the end of the construct. |
|
|
|
|
|
|
Return to your “pi” program and this time, use
private, reduction and a worksharing construct to parallelize it. |
|
See how similar you can make it to the original
sequential program. |
|
|
|
|
|
|
The data scope clauses take a list argument |
|
The list can be a common block name with is a
short hand for listing all the variables in the common block. |
|
Default private for some loop indices: |
|
Fortran: loop indices are private even if they are specified as
shared. |
|
C: Loop indices on “work-shared loops” are
private when they otherwise would
be shared. |
|
Not all privates are undefined |
|
Allocatable arrays in Fortran |
|
Class type (I.e. non-POD) variables in C++. |
|
|
|
|
Variables privitized in a parallel region can
not be reprivitzied on an enclosed worksharing directive. |
|
Assumed size and assumed shape arrays can not be
privitized. |
|
Fortran pointers or allocatable arrays can be
private or shared but not lastprivate or firstprivate. |
|
When a common block is listed in a private,
firstprivate, or lastprivate clause, its constituent elements can’t appear
in other data scope clauses. |
|
If an element of a shared common block is
privitized, it is no longer storage associated with the common block. |
|
|
|
|
|
OpenMP’s constructs fall into 5 categories: |
|
Parallel Regions |
|
Worksharing |
|
Data Environment |
|
Synchronization |
|
Runtime functions/environment variables |
|
|
|
|
|
|
OpenMP has the following constructs to support
synchronization: |
|
atomic |
|
critical section |
|
barrier |
|
flush |
|
ordered |
|
single |
|
master |
|
|
|
|
Only one thread at a time can enter a critical
section. |
|
|
|
|
Atomic is a special case of a critical section
that can be used for certain simple statements. |
|
It applies only to the update of a memory
location (the update of X in the following example) |
|
|
|
|
Barrier: Each thread waits until all threads
arrive. |
|
|
|
|
The ordered construct enforces the sequential
order for a block. |
|
|
|
|
The master construct denotes a structured
block that is only executed by the
master thread. The other threads just skip it (no implied barriers or
flushes). |
|
|
|
|
The single construct denotes a block of code
that is executed by only one thread. |
|
A barrier and a flush are implied at the end of
the single block. |
|
|
|
|
|
|
The flush construct denotes a sequence point
where a thread tries to create a consistent view of memory. |
|
All memory operations (both reads and writes)
defined prior to the sequence point must complete. |
|
All memory operations (both reads and writes)
defined after the sequence point
must follow the flush. |
|
Variables in registers or write buffers must be
updated in memory. |
|
Arguments to flush specify which variables are
flushed. No arguments specifies that all thread visible variables are
flushed. |
|
|
|
|
|
|
|
|
Barriers are implied on the following OpenMP
constructs: |
|
|
|
|
For, sections and single directives binding to
the same parallel region can’t be nested. |
|
Critical sections with the same name can’t be
nested. |
|
For, sections, and single can not appear in the
dynamic extent of critical, ordered or master. |
|
Barrier can not appear in the dynamic extent of for,
ordered, sections, single., master or critical |
|
Master can not appear in the dynamic extent of for,
sections and single. |
|
Ordered are not allowed inside critical |
|
Any directives legal inside a parallel region
are also legal outside a parallel region in which case they are treated as
part of a team of size one. |
|
|
|
|
|
OpenMP’s constructs fall into 5 categories: |
|
Parallel Regions |
|
Worksharing |
|
Data Environment |
|
Synchronization |
|
Runtime functions/environment variables |
|
|
|
|
|
|
|
Lock routines |
|
omp_init_lock(), omp_set_lock(),
omp_unset_lock(), omp_test_lock() |
|
Runtime environment routines: |
|
Modify/Check the number of threads |
|
omp_set_num_threads(), omp_get_num_threads(),
omp_get_thread_num(), omp_get_max_threads() |
|
Turn on/off nesting and dynamic mode |
|
omp_set_nested(), omp_set_dynamic(),
omp_get_nested(), omp_get_dynamic() |
|
Are we in a parallel region? |
|
omp_in_parallel() |
|
How many processors in the system? |
|
omp_num_procs() |
|
|
|
|
Protect resources with locks. |
|
|
|
|
To fix the number of threads used in a program,
first turn off dynamic mode and then set the number of threads. |
|
|
|
|
|
|
Control how “omp for schedule(RUNTIME)” loop
iterations are scheduled. |
|
OMP_SCHEDULE “schedule[, chunk_size]” |
|
Set the default number of threads to use. |
|
OMP_NUM_THREADS int_literal |
|
Can the program use a different number of
threads in each parallel region? |
|
OMP_DYNAMIC TRUE || FALSE |
|
Will nested parallel regions create new teams of
threads, or will they be serialized? |
|
OMP_NESTED TRUE || FALSE |
|
|
|
|
|
|
Setting the stage |
|
Parallel computing, hardware, software, etc. |
|
OpenMP: A quick overview |
|
OpenMP: A detailed introduction |
|
Automatic Parallelism and Tools support. |
|
OpenMP case studies. |
|
Including performance tuning. |
|
Common bugs in OpenMP programs |
|
and how to avoid them. |
|
The future of OpenMP |
|
|
|
|
|
|
Loops are the primary source of parallelism in
scientific and engineering applications. |
|
Compilers detect loops that have independent
iterations. |
|
|
|
|
|
|
|
|
Induction variable substitution: |
|
|
|
|
|
|
Examples of
options from the KAP parallelizing compiler (KAP includes some 60
options) |
|
optimization levels |
|
optimize : simple analysis, advanced analysis,
loop interchanging, array expansion |
|
aggressive: pad common blocks, adjust data
layout |
|
subroutine inline expansion |
|
inline all, specific routines, how to deal with
libraries |
|
try specific optimizations |
|
e.g., recurrence and reduction recognition, loop
fusion |
|
(These transformations may degrade performance) |
|
|
|
|
|
|
Limits on amount of optimization: |
|
e.g., size of optimization data structures,
number of optimization variants tried |
|
Make certain assumptions: |
|
e.g., array bounds are not violated, arrays are
not aliased |
|
Machine parameters: |
|
e.g., cache size, line size, mapping |
|
Listing control |
|
|
|
Note, compiler options can be a substitute for
advanced compiler strategies. If the compiler has limited information, the
user can help out. |
|
|
|
|
|
Source-to-source restructurers: |
|
transformed source code is the actual output |
|
Example:
KAP |
|
Code-generating compilers: |
|
typically have an option for viewing the translated (parallel)
code |
|
Example:
SGI f77 -apo -mplist |
|
|
|
This can be the starting point for code tuning |
|
|
|
|
|
The listing gives many useful clues for
improving the performance: |
|
Loop optimization tables |
|
Reports about data dependences |
|
Explanations about applied transformations |
|
The annotated, transformed code |
|
Calling tree |
|
Performance statistics |
|
|
|
The type of reports to be included in the
listing can be set through compiler options. |
|
|
|
|
|
|
|
|
This task is similar to explicit parallel
programming (will be discussed later) |
|
Two important differences : |
|
The compiler gives hints in its listing, which may tell you where to focus
attention. E.g., which variables have data dependences. |
|
You don’t need to perform all transformations by
hand. If you expose the right information to the compiler, it will do the
translation for you. |
|
(E.g., C$assert independent) |
|
|
|
|
|
|
|
Hand improvements can pay off because |
|
compiler techniques are limited |
|
E.g., array reductions are parallelized by only
few compilers |
|
compilers may have insufficient information |
|
E.g., |
|
loop iteration range may be input data |
|
variables are defined in other subroutines (no
interprocedural analysis) |
|
|
|
|
|
|
|
|
Setting the stage |
|
Parallel computing, hardware, software, etc. |
|
OpenMP: A quick overview |
|
OpenMP: A detailed introduction |
|
Automatic Parallelism and Tools support. |
|
OpenMP case studies. |
|
Including performance tuning. |
|
Common bugs in OpenMP programs |
|
and how to avoid them. |
|
The future of OpenMP |
|
|
|
|
1. Performance tuning of several benchmarks |
|
2. Case study of a large-scale application |
|
|
|
|
MDG: A Fortran code of the “Perfect Benchmarks”. |
|
Automatic parallelization does not improve this
code. |
|
|
|
|
|
Step 1: parallelize the most time-consuming
loop. It consumes 95% of the serial execution time. This takes: |
|
array privatization |
|
reduction parallelization |
|
Step 2: balancing the iteration space of this
loop. |
|
Loop is “triangular”. By default unbalanced
assignment of iterations to processors. |
|
|
|
|
|
|
|
|
ARC2D: A Fortran code of the “Perfect
Benchmarks”. |
|
|
|
|
|
|
|
|
|
Step 1: |
|
Loop interchanging increases cache locality
through stride-1 references. |
|
Step 2: |
|
Move parallel loops to outer positions |
|
Step 3: |
|
Move synchronization points outward |
|
Step 4: |
|
Coalesce
loops |
|
|
|
|
|
|
|
|
|
|
|
|
TOMCATV: A Fortran code of the SPEC 95 benchmarks. |
|
|
|
|
|
|
|
Step1: |
|
Parallelizing MAX reduction |
|
Step2: |
|
Cache optimization and granularity increase
through loop interchange and array transpose |
|
|
|
|
|
|
|
|
|
Compilers |
|
the starting point for our performance tuning
was always the compiler-parallelized program. |
|
It reports: parallelized loops, data dependences |
|
Subroutine and loop profilers |
|
focusing attention on the most time-consuming loops is absolutely
essential. |
|
Performance tables: |
|
typically comparing performance differences at
the loop level. |
|
|
|
|
|
|
The methodology that worked for us: |
|
Use compiler-parallelized code as a starting
point |
|
Get loop profile and compiler listing |
|
Inspect time-consuming loops (biggest
potential for improvement) |
|
Case 1. Check for parallelism where the compiler
could not find it |
|
Case 2. Improve parallel loops where the speedup
is limited |
|
|
|
|
|
|
|
Case 1: if the loop is not parallelized
automatically, do this: |
|
Check for parallelism: |
|
read the compiler explanation |
|
a variable may be independent even if the
compiler detects dependences (compilers are conservative) |
|
check if conflicting array is privatizable
(compilers don’t perform array privatization well) |
|
If you find parallelism, add OpenMP parallel
directives, or make the information explicit for the parallelizer |
|
|
|
|
Case 2: if the loop is parallel but does not
perform well, consider several optimization factors: |
|
|
|
|
|
|
Converting a Seismic Processing Application |
|
to OpenMP |
|
Overview of the Application |
|
Basic use of OpenMP |
|
OpenMP Issues Encountered |
|
Performance Results |
|
|
|
|
Representative of modern seismic processing
programs used in the search for oil and gas. |
|
20,000 lines of Fortran. C subroutines interface
with the operating system. |
|
Available in a serial and a parallel variant. |
|
Parallel code is available in a message-passing
and an OpenMP form. |
|
Is part of the SPEChpc benchmark suite. Includes
4 data sets: small to x-large. |
|
|
|
|
|
Program structure: |
|
240 Fortran and 119 C subroutines. |
|
Main algorithms: |
|
FFT,
finite difference solvers |
|
Running time of Seismic (@ 500MFlops): |
|
small data set:
0.1 hours |
|
x-large data set: 48 hours |
|
IO requirement: |
|
small data set: 110 MB |
|
x-large data set: 93 GB |
|
|
|
|
|
Split into p parallel tasks |
|
(p = number of processors) |
|
|
|
|
|
Most data structures are private, |
|
i.e., Each thread has its own copy. |
|
Syntactic forms: |
|
|
|
|
|
|
|
Bulk of computation is done in Fortran |
|
Utility routines are in C: |
|
IO operations |
|
data partitioning routines |
|
communication/synchronization operations |
|
OpenMP-related issues: |
|
IF C/OpenMP compiler is not available, data
privatization must be done through “expansion”. |
|
Mix of Fortran and C is implementation dependent |
|
|
|
|
|
|
|
IO routines and memory allocation are called
within parallel threads, inside C utility routines. |
|
OpenMP requires all standard libraries and
instrinsics to be thread-save. However the implementations are not always
compliant. |
|
® system-dependent
solutions need to be found |
|
The same issue arises if standard C routines are
called inside a parallel Fortran region or in non-standard syntax. |
|
Standard
C compilers do not know anything about OpenMP and the thread-safe
requirement. |
|
|
|
|
|
OpenMP currently does not specify or provide
constructs for controlling the binding of threads to processors. |
|
|
|
|
|
|
|
|
|
Processors can migrate, causing overhead. This
behavior is system-dependent. |
|
System-dependent solutions may be available. |
|
|
|
|
|
|
|
|
|
|
Setting the stage |
|
Parallel computing, hardware, software, etc. |
|
OpenMP: A quick overview |
|
OpenMP: A detailed introduction |
|
Automatic Parallelism and Tools support. |
|
OpenMP case studies. |
|
Including performance tuning. |
|
Common bugs in OpenMP programs |
|
and how to avoid them. |
|
The future of OpenMP |
|
|
|
|
|
Shared memory parallel programming is a mixed
bag: |
|
It saves the programmer from having to map data
onto multiple processors. In this
sense, its much easier. |
|
It opens up a range of new errors coming from
unanticipated shared resource conflicts. |
|
|
|
|
|
|
|
|
Race Conditions |
|
The outcome of a program depends on the detailed
timing of the threads in the team. |
|
Deadlock |
|
Threads lock up waiting on a locked resource
that will never become free. |
|
|
|
|
The result varies unpredictably based on
detailed order of execution for each section. |
|
Wrong answers produced without warning! |
|
|
|
|
|
In this example, we choose the assignments to
occur in the order A, B, C. |
|
ICOUNT forces this order. |
|
FLUSH so each thread sees updates to ICOUNT -
NOTE: you need the flush on each read and each write. |
|
|
|
|
The result varies unpredictably because the
value of X isn’t dependable until the barrier at the end of the do loop. |
|
Wrong answers produced without warning! |
|
Solution: Be careful when you use NOWAIT. |
|
|
|
|
The result varies unpredictably because access
to shared variable TMP is not
protected. |
|
Wrong answers produced without warning! |
|
The user probably wanted to make TMP private. |
|
|
|
|
|
Return to your “pi” program and this time, drop the private clause on x. In other words, let all threads use the
same global variable for x. |
|
Does your program still work? |
|
Run it many times and see what happens to the
answer. |
|
Change the number of threads. Does the answer change? |
|
|
|
|
This shows a race condition and a deadlock. |
|
If A is locked by one thread and B by another,
you have deadlock. |
|
If the same thread gets both locks, you get a
race condition - i.e. different behavior depending on detailed interleaving
of the thread. |
|
Avoid nesting different locks. |
|
|
|
|
This shows a race condition and a deadlock. |
|
If A is locked in the first section and the if
statement branches around the unset lock, threads running the other
sections deadlock waiting for the lock to be released. |
|
Make sure you release your locks. |
|
|
|
|
|
Are you using threadsafe libraries? |
|
I/O inside a parallel region can interleave
unpredictably. |
|
Make sure you understand what your constructors
are doing with private objects. |
|
Private variables can mask globals. |
|
Understand when shared memory is coherent. When in doubt, use FLUSH. |
|
NOWAIT
removes implied barriers. |
|
|
|
|
|
Option 1: Analyze your code to make sure every
semantically permitted interleaving of the threads yields the correct
results. |
|
This can be prohibitively difficult due to the
explosion of possible interleavings. |
|
Tools like KAI’s Assure can help. |
|
|
|
|
|
Option 2: Write SMP code that is portable and equivalent to the
sequential form. |
|
Use a safe subset of OpenMP. |
|
Follow a set of “rules” for Sequential
Equivalence. |
|
|
|
|
|
|
|
|
What is Portable Sequential Equivalence (PSE)? |
|
A program is sequentially equivalent if its
results are the same with one thread and many threads. |
|
For a program to be portable (i.e. runs the same
on different platforms/compilers)
it must execute identically when the OpenMP constructs are used or ignored. |
|
|
|
|
|
|
Advantages of PSE |
|
A PSE program can run on a wide range of
hardware and with different compilers - minimizes software development
costs. |
|
A PSE program can be tested and debugged in
serial mode with off the shelf tools - even if they don’t support OpenMP. |
|
|
|
|
|
|
|
|
Two forms of Sequential equivalence based on
what you mean by the phrase “equivalent to the single threaded execution”: |
|
Strong SE: bitwise identical results. |
|
Weak SE:
equivalent mathematically but due to quirks of floating point
arithmetic, not bitwise identical. |
|
|
|
|
|
|
Control data scope with the base language |
|
Avoid the data scope clauses. |
|
Only use private for scratch variables local to
a block (eg. temporaries or loop control variables) whose global
initialization don’t matter. |
|
Locate all cases where a shared variable can be
written by multiple threads. |
|
The access to the variable must be protected. |
|
If multiple threads combine results into a
single value, enforce sequential order. |
|
Do not use the reduction clause. |
|
|
|
|
Everything is shared except I and TMP. These can be private since they are not
initialized and they are unused outside the loop. |
|
The summation into RES occurs in the sequential
order so the result from the program is bitwise compatible with the
sequential program. |
|
Problem: Can be inefficient if threads finish in
an order that’s greatly different from the sequential order. |
|
|
|
|
|
|
For weak sequential equivalence only
mathematically valid constraints are enforced. |
|
Floating point arithmetic is not associative and
not commutative. |
|
In most cases, no particular grouping of
floating point operations is mathematically preferred so why take a
performance hit by forcing the sequential order? |
|
In most cases, if you need a particular grouping
of floating point operations, you have a bad algorithm. |
|
How do you write a program that is portable and
satisfies weak sequential equivalence? |
|
Follow the same rules as the strong case, but
relax sequential ordering constraints. |
|
|
|
|
The summation into RES occurs one thread at a
time, but in any order so the result is not bitwise compatible with the
sequential program. |
|
Much more efficient, but some users get upset
when low order bits vary between program runs. |
|
|
|
|
This program follows the weak PSE rules, but its
still wrong. |
|
In this example, RAND() may not be thread
safe. Even if it is, the
pseudo-random sequences might overlap thereby throwing off the basic statistics. |
|
|
|
|
|
OpenMP is: |
|
A great way to write fast executing code. |
|
Your gateway to special, painful errors. |
|
You can save yourself grief if you consider the possible danger zones as you write
your OpenMP programs. |
|
Tools and/or a discipline of writing portable
sequentially equivalent programs can help. |
|
|
|
|
|
|
Setting the stage |
|
Parallel computing, hardware, software, etc. |
|
OpenMP: A quick overview |
|
OpenMP: A detailed introduction |
|
Automatic Parallelism and Tools support. |
|
OpenMP case studies. |
|
Including performance tuning. |
|
Common bugs in OpenMP programs |
|
and how to avoid them. |
|
The future of OpenMP |
|
|
|
|
|
|
The future of OpenMP is in the hands of the
OpenMP Architecture Review Board (the ARB) |
|
Intel, KAI, IBM, HP, Compaq, Sun, SGI, DOE ASCI |
|
The ARB resolves interpretation issues and
manages the evolution of new OpenMP API’s. |
|
Membership in the ARB is Open to any
organization with a stake in OpenMP. |
|
Research organization (e.g. DOE ASCI) |
|
Hardware vendors (e.g. Intel or HP) |
|
Software vendors (e.g. KAI) |
|
|
|
|
|
We meet every two to four weeks and work on: |
|
Interpretation - resolve ambiguities in the
specifications. |
|
Answer questions - each member organization
takes a one month shift answering questions to the ARB. |
|
Develop new specifications - We usually have a
specification project underway . |
|
Whatever else it takes to further OpenMP’s
impact - press releases, tutorials, test suites, etc. |
|
Meetings are held by email and over the phone so
travel costs are nil. |
|
|
|
|
|
|
To produce API specifications that let
programmers write portable, efficient, and well understood parallel
programs for shared memory systems. |
|
To produce specifications that can be readily
implemented in robust commercial products. |
|
i.e. we
want to standardize common or well understood practices, not chart new
research agendas. |
|
To whatever extent makes sense, deliver
consistency between programming languages. |
|
The specification should map cleanly and
predictably between C, Fortran, and C++. |
|
|
|
|
|
|
|
|
We want OpenMP to be just large enough to
express important, control-parallel, shared memory programs -- but no larger. |
|
OpenMP needs to stay "lean and mean". |
|
Legal programs under an older version of an
OpenMP specification should
continue to be legal under newer specifications. |
|
To whatever extent possible, produce
specifications that are sequentially
consistent. |
|
If sequential consistency is violated, there
should be documented reasons for
doing so. |
|
|
|
|
|
|
|
|
|
OpenMP is an evolving standard. We will see to it that it is well
matched to the changing needs of the shard memory programming community. |
|
Here’s what’s coming in the future: |
|
OpenMP 1.1 for Fortran: |
|
This is a cleaned up version of 1.0 where we
have fixed typos and merged the interpretations into the specification. |
|
Status: We are in the final proofreading stage.
Release: this fall. |
|
OpenMP 2.0 for Fortran: |
|
This is a major update of OpenMP for Fortran. |
|
Status.
Under active development.
Done sometime in 2000. |
|
|
|
|
Began gathering feedback winter 1999. |
|
Started regular meetings to create the standard,
late spring 1999. |
|
Freeze list of features to consider, end of
September, 1999. |
|
Completion date? Probably sometime next year but
we will not rush it. Its better to
be a little late than to do a poor job. |
|
|
|
|
|
In addition to the general goals laid out by the
ARB, we have some specific goals for OpenMP 2.0: |
|
We want to support the Fortran95 language and
the programming practices Fortran95 programmers typically use. |
|
Fortran77 compilers must be able to conform to
OpenMP 2.0. |
|
We want to extend the range of applications that
can be parallelized with OpenMP. |
|
|
|
|
|
|
|
Define threadprivate module data. |
|
Our
number one request from Fortran90 users. |
|
Define worksharing constructs for Fortran90
array expressions. |
|
Allow arrays in reductions. |
|
Cleanup the language |
|
allow comments on a directive |
|
reprivatization of private variables |
|
provide a module defining runtime library
interfaces. |
|
… and many more subtle changes |
|
|
|
|
|
|
Parallel I/O. |
|
This would be a huge addition to the language …
violates the “lean and mean” goal. |
|
Explicit thread groups. |
|
Violates the “lean and mean” rule, but also, how
to implement and use this construct is a research question. |
|
Condition variable synchronization. |
|
violates the sequential consistency goal. Makes it too easy to write programs that
dead-lock under sequential readings. |
|
|
|
|
Get more performance from applications running
on multiprocessor workstations |
|
Get software to market sooner using a simplified
programming model |
|
Reduce support costs by developing for multiple
platforms with a single source code |
|
|
|
|
|
|
|
|
|
Let’s speed up the program with multiple
threads. |
|
Consider the Win32 threads library: |
|
Thread management and interaction is explicit. |
|
Programmer has full control over the threads |
|
|
|
|
|
|
|
|
Threads libraries: |
|
Pro: Programmer has control over everything |
|
Con: Programmer must control everything |
|
|
|
|
|
|
|
|
|
|
|