Notes
Outline
OpenMP: An API for Writing Portable SMP Application Software
Tim Mattson
Intel Corporation
Computational Sciences Laboratory
SC’99 Tutorial: Agenda
Setting the stage
Parallel computing, hardware, software, etc.
OpenMP: A quick overview
OpenMP: A detailed introduction
Automatic Parallelism and Tools support.
OpenMP case studies.
Including performance tuning.
Common bugs in OpenMP programs
and how to avoid them.
The future of OpenMP
Parallel Computing
What is it?
Parallel computing is when a program uses concurrency to either:
decrease the runtime for the solution to a problem.
Increase the size of the problem that can be solved.
Parallel Computing:
Writing a parallel application.
Parallel Computing:
The Hardware is in great shape.
Parallel Computing:
… but where is the software?
Most ISV’s have ignored parallel computing (other than coarse-grained multithreading for GUI’s and systems programming)
Why?
The perceived difficulties of writing parallel software out-weigh the benefits
The benefits are clear.  To increase the amount of parallel software, we need to reduce the perceived difficulties.
Solution: Effective Standards for parallel programming
Thread Libraries
Win32 API - a low level approach.
threads.h++ - a high level class library.
Compiler Directives
OpenMP - portable shared memory parallelism.
Message Passing Libraries
MPI - www.mpi-softtech.com
High Level Tools
www.apri.com,  www.kai.com, www.pgroup.com
SC’99 Tutorial: Agenda
Setting the stage
Parallel computing, hardware, software, etc.
OpenMP: A quick overview
OpenMP: A detailed introduction
Automatic Parallelism and Tools support.
OpenMP case studies.
Including performance tuning.
Common bugs in OpenMP programs
and how to avoid them.
The future of OpenMP
OpenMP: Introduction

OpenMP:  An API for Writing Multithreaded Applications
A set of compiler directives and library routines  for parallel application programmers
Makes it easy to create multi-threaded (MT) programs in Fortran, C and C++
Standardizes last 15 years of SMP practice
OpenMP: Supporters*
Hardware vendors
Intel, HP, SGI, IBM, SUN, Compaq
Software tools vendors
KAI, PGI, PSR, APR, Absoft
Applications vendors
ANSYS, Fluent, Oxford Molecular, NAG, DOE ASCI, Dash, Livermore Software, and many others
OpenMP: Programming Model
OpenMP:
How is OpenMP typically used?
OpenMP is usually used to parallelize loops:
Find your most time consuming loops.
Split them up between threads.
OpenMP:
How do threads interact?
OpenMP is a shared memory model.
Threads communicate by sharing variables.
Unintended sharing of data can lead to race conditions:
race condition: when the program’s outcome changes as the threads are scheduled differently.
To control race conditions:
Use synchronization to protect data conflicts.
Synchronization is expensive so:
Change how data is stored to minimize the need for synchronization.
SC’99 Tutorial: Agenda
Setting the stage
Parallel computing, hardware, software, etc.
OpenMP: A quick overview
OpenMP: A detailed introduction
Automatic Parallelism and Tools support.
OpenMP case studies.
Including performance tuning.
Common bugs in OpenMP programs
and how to avoid them.
The future of OpenMP
OpenMP:
Some syntax details to get us started
Most of the constructs in OpenMP are compiler directives or pragmas.
For C and C++, the pragmas take the form:
#pragma omp construct [clause [clause]…]
For Fortran, the directives take one of the forms:
C$OMP construct [clause [clause]…]
!$OMP construct [clause [clause]…]
*$OMP construct [clause [clause]…]
Since the constructs are directives, an OpenMP program can  be compiled by compilers that don’t support OpenMP.
OpenMP:
Structured blocks
Most OpenMP constructs apply to structured blocks.
Structured block: a block of code with one point of entry at the top and one point of exit at the bottom.  The only other branches allowed are STOP statements in Fortran and exit() in C/C++.
OpenMP: Contents
OpenMP’s constructs fall into 5 categories:
Parallel Regions
Worksharing
Data Environment
Synchronization
Runtime functions/environment variables
OpenMP is basically the same between Fortran and C/C++
OpenMP: Parallel Regions
You create threads in OpenMP with the “omp parallel” pragma.
For example, To create a 4 thread Parallel region:
OpenMP: Parallel Regions
Each thread executes the same code redundantly.
Exercise 1:
A multi-threaded “Hello world” program
Write a multithreaded program where each thread prints a simple message (such as “hello world”).
Use two separate printf statements and include the thread ID:
OpenMP: Some subtle details (don’t worry about these at first)
Dynamic mode (the default mode):
The number of threads used in a parallel region can vary from one parallel region to another.
Setting the number of threads only sets  the maximum number of threads - you could get less.
Static mode:
The number of threads is fixed and controlled by the programmer.
OpenMP lets you nest parallel regions, but…
A compiler can choose to serialize the nested parallel region (i.e. use a team with only one thread).
OpenMP: Contents
OpenMP’s constructs fall into 5 categories:
Parallel Regions
Worksharing
Data Environment
Synchronization
Runtime functions/environment variables
OpenMP: Work-Sharing Constructs
The “for” Work-Sharing construct splits up loop iterations  among the threads in a team
Work Sharing Constructs
A motivating example
OpenMP For constuct:
The schedule clause
The schedule clause effects how loop iterations are mapped onto threads
schedule(static [,chunk])
Deal-out blocks of iterations of size “chunk” to each thread.
schedule(dynamic[,chunk])
Each thread grabs “chunk” iterations off a queue until all iterations have been handled.
schedule(guided[,chunk])
Threads dynamically grab blocks of iterations. The size of the block starts large and shrinks down to size “chunk” as the calculation proceeds.
schedule(runtime)
Schedule  and chunk size taken from the OMP_SCHEDULE environment variable.
OpenMP: Work-Sharing Constructs
The Sections work-sharing construct gives a different structured block to each thread.
OpenMP: Combined Parallel Work-Sharing Constructs
A short hand notation that combines the Parallel and work-sharing construct.
Exercise 2:
A multi-threaded “pi” program
On the following slide, you’ll see a sequential program that uses numerical integration to compute an estimate of PI.
Parallelize this program using OpenMP.  There are several options (do them all if you have time):
Do it as an SPMD program using a parallel region only.
Do it with a work sharing construct.
Remember, you’ll need to make sure multiple threads don’t overwrite each other’s variables.
PI Program:
The sequential program
OpenMP:
More details: Scope of OpenMP constructs
OpenMP: Contents
OpenMP’s constructs fall into 5 categories:
Parallel Regions
Worksharing
Data Environment
Synchronization
Runtime functions/environment variables
Data Environment:
Default storage attributes
Shared Memory programming model:
Most variables are shared by default
Global variables are SHARED among threads
Fortran: COMMON blocks, SAVE variables, MODULE variables
C: File scope variables, static
But not everything is shared...
Stack variables in sub-programs called from parallel regions are PRIVATE
Automatic variables within a statement block are PRIVATE.
Data Environment:
Example storage attributes
Data Environment:
Changing storage attributes
One can selectively change storage attributes constructs using the following clauses*
SHARED
PRIVATE
FIRSTPRIVATE
THREADPRIVATE
The value of a private inside a parallel loop can be transmitted to a  global value outside the loop with:
LASTPRIVATE
The default status can be modified with:
DEFAULT (PRIVATE | SHARED | NONE)
Private Clause
private(var)  creates a local copy of var for each thread.
The value is uninitialized
Private copy is not storage associated with the original
Firstprivate Clause
Firstprivate is a special case of private.
Initializes each private copy with the corresponding value from the master thread.
Lastprivate Clause
Lastprivate passes the value of a  private from the last iteration  to a global variable.
OpenMP:
Another data environment example
Here’s an example of PRIVATE and FIRSTPRIVATE
OpenMP:
Default Clause
Note that the default storage attribute is DEFAULT(SHARED) (so no need to specify)
To change default: DEFAULT(PRIVATE)
each variable in static extent of the parallel region is made private as if specified in a private clause
mostly saves typing
DEFAULT(NONE): no default for variables in static extent. Must list storage attribute for each variable in static extent
OpenMP:
Default Clause Example
Threadprivate
Makes global data private to a thread
Fortran: COMMON  blocks
C: File scope and static variables
Different from making them PRIVATE
with PRIVATE global variables are masked.
THREADPRIVATE preserves global scope within each thread
Threadprivate variables can be initialized using COPYIN or by using DATA statements.
A threadprivate example
OpenMP: Reduction
Another clause that effects the way variables are shared:
reduction (op : list)
The variables in “list” must be shared in the enclosing parallel region.
Inside a parallel or a worksharing construct:
A local copy of each list variable is made and initialized depending on the “op” (e.g. 0 for “+”)
pair wise “op” is updated on the local value
Local copies are reduced into a single global copy at the end of the construct.
OpenMP: 
Reduction example
Exercise 3:
A multi-threaded “pi” program
Return to your “pi” program and this time, use private, reduction and a worksharing construct to parallelize it.
See how similar you can make it to the original sequential program.
OpenMP: Some subtle details (don’t worry about these at first)
The data scope clauses take a list argument
The list can be a common block name with is a short hand for listing all the variables in the common block.
Default private for some loop indices:
Fortran: loop indices are  private even if they are specified as shared.
C: Loop indices on “work-shared loops” are private when  they otherwise would be shared.
Not all privates are undefined
Allocatable arrays in Fortran
Class type (I.e. non-POD) variables in C++.
OpenMP: More subtle details (don’t worry about these at first)
Variables privitized in a parallel region can not be reprivitzied on an enclosed worksharing directive.
Assumed size and assumed shape arrays can not be privitized.
Fortran pointers or allocatable arrays can be private or shared but not lastprivate or firstprivate.
When a common block is listed in a private, firstprivate, or lastprivate clause, its constituent elements can’t appear in other data scope clauses.
If an element of a shared common block is privitized, it is no longer storage associated with the common block.
OpenMP: Contents
OpenMP’s constructs fall into 5 categories:
Parallel Regions
Worksharing
Data Environment
Synchronization
Runtime functions/environment variables
OpenMP: Synchronization
OpenMP has the following constructs to support synchronization:
atomic
critical section
barrier
flush
ordered
single
master
OpenMP: Synchronization
Only one thread at a time can enter a critical section.
OpenMP: Synchronization
Atomic is a special case of a critical section that can be used for certain simple statements.
It applies only to the update of a memory location (the update of X in the following example)
OpenMP: Synchronization
Barrier: Each thread waits until all threads arrive.
OpenMP: Synchronization
The ordered construct enforces the sequential order for a block.
OpenMP: Synchronization
The master construct denotes a structured block  that is only executed by the master thread. The other threads just skip it (no implied barriers or flushes).
OpenMP: Synchronization
The single construct denotes a block of code that is executed by only one thread.
A barrier and a flush are implied at the end of the single block.
OpenMP: Synchronization
The flush construct denotes a sequence point where a thread tries to create a consistent view of memory.
All memory operations (both reads and writes) defined prior to the sequence point must complete.
All memory operations (both reads and writes) defined after  the sequence point must follow the flush.
Variables in registers or write buffers must be updated in memory.
Arguments to flush specify which variables are flushed. No arguments specifies that all thread visible variables are flushed.
OpenMP:
A flush example
OpenMP:
Implicit synchronization
Barriers are implied on the following OpenMP constructs:
OpenMP: Some subtle details on directive nesting
For, sections and single directives binding to the same parallel region can’t be nested.
Critical sections with the same name can’t be nested.
For, sections, and single can not appear in the dynamic extent of critical, ordered or master.
Barrier can not appear in the dynamic extent of for, ordered, sections, single., master or critical
Master can not appear in the dynamic extent of for, sections and single.
Ordered are not allowed inside critical
Any directives legal inside a parallel region are also legal outside a parallel region in which case they are treated as part of a team of size one.
OpenMP: Contents
OpenMP’s constructs fall into 5 categories:
Parallel Regions
Worksharing
Data Environment
Synchronization
Runtime functions/environment variables
OpenMP: Library routines
Lock routines
omp_init_lock(), omp_set_lock(), omp_unset_lock(), omp_test_lock()
Runtime environment routines:
Modify/Check the number of threads
omp_set_num_threads(), omp_get_num_threads(), omp_get_thread_num(), omp_get_max_threads()
Turn on/off nesting and dynamic mode
omp_set_nested(), omp_set_dynamic(), omp_get_nested(), omp_get_dynamic()
Are we in a parallel region?
omp_in_parallel()
How many processors in the system?
omp_num_procs()
OpenMP: Library Routines
Protect resources with locks.
OpenMP: Library Routines
To fix the number of threads used in a program, first turn off dynamic mode and then set the number of threads.
OpenMP: Environment Variables
Control how “omp for schedule(RUNTIME)” loop iterations are scheduled.
OMP_SCHEDULE “schedule[, chunk_size]”
Set the default number of threads to use.
OMP_NUM_THREADS int_literal
Can the program use a different number of threads in each parallel region?
OMP_DYNAMIC TRUE || FALSE
Will nested parallel regions create new teams of threads, or will they be serialized?
OMP_NESTED TRUE || FALSE
SC’99 Tutorial: Agenda
Setting the stage
Parallel computing, hardware, software, etc.
OpenMP: A quick overview
OpenMP: A detailed introduction
Automatic Parallelism and Tools support.
OpenMP case studies.
Including performance tuning.
Common bugs in OpenMP programs
and how to avoid them.
The future of OpenMP
Generating OpenMP Programs Automatically
The Basics About
Parallelizing Compilers
Loops are the primary source of parallelism in scientific and engineering applications.
Compilers detect loops that have independent iterations.
Basic Compiler Transformations
Data privatization:
Basic Compiler Transformations
Reduction recognition:
Basic Compiler Transformations
Induction variable substitution:
Compiler Options
Examples of  options from the KAP parallelizing compiler (KAP includes some 60 options)
optimization levels
optimize : simple analysis, advanced analysis, loop interchanging, array expansion
aggressive: pad common blocks, adjust data layout
subroutine inline expansion
inline all, specific routines, how to deal with libraries
try specific optimizations
e.g., recurrence and reduction recognition, loop fusion
(These transformations may degrade performance)
More About Compiler Options
Limits on amount of optimization:
e.g., size of optimization data structures, number of optimization variants tried
Make certain assumptions:
e.g., array bounds are not violated, arrays are not aliased
Machine parameters:
e.g., cache size, line size, mapping
Listing control
Note, compiler options can be a substitute for advanced compiler strategies. If the compiler has limited information, the user can help out.
Inspecting the Translated Program
Source-to-source restructurers:
 transformed source code is the actual output
 Example: KAP
Code-generating compilers:
 typically have an option for viewing the translated (parallel) code
 Example: SGI f77 -apo -mplist
This can be the starting point for code tuning
Compiler Listing
The listing gives many useful clues for improving the performance:
Loop optimization tables
Reports about data dependences
Explanations about applied transformations
The annotated, transformed code
Calling tree
Performance statistics
The type of reports to be included in the listing can be set through compiler options.
Slide 75
Tuning Automatically-
Parallelized Code
This task is similar to explicit parallel programming (will be discussed later)
Two important differences :
The compiler gives hints in its listing,  which may tell you where to focus attention. E.g., which variables have data dependences.
You don’t need to perform all transformations by hand. If you expose the right information to the compiler, it will do the translation for you.
(E.g., C$assert independent)
Why Tuning Automatically-Parallelized Code?
Hand improvements can pay off because
compiler techniques are limited
E.g., array reductions are parallelized by only few compilers
compilers may have insufficient information
E.g.,
loop iteration range may be input data
variables are defined in other subroutines (no interprocedural analysis)
SC’99 Tutorial: Agenda
Setting the stage
Parallel computing, hardware, software, etc.
OpenMP: A quick overview
OpenMP: A detailed introduction
Automatic Parallelism and Tools support.
OpenMP case studies.
Including performance tuning.
Common bugs in OpenMP programs
and how to avoid them.
The future of OpenMP
Performance Tuning and Case Studies with Realistic Applications
1. Performance tuning of several benchmarks
2. Case study of a large-scale application
Performance Tuning
Example 1: MDG
MDG: A Fortran code of the “Perfect Benchmarks”.
Automatic parallelization does not improve this code.
MDG: Tuning Steps
Step 1: parallelize the most time-consuming loop. It consumes 95% of the serial execution time. This takes:
array privatization
reduction parallelization
Step 2: balancing the iteration space of this loop.
Loop is “triangular”. By default unbalanced assignment of iterations to processors.
MDG Code Sample
Performance Tuning
Example 2: ARC2D
ARC2D: A Fortran code of the “Perfect Benchmarks”.
ARC2D: Tuning Steps
Step 1:
Loop interchanging increases cache locality through stride-1 references.
Step 2:
Move parallel loops to outer positions
Step 3:
Move synchronization points outward
Step 4:
 Coalesce loops
ARC2D: Code  Samples
ARC2D: Code  Samples
ARC2D: Code  Samples
Performance Tuning Example 3: TOMCATV
TOMCATV: A Fortran code of the  SPEC 95 benchmarks.
TOMCATV: Tuning Steps
Step1:
Parallelizing MAX reduction
Step2:
Cache optimization and granularity increase through loop interchange and array transpose
TOMCATV Code Samples
TOMCATV Code Samples
What Tools Did We Use for Performance Analysis and Tuning?
Compilers
the starting point for our performance tuning was always the compiler-parallelized program.
It reports: parallelized loops, data dependences
Subroutine and loop profilers
focusing attention on the  most time-consuming loops is absolutely essential.
Performance tables:
typically comparing performance differences at the loop level.
Guidelines for Fixing “Performance Bugs”
The methodology that worked for us:
Use compiler-parallelized code as a starting point
Get loop profile and compiler listing
Inspect time-consuming loops (biggest potential  for improvement)
Case 1. Check for parallelism where the compiler could not find it
Case 2. Improve parallel loops where the speedup is limited
Performance Tuning
Case 1: if the loop is not parallelized automatically, do this:
Check for parallelism:
read the compiler explanation
a variable may be independent even if the compiler detects dependences (compilers are conservative)
check if conflicting array is privatizable (compilers don’t perform array privatization well)
If you find parallelism, add OpenMP parallel directives, or make the information explicit for the parallelizer
Performance Tuning
Case 2: if the loop is parallel but does not perform well, consider several optimization factors:
Case Study of a Large-Scale Application
Converting a Seismic Processing Application
to OpenMP
Overview of the Application
Basic use of OpenMP
OpenMP Issues Encountered
Performance Results
Overview of Seismic
Representative of modern seismic processing programs used in the search for oil and gas.
20,000 lines of Fortran. C subroutines interface with the operating system.
Available in a serial and a parallel variant.
Parallel code is available in a message-passing and an OpenMP form.
Is part of the SPEChpc benchmark suite. Includes 4 data sets: small to x-large.
Seismic:
Basic Characteristics
Program structure:
240 Fortran and 119 C subroutines.
Main algorithms:
 FFT, finite difference solvers
Running time of Seismic (@ 500MFlops):
small data set:  0.1 hours
x-large data set: 48 hours
IO requirement:
small data set: 110 MB
x-large data set: 93 GB
Basic OpenMP Use:
Parallelization Scheme
Split into p parallel tasks
(p = number of processors)
Basic OpenMP Use:
Data Privatization
Most data structures are private,
i.e., Each thread has its own copy.
Syntactic forms:
Basic OpenMP Use:
Synchronization and Communication
OpenMP Issues:
Mixing Fortran and C
Bulk of computation is done in Fortran
Utility routines are in C:
IO operations
data partitioning routines
communication/synchronization operations
OpenMP-related issues:
IF C/OpenMP compiler is not available, data privatization must be done through “expansion”.
Mix of Fortran and C is implementation dependent
OpenMP Issues:
Broadcast Common Blocks
OpenMP Issues:
Multithreading IO and malloc
IO routines and memory allocation are called within parallel threads, inside C utility routines.
OpenMP requires all standard libraries and instrinsics to be thread-save. However the implementations are not always compliant.
 ® system-dependent solutions need to be found
The same issue arises if standard C routines are called inside a parallel Fortran region or in non-standard syntax.
 Standard C compilers do not know anything about OpenMP and the thread-safe requirement.
OpenMP Issues:
Processor Affinity
OpenMP currently does not specify or provide constructs for controlling the binding of threads to processors.
Processors can migrate, causing overhead. This behavior is system-dependent.
System-dependent solutions may be available.
Performance Results
SC’99 Tutorial: Agenda
Setting the stage
Parallel computing, hardware, software, etc.
OpenMP: A quick overview
OpenMP: A detailed introduction
Automatic Parallelism and Tools support.
OpenMP case studies.
Including performance tuning.
Common bugs in OpenMP programs
and how to avoid them.
The future of OpenMP
SMP Programming errors
Shared memory parallel programming is a mixed bag:
It saves the programmer from having to map data onto multiple processors.  In this sense, its much easier.
It opens up a range of new errors coming from unanticipated shared resource conflicts.
2 major SMP errors
Race Conditions
The outcome of a program depends on the detailed timing of the threads in the team.
Deadlock
Threads lock up waiting on a locked resource that will never become free.
Race Conditions
The result varies unpredictably based on detailed order of execution for each section.
Wrong answers produced without warning!
Race Conditions:
A complicated solution
In this example, we choose the assignments to occur in the order A, B, C.
ICOUNT forces this order.
FLUSH so each thread sees updates to ICOUNT - NOTE: you need the flush on each read and each write.
Race Conditions
The result varies unpredictably because the value of X isn’t dependable until the barrier at the end of the do loop.
Wrong answers produced without warning!
Solution: Be careful when you use NOWAIT.
Race Conditions
The result varies unpredictably because access to shared variable  TMP is not protected.
Wrong answers produced without warning!
The user probably wanted to make TMP private.
Exercise 4:
Race conditions and the “pi” program
Return to your “pi” program and this time,  drop the private clause on x.  In other words, let all threads use the same global variable for x.
Does your program still work?
Run it many times and see what happens to the answer.
Change the number of threads.  Does the answer change?
Deadlock
This shows a race condition and a deadlock.
If A is locked by one thread and B by another, you have deadlock.
If the same thread gets both locks, you get a race condition - i.e. different behavior depending on detailed interleaving of the thread.
Avoid nesting different locks.
Deadlock
This shows a race condition and a deadlock.
If A is locked in the first section and the if statement branches around the unset lock, threads running the other sections deadlock waiting for the lock to be released.
Make sure you release your locks.
OpenMP death-traps
Are you using threadsafe libraries?
I/O inside a parallel region can interleave unpredictably.
Make sure you understand what your constructors are doing with private objects.
Private variables can mask globals.
Understand when shared memory is coherent.  When in doubt, use FLUSH.
 NOWAIT removes implied barriers.
Navigating through the Danger Zones
Option 1: Analyze your code to make sure every semantically permitted interleaving of the threads yields the correct results.
This can be prohibitively difficult due to the explosion of possible interleavings.
Tools like KAI’s Assure can help.
Navigating through the Danger Zones
Option 2: Write SMP code that  is portable and equivalent to the sequential form.
Use a safe subset of OpenMP.
Follow a set of “rules” for Sequential Equivalence.
Portable Sequential Equivalence
What is Portable Sequential Equivalence (PSE)?
A program is sequentially equivalent if its results are the same with one thread and many threads.
For a program to be portable (i.e. runs the same on  different platforms/compilers) it must execute identically when the OpenMP constructs are used or ignored.
Portable Sequential Equivalence
Advantages of PSE
A PSE program can run on a wide range of hardware and with different compilers - minimizes software development costs.
A PSE program can be tested and debugged in serial mode with off the shelf tools - even if they don’t support OpenMP.
2 Forms of Sequential Equivalence
Two forms of Sequential equivalence based on what you mean by the phrase “equivalent to the single threaded execution”:
Strong SE: bitwise identical results.
Weak SE:  equivalent mathematically but due to quirks of floating point arithmetic, not bitwise identical.
Strong Sequential Equivalence: rules
Control data scope with the base language
Avoid the data scope clauses.
Only use private for scratch variables local to a block (eg. temporaries or loop control variables) whose global initialization don’t matter.
Locate all cases where a shared variable can be written by multiple threads.
The access to the variable must be protected.
If multiple threads combine results into a single value, enforce sequential order.
Do not use the reduction clause.
Strong Sequential Equivalence: example
Everything is shared except I and TMP.  These can be private since they are not initialized and they are unused outside the loop.
The summation into RES occurs in the sequential order so the result from the program is bitwise compatible with the sequential program.
Problem: Can be inefficient if threads finish in an order that’s greatly different from the sequential order.
Weak Sequential equivalence
For weak sequential equivalence only mathematically valid constraints are enforced.
Floating point arithmetic is not associative and not commutative.
In most cases, no particular grouping of floating point operations is mathematically preferred so why take a performance hit by forcing the sequential order?
In most cases, if you need a particular grouping of floating point operations, you have a bad algorithm.
How do you write a program that is portable and satisfies weak sequential equivalence?
Follow the same rules as the strong case, but relax sequential ordering constraints.
Weak equivalence: example
The summation into RES occurs one thread at a time, but in any order so the result is not bitwise compatible with the sequential program.
Much more efficient, but some users get upset when low order bits vary between program runs.
Sequential Equivalence isn’t a Silver Bullet
This program follows the weak PSE rules, but its still wrong.
In this example, RAND() may not be thread safe.  Even if it is, the pseudo-random sequences might overlap thereby throwing  off the basic statistics.
Conclusion
OpenMP is:
A great way to write fast executing code.
Your gateway to special, painful errors.
You can save yourself  grief if you consider the possible danger zones as you write your OpenMP programs.
Tools and/or a discipline of writing portable sequentially equivalent programs can help.
SC’99 Tutorial: Agenda
Setting the stage
Parallel computing, hardware, software, etc.
OpenMP: A quick overview
OpenMP: A detailed introduction
Automatic Parallelism and Tools support.
OpenMP case studies.
Including performance tuning.
Common bugs in OpenMP programs
and how to avoid them.
The future of OpenMP
OpenMP Futures: The ARB
The future of OpenMP is in the hands of the OpenMP Architecture Review Board (the ARB)
Intel, KAI, IBM, HP, Compaq, Sun, SGI, DOE ASCI
The ARB resolves interpretation issues and manages the evolution of new OpenMP API’s.
Membership in the ARB is Open to any organization with a stake in OpenMP.
Research organization (e.g. DOE ASCI)
Hardware vendors  (e.g. Intel or HP)
Software vendors (e.g. KAI)
ARB: What do we do
We meet every two to four weeks and work on:
Interpretation - resolve ambiguities in the specifications.
Answer questions - each member organization takes a one month shift answering questions to the ARB.
Develop new specifications - We usually have a specification project underway .
Whatever else it takes to further OpenMP’s impact - press releases, tutorials, test suites, etc.
Meetings are held by email and over the phone so travel costs are nil.
Goals of the ARB: Part 1
To produce API specifications that let programmers write portable, efficient, and well understood parallel programs for shared memory systems.
To produce specifications that can be readily implemented in robust commercial products.
 i.e. we want to standardize common or well understood practices, not chart new research agendas.
To whatever extent makes sense, deliver consistency between programming languages.
The specification should map cleanly and predictably between C, Fortran, and C++.
Goals of the ARB: Part 2.
We want OpenMP to be just large enough to express important, control-parallel, shared memory programs --  but no larger.
OpenMP needs to stay "lean and mean".
Legal programs under an older version of an OpenMP specification should  continue to be legal under newer specifications.
To whatever extent possible, produce specifications that are sequentially  consistent.
If sequential consistency is violated, there should be  documented reasons for doing so.
The Future of OpenMP
OpenMP is an evolving standard.  We will see to it that it is well matched to the changing needs of the shard memory programming community.
Here’s what’s coming in the future:
OpenMP 1.1 for Fortran:
This is a cleaned up version of 1.0 where we have fixed typos and merged the interpretations into the specification.
Status: We are in the  final proofreading stage.  Release: this fall.
OpenMP 2.0 for Fortran:
This is a major update  of OpenMP for Fortran.
Status.  Under  active development. Done sometime in 2000.
OpenMP version 2.0:
Time line.
Began gathering feedback winter 1999.
Started regular meetings to create the standard, late spring 1999.
Freeze list of features to consider, end of September, 1999.
Completion date? Probably sometime next year but we will not rush it.  Its better to be a little late than to do a poor job.
OpenMP 2.0: Goals*
In addition to the general goals laid out by the ARB, we have some specific goals for OpenMP 2.0:
We want to support the Fortran95 language and the programming practices Fortran95 programmers typically use.
Fortran77 compilers must be able to conform to OpenMP 2.0.
We want to extend the range of applications that can be parallelized with OpenMP.
OpenMP 2.0:
Some key items on “the list”
Define threadprivate module data.
Our  number one request from Fortran90 users.
Define worksharing constructs for Fortran90 array expressions.
Allow arrays in reductions.
Cleanup the language
allow comments on a directive
reprivatization of private variables
provide a module defining runtime library interfaces.
… and many more subtle changes
OpenMP 2.0:
Some key items dropped from “the list”
Parallel I/O.
This would be a huge addition to the language … violates the “lean and mean” goal.
Explicit thread groups.
Violates the “lean and mean” rule, but also, how to implement and use this construct is a research question.
Condition variable synchronization.
violates the sequential consistency goal.  Makes it too easy to write programs that dead-lock under sequential readings.
Summary: OpenMP Benefits*
Get more performance from applications running on multiprocessor workstations
Get software to market sooner using a simplified programming model
Reduce support costs by developing for multiple platforms with a single source code
Extra Slides
A series of parallel pi programs
PI Program: an example
Parallel Pi Program
Let’s speed up the program with multiple threads.
Consider the Win32 threads library:
Thread management and interaction is explicit.
Programmer has full control over the threads
Solution: Win32 API, PI
Solution: Keep it simple
Threads libraries:
Pro: Programmer has control over everything
Con: Programmer must control everything
OpenMP PI Program: 
Parallel Region example (SPMD Program)
OpenMP PI Program: 
Work sharing construct
OpenMP PI Program: 
private clause and a critical section
OpenMP PI Program : 
Parallel for with a reduction
Reference Material on OpenMP