Figure 1 presents a simple code example for computing PI using OpenMP. This example is meant only to illustrate how a simple loop may be parallelized in a shared memory programming model. The code would look quite similar with either the DOACROSS or the X3H5 set of directives (except that X3H5 does not have a REDUCTION attribute so you would have to code it yourself).
Program execution begins as a single process. This initial process executes serially and we can set up our problem in a standard sequential manner, reading and writing stdout as necessary. When we first encounter a PARALLEL construct, in this case a PARALLEL DO, a team of one or more processes is formed, and the data environment for each team member is created. The data environment here consists of one PRIVATE variable, x, one REDUCTION variable, sum, and one SHARED variable, w. All references to x and sum inside the parallel region address private, non-shared, copies. The REDUCTION attribute takes an operator, such that at the end of the parallel region the private copies are reduced to the master copy using the specified operator. All references to w in the parallel region address the single master copy. The loop index variable, i, is PRIVATE by default. The compiler takes care of assigning the appropriate iterations to the individual team members, so in parallelizing this loop you don't even need to know how many processors you will run it on.
Figure 1: Computing PI in parallel using OpenMP.
Within the parallel region there may be additional control and synchronization constructs, but there are none in this simple example. The parallel region here terminates with the END DO which has an implied barrier. On exit of the parallel region, the initial process resumes execution using its updated data environment. In this case the only change to the master's data environment is in the reduced value of sum.
This model of execution is referred to as the fork/join model. Throughout the course of a program, the initial process may fork and join a number of times. The fork/join execution model makes it easy to get loop level parallelism out of a sequential program. Unlike in message passing, where the program must be completely decomposed for parallel execution, in a shared memory model it is possible to parallelize just at the loop level without decomposing the data structures. Given a working sequential program, it becomes fairly straightforward to parallelize individual loops in an incremental fashion and thereby immediately realize the performance advantages of a multiprocessor system.
For comparison with message passing, Figure 2 presents the same code example using MPI. Clearly there is additional complexity just in setting up the problem for the simple reason that here we begin outright with a team of parallel processes. Consequently we need to isolate a root process to read and write stdout. Since there is no globally shared data, the we have to explicitly broadcast the input parameters (in this case, the number of intervals for the integration) to all the processors. Furthermore, we have to explicitly manage the loop bounds. This requires identifying each processor ( myid) and knowing how many processors will be used to execute the loop ( numprocs). When we finally get to the loop, we can only sum into our private value for mypi. To reduce across processors we use the MPI_REDUCE routine and sum into pi. Note that the storage for pi is replicated across all processors, even though only the root process needs it. As a general rule, message passing programs are more wasteful of storage than shared memory programs [4]. Finally we can print out the result, again making sure to isolate just one process for this in order to avoid printing out numprocs messages.
Figure 2: Computing PI in parallel using MPI.
Finally, it is interesting to see how this example would look using pthreads. Figure 3 presents the pthreads version for computing PI. Naturally it's written in C, but we can still compare functionality with the Fortran examples given in Figures 1 and 2.

Figure 3: Computing PI in parallel using pthreads.
Clearly the pthreads version is much more complex than either the OpenMP or the MPI versions of the code. The reasons are twofold. Firstly, pthreads is aimed at providing task parallelism, whereas the example is one of data parallelism, i.e. parallelizing a loop. It is evident from the example why pthreads has not been widely used for scientific applications. Secondly, pthreads is somewhat lower level than we need, even in a task or threads based model. This becomes clearer as we go through the example.
As with the MPI version, we need to know how many threads will be executing the loop and be able to determine their id's so we can manage the loop bounds. We get the thread number as a command line argument, and then use it to allocate an array of thread id's. At this time we also initialize a lock, reduction_mutex, which we'll need for reducing our partial sums into a global sum for PI. The basic approach we use then is to start a worker thread, PIworker, for every processor we want to work on the loop. In PIworker we first compute a zero-based thread id and use this to map the loop iterations. The loop then computes the partial sums into mypi. We add these into the global result pi, making sure to protect against a race condition by locking. Finally, since there is no implied barriers on threads, we need to explicitly join all our threads before we can print out the result of the integration.
Notice in this example that all the data scoping is implicit. In other words, global variables are shared and automatic variables are thread private. There is no simple mechanism in pthreads for making global variables private. Implicit scoping also is more awkward in Fortran because the language is not as strongly scoped as C. We won't discuss pthreads further in this paper since the limitations of that model for expressing data parallelism should now be abundantly clear.