Summary, OpenMP Programming Contest, SC’2005

Larry Meadows

January, 2006

 

In 2005 the OpenMP ARB agreed to sponsor a programming contest to promote the usage of OpenMP. Here is the announcement:

 

http://www.openmp.org/drupal/sc05/omp-contest.htm

 

The DARPA HPCS SSCA#2 benchmark was chosen because it appeared to be a challenging problem, and also had two existing shared-memory implementations and a reasonable set of criteria for judging correctness.

 

The two existing implementations are detailed in the above link. One was created by David Bader and Kamesh Madduri using Bader’s SIMPLE library; in this case, the pthreads version of the library was used. The other implementation was created by John Feo of Cray, using the Cray/Tera MTA directives, and so was restricted to execute only on an MTA system.

 

Originally, the Bader/Madduri implementation was posted as part of the contest challenge. Minor effort was required to port the program to run on a 4 processor shared memory Itanium® system (the reference platform). The runtimes are included as part of the above link. The intent of this port was simply to provide a reference for the contestants.

 

Later, the Feo code was modified to compile on a non-MTA system, with no parallelism, in order to provide a serial implementation that the contestants could study.

 

The intent was that three fairly substantial prizes would be awarded, with the judging criteria being half Performance and Scalability and half Usage of OpenMP.  A team of judges was drafted:

 

            Tim Mattson, Intel Corporation

            Kamesh Madduri, Georgia Tech

            John Feo, Cray

            Mark Bull, Edinburgh Parallel Computing Centre

            Henry Jin, NASA Ames Research Center

 

Six contest submissions were received. All submittals were compiled and executed to attempt to judge correctness and performance. The judges met at SC’05 in Seattle to discuss the submissions. The judges decided that any correctness errors in the code would automatically disqualify an entry.

 

After reviewing all the submissions, the judges decided that only one submission was worthy of a prize. So, only the first prize was awarded.

 

We learned several lessons from this contest. They were:

 

  • Timeliness. The contest was announced far too late. The reference implementations were even later. The submittal deadline was too late. Contestants had insufficient time to submit their entries, and judges had insufficient time to evaluate the entries.
  • Benchmark Specification. The specification of SSCA#2 was vague. It was difficult to determine correctness. The latitude in implementation was unclear (for example, could the graph be sorted in the generation phase). Future contest rules will include verification criteria.
  • Performance Measurement. It wasn’t clear how to measure performance. Should all codes be executed on a single platform, and measured by wall clock time? A reasonable measure, but it wasn’t clear that the codes were computing the same problem.
  • Access to systems. Many contestants had limited system access and were not able to measure scalability either in time or memory usage.
  • Judging criteria. It was not clear whether an entry should be implemented from scratch, or build off of one of the reference implementations.

 

We promise to do better next time!

 

The remainder of this document details each submission and the reasons that it was accepted or rejected.

 

The following are the principal contacts for the various submissions. I have chosen not to identify the particular submissions by the submitter. Two of the submitters have agreed to have their code available on the website: OpenMPContest.tgz and SSCA2-OMP.tar.gz. You may contact the other submitters directly if you wish to see a copy of their code. Note: don’t assume that the submitters are in the same order as the submissions.

 

The submitters:

 

Priya Unnikrishnan, IBM, priyau at ca dot ibm dot com

Eugene Loh, Eugene dot loh at sun dot com

Arun Kejariwal, UC Irvine, arun_kejariwal at ieee dot org

David Coakley, coakley at nwlink dot com

Lei Huang, leihuang at cs dot uh dot edu <code>

Ayon Basumallik, Purdue University, basumall at purdue dot edu <code>

 

The submissions:

 

Submission #1

 

Submission #1 is based on the serial version of the Feo code. The Feo code uses MTA fetch-and-add and full/empty bits for synchronization. The serial port of the code purposely created macros for these operations that were not suitable for parallel execution. The authors of the code did not alter these macros; they simply replaced the MTA parallelization directives with similar OpenMP directives. The code was clearly incorrect, because it did not contain the required synchronization.

 

Submission #2:

 

Submission #2 is based on the serial version of the Feo code. The authors of this code paid attention to the requirements for synchronization. The OpenMP atomic directive was used in many places, as was the critical directive. However, there was a race condition in one part of the code (in the computeGraph subroutine where the neighbors array is filled it) that caused the code to be disqualified..

 

Submission #3:

 

Submission #3 is based on the serial version of the Feo code. However, the majority of the code was rewritten to use C++ STL and other C++ features. The judges noted that the C++ STL is not guaranteed to be thread-safe. Also, the code did not scale beyond two processors. Note, however, that the author of the code asserted that the STL version used in his validation was in fact guaranteed to be thread safe. The author was unable to run the code on a system with large memory and more than two processors. It was rejected because it failed to scale, and because of the STL thread-safety issue.

 

Submission #4:

Submission #4 is based on the Bader/Madduri code. Attention was paid to synchronization issues. However, this code failed to execute on the dataset size that was used to validate the other codes, so it was rejected. Also, the timing functions were called inside of an OpenMP single region, which is incorrect (since the region might be executed by a different processor each time).

 

Submission #5:

 

Submission #5 is unique. It was written from scratch based on the written specification of the SSCA#2 benchmark. It is written in Fortran 90. The code compiled and executed immediately on the reference platform. The code is well documented. It performs well and scales well. The judges unanimously agreed that this code was the clear winner.

 

Submission #6:

Submission #6 is based on the Bader/Madduri code. It failed to execute correctly on the reference platform (it aborted). On inspection, one of the judges found a problem where the omp_get_num_threads() was called incorrectly (outside of a parallel region); another similar problem was also found. The code was therefore rejected.