Summary, OpenMP Programming Contest, SC’2005
Larry Meadows
January, 2006
In 2005 the OpenMP ARB agreed to sponsor a programming contest to promote the usage of OpenMP. Here is the announcement:
http://www.openmp.org/drupal/sc05/omp-contest.htm
The DARPA HPCS SSCA#2 benchmark was chosen because it appeared to be a challenging problem, and also had two existing shared-memory implementations and a reasonable set of criteria for judging correctness.
The two existing implementations are detailed in the above link. One was created by David Bader and Kamesh Madduri using Bader’s SIMPLE library; in this case, the pthreads version of the library was used. The other implementation was created by John Feo of Cray, using the Cray/Tera MTA directives, and so was restricted to execute only on an MTA system.
Originally, the Bader/Madduri implementation was posted as part of the contest challenge. Minor effort was required to port the program to run on a 4 processor shared memory Itanium® system (the reference platform). The runtimes are included as part of the above link. The intent of this port was simply to provide a reference for the contestants.
Later, the Feo code was modified to compile on a non-MTA system, with no parallelism, in order to provide a serial implementation that the contestants could study.
The intent was that three fairly substantial prizes would be awarded, with the judging criteria being half Performance and Scalability and half Usage of OpenMP. A team of judges was drafted:
Tim Mattson, Intel Corporation
John Feo, Cray
Mark Bull,
Henry Jin,
Six contest submissions were received. All submittals were compiled
and executed to attempt to judge correctness and performance. The judges met at
SC’05 in
After reviewing all the submissions, the judges decided that only one submission was worthy of a prize. So, only the first prize was awarded.
We learned several lessons from this contest. They were:
We promise to do better next time!
The remainder of this document details each submission and the reasons that it was accepted or rejected.
The following are the principal contacts for the various submissions. I have chosen not to identify the particular submissions by the submitter. Two of the submitters have agreed to have their code available on the website: OpenMPContest.tgz and SSCA2-OMP.tar.gz. You may contact the other submitters directly if you wish to see a copy of their code. Note: don’t assume that the submitters are in the same order as the submissions.
Priya Unnikrishnan, IBM, priyau at ca dot ibm dot com
Eugene Loh,
Arun Kejariwal, UC Irvine, arun_kejariwal at ieee dot org
David Coakley, coakley at nwlink dot com
Lei Huang, leihuang at cs dot uh dot edu <code>
Ayon Basumallik,
Submission #1 is based on the serial version of the Feo code. The Feo code uses MTA fetch-and-add and full/empty bits for synchronization. The serial port of the code purposely created macros for these operations that were not suitable for parallel execution. The authors of the code did not alter these macros; they simply replaced the MTA parallelization directives with similar OpenMP directives. The code was clearly incorrect, because it did not contain the required synchronization.
Submission #2 is based on the serial version of the Feo code. The authors of this code paid attention to the requirements for synchronization. The OpenMP atomic directive was used in many places, as was the critical directive. However, there was a race condition in one part of the code (in the computeGraph subroutine where the neighbors array is filled it) that caused the code to be disqualified..
Submission #3 is based on the serial version of the Feo code. However, the majority of the code was rewritten to use C++ STL and other C++ features. The judges noted that the C++ STL is not guaranteed to be thread-safe. Also, the code did not scale beyond two processors. Note, however, that the author of the code asserted that the STL version used in his validation was in fact guaranteed to be thread safe. The author was unable to run the code on a system with large memory and more than two processors. It was rejected because it failed to scale, and because of the STL thread-safety issue.
Submission #4 is based on the Bader/Madduri code. Attention was paid to synchronization issues. However, this code failed to execute on the dataset size that was used to validate the other codes, so it was rejected. Also, the timing functions were called inside of an OpenMP single region, which is incorrect (since the region might be executed by a different processor each time).
Submission #5 is unique. It was written from scratch based on the written specification of the SSCA#2 benchmark. It is written in Fortran 90. The code compiled and executed immediately on the reference platform. The code is well documented. It performs well and scales well. The judges unanimously agreed that this code was the clear winner.
Submission #6 is based on the Bader/Madduri code. It failed to execute correctly on the reference platform (it aborted). On inspection, one of the judges found a problem where the omp_get_num_threads() was called incorrectly (outside of a parallel region); another similar problem was also found. The code was therefore rejected.