0% found this document useful (0 votes)
24 views

Sort Open MP

This document discusses the authors' experience parallelizing a simple quicksort algorithm with OpenMP. They encountered two main problems: recursion and busy waiting. To address recursion, they tried iterative approaches using stacks, nested parallelism, and work queues. For busy waiting, they used condition variables and sched_yield. The authors compared these solutions and found stacks and condition variables worked best. They suggest potential extensions to the OpenMP specification to better support recursion and synchronization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Sort Open MP

This document discusses the authors' experience parallelizing a simple quicksort algorithm with OpenMP. They encountered two main problems: recursion and busy waiting. To address recursion, they tried iterative approaches using stacks, nested parallelism, and work queues. For busy waiting, they used condition variables and sched_yield. The authors compared these solutions and found stacks and condition variables worked best. They suggest potential extensions to the OpenMP specification to better support recursion and synchronization.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

A User’s Experience with Parallel Sorting and OpenMP

Michael Süß and Claudia Leopold


University of Kassel
Research Group Programming Languages / Methodologies
Wilhelmshöher Allee 73, D-34121 Kassel, Germany
{msuess, leopold}@uni-kassel.de

Abstract
Some algorithmic patterns are difficult to express in
OpenMP. In this paper, we use a simple sorting algorithm
to illustrate problems with recursion and the avoidance of
busy waiting. We compare several solution approaches with
respect to programming expense and performance: stacks,
nesting and a workqueue (for recursion), as well as condi-
tion variables and the sched yield–function (for busy wait-
ing). Enhancements of the OpenMP–specification are sug-
gested where we saw the need. Performance measurements
are included to backup our claims.

1. Introduction

Parallel Programming is still a challenging task in our


days. Although there are many powerful parallel program-
ming systems, most of them operate on a relatively low ab-
straction level (e.g. POSIX threads, MPI). The specifica-
tion of OpenMP [2] promised advances in this regard, and Figure 1. Revision tree of Sorting Program
has provided a relatively smooth way to incrementally par-
allelize existing programs as well as to write powerful new
illustrates the history of the different program versions we
applications on a high abstraction level, since its introduc-
wrote to solve these problems (some less important versions
tion in 1997. Its portability and vendor acceptance quickly
are left out in the diagram, causing missing entries in the
helped the system to become a de facto standard for pro-
numbering scheme). Three different solution strategies are
gramming shared memory parallel machines.
presented for recursion:
Nevertheless, OpenMP is not without problems and
rough edges, which version 2.0 of the specification was not • an iterative approach (sort omp 1.0)
able to straighten out fully. This paper uses a simple sort-
ing algorithm to show some of these, as well as a couple • an advanced iterative version using local stacks
of techniques developed to work around them. All exam- (sort omp 2.0)
ples were written in the C++ programming language. Some
• nested parallelism (sort omp nested 1.0)
suggestions for future enhancements to the specification are
included. • a workqueue (sort omp taskq 1.0)
The next section gives a short summary of the used sort-
ing algorithm. Section 3 describes the two problematic ar- The problem of busy waiting is approached with so-
eas that we encountered during the course of our work: re- lutions borrowed from POSIX threads: condition vari-
cursion and busy waiting. Figure 1 depicts a diagram that ables (sort pthreads cv 1.0) and the sched yield–function
(sort pthreads yield 1.0). Ideas for extensions to the
OpenMP–specification presented in section 3 are the ad- template < typename T >
dition of an omp get num all threads–function, condition void myQuickSort(std::vector <T> &myVec, int q, int r)
{
variables and an omp yield–function. Note that at this point T pivot;
only data points are given for the inclusion of these exten- int i, j;
sions into the specification, as this paper merely describes
/* only small segment of numbers left -> use insertion sort! */
our experiences with a certain problem area and does not
if (r − q < SWITCH THRESH) {
include a complete proposal. Section 4 discusses the per- myInsertSort(myVec, q, r);
formance of our solutions. In section 5, we sum up our return; 10
results and discuss further perspectives. }

/* choose pivot, initialize borders */


2. Sorting pivot = myVec[r];
i = q − 1;
j = r;
Sorting data has always been one of the key problems of
computer science. Many sequential algorithms have been /* partition step, which moves smaller numbers to the left
suggested, of which Quicksort, invented and described by and larger numbers to the right of the pivot */
Hoare [4], is one of the most popular ones. It works recur- while (true) { 20
while (myVec[++i] < pivot);
sively using the divide and conquer principle: while (myVec[−−j] > pivot);
if (i >= j) break;
1. choose a pivot element, usually by just picking the last std::swap(myVec[i], myVec[j]);
element out of the sorting area }
std::swap(myVec[i], myVec[r]);
2. iterate through the elements to be sorted, moving num-
bers smaller than the pivot to a position on its left, and /* recursively call yourself with new subsegments,
numbers larger than the pivot to a position on its right, i is index of pivot */
myQuickSort(myVec, q, i − 1); 30
by swapping elements. After that the sorting area is di- myQuickSort(myVec, i + 1, r);
vided into two subareas: the left one contains all num- }
bers smaller than the pivot element and the right one
contains all numbers larger than the pivot element.
Figure 2. Base version of the quicksort algo-
3. goto step 1 for the left and right subareas (if there is
rithm
more than one element left in that area)
The algorithm has an average time complexity of O(n ·
log(n)) and a worst case time complexity of O(n2 ), making 3.1. Problem 1: Recursion
it one of the fastest sequential sorting algorithms available.
Since we do not aim at producing the fastest parallel sort- There is no easy and intuitive way to deal with recursion
ing algorithm ever, but merely try to show some problems in OpenMP (yet), as the basic worksharing constructs pro-
and solutions with OpenMP, we chose this easy and widely vided by the specification (for and sections) do not seem to
used algorithm as the basis for our experiments, instead of be well suited for recursive function calls. Our first solution
a more advanced and complex (sequential or parallel) one. to the problem involved getting rid of the recursion, as al-
Previous experiments (although with a different focus) with ready suggested by an Mey [1]. It is a widely known fact
quicksort and OpenMP have been conducted by Parikh [6]. that, by using a stack, every recursion can be changed into
an iteration, and that is exactly what we did in sort seq 1.6.
3. A simple sorting algorithm and its problems This step was one of the most time consuming of all, as it
involved the introduction of a new data structure, specially
The base version for our tests was the simple quicksort tailored to our problem. We called this structure globalTo-
algorithm described above, combined with insertion sort doStack, it stores the intervals still to be sorted.
for small sorting areas. This algorithm (sort seq 1.5) is This program was later parallelized into sort omp 1.0,
sketched in Figure 2. We tried to speed up the algorithm making the following changes:
through advanced (sequential) sorting techniques, but the
• a parallel region was added in main, around the first
performance gain was not worth the loss in simplicity.
call of the myQuickSort–function
Using this algorithm, we started our experiments with
OpenMP, and soon after were confronted with our first • a call to omp get thread num was added, so that
problem: recursion. only one thread initially executes the myQuickSort–
function, all others wait until work for them is put on template < typename T >
void myQuickSort(std::vector <T> &myVec, int q, int r,
the globalTodoStack std::stack <std::pair <int,int>>&globalTodoStack,
int &numBusyThreads, const int numThreads)
• all accesses to shared variables (especially the glob- {
bool idle = true;
alTodoStack) were protected by critical sections to
prevent multiple threads from accessing the variables /* Skipped: Initialisation */

at the same time while (true) {

/* only small segment of numbers left ->use insertion sort! */


The resulting program is sketched in Figure 3. if (r - q < SWITCH THRESH) {
In sort omp 2.0 the global stack was complemented with myInsertSort(myVec, q, r);
/* and mark the region as sorted, by setting q to r */
one local stack per thread. All new segments to be sorted q = r;
are put on the local stacks per default. Only when one of }
the local stacks is empty, it needs to communicate with the while (q >= r) { /* Thread needs new work */
global stack and poll for new work there. It works the other
/* only one thread at a time should access the
way around as well: when the global stack is getting empty, globalTodoStack, numBusyThreads and idle variables */
new work is pushed on it from a local stack. This modi- #pragma omp critical
{
fication led to a significant performance gain, because the /* something left on the global stack to do? */
need for synchronisation via critical sections dropped con- if (false == globalTodoStack.empty()) {
if (true == idle) ++numBusyThreads;
siderably. The effect is most visible for high numbers of idle = false;
working threads, because that is when a lot of work sharing /* Skipped: Pop a new segment off the stack */
} else {
needs to be done. if (false == idle) --numBusyThreads;
Our second solution to the recursion problem idle = true;
}
(sort omp nested 1.0) involves nested parallelism as }
illustrated in Figure 4. There are a few problems with this
/* if all threads are done, break out of this function.
approach though. First of all, the OpenMP–specification note, that the value of numBusyThreads is current, as there
allows compilers to serialize nested parallel directives, is a flush implied at the end of the last critical section */
if (numBusyThreads == 0) {
and many still do so. With these compilers, the version return;
will not achieve considerable speedups, performance }
}
portability is not granted. Furthermore, nesting support
in the specification seems to be a little immature. In /* Skipped: choose pivot and do partitioning step */
particular, the omp get num threads–function is useless #pragma omp critical
for our example, as it always returns two threads per {
globalTodoStack.push(pair(q, i - 1));
parallel region. The value we were really interested in is }
the number of all running threads (to limit the creation of
/* iteratively sort elements right of pivot */
new threads as we dive deeper into the recursion). Since q = i + 1;
this number is not available through a simple function call, }
}
we needed to track it ourselves, therefore introducing new
(and performance hindering) critical sections into the code. int main(int argc, char *argv[])
{
Perhaps this could be taken care of by the introduction of /* Skipped: Program Initialisation */
an omp get num all threads function call or some similar
#pragma omp parallel shared(myVec, \
mechanism into the next iteration of the specification. globalTodoStack, numThreads, numBusyThreads)
The third solution to the recursion problem has been first {
/* start sorting with one thread, the others wait for the stack to fill up */
suggested by Shah et al. [7]. It involves usage of the if (0 == omp get thread num()) {
workqueuing model (sort omp taskq 1.0, depicted in Fig- myQuickSort(myVec, 0, myVec.size() - 1,
globalTodoStack, numBusyThreads,
ure 5). This model has been proposed as an OpenMP– numThreads);
extension, but has not been accepted yet. To the authors } else {
myQuickSort(myVec, 0, 0,
knowledge, only two compilers understand the new prag- globalTodoStack, numBusyThreads,
mas, and so this solution has the drawback that the code numThreads);
}
presented here is not portable. This might change quickly }
though, if the workqueuing–proposal is accepted. Except /* Skipped: Tests and Program output */
}
for this drawback, the solution is easy and elegant. This
version lacks the ability to add local queues (similar to
the local stacks that brought a performance improvement
in sort omp 2.0), but a smart compiler might recognize this Figure 3. Parallel part of sort omp 1.0
template < typename T > template < typename T >
void myQuickSort(std::vector < T > &myVec, int q, void myQuickSort(std::vector <T> &myVec, int q, int r)
int r, int &numBusyThreads, const int numThreads) {
{ /* Skipped: Initialisation + Partitioning step */
/* Skipped: Initialisation + Partitioning step */
#pragma omp taskq
/* do not nest, if there are too many threads already */ {
if (numBusyThreads >= numThreads) { #pragma omp task
myQuickSort(myVec, q, i - 1, numBusyThreads, {
numThreads); myQuickSort(myVec, q, i - 1);
myQuickSort(myVec, i + 1, r, numBusyThreads, }
numThreads); #pragma omp task
} else { {
#pragma omp atomic myQuickSort(myVec, i + 1, r);
numBusyThreads += 2; }
}
#pragma omp parallel shared(myVec, numThreads, \ }
numBusyThreads, q, i, r)
{ int main(int argc, char *argv[])
#pragma omp sections nowait {
{ /* Skipped: Program Initialisation */
#pragma omp section
{ #pragma omp parallel shared (myVec)
myQuickSort(myVec, q, i - 1, {
numBusyThreads, numThreads); #pragma omp taskq
#pragma omp atomic {
numBusyThreads--; #pragma omp task
} {
#pragma omp section myQuickSort(myVec, 0, myVec.size() - 1);
{ }
myQuickSort(myVec, i + 1, r, }
numBusyThreads, numThreads); }
#pragma omp atomic
numBusyThreads--; /* Skipped: Tests and Program output */
} }
}
}
}
}
Figure 5. Parallel part of sort omp taskq 1.0

Figure 4. Parallel part of sort omp nested 1.0 condition variables like this:

A condition variable is a “signaling mechanism”


associated with a mutex and by extension is also
performance potential and insert them automatically. associated with the shared data protected by the
Having solved the recursion problem, the next one ap- mutex. Waiting on a condition variable atomi-
peared: busy waiting. cally releases the associated mutex and waits un-
til another thread signals (to wake up one waiter)
3.2. Problem 2: Busy Waiting or broadcasts (to wake all waiters) the condi-
tion variable. The mutex must always be locked
when you wait on a condition variable and, when
When a thread has nothing to do in our sorting applica- a thread wakes up from a condition variable
tion (because both its local stack and the global stack are wait, it always resumes with the mutex locked.
presently empty, e.g. at the beginning), it does not mean ([3, p. 72])
that it is allowed to quit. New tasks can be put on the global
stack at any time and should of course be processed as soon The introduction of this concept into the OpenMP–
as possible. The only way this can be accomplished with specification has already been suggested by Lu et al. [5],
OpenMP is busy waiting, which means that the thread is and we would also like to encourage the inclusion of this
constantly polling for work, wasting processor cycles that or a similar mechanism into the specification. To illus-
could be better spent in another thread. trate the savings possible when using condition variables,
For possible solutions, one must only look as far as to we ported our sorting application to POSIX threads (stay-
the POSIX threads standard. A synchronisation primitive ing as close to the original version as possible). The re-
called condition variable is implemented there and solves sulting program is sort pthreads 1.0, which should per-
the problem in a quite hard to understand way (when one form about equal to sort omp 2.0 (see section 4 for details).
is looking at it through the eyes of a beginner to parallel Then, sort pthreads 1.0 was enhanced with condition vari-
programming), but nevertheless fully. Butenhof describes ables (sort pthreads cv 1.0). Every time a thread finishes
his work and finds nothing else to do on the stacks, it puts Program Wall–clock time (sec.)
itself to sleep and is woken up by another thread only when 1Th. 2Th. 4Th
there is new work to be done. This may lead to a significant sort seq 1.5 23.8 23.8 23.8
performance gain on a heavily loaded machine. sort seq 1.6 23.6 23.6 23.6
A second and somewhat easier solution to the problem of sort omp 1.0 24.0 13.7 8.1
busy waiting can be observed in POSIX.1b (realtime exten- sort omp 2.0 24.3 12.6 7.5
sions). This standard defines a function sched yield, which sort omp nested 1.0 23.9 21.4 12.4
puts the calling thread at the end of the ready–queue of the sort omp taskq 1.0 29.8 16.3 9.1
operating system scheduler and selects a new thread to run. sort pthreads 1.0 24.0 12.7 7.5
If there is no other thread waiting, the function returns im- sort pthreads cv 1.0 24.8 12.9 7.6
mediately. The same could be done with a new function sort pthreads yield 1.0 24.5 12.7 7.5
omp yield in OpenMP. Whenever a thread runs out of work
in our example program and the stacks are empty, it calls Table 1. Wall–clock time for sorting 100 mil-
the suggested function. If other threads are waiting to be lion integers on an AMD Opteron 2200 in sec-
processed (which might not be out of work yet), these get onds
a chance to run and produce more work for all idle threads.
Though less powerfull than condition variables, our experi- Program Wall–clock time (sec.)
ments with sort pthreads yield 1.0 (a version incorporating 1Th. 2Th. 4Th. 8Th.
sched yield) suggest that this function is able to reduce busy sort seq 1.5 36.8 36.8 36.8 36.8
waiting under heavy load for our problem as well (see Table sort seq 1.6 37.4 37.4 37.4 37.4
3).
sort omp 1.0 38.2 23.9 15.7 11.0
sort omp 2.0 37.9 21.4 13.1 10.0
4. Performance results sort omp nested 1.0 43.4 25.2 25.2 25.2
sort omp taskq 1.0 n.A. n.A. n.A. n.A.
Performance tests were carried out on an otherwise un- sort pthreads 1.0 37.6 21.2 13.3 10.1
loaded node of an AMD Opteron 848 class computer with sort pthreads cv 1.0 37.2 20.3 13.3 9.6
4 processors at 2.2 GHz, located at the RWTH Aachen. sort pthreads yield 1.0 37.2 22.1 14.2 10.9
Programs were compiled with the Intel C++ Compiler 8.1
using the -O3 -openmp compiler options. Further experi- Table 2. Wall–clock time for sorting 100 mil-
ments were carried out on an otherwise unloaded node of a lion integers on a Sun Fire 6800 in seconds
Sun Fire 6800 class computer with a maximum of 8 Ul-
tra Sparc III processors at 900MHz, also located at the
RWTH Aachen. Here, the Guide Compiler by Kuck & As- sult we were able to achieve when running with different
soc. Inc. (KAI) was used with options: -fast –backend - numbers of threads was 11.7 seconds, which is still slower
xchip=ultra3cu - -backend -xcache=64/32/4:8192/512/2 - than the other programs. No difference in speed is notice-
-backend -xarch=v8plusb. able between the different versions using Pthreads. This
Tables 1 and 2 show wall–clock time in seconds for all is to be expected, as the advantages of these solutions will
versions of our sorting program, with different numbers of only show on heavily loaded systems or when looking at the
threads. Only the time needed to actually perform the sort- CPU–time.
ing algorithms was measured. All experiments were re- The results in Table 2 look similar, except for two differ-
peated at least three times, each time sorting 100 million ences:
random integers. For Tables 1 and 2, the best time achieved
in each test was chosen. • nested parallel regions are serialized by the Guide
Table 1 shows good speedups for all parallel program Compiler, therefore no speedup beyond 2 is possible
versions. The best performing solutions are sort omp 2.0 for sort omp nested 1.0
and the programs using Pthreads. Program sort omp 2.0 • we were not able to get sort omp taskq to work reli-
outperforms sort omp 1.0, which shows the relevance of the ably on this platform (we have no idea if this is a com-
local stacks. The programs with nesting and the workqueue piler problem or a subtle bug in our implementation,
are slower than the iterative programs, but this might be due but as soon as more than one thread was employed, it
to the relative immaturity of both options in the specifica- would either crash or run forever), therefore no results
tion. The results for sort omp nested 1.0 are to be taken for this platform are provided
with a grain of salt, since we were not able to fully con-
trol the number of threads used. Nevertheless, the best re- Table 3 demonstrates what happens on a heavily loaded
Program SUN / 96Th. AMD / 16Th. 6. Acknowledgments
sort omp 1.0 > 600 20.5
sort omp 2.0 > 600 19.0 We are grateful to Björn Knafla and Beliz Senyüz for
sort pthreads 1.0 15.9 7.8 proofreading the paper. We thank the Center for Comput-
sort pthreads cv 1.0 10.6 7.7 ing and Communication (especially Dieter an Mey) at the
sort pthreads yield 1.0 11.4 7.9 RWTH Aachen and the University Computing Center at the
Technical University of Darmstadt for providing the com-
Table 3. Average wall–clock time for sorting puting facilities used to carry out our performance measure-
100 million integers on heavily loaded sys- ments.
tems in seconds

References
system with and without busy waiting. The heavy load was
[1] D. an Mey. Two OpenMP programming patterns. In Pro-
built up by using four times as many threads as there are ceedings of the Fifth European Workshop on OpenMP -
processors available on the machine. It shows average wall– EWOMP’03, September 2003.
clock time to reduce the chance of lucky scheduling deci- [2] O. A. R. Board. OpenMP specifications. http://www.
sions. openmp.org/specs.
On the SUN platform, the Pthreads solutions show [3] D. R. Butenhof. Programming with POSIX Threads.
the results we expected: Program sort pthreads 1.0 Addison–Wesley, 1997.
takes considerably longer than sort pthreads cv and [4] C. Hoare. Quicksort. The Computer Journal, 5:10–15, 1962.
[5] H. Lu, C. Hu, and W. Zwaenepoel. OpenMP on networks of
sort pthreads yield. There are two relatively big surprises
workstations. In Proc. of Supercomputing’98, 1998.
for us though: First, the OpenMP–versions are slow as com- [6] R. Parikh. Accelerating quicksort on the intel pen-
pared to the Pthreads–versions (in case of the SUN ma- tium 4 processor with hyper–threading technology.
chine so slow that we decided to cancel the runs after 10 www.intel.com/cd/ids/developer/asmo-na/eng/technologies/
minutes). The reasons for this are not yet clear to us and threading/hyperthreading/20372.htm, 2003.
still under investigation. Second, on the AMD–machine, no [7] S. Shah, G. Haab, P. Petersen, and J. Throop. Flexible con-
performance difference is noticeable between the different trol structures for parallelism in OpenMP. In Proceedings of
Pthreads–versions. A better scheduler might account for the Fourth European Workshop on OpenMP - EWOMP’02,
this, but we are still investigating this question as well. September 2002.

5. Concluding remarks and perspectives

In this paper, we have used several versions of a sim-


ple parallel sorting program to show some weaknesses of
the OpenMP–specification and a couple of ways to address
them. Suggestions for enhancements to the specification
were made whenever it seemed appropriate.
Section 3.1 addressed the problem of recursion. Three
solutions to it were presented, only one of which could be
portably implemented with the present state of the specifica-
tion, while the other, more elegant but less performant, two
required additional support for the workqueue extension, or
for nested parallelism, respectively.
In section 3.2, we have discussed the problem of wasted
processor cycles (busy waiting). As possible solutions we
suggest the introduction of condition variables and a new
function omp yield. Performance measurements showed
that both approaches may provide adequate savings in pro-
cessor time.
In the future, we plan to implement and test some of the
ideas and additions to the specification we have suggested
into an actual compiler, as well as investigate other algorith-
mic problems beyond sorting.

You might also like