Computer Science 338 |
Tutorial Assignment 10: Parallel Sorting
Due: Before your tutorial meeting
Reminder
Final project progress reports are Monday, April 24.
Readings
As you do this week's readings, keep an outline of important points and points you wish to discuss during our tutorial meetings in a plain text file tut10.txt or something that can be turned into a PDF file tut10.pdf.
For most of the semester, we have looked at parallelism mainly from a point of view of implementation. We will look this week at parallel algorithms, similar to the way you might study algorithms in CS 256.
Sometimes in the study of parallel algorithms, we state (possibly impractical) assumptions such as "suppose we have p processors" or perhaps even "suppose we have p2 processors" or worse yet "suppose we have 2p processors" to solve a problem of size p.
The Quinn textbook describes some parallel sorting procedures in Chapter 14. Read that chapter and do Exercise 14.10, parts a, b, and c. Include your answers in the submitted text or PDF file.
Some other parallel sorting procedures to consider:
We start with n integer values, uniformly distributed across a known interval [0,a-1], and we will use m buckets to sort. In the first step, we distribute the numbers into the buckets, with numbers up to (a)/(m) in the first bucket, up to (2a)/(m) in the second, and so on. This requires n steps, one to determine the bucket for each number. In general, this requires a potentially expensive division operation, though if m is a power of 2, we can use fast bit-shift operators to achieve the division.
Now, (only because we assumed that the numbers were uniformly distributed across the interval) we have appoximately (n)/(m) numbers in each bucket.
Next, each bucket must be sorted, and we'll use our favorite "compare and exchange" sorting algorithm, perhaps quicksort or mergesort. For each bucket, with (n)/(m) numbers, we need (n)/(m) log(n)/(m) steps. And we need this for each of the m buckets.
Finally, the sorted numbers must be concatenated into a sorted list, which is, at worst, n steps.
So we have total time of
ts = n + m((n)/(m) log(n)/(m)) = n + n log(n)/(m) = O(n log(n)/(m))
Furthermore, if n = km, where k is a constant (that is, we add buckets proportionally if we increase the problem size), this reduces to O(n). This is better than our usual bound on sorting algorithms, but depends on the fact that we have a well-distributed initial values.
How can we parallelize it?
The obvious place is in the sorting of the contents of each bucket. These are independent operations, so if we have p=m processors, we can have a processor dedicated to each, and we can reduce the second term in the equation from m((n)/(m) log(n)/(m)) to (n)/(p) log(n)/(p).
However, the whole process is still O(n), since they all need to look at all of the original numbers to fill their buckets initially.
One way to parallelize the initial bucket-filling is to partition the initial values into m parts. Each processor deals with its partition. Each processor could potentially put numbers directly into the correct bucket, if we have shared memory, though there would be the added cost of enforcing mutual exclusion when accessing the buckets.
If we don't have shared memory, we can have each processor put the numbers into "small buckets." Each processor has m small buckets, where it puts numbers from its partition that are intended for each of the regular buckets. After each processor has put all of the numbers from its range into its small buckets, each processor sends the contents of one small bucket and to and receives the contents of one small bucket from each other processor.
This algorithm has 4 phases:
We consider the costs at each stage, in detail:
Assume n computational steps to partition n numbers p ways:
tcomp1 = n
We then broadcast/scatter the partitions:
tcomm1 = tstartup + tdatan
Each processor has to partition (n)/(p) numbers into p small buckets:
tcomp2 = (n)/(p)
Distribution of the small buckets. For simplicity, we assume the uniform initial distribution has given about (n)/(p2) numbers in each small bucket on each processor. Each must send the contents of p-1 of its small buckets. This happens on each of p processes.
If the communication is serialized:
tcomm3 = p(p-1)(tstartup + (n)/(p2)tdata)
or if the communication can be concurrent:
tcomm3 = (p-1)(tstartup + (n)/(p2)tdata)
Note that an implementation of this algorithm might employ an MPI_Alltoall() call.
Each processor sorts its own "large" bucket's contents:
tcomp4 = (n)/(p) log(n)/(p)
We add all these up to get our estimate of the total running time.
Now we consider a system where we have a pipeline of p processing units. Each processor can hold one number. When a processor gets a new number, it compares it with the one it has, and keeps the larger (or smaller) and passes along the other to the next processor in the pipeline. The numbers to be sorted are fed in, one at a time, to the first processor. At the end, the processor 0 will have the largest (or smallest) number, processor 1 the second largest (smallest), and so on.
This is a parallel version of an insertion sort.
Here is what such an algorithm would do to sort 5 numbers:
Time Left to Sort P0 P1 P2 P3 P4 1 7 5 2 3 11 2 7 5 2 3 11 3 7 5 2 11 -3-> 4 7 5 11 -2-> 3 5 7 11 -5-> 3 -2-> 6 11 -7-> 5 -3-> 2 7 11 7 -5-> 3 -2-> 8 11 7 5 -3-> 2 9 11 7 5 3 -2-> 10 11 7 5 3 2
This is nicely suited to a message passing program.
It takes 2p steps.
A rank, or enumeration, sort algorithm counts the number of numbers that are smaller than each number. This determines its rank in the final ordering.
A sequential implementation that takes numbers in an array a of size n and sorts them into an array b of size n:
for (i=0; i<n; i++) { /* look at each position */ rank=0; for (j=0; j<n; j++) { /* count how many are < a[i] */ if (a[i] > a[j]) rank++; } b[rank] = a[i]; /* this is where it goes */ }
This is not an especially efficient sequential sorting algorithm, running in O(n2) time.
But what if we have n processors? We can do the outer loop in parallel:
parallel for (i=0; i<n; i++) { /* look at each position */ rank=0; for (j=0; j<n; j++) { /* count how many are < a[i] */ if (a[i] > a[j]) rank++; } b[rank] = a[i]; /* this is where it goes */ }
rank and j are private variables, everything else is shared.
Now, we have a running time of O(n).
What if we have n2 processors? We can then have n processors available to work on finding the rank of the number at each position. What we can do in this situation depends on our assumptions about the available system.
If memory is fully shared, we can change the inside for loop to a parallel for, but the increments to rank need to be serialized. This leads to 1+n steps, 1 to do each of the n-1 comparisons of a[i] to each other entry (completely in parallel), and n to deal with rank (including one step for the initialization). So we still have overall complexity O(n).
The increments of rank could be improved by doing a tree reduction:
a[i]a[0] a[i]a[1] a[i]a[2] a[i]a[3] \ / Compare \ / 0/1 0/1 0/1 0/1 \ / \ / + Add + \ / 0/1/2 0/1/2 \ / \ / \ / \ / + 0/1/2/3/4
Now this is O(logn).
If we allow concurrent writes to memory (hard) we can do this all in one step - making an O(1) sorting algorithm!
Most sorting algorithms we see in the sequential world are "compare and exchange." These algorithms are based on this operation, for two items A and B:
if (A > B) { temp = A; A = B; B = temp; }
If we think of this in a message passing system, and A and B are on different processors, P1 and P2. Suppose P1 initially has A and P2 intially has B, and at the end P1 should have the smaller value and P2 should have the larger.
One way to do this, using simple send/recv operations.
send(&A, P2); recv(&A, P2);
recv(&A, P1); if (A > B) { send(&B, P1); B = A; } else { send(&A, P1); }
Each process has one send and one receive.
Another possible approach is to have the processes exchange values, and only keep the one they're supposed to keep:
send(&A, P2); recv(&B, P2); if (A > B) A = B;
recv(&A, P1); send(&B, P1); if (A > B) B = A;
Remember everyone's favorite laughingstock sorting algorithm, the bubble sort...
for (i = n -1; i > 0; i--) { for (j = 0; j < i; j++) { k = j + 1; if (a[j] > a[k]) { temp = a[j]; a[j] = a[k]; a[k] = temp; } } }
This requires O(n2) compare and exchange operations.
4 2 7 8 5 1 3 6 4x2 7 8 5 1 3 6 2 4x7 8 5 1 3 6 2 4 7x8 5 1 3 6 2 4 7 8x5 1 3 6 2 4 7 5 8x1 3 6 2 4 7 5 1 8x3 6 2 4 7 5 1 3 8x6 2x4 7 5 1 3 6 8 2 4x7 5 1 3 6 8 2 4 7x5 1 3 6 8 2 4 5 7x1 3 6 8 2 4 5 1 7x3 6 8 2 4 5 1 3 7x6 8 2x4 5 1 3 6 7 8 2 4x5 1 3 6 7 8 ...
How can we parallelize this? It looks at first that each pass depends on the previous pass being completed, but that is not the case.
As soon as the first pass has gotten to the comparison of the 3rd and 4th columns, the comparison of the 1st and 2nd columns for the second pass can begin:
4 2 7 8 5 1 3 6 4x2 7 8 5 1 3 6 2 4x7 8 5 1 3 6 2x4 7x8 5 1 3 6 2 4x7 8x5 1 3 6 2x4 7x5 8x1 3 6 2 4x5 7x1 8x3 6 2x4 5x1 7x3 8x6 2 4x1 5x3 7x6 8 2x1 4x3 5x6 7 8 1 2x3 4x5 6 7 8 ...
This leads to an algorithm called an odd-even transposition sort:
4x2 7x8 5x1 3x6 2 4x7 8x1 5x3 6 2x4 7x1 8x3 5x6 2 4x1 7x3 8x5 6 2x1 4x3 7x5 8x6 1 2x3 4x5 7x6 8 1 2 3 4 5 6 7 8
First, the even-numbered entries are compared with their next-highest neighbor's value and swapped if necessary, then the off-numbered entries. Then back to even.
This takes n steps, and can use up to p=(n)/(2) processors.
If we cast this as a message-passing program:
send(&A, Pi-1); /* even phase */ recv(&B, Pi-1); if (A < B) A = B; if (i <= n-3) { /* odd phase */ send(&A, Pi+1); recv(&B, Pi+1); if (A > B) A = B; }
recv(&A, Pi+1); /* even phase */ send(&B, Pi+1); if (A < B) B = A; if (i >= 2) { /* off phase */ recv(&A, Pi-1); send(&B, Pi-1); if (A > B) B = A; }
..with appropriate while loops and determination of process rank.
This algorithm proceeds by rearranging bitonic sequences.
A bitonic sequence is a sequence of numbers a0, a1, ..., an-1 which monotonically increases in value, reaches a single maximum, and then monotonically decreases in value. So:
a0 < a1 < ... < ai-1 < ai > ai+1 > ... > an-2 > an-1
for some value of i. A sequence is also considered to be bitonic if the relation above can be achieved by shifting the numbers cyclically.
Examples:
This is useful to us because we can rearrange a bitonic sequence in into two smaller bitonic sequences:
After this, we have two bitonic sequences, and all elements in the first bitonic sequence will be smaller than all elements in the second. So we have converted the problem of rearranging a bitonic sequence of size n to rearranging two bitonic sequences of n/2, followed by concatenation. We apply this recursively until the whole sequence is sorted.
3 5 8 9 10 12 14 20 95 90 60 40 35 23 18 0 3 5 8 9 10 12 14 0|95 90 60 40 35 23 18 20 3 5 8 0|10 12 14 9|35 23 18 20|95 90 60 40 3 0| 8 5|10 9|14 12|18 20|35 23|60 40|95 90 0 3 5 8 9 10 12 14 18 20 23 35 40 60 90 95
This takes logn steps to sort an n-element bitonic sequence.
What can be done in parallel? We can always make (n)/(2) comparisons concurrently. So if we have O(n) processors, we can do this entire thing with O(logn) complexity.
But what if we don't have a bitonic sequence to start out? Well, we can make one:
Consider our example from above, but start with a completely unordered list of numbers:
10 20 5 9 3 8 12 14 90 0 60 40 23 35 95 18
First, we think of it as 8 little bitonic sequences (which is is). We sort them, so that the first pair is increasing, the second is decreasing, third is increasing, etc.
10 20 9 5 3 8 14 12 0 90 60 40 23 35 95 18
Now, we have 4 bitonic sequences of length 4. We sort each, using the approach we described previously. The first, we sort in increasing order, the second in decreasing, the third increasing and the 4th decreasing.
9 5 10 20 14 12 3 8 0 40 60 90 95 35 21 18 5 9 10 20 14 12 8 3 0 40 60 90 95 35 23 18
Now, we have two bitonic sequences of length 8. Sort these as described previously. The first in increasing order, the second in decreasing order.
5 9 8 3 14 12 10 20 95 40 60 90 0 35 23 18 5 3 8 9 10 12 14 20 95 90 60 40 23 35 0 18 3 5 8 9 10 12 14 20 95 90 60 40 35 23 18 0
Which is our starting point for the original example, leaving one bitonic sequence to be sorted.
What is the complexity to get all the way from the unordered sequence to the sorted sequence?
For our 16-element example, we can see 4 phases, each of which consists of a number of substeps for a total of 10 steps:
So the total number of compare and exchange steps we need for a sequence of size n = 2k is
1 + 2 + ... + k = (k(k+1))/(2) = (logn(logn + 1))/(2) = O(log2 n)
assuming again that we have O(n) processors.
This sort can be implemented in a hardware sorting network.
Submission and Grading Guidelines
Before your tutorial meeting, submit a plain text file tut10.txt or a PDF file tut10.pdf that contains the outline that you will use to help guide our discussion and your answer to the assigned question from the text. You may choose to work with your tutorial partner(s) or independently on this outline. Your grade for the week will be based on the quality of this outline.