So far, we have looked at parallelism mainly from a point of view of implementation. We will look this week at parallel algorithms, along the lines like you studied (or are studying, as the case may be) algorithms in CS 256.
Here, we will state (possibly impractical) assumptions such as "suppose we have p processors" or perhaps even "suppose we have p2 processors" or worse yet "suppose we have 2p processors" to solve a problem of size p.
We have all seen plenty of sorting algorithms. We will look at sorting numbers, though of course we can sort anything that we can compare (strings, complicated structures, etc).
Recall your favorite algorithms from the past: bubble sort, insertion sort, selection sort, quicksort, merge sort.
We start by considering a different one, which you may or may not have seen before: the bucket sort.
We start with n integer values, uniformly distributed across a known interval [0,a-1], and we will use m buckets to sort. In the first step, we distribute the numbers into the buckets, with numbers up to a/m in the first bucket, up to 2a/m in the second, and so on. This requires n steps, one to determine the bucket for each number. in general, this requires a potentially expensive division operation, though if m is a power of 2, we can use fast bit-shift operators to achieve the division.
Now, (only because we assumed that the numbers were uniformly distributed across the interval) we have appoximately n/m numbers in each bucket.
Next, each bucket must be sorted, and we'll use our favorite "compare and exchange" sorting algorithm, perhaps quicksort or mergesort. For each bucket, with n/m numbers, we need (n/m) log(n/m) steps. And we need this for each of the m buckets.
Finally, the sorted numbers must be concatenated into a sorted list, which is, at worst, n steps.
So we have total time of
ts = n + m((n/m) log(n/m)) = n + n log(n/m) = O(n log(n/m))
Furthermore, if n = km, where k is a constant (that is, we add buckets proportionally if we increase the problem size), this reduces to O(n). This is better than our usual bound on sorting algorithms, but depends on the fact that we have a well-distributed initial values.
How can we parallelize it?
The obvious place is in the sorting of the contents of each bucket. These are independent operations, so if we have p=m processors, we can have a processor dedicated to each, and we can reduce the second term in the equation from m((n/m) log(n/m)) to (n/p) log(n/p).
However, the whole process is still O(n), since they all need to look at all of the original numbers to fill their buckets initially.
One way to parallelize the initial bucket-filling is to partition the initial values into m parts. Each processor deals with its partition. Each processor could potentially put numbers directly into the correct bucket, if we have shared memory, though there would be the added cost of enforcing mutual exclusion when accessing the buckets.
If we don't have shared memory, we can have each processor put the numbers into "small buckets." Each processor has m small buckets, where it puts numbers from its partition that are intended for each of the regular buckets. After each processor has put all of the numbers from its range into its small buckets, each processor sends the contents of one small bucket and to and receives the contents of one small bucket from each other processor.
This algorithm has 4 phases:
We consider the costs at each stage, in detail:
Assume n computational steps to partition n numbers p ways:
tcomp1 = n
We then broadcast/scatter the partitions:
tcomm1 = tstartup + tdatan
Each processor has to partition n/p numbers into p small buckets:
tcomp2 = n/p
Distribution of the small buckets. For simplicity, we assume the uniform initial distribution has given about n/p2 numbers in each small bucket on each processor. Each must send the contents of p-1 of its small buckets. This happens on each of p processes.
If the communication is serialized:
tcomm3 = p(p-1)(tstartup + (n/p2)tdata)
or if the communication can be concurrent:
tcomm3 = (p-1)(tstartup + (n/p2)tdata)
Note that an implementation of this algorithm might employ an MPI_Alltoall() call.
Each processor sorts its own "large" bucket's contents:
tcomp4 = (n/p) log(n/p)
We add all these up to get our estimate of the total running time.
Now we consider a system where we have a pipeline of p processing units. Each processor can hold one number. When a processor gets a new number, it compares it with the one it has, and keeps the larger (or smaller) and passes along the other to the next processor in the pipeline. The numbers to be sorted are fed in, one at a time, to the first processor. At the end, the processor 0 will have the largest (or smallest) number, processor 1 the second largest (smallest), and so on.
This is a parallel version of an insertion sort.
Here is what such an algorithm would do to sort 5 numbers:
Time Left to Sort P0 P1 P2 P3 P4 1 7 5 2 3 11 2 7 5 2 3 11 3 7 5 2 11 -3-> 4 7 5 11 -2-> 3 5 7 11 -5-> 3 -2-> 6 11 -7-> 5 -3-> 2 7 11 7 -5-> 3 -2-> 8 11 7 5 -3-> 2 9 11 7 5 3 -2-> 10 11 7 5 3 2
This is nicely suited to a message passing program.
It takes 2p steps.
A rank, or enumeration, sort algorithm counts the number of numbers that are cmaller than each number. This determines its rank in the final ordering.
A sequential implementation that takes numbers in an array a of size n and sorts them into an array b of size n:
for (i=0; i<n; i++) { /* look at each position */ rank=0; for (j=0; j<n; j++) { /* count how many are < a[i] */ if (a[i] > a[j]) rank++; } b[rank] = a[i]; /* this is where it goes */ }
This is not an especially efficient sequential sorting algorithm, running in O(n2) time.
But what if we have n processors? We can do the outer loop in parallel:
parallel for (i=0; i<n; i++) { /* look at each position */ rank=0; for (j=0; j<n; j++) { /* count how many are < a[i] */ if (a[i] > a[j]) rank++; } b[rank] = a[i]; /* this is where it goes */ }
rank and j are private variables, everything else is shared.
Now, we have a running time of O(n).
What if we have n2 processors? We can then have n processors available to work on finding the rank of the number at each position. What we can do in this situation depends on our assumptions about the available system.
If memory is fully shared, we can change the inside for loop to a parallel for, but the increments to rank need to be serialized. This leads to 1+n steps, 1 to do each of the n-1 comparisons of a[i] to each other entry (completely in parallel), and n to deal with rank (including one step for the initialization). So we still have overall complexity O(n).
The increments of rank could be improved by doing a tree reduction:
a[i]a[0] a[i]a[1] a[i]a[2] a[i]a[3] \ / Compare \ / 0/1 0/1 0/1 0/1 \ / \ / + Add + \ / 0/1/2 0/1/2 \ / \ / \ / \ / + 0/1/2/3/4
Now this is O(logn).
If we allow concurrent writes to memory (hard) we can do this all in one step - making an O(1) sorting algorithm!
Most sorting algorithms we see in the sequential world are "compare and exchange." These algorithms are based on this operation, for two items A and B:
if (A > B) { temp = A; A = B; B = temp; }
If we think of this in a message passing system, and A and B are on different processors, P1 and P2. Suppose P1 initially has A and P2 intially has B, and at the end P1 should have the smaller value and P2 should have the larger.
One way to do this, using simple send/recv operations.
send(&A, P2); recv(&A, P2);
recv(&A, P1); if (A > B) { send(&B, P1); B = A; } else { send(&A, P1); }
Each process has one send and one receive.
Another possible approach is to have the processes exchange values, and only keep the one they're supposed to keep:
send(&A, P2); recv(&B, P2); if (A > B) A = B;
recv(&A, P1); send(&B, P1); if (A > B) B = A;