Lecture 17 - More Parallel Algorithms

Sorting with one hat instead of n processors.


Agenda

Announcements

Parallel Algorithms

We finished off last time looking at a few ways to do a "compare and exchange" step in a message-passing environment.

We move on now to a few more sorting algorithms.

Bubble Sort

Remember everyone's favorite laughingstock sorting algorithm, the bubble sort...

for (i = n -1; i > 0; i--) {
  for (j = 0; j < i; j++) {
    k = j + 1;
    if (a[j] > a[k]) {
      temp = a[j];
      a[j] = a[k];
      a[k] = temp;
    }
  }
}

This requires O(n2) compare and exchange operations.

4 2 7 8 5 1 3 6
4x2 7 8 5 1 3 6
2 4x7 8 5 1 3 6
2 4 7x8 5 1 3 6
2 4 7 8x5 1 3 6
2 4 7 5 8x1 3 6
2 4 7 5 1 8x3 6
2 4 7 5 1 3 8x6
2x4 7 5 1 3 6 8
2 4x7 5 1 3 6 8
2 4 7x5 1 3 6 8
2 4 5 7x1 3 6 8
2 4 5 1 7x3 6 8
2 4 5 1 3 7x6 8
2x4 5 1 3 6 7 8
2 4x5 1 3 6 7 8
...

How can we parallelize this? It looks at first that each pass depends on the previous pass being completed, but that is not the case.

As soon as the first pass has gotten to the comparison of the 3rd and 4th columns, the comparison of the 1st and 2nd columns for the second pass can begin:

4 2 7 8 5 1 3 6
4x2 7 8 5 1 3 6
2 4x7 8 5 1 3 6
2x4 7x8 5 1 3 6
2 4x7 8x5 1 3 6
2x4 7x5 8x1 3 6
2 4x5 7x1 8x3 6
2x4 5x1 7x3 8x6
2 4x1 5x3 7x6 8
2x1 4x3 5x6 7 8
1 2x3 4x5 6 7 8
...

This leads to an algorithm called an odd-even transposition sort:

4x2 7x8 5x1 3x6
2 4x7 8x1 5x3 6
2x4 7x1 8x3 5x6
2 4x1 7x3 8x5 6
2x1 4x3 7x5 8x6
1 2x3 4x5 7x6 8
1 2 3 4 5 6 7 8

First, the even-numbered entries are compared with their next-highest neighbor's value and swapped if necessary, then the odd-numbered entries. Then back to even.

This takes n steps, and can use up to p=n/2 processors.

If we cast this as a message-passing program:

..with appropriate while loops and determination of process rank.

Bitonic Sort

This algorithm proceeds by rearranging bitonic sequences.

A bitonic sequence is a sequence of numbers a0, a1, ..., an-1 which monotonically increases in value, reaches a single maximum, and then monotonically decreases in value. So:

a0 < a1 < ... < ai-1 < ai > ai+1 > ... > an-2 > an-1

for some value of i. A sequence is also considered to be bitonic if the relation above can be achieved by shifting the numbers cyclically.

Examples:

This is useful to us because we can rearrange a bitonic sequence in into two smaller bitonic sequences:

After this, we have two bitonic sequences, and all elements in the first bitonic sequence will be smaller than all elements in the second. So we have converted the problem of rearranging a bitonic sequence of size n to rearranging two bitonic sequences of n/2, followed by concatenation. We apply this recursively until the whole sequence is sorted.

  3  5  8  9 10 12 14 20 95 90 60 40 35 23 18  0
  3  5  8  9 10 12 14  0|95 90 60 40 35 23 18 20
  3  5  8  0|10 12 14  9|35 23 18 20|95 90 60 40
  3  0| 8  5|10  9|14 12|18 20|35 23|60 40|95 90
  0  3  5  8  9 10 12 14 18 20 23 35 40 60 90 95

This takes logn steps to sort an n-element bitonic sequence.

What can be done in parallel? We can always make n/2 comparisons concurrently. So if we have O(n) processors, we can do this entire thing with O(logn) complexity.

But what if we don't have a bitonic sequence to start out? Well, we can make one:

Consider our example from above, but start with a completely unordered list of numbers:

 10 20  5  9  3  8 12 14 90  0 60 40 23 35 95 18

First, we think of it as 8 little bitonic sequences (which is is). We sort them, so that the first pair is increasing, the second is decreasing, third is increasing, etc.

 10 20  9  5  3  8 14 12  0 90 60 40 23 35 95 18

Now, we have 4 bitonic sequences of length 4. We sort each, using the approach we described previously. The first, we sort in increasing order, the second in decreasing, the third increasing and the 4th decreasing.

  9  5 10 20 14 12  3  8  0 40 60 90 95 35 21 18
  5  9 10 20 14 12  8  3  0 40 60 90 95 35 23 18

Now, we have two bitonic sequences of length 8. Sort these as described previously. The first in increasing order, the second in decreasing order.

  5  9  8  3 14 12 10 20 95 40 60 90  0 35 23 18
  5  3  8  9 10 12 14 20 95 90 60 40 23 35  0 18
  3  5  8  9 10 12 14 20 95 90 60 40 35 23 18  0

Which is our starting point for the original example, leaving one bitonic sequence to be sorted.

What is the complexity to get all the way from the unordered sequence to the sorted sequence?

For our 16-element example, we can see 4 phases, each of which consists of a number of substeps for a total of 10 steps:

So the total number of compare and exchange steps we need for a sequence of size n = 2k is

1 + 2 + ... + k = (k(k+1))/(2) = (logn(logn + 1))/(2) = O(log2 n)

assuming again that we have O(n) processors.

This sort can be implemented in a hardware sorting network.

Graph Algorithms

We can also consider parallel versions of some of our favorite graph algorithms.

Prim's Algorithm

Given a weighted, undirected graph G = (V, E, w), the minimum spanning tree (MST) is a subgraph of G that is a tree containing all vertices of G with minimum edge weight.

Prim's Algorithm can be used to compute a MST. It is O(n2) for a graph with n vertices.

How can we parallelize it?

We cannot do different iterations of the outer loop concurrently, as a non-optimal edge may be added to the MST.

Parallelization must involve the inner loop. We partition the adjacency matrix and the distance array across the processors. Each processor determines the smallest weight to a vertex not yet in the MST, and a global reduction (all-reduce) is used to determine which vertex to choose. Processors can each update their own parts of the distance array, and the process continues.