## Computer Science 338 |

**Tutorial Assignment 10: Parallel Sorting****Due: Before your tutorial meeting**

**Reminder**

Final project progress reports are Monday, April 24.

**Readings**

As you do this week's readings, keep an outline of important points
and points you wish to discuss during our tutorial meetings in a plain
text file `tut10.txt` or something that can be turned into a PDF
file `tut10.pdf`.

For most of the semester, we have looked at parallelism mainly from a point of view of implementation. We will look this week at parallel algorithms, similar to the way you might study algorithms in CS 256.

Sometimes in the study of parallel algorithms, we state (possibly
impractical) assumptions such as "suppose we have *p* processors" or
perhaps even "suppose we have *p ^{2}* processors" or worse yet
"suppose we have

The Quinn textbook describes some parallel sorting procedures in Chapter 14. Read that chapter and do Exercise 14.10, parts a, b, and c. Include your answers in the submitted text or PDF file.

Some other parallel sorting procedures to consider:

We start with *n* integer values, uniformly distributed across a known
interval *[0,a-1]*, and we will use *m* buckets to sort. In the first
step, we distribute the numbers into the buckets, with numbers up to
*(a)/(m)* in the first bucket, up to *(2a)/(m)* in the second,
and so on. This requires *n* steps, one to determine the bucket for
each number. In general, this requires a potentially expensive
division operation, though if *m* is a power of 2, we can use fast
bit-shift operators to achieve the division.

Now, (only because we assumed that the numbers were uniformly
distributed across the interval) we have appoximately *(n)/(m)*
numbers in each bucket.

Next, each bucket must be sorted, and we'll use our favorite "compare
and exchange" sorting algorithm, perhaps quicksort or mergesort. For
each bucket, with *(n)/(m)* numbers, we need *(n)/(m) log(n)/(m)* steps. And we need this for each of the *m* buckets.

Finally, the sorted numbers must be concatenated into a sorted list,
which is, at worst, *n* steps.

So we have total time of

*t _{s} = n + m((n)/(m) log(n)/(m)) = n + n log(n)/(m) = O(n log(n)/(m))*

Furthermore, if *n = km*, where *k* is a constant (that is, we add
buckets proportionally if we increase the problem size), this reduces
to *O(n)*. This is better than our usual bound on sorting algorithms,
but depends on the fact that we have a well-distributed initial
values.

How can we parallelize it?

The obvious place is in the sorting of the contents of each bucket.
These are independent operations, so if we have *p=m* processors, we
can have a processor dedicated to each, and we can reduce the second
term in the equation from *m((n)/(m) log(n)/(m))* to
*(n)/(p) log(n)/(p)*.

However, the whole process is still *O(n)*, since they all need to
look at all of the original numbers to fill their buckets initially.

One way to parallelize the initial bucket-filling is to partition the
initial values into *m* parts. Each processor deals with its
partition. Each processor could potentially put numbers directly into
the correct bucket, if we have shared memory, though there would be
the added cost of enforcing mutual exclusion when accessing the
buckets.

If we don't have shared memory, we can have each processor put the
numbers into "small buckets." Each processor has *m* small buckets,
where it puts numbers from its partition that are intended for each of
the regular buckets. After each processor has put all of the numbers
from its range into its small buckets, each processor sends the
contents of one small bucket and to and receives the contents of one
small bucket from each other processor.

This algorithm has 4 phases:

- Partition the initial collection of numbers
- Sort into small buckets
- Send to large buckets
- Sort large buckets

We consider the costs at each stage, in detail:

- Phase 1 - Computation and Communication
Assume

*n*computational steps to partition*n*numbers*p*ways:*t*_{comp1}= nWe then broadcast/scatter the partitions:

*t*_{comm1}= t_{startup}+ t_{data}n - Phase 2 - Computation only
Each processor has to partition

*(n)/(p)*numbers into*p*small buckets:*t*_{comp2}= (n)/(p) - Phase 3 - Communication only
Distribution of the small buckets. For simplicity, we assume the uniform initial distribution has given about

*(n)/(p*numbers in each small bucket on each processor. Each must send the contents of^{2})*p-1*of its small buckets. This happens on each of*p*processes.If the communication is serialized:

*t*_{comm3}= p(p-1)(t_{startup}+ (n)/(p^{2})t_{data})or if the communication can be concurrent:

*t*_{comm3}= (p-1)(t_{startup}+ (n)/(p^{2})t_{data})Note that an implementation of this algorithm might employ an

`MPI_Alltoall()`call. - Phase 4 - Computation
Each processor sorts its own "large" bucket's contents:

*t*_{comp4}= (n)/(p) log(n)/(p)

We add all these up to get our estimate of the total running time.

Now we consider a system where we have a pipeline of *p* processing
units. Each processor can hold one number. When a processor gets a
new number, it compares it with the one it has, and keeps the larger
(or smaller) and passes along the other to the next processor in the
pipeline. The numbers to be sorted are fed in, one at a time, to the
first processor. At the end, the processor 0 will have the largest
(or smallest) number, processor 1 the second largest (smallest), and
so on.

This is a parallel version of an insertion sort.

Here is what such an algorithm would do to sort 5 numbers:

Time Left to Sort P0 P1 P2 P3 P4 1 7 5 2 3 11 2 7 5 2 3 11 3 7 5 2 11 -3-> 4 7 5 11 -2-> 3 5 7 11 -5-> 3 -2-> 6 11 -7-> 5 -3-> 2 7 11 7 -5-> 3 -2-> 8 11 7 5 -3-> 2 9 11 7 5 3 -2-> 10 11 7 5 3 2

This is nicely suited to a message passing program.

It takes *2p* steps.

A **rank**, or **enumeration**, **sort** algorithm counts the
number of numbers that are smaller than each number. This determines
its rank in the final ordering.

A sequential implementation that takes numbers in an array `a` of
size *n* and sorts them into an array `b` of size *n*:

for (i=0; i<n; i++) { /* look at each position */ rank=0; for (j=0; j<n; j++) { /* count how many are < a[i] */ if (a[i] > a[j]) rank++; } b[rank] = a[i]; /* this is where it goes */ }

This is not an especially efficient sequential sorting algorithm,
running in *O(n ^{2})* time.

But what if we have *n* processors? We can do the outer loop in
parallel:

parallel for (i=0; i<n; i++) { /* look at each position */ rank=0; for (j=0; j<n; j++) { /* count how many are < a[i] */ if (a[i] > a[j]) rank++; } b[rank] = a[i]; /* this is where it goes */ }

`rank` and `j` are private variables, everything else is
shared.

Now, we have a running time of *O(n)*.

What if we have *n ^{2}* processors? We can then have

If memory is fully shared, we can change the inside for loop to a
parallel for, but the increments to `rank` need to be serialized.
This leads to *1+n* steps, 1 to do each of the *n-1* comparisons of
`a[i]` to each other entry (completely in parallel), and *n* to
deal with `rank` (including one step for the initialization). So
we still have overall complexity *O(n)*.

The increments of `rank` could be improved by doing a tree
reduction:

a[i]a[0] a[i]a[1] a[i]a[2] a[i]a[3] \ / Compare \ / 0/1 0/1 0/1 0/1 \ / \ / + Add + \ / 0/1/2 0/1/2 \ / \ / \ / \ / + 0/1/2/3/4

Now this is *O(logn)*.

If we allow concurrent writes to memory (hard) we can do this all in
one step - making an *O(1)* sorting algorithm!

Most sorting algorithms we see in the sequential world are "compare
and exchange." These algorithms are based on this operation, for two
items `A` and `B`:

if (A > B) { temp = A; A = B; B = temp; }

If we think of this in a message passing system, and `A` and `B` are on different processors, *P _{1}* and

One way to do this, using simple send/recv operations.

- Process
*P*:_{1}send(&A, P2); recv(&A, P2);

- Process
*P*:_{2}recv(&A, P1); if (A > B) { send(&B, P1); B = A; } else { send(&A, P1); }

Each process has one send and one receive.

Another possible approach is to have the processes exchange values, and only keep the one they're supposed to keep:

- Process
*P*:_{1}send(&A, P2); recv(&B, P2); if (A > B) A = B;

- Process
*P*:_{2}recv(&A, P1); send(&B, P1); if (A > B) B = A;

Remember everyone's favorite laughingstock sorting algorithm, the
**bubble sort**...

for (i = n -1; i > 0; i--) { for (j = 0; j < i; j++) { k = j + 1; if (a[j] > a[k]) { temp = a[j]; a[j] = a[k]; a[k] = temp; } } }

This requires *O(n ^{2})* compare and exchange operations.

4 2 7 8 5 1 3 6 4x2 7 8 5 1 3 6 2 4x7 8 5 1 3 6 2 4 7x8 5 1 3 6 2 4 7 8x5 1 3 6 2 4 7 5 8x1 3 6 2 4 7 5 1 8x3 6 2 4 7 5 1 3 8x6 2x4 7 5 1 3 6 8 2 4x7 5 1 3 6 8 2 4 7x5 1 3 6 8 2 4 5 7x1 3 6 8 2 4 5 1 7x3 6 8 2 4 5 1 3 7x6 8 2x4 5 1 3 6 7 8 2 4x5 1 3 6 7 8 ...

How can we parallelize this? It looks at first that each pass depends on the previous pass being completed, but that is not the case.

As soon as the first pass has gotten to the comparison of the 3rd and 4th columns, the comparison of the 1st and 2nd columns for the second pass can begin:

4 2 7 8 5 1 3 6 4x2 7 8 5 1 3 6 2 4x7 8 5 1 3 6 2x4 7x8 5 1 3 6 2 4x7 8x5 1 3 6 2x4 7x5 8x1 3 6 2 4x5 7x1 8x3 6 2x4 5x1 7x3 8x6 2 4x1 5x3 7x6 8 2x1 4x3 5x6 7 8 1 2x3 4x5 6 7 8 ...

This leads to an algorithm called an **odd-even transposition
sort**:

4x2 7x8 5x1 3x6 2 4x7 8x1 5x3 6 2x4 7x1 8x3 5x6 2 4x1 7x3 8x5 6 2x1 4x3 7x5 8x6 1 2x3 4x5 7x6 8 1 2 3 4 5 6 7 8

First, the even-numbered entries are compared with their next-highest neighbor's value and swapped if necessary, then the off-numbered entries. Then back to even.

This takes *n* steps, and can use up to *p=(n)/(2)* processors.

If we cast this as a message-passing program:

*P*(odd)_{i}, i=1,3,5,...,n-3send(&A, Pi-1); /* even phase */ recv(&B, Pi-1); if (A < B) A = B; if (i <= n-3) { /* odd phase */ send(&A, Pi+1); recv(&B, Pi+1); if (A > B) A = B; }

*P*(even)_{i}, i=0,2,4,...,n-2recv(&A, Pi+1); /* even phase */ send(&B, Pi+1); if (A < B) B = A; if (i >= 2) { /* off phase */ recv(&A, Pi-1); send(&B, Pi-1); if (A > B) B = A; }

..with appropriate while loops and determination of process rank.

This algorithm proceeds by rearranging **bitonic sequences**.

A **bitonic sequence** is a sequence of numbers *a _{0}, a_{1}, ...,
a_{n-1}* which monotonically increases in value, reaches a single
maximum, and then monotonically decreases in value. So:

*a _{0} < a_{1} < ... < a_{i-1} < a_{i} > a_{i+1} > ... > a_{n-2} > a_{n-1}*

for some value of *i*. A sequence is also considered to be bitonic if
the relation above can be achieved by shifting the numbers cyclically.

Examples:

- 1, 2, 4, 7, 6, 0 (use
*i=3*) - 8, 9, 2, 1, 0, 4 (rotate right by 1 or 2 slots, then
*i=2*or*i=3*)

This is useful to us because we can rearrange a bitonic sequence in into two smaller bitonic sequences:

- We apply a compare-and-exchange operation on the first element in the sequence with the first element in the second half of the sequence, the second element in the sequence with the second element in the second half of sequence, and so on.

After this, we have two bitonic sequences, and all elements in the
first bitonic sequence will be smaller than all elements in the
second. So we have converted the problem of rearranging a bitonic
sequence of size *n* to rearranging two bitonic sequences of *n/2*,
followed by concatenation. We apply this recursively until the whole
sequence is sorted.

3 5 8 9 10 12 14 20 95 90 60 40 35 23 18 0 3 5 8 9 10 12 14 0|95 90 60 40 35 23 18 20 3 5 8 0|10 12 14 9|35 23 18 20|95 90 60 40 3 0| 8 5|10 9|14 12|18 20|35 23|60 40|95 90 0 3 5 8 9 10 12 14 18 20 23 35 40 60 90 95

This takes *logn* steps to sort an *n*-element bitonic sequence.

What can be done in parallel? We can always make *(n)/(2)*
comparisons concurrently. So if we have *O(n)* processors, we can do
this entire thing with *O(logn)* complexity.

But what if we don't have a bitonic sequence to start out? Well, we can make one:

- Two elements always form a bitonic sequence
- Any unsorted sequence is a concatenation of bitonic sequences of size 2
- Two bitonic sequences can be combined into one large bitonic sequence by sorting the first in increasing order and the second in decreasing order.
- Merge these into larger bitonic sequences until we have a
sequence of size
*n*.

Consider our example from above, but start with a completely unordered list of numbers:

10 20 5 9 3 8 12 14 90 0 60 40 23 35 95 18

First, we think of it as 8 little bitonic sequences (which is is). We sort them, so that the first pair is increasing, the second is decreasing, third is increasing, etc.

10 20 9 5 3 8 14 12 0 90 60 40 23 35 95 18

Now, we have 4 bitonic sequences of length 4. We sort each, using the approach we described previously. The first, we sort in increasing order, the second in decreasing, the third increasing and the 4th decreasing.

9 5 10 20 14 12 3 8 0 40 60 90 95 35 21 18 5 9 10 20 14 12 8 3 0 40 60 90 95 35 23 18

Now, we have two bitonic sequences of length 8. Sort these as described previously. The first in increasing order, the second in decreasing order.

5 9 8 3 14 12 10 20 95 40 60 90 0 35 23 18 5 3 8 9 10 12 14 20 95 90 60 40 23 35 0 18 3 5 8 9 10 12 14 20 95 90 60 40 35 23 18 0

Which is our starting point for the original example, leaving one bitonic sequence to be sorted.

What is the complexity to get all the way from the unordered sequence to the sorted sequence?

For our 16-element example, we can see 4 phases, each of which consists of a number of substeps for a total of 10 steps:

- Phase 1: Convert pairs of numbers into alternating increasing/decreasing sequences to get 4 bitonic sequences of size 4. This requires just one "compare and exchange" step.
- Phase 2: Sort each sequence of size 4 into alternating
increasing/decreasing sequences to get 2 bitonic sequences of size 8.
This requires
*2 (= log*compare and exchange steps._{2}4) - Phase 3: Sort each sequence of size 8 into an increasing and a
decreasing sequence, to get one bitonic sequence of size 16. This
requires
*3 (= log*compare and exchange steps._{2}8) - Phase 4: Sort the bitonic sequence, requiring
*4 = (log*steps._{2}16)

So the total number of compare and exchange steps we need for a
sequence of size *n = 2 ^{k}* is

*1 + 2 + ... + k = (k(k+1))/(2) = (logn(logn + 1))/(2) =
O(log ^{2} n)*

assuming again that we have *O(n)* processors.

This sort can be implemented in a hardware sorting network.

**Submission and Grading Guidelines**

Before your tutorial meeting, submit a plain text file `tut10.txt`
or a PDF file `tut10.pdf` that contains the outline that you will
use to help guide our discussion and your answer to the assigned
question from the text. You may choose to work with your tutorial
partner(s) or independently on this outline. Your grade for the week
will be based on the quality of this outline.