Lecture 2

Lecture 2 - Parallel Algorithm Example

Agenda

Announcements
Parallel Algorithm Example: Matrix Multiplication
- C/Unix introduction (or review)
- Finding parallelism
- Compiler-directed parallelism

Announcements

Feeling cold? Pitchers and catchers are reporting right about now.
WSB incorrectly listed the MPI text as required, when it is in fact optional. If you purchased it but would prefer to use the eBook version, you can return it this week.
Most of the examples we'll look at in class will be available both linked from the online version of the lecture notes, and will be in my shared directory /home/faculty/terescoj/shared/cs338/lectnn.

Parallel Algorithm Example: Matrix Multiplication

The Jordan and Alaghbandtext uses a running example of the computation of a matrix multiply. We will consider that same example to get started.

We start with a serial implementation of a matrix-matrix multiply:

matmult.tgz Also available in /home/faculty/terescoj/shared/cs338/lect02.

A few C/Unix notes on this (review for many of you):

example of separate compilation - The function in timer.c will be useful throughout the semester. We tell matmult.c about it with the line

#include "timer.h"

This provides a prototype of the function in timer.c. In many cases, this file would also define any data structures or constants/macros used by the functions it defines.

This is a good model to use as you move forward and develop more complicated C programs. Group functions as you would group methods in a Java class or member functions in a C++ class.

Along those same lines, the include files in angle brackets

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

specify system-wide header files. By convention (though most compilers don't really make a distinction) system-wide header files are in angle brackets, while your own header files are in double quotes.

Each file can then be compiled separately to create an object file (.o file) from the C source. These object files are all listed at the linking step.

What happens for function diffgettime() at compile time? Link time?

The program uses two system calls: printf() and gettimeofday(). To see how these work, we can look at their man pages:

man printf

to see everything we wanted to know about a particular system call. But if you do this on bullpen, you might get a man page for a command-line utility called printf instead of the system call printf(). Not what we were looking for. The Unix manual is divided up into sections. The most important of these sections, for our purposes, are Section 1: User Commands, and Section 3: Library Functions. If we don't ask for a section, we get section 1. Since section 1 contains an entry for printf, that's what it produced. To force it to give you the system call manual page:

man -s 3C printf

This actually tells it to look in section 3C, which contains system calls in the C library. How did I know to look in section 3C? Mainly because the printf man page in section 1 told me so, at the bottom under the "See Also" section.

Fortunately, you only need to concern yourself with what section of the manual to use when you look something up that it in more than one section. For example,

man gettimeofday

brings up the man page we want, for the gettimeofday() system call in section 3C.

If you see a reference to something like ctime(3C) in the "See Also" section of a man page, such as that in gettimeofday()'s man page, that means the ctime() man page is in section 3C. I will use this notation frequently throughout the semester.

You will find the Unix manual very helpful as we move forward.

So what does gettimeofday(3C) do? See the man page and look at the usage in the example program.

gettimeofday(3C) returns wall clock times. This is the amount of elapsed real time. So if our process is taking turns on the CPU with other processes (see CS 432 again) and it is not always running, it continues to accumulate wall clock time, but not CPU usage time. There are also system calls to examine CPU usage time which we may consider later.

The Makefile is using the Sun compiler (cc) with the option -xO3 for optimization.

If we compile and run this program, it reports initialization and matrix multiplication times. Initialization is just filling in matrices a and b. Then we compute the value of each element of c using the dot product of the corresponding row of a and column of b.

Pop quiz: what is the complexity of matrix-matrix multiply?

The running time on bullpen is around 42 seconds.

Opportunity for Parallelism

We find opportunities for parallelism by looking for parts of the sequential program that can be run in any order.

But first step back and look at a simpler example:

1: a = 10;
2: b = a + 5;
3: c = a - 3;
4: b = 7;
5: a = 3;
6: b = c - a;
7: print a, b, c;

Which statements can be run in a different order (or concurrently) but still produce the same answers at the end?

1 has to happen before 2 and 3, since they depend on a having a value.

2 and 3 can happen in either order.

4 has to happen after 2, but it can happen before 3.

5 has to happen after 2 and 3, but can happen before 4.

6 has to happen after 4 (so 4 doesn't clobber its value) and after 5 (because it depends on its value)

7 has to happen last.

This is formalized in Jordan and Alaghband, Section 1.5, Equation 1.1, known as Bernstein's conditions. More on this in a homework.

Back to our example, let's see what can be done concurrently.

  /* initialize matrices, just fill with junk */
  for (i=0; i<SIZE; i++) {
    for (j=0; j<SIZE; j++) {
      a[i][j] = i+j;
      b[i][j] = i-j;
    }
  }
  
  /* matrix-matrix multiply */
  for (i=0; i<SIZE; i++) {  /* for each row */
    for (j=0; j<SIZE; j++) { /* for each column */
      /* initialize result to 0 */
      c[i][j] = 0;

      /* perform dot product */
      for(k=0; k<SIZE; k++) {
	c[i][j] = c[i][j] + a[i][k]*b[k][j];
      }
    }
  }

  sum=0;
  for (i=0; i<SIZE; i++) {
    for (j=0; j<SIZE; j++) {
      sum += c[i][j];
    }
  }

The initialization can all be done in any order - each i and j combination is independent of each other, and the assignment of a[i][j] and b[i][j] can be done in either order.

In the actual matrix-matrix multiply, each c[i][j] must be initialized to 0 before the sum can start to be accumulated. Also, iteration k of the inner loop can only be done after row i of a and column j of b have been initialized.

Finally, the sum contribution of each c[i][j] can be added as soon as that c[i][j] has been computed, and after sum has been initialized to 0.

That granularity seems a bit cumbersome, so we might step back and just say that we can initialize a and b in any order, but that it should be completed before we start computing values in c. Then we can initialize and compute each c[i][j] in any order, but we do not start accumulating sum until c is completely computed.

But all of these dependencies in this case can be determined by a relatively straightforward computation. Seems like a job for a compiler!

In the example, if we add the flag -xparallel to the compile command, the Sun compiler will determine what can be done in parallel and generate code to support it. With this executable, we can request a number of parallel processes by setting the environment variable PARALLEL. For example:

setenv PARALLEL 4

If we run this program on a node with multiple processors, hopefully it will actually run faster.. This is an example of a speedup. Jordan and Alaghbandformally defines speedup and efficiency on p. 36-37.

An efficient program is one that exhibits linear speedup - double the number of processors, halve the running time.

The theoretical upper bound on speedup for p processors is p. Anything greater is called superlinear speedup - can this happen?

Try also setting PARALLEL to a value greater than the number of available processors.

We will return to this example and parallelize it by hand.

No Opportunity for Parallelism

But not everything can be parallelized by the compiler:

matmult-serial-init.tgz Also available in /home/faculty/terescoj/shared/cs338/lect02.

The new initialization code:

  for (i=0; i<SIZE; i++) {
    for (j=0; j<SIZE; j++) {
      if ((i == 0) || (j == 0)) {
	a[i][j] = i+j;
	b[i][j] = i-j;
      }
      else {
	a[i][j] = a[i-1][j-1] + i + j;
	b[i][j] = b[i-1][j-1] + i - j;
      }
    }
  }

can't be parallelized, so no matter how many processors we throw at it, we can't speed it up.

Any parallel program will have some fraction f that cannot be parallelized, leaving (1-f) that may be parallelized. This means that at best, we can expect running time on p processors to be f + (1-f)/(p). This is known as Amdahl's Law.

Automatic parallelism is great, when it's possible. We got it for free (at least once we bought the compiler)! It does have limitations, though:

some potential parallelization opportunities cannot be detected automatically - can add directives to help (OpenMP)

bigger complication - this executable cannot run on distributed-memory systems