The Jordan and Alaghbandtext uses a running example of the computation of a matrix multiply. We will consider that same example to get started.
We start with a serial implementation of a matrix-matrix multiply:
matmult.tgz Also available in /home/faculty/terescoj/shared/cs338/lect02.
If we compile and run this program, it reports initialization and matrix multiplication times. Initialization is just filling in matrices a and b. Then we compute the value of each element of c using the dot product of the corresponding row of a and column of b.
Pop quiz: what is the complexity of matrix-matrix multiply?
The running time on bullpen is around 42 seconds.
We find opportunities for parallelism by looking for parts of the sequential program that can be run in any order.
But first step back and look at a simpler example:
1: a = 10; 2: b = a + 5; 3: c = a - 3; 4: b = 7; 5: a = 3; 6: b = c - a; 7: print a, b, c;
Which statements can be run in a different order (or concurrently) but still produce the same answers at the end?
This is formalized in Jordan and Alaghband, Section 1.5, Equation 1.1, known as Bernstein's conditions. More on this in a homework.
Back to our example, let's see what can be done concurrently.
/* initialize matrices, just fill with junk */ for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { a[i][j] = i+j; b[i][j] = i-j; } } /* matrix-matrix multiply */ for (i=0; i<SIZE; i++) { /* for each row */ for (j=0; j<SIZE; j++) { /* for each column */ /* initialize result to 0 */ c[i][j] = 0; /* perform dot product */ for(k=0; k<SIZE; k++) { c[i][j] = c[i][j] + a[i][k]*b[k][j]; } } } sum=0; for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { sum += c[i][j]; } }
The initialization can all be done in any order - each i and j combination is independent of each other, and the assignment of a[i][j] and b[i][j] can be done in either order.
In the actual matrix-matrix multiply, each c[i][j] must be initialized to 0 before the sum can start to be accumulated. Also, iteration k of the inner loop can only be done after row i of a and column j of b have been initialized.
Finally, the sum contribution of each c[i][j] can be added as soon as that c[i][j] has been computed, and after sum has been initialized to 0.
That granularity seems a bit cumbersome, so we might step back and just say that we can initialize a and b in any order, but that it should be completed before we start computing values in c. Then we can initialize and compute each c[i][j] in any order, but we do not start accumulating sum until c is completely computed.
But all of these dependencies in this case can be determined by a relatively straightforward computation. Seems like a job for a compiler!
In the example, if we add the flag -xparallel to the compile command, the Sun compiler will determine what can be done in parallel and generate code to support it. With this executable, we can request a number of parallel processes by setting the environment variable PARALLEL. For example:
setenv PARALLEL 4
If we run this program on a node with multiple processors, hopefully it will actually run faster.. This is an example of a speedup. Jordan and Alaghbandformally defines speedup and efficiency on p. 36-37.
An efficient program is one that exhibits linear speedup - double the number of processors, halve the running time.
The theoretical upper bound on speedup for p processors is p. Anything greater is called superlinear speedup - can this happen?
Try also setting PARALLEL to a value greater than the number of available processors.
We will return to this example and parallelize it by hand.
But not everything can be parallelized by the compiler:
matmult-serial-init.tgz Also available in /home/faculty/terescoj/shared/cs338/lect02.
The new initialization code:
for (i=0; i<SIZE; i++) { for (j=0; j<SIZE; j++) { if ((i == 0) || (j == 0)) { a[i][j] = i+j; b[i][j] = i-j; } else { a[i][j] = a[i-1][j-1] + i + j; b[i][j] = b[i-1][j-1] + i - j; } } }
can't be parallelized, so no matter how many processors we throw at it, we can't speed it up.
Any parallel program will have some fraction f that cannot be parallelized, leaving (1-f) that may be parallelized. This means that at best, we can expect running time on p processors to be f + (1-f)/(p). This is known as Amdahl's Law.
Automatic parallelism is great, when it's possible. We got it for free (at least once we bought the compiler)! It does have limitations, though: