Computer Science 341-02 |
Assignment 3: Multithreaded Matrix-Matrix Multiplication
Due: 9:00 AM, Tuesday, October 2, 2007
Read Quinn Chapter 17.
If we could improve things significantly by coarsening the parallelism in the bag of tasks, perhaps we can do even better by dividing up the entire computation ahead of time to avoid any selection from the bag of tasks whatsoever.
With the matrix-matrix multiply example, this is easy enough - we can just give SIZE/numworkers rows to each thread, they all go off and compute, and they'll all finish in about the same amount of time:
See: /cluster/home/terescoj/shared/cs341/examples/matmult_decompSome things to notice about this example:
Explicit domain decomposition works out well in this example, since there's an easy way to break it up (by rows), and each row's computation is equal in cost.
Is there any advantage to breaking it down by columns instead? How about, in the case of 4 threads, into quadrants?
In more complicated examples, load balancing with a domain decomposition may be more difficult. We will consider such examples later in the semester.
A third major approach to data parallel computation is to take the divide and conquer approach of many of your favorite algorithms, but instead of taking each subproblem and solving one after the other, we solve the subproblems concurrently, with multiple threads/processes. Recursive calls can be made concurrently only if the calls write to disjoint parts of the program's data.
First, consider a parallel summation. We have an array of size n and we want to compute the sum. Sequentially, this takes n-1 steps, but if we can do things in parallel, we can reduce this to log2 n steps. How would you approach this? How many processors do you think you could make use of?
Consider a similar idea, but instead of just getting a global sum, we want a prefix sum - each array element gets the sum of itself and all of its predecessors. What approach would you suggest?
An example of a more interesting computation that we might consider parallelizing using recursive parallelism is the adaptive quadrature problem. Quadrature can be used to approximate the value of an integral of a function f(x) from a to b (the area between f(x) and the x-axis in the interval [a,b]) by a sum of approximations of the area of smaller subintervals. Each subinterval may be approximated by a trapezoidal rule.
The straightforward (non-adaptive) approach uses fixed width subintervals, and works like this:
double f_left = f(a), f_right, area = 0.0; double width = (b-a) / NUM_INTERVALS; for (x=a+width; x<=b; x+=width) { f_right = f(x); area += (f_left+f_right) * width / 2; f_left = f_right; }
The more subintervals, the more accurate the approximation.
But the integrals of some functions are much more difficult to approximate than others. In areas of larger curvature, larger subintervals will lead to larger approximation errors.
If we allow different sized subintervals in different parts of the large interval, we can get the same accuracy, but with much less work. We can compute appropriately-sized intervals adaptively with a routine such as this:
double quad(double left, double right, double f_left, double f_right, double lr_area) { double mid = (left + right) / 2; double f_mid = f(mid); double l_area = (f_left+f_mid) * (mid-left) / 2; double r_area = (f_mid+f_right) * (right-mid) / 2; if (abs((l_area+r_area) - lr_area) > EPSILON) { // recurse to integrate both halves l_area = quad(left, mid, f_left, f_mid, l_area); r_area = quad(mid, right, f_mid, f_right, r_area); } return (l_area + r_area); }
Which is called initially as:
area = quad(a, b, f(a), f(b), (f(a)+f(b))*(b-a)/2);
This procedure starts with an approximation using a single subinterval, then recursively refines the approximation until it is determined that a recursive step has not improved the approximation enough to justify the next recursive step. This happens when the difference between the new refined approximation and the previous approximation fall below the threshold EPSILON. At each recursive step, we divide the current subinterval in half.
A simple program to compute the integral of a function using adaptive quadrature:
See: /cluster/home/terescoj/shared/cs341/examples/adapt_quadHow can we think about parallelizing this? If we want to take the recursive parallelism approach, the place to look is the recursive calls.
Are these calls independent? Yes, except for the counter of the number of function calls, but that is only to get an feel for how much work the computation is doing, and can be eliminated. Each recursive call operates only with its formal parameters and local variables. So we can execute the recursive calls concurrently. We just need to make sure both calls finish before we return the values.
Suppose we create a new thread for each recursive calls. That's a lot of threads! Once we get much beyond the number of processors available, extra threads will not add much besides extra overhead. To program this efficiently, we would need to stop creating new threads after a few steps and instead make regular recursive function calls.
What about tackling this problem with other approaches to parallelism? Bag of tasks? Explicit decomposition? How hard would it be to program each of these? What efficiency concerns arise?
A few things to consider for the bag of tasks approach:
We have seen how to implement algorithms using pthreads - shared variables can be declared globally, private variables can be declared local to our thread function or functions called by the thread function.
But a compiler should be able to do a little of this work for us - convert some higher level constructs into appropriate thread calls and mutex locks.
We saw that the Sun compilers could do some parallelization completely automatically, but this doesn't always work, and it may not choose to parallelize the way we would like. A standard set of compiler directives and library functions called OpenMP have been developed to allow programmers to specify parallelism.
Chapter 17 of Quinn describes OpenMP. We will go through some examples here.
As always, we start with a "Hello, world" program:
See: /cluster/home/terescoj/shared/cs341/examples/openmp_helloThings to note about the OpenMP hello world:
Another example, back to the matrix-matrix multiply:
See: /cluster/home/terescoj/shared/cs341/examples/matmult_openmpAgain, we include omp.h but the only other change from a straightforward serial version is the
#pragma omp parallel for
which tells OpenMP to parallelize the upcoming for loop.
We can also take more control, similar to the way we did with pthreads:
A parallel directive can take a number of clauses to define how variables are to be treated.
There are several other directives worth looking at a bit:
Yet another matrix-matrix multiply example that uses some of these:
See: /cluster/home/terescoj/shared/cs341/examples/matmult_omp_explicit3 Lab TasksThere are several files to turn in for this assignment. They should all be included in a file named assignment03.tar that you submit as an e-mail attachment. Please use the filenames specified and be sure to include your name in each file.
Your submitted tar file should include the assignment03.txt file and subdirectories named pthreads and openmp, each of which contains an appropriate Makefile, your C or C++ source code (including the timer code from class, if you choose to use it), your PBS script(s), a brief README file expaining how to run your program. Please do not include object files or your executable.
Honor code guidelines: While the programming is to be doneindividually, I want to encourage you to ask questions and discuss the program with me and with each other as you develop it. However, no sharing of code is permitted. If you have any doubts, please check first and avoid honor code problems later.
Grading guidelines: Your grade for the programming will bedetermined by correctness, design, documentation, and style, as well as the presentation of your timing results.