Computer Science 338
Parallel Processing

Williams College
Spring 2006

Tutorial Assignment 3: Multithreaded Matrix-Matrix Multiplication
Due: Tuesday, February 21, 2006 at 9:00 AM

Upcoming Schedule Changes

We will meet in the Unix Lab on Thursday, February 16 at 9:00 AM to make sure everyone's accounts are set up to run MPI jobs on both clusters and to discuss the MPI programming model.

There will be no regular tutorial meetings or group lab meeting the week of February 20. I will be away at a conference. Instead of the usual meeting, I would like each group to meet (perhaps during your regular slot) and compose an e-mail to me as you discuss the readings and examples. It should be the same sorts of things that we would talk about in the meeting. E-mail me this document on the day of your regular meeting.

This week's lab tasks are still due as usual on Tuesday morning. I expect to be available to help with this and next week's assignments by e-mail while I am away. I will distribute next week's readings and programming assignment before I leave.

Readings

Read Quinn Chapter 17.

Explicit Domain Decomposition

If we could improve things significantly by coarsening the parallelism in the bag of tasks, perhaps we can do even better by dividing up the entire computation ahead of time to avoid any selection from the bag of tasks whatsoever.

With the matrix-matrix multiply example, this is easy enough - we can just give SIZE/numworkers rows to each thread, they all go off and compute, and they'll all finish in about the same amount of time:

See: /usr/cs-local/share/cs338/examples/matmult_decomp

Some things to notice about this example:

During the setup, we compute the range of rows that each thread will be responsible for. We can't simply give every thread the same number of rows, in case the number of threads does not divide the matrix size evenly. The computation as shown also guarantees no more than a one-row imbalance.
We need to tell each thread its range of rows to compute. But thread functions are restricted to a single parameter, of type void *. We can use this pointer to pass in anything we want, by putting all of the parameters into a structure.
Notice that each thread needs its own copy of this structure - we can't just create a single copy, send it to pthread_create(), change the values, and use it again. Why?

Explicit domain decomposition works out well in this example, since there's an easy way to break it up (by rows), and each row's computation is equal in cost.

Is there any advantage to breaking it down by columns instead? How about, in the case of 4 threads, into quadrants?

In more complicated examples, load balancing with a domain decomposition may be more difficult. We will consider such examples later in the semester.

Divide and Conquer or Recursive Parallelism

A third major approach to data parallel computation is to take the divide and conquer approach of many of your favorite algorithms, but instead of taking each subproblem and solving one after the other, we solve the subproblems concurrently, with multiple threads/processes. Recursive calls can be made concurrently only if the calls write to disjoint parts of the program's data.

First, consider a parallel summation. We have an array of size n and we want to compute the sum. Sequentially, this takes n-1 steps, but if we can do things in parallel, we can reduce this to log₂ n steps. How would you approach this? How many processors do you think you could make use of?

Consider a similar idea, but instead of just getting a global sum, we want a prefix sum - each array element gets the sum of itself and all of its predecessors. What approach would you suggest?

An example of a more interesting computation that we might consider parallelizing using recursive parallelism is the adaptive quadrature problem. Quadrature can be used to approximate the value of an integral of a function f(x) from a to b (the area between f(x) and the x-axis in the interval [a,b]) by a sum of approximations of the area of smaller subintervals. Each subinterval may be approximated by a trapezoidal rule.

The straightforward (non-adaptive) approach uses fixed width subintervals, and works like this:

  double f_left = f(a), f_right, area = 0.0;
  double width = (b-a) / NUM_INTERVALS;
  for (x=a+width; x<=b; x+=width) {
    f_right = f(x);
    area += (f_left+f_right) * width / 2;
    f_left = f_right;
  }

The more subintervals, the more accurate the approximation.

But the integrals of some functions are much more difficult to approximate than others. In areas of larger curvature, larger subintervals will lead to larger approximation errors.

If we allow different sized subintervals in different parts of the large interval, we can get the same accuracy, but with much less work. We can compute appropriately-sized intervals adaptively with a routine such as this:

  double quad(double left, double right, 
              double f_left, double f_right,
              double lr_area) {
    double mid = (left + right) / 2;
    double f_mid = f(mid);
    double l_area = (f_left+f_mid) * (mid-left) / 2;
    double r_area = (f_mid+f_right) * (right-mid) / 2;
    if (abs((l_area+r_area) - lr_area) > EPSILON) {
       // recurse to integrate both halves
       l_area = quad(left, mid, f_left, f_mid, l_area);
       r_area = quad(mid, right, f_mid, f_right, r_area);
    }
    return (l_area + r_area);
  }

Which is called initially as:

  area = quad(a, b, f(a), f(b), (f(a)+f(b))*(b-a)/2);

This procedure starts with an approximation using a single subinterval, then recursively refines the approximation until it is determined that a recursive step has not improved the approximation enough to justify the next recursive step. This happens when the difference between the new refined approximation and the previous approximation fall below the threshold EPSILON. At each recursive step, we divide the current subinterval in half.

A simple program to compute the integral of a function using adaptive quadrature:

See: /usr/cs-local/share/cs338/examples/adapt_quad

How can we think about parallelizing this? If we want to take the recursive parallelism approach, the place to look is the recursive calls.

Are these calls independent? Yes, except for the counter of the number of function calls, but that is only to get an feel for how much work the computation is doing, and can be eliminated. Each recursive call operates only with its formal parameters and local variables. So we can execute the recursive calls concurrently. We just need to make sure both calls finish before we return the values.

Suppose we create a new thread for each recursive calls. That's a lot of threads! Once we get much beyond the number of processors available, extra threads will not add much besides extra overhead. To program this efficiently, we would need to stop creating new threads after a few steps and instead make regular recursive function calls.

Other Approaches to Adaptive Quadrature

What about tackling this problem with other approaches to parallelism? Bag of tasks? Explicit decomposition? How hard would it be to program each of these? What efficiency concerns arise?

A few things to consider for the bag of tasks approach:

We don't know all the tasks at the start! Each worker would grab a task from the bag, but may generate two new tasks!
How do we decide when we're done? Just because the bag of tasks is empty when a worker looks doesn't mean than another worker won't put a new task into the bag.
How do we accumulate the result? We can't "wait" for the tasks to compute subintervals to finish, since they just get into the bag of tasks. We need to accumulate the result only when we get to a task that doesn't generate two new tasks.

OpenMP

We have seen how to implement algorithms using pthreads - shared variables can be declared globally, private variables can be declared local to our thread function or functions called by the thread function.

But a compiler should be able to do a little of this work for us - convert some higher level constructs into appropriate thread calls and mutex locks.

We saw that the Sun compilers could do some parallelization completely automatically, but this doesn't always work, and it may not choose to parallelize the way we would like. A standard set of compiler directives and library functions called OpenMP have been developed to allow programmers to specify parallelism.

Chapter 17 of Quinn describes OpenMP. We will go through some examples here.

As always, we start with a "Hello, world" program:

See: /usr/cs-local/share/cs338/examples/openmp_hello

Things to note about the OpenMP hello world:

We include omp.h, the OpenMP header file.
We have some odd syntax just inside of main() that starts a parallel block:
```
#pragma omp parallel private(nthreads, tid)
```
It is a preprocessor directive, so the initial pass of the C compiler will replace this with some code to start up a number of tasks.
It means the block that follows is a parallel block which should give private copies to each task of the variables nthreads and tid.
Note the seemlingly extraneous curly braces following the #pragma. They define the extent of the parallel block.
Two self-explanatory query functions are called inside the parallel block:
- omp_get_thread_num()
- omp_get_num_threads()
To compile this on Solaris (bullpen), we need to use Sun's cc with options -xO3 and -xopenmp=parallel.
To request a number of threads to be created, we set the environment variable OMP_NUM_THREADS to the number of threads we want. The system will only start a number of threads up to the number of available processors, by default.

Another example, back to the matrix-matrix multiply:

See: /usr/cs-local/share/cs338/examples/matmult_openmp

Again, we include omp.h but the only other change from a straightforward serial version is the

#pragma omp parallel for

which tells OpenMP to parallelize the upcoming for loop.

Setting the maximum number of threads (OMP_NUM_THREADS environment variable or the omp_set_num_threads() function requests a certain number of threads. You are not guaranteed to get that many.
When a thread reaches a #pragma omp parallel directive, it creates a team of threads and becomes the master of the team. The master is a member of that team and has thread number 0 within that team.
Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code.
There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point.

We can also take more control, similar to the way we did with pthreads:

Explicit domain decomposition:
See: /usr/cs-local/share/cs338/examples/matmult_omp_explicit
Things to note:
- The simple thread creation and destruction, calling worker()
- The worker has a bunch of local variables, all of which are private to the calling thread
- The computation of the row range for each thread, based on numthreads and threadid
Bag of tasks:
See: /usr/cs-local/share/cs338/examples/matmult_omp_bagoftasks
Things to note:
- No worker function here - the threads are defined by the block inside the { ... } pair following the #pragma omp parallel directive.
- The shared and private clauses tell OpenMP which variables that we use inside the multithreaded block should be shared or private to each thread.
- The critical section for the concurrent access to next_avail_task is taken care of by the #pragma omp critical(mutex) directives. By naming the critical sections with the name "mutex" they are essentially the same critical section (as both deal with the same variable). The name "mutex" here is just a name - it could be anything. This replaces all the declarations, initialization, locking and unlocking, and destruction of the mutex in the pthread version.

More clauses to parallel directives

A parallel directive can take a number of clauses to define how variables are to be treated.

private(variables): indicate that the variables in the list are to be private to each thread created.
Any previous value is not seen by the threads, and that value is still there when the parallel block ends.
See: /usr/cs-local/share/cs338/examples/openmp_private
shared(variables): indicate that the variables in the list are to be shared among all threads created.
Any previous value is seen by all threads, and any changes made by threads will persist when the parallel block ends.
See: /usr/cs-local/share/cs338/examples/openmp_shared
firstprivate(variables): Give the copies of the variables within each thread the initial value which is the the value the variable had outside the parallel block.
lastprivate(variables): Take the values of the variables in the "last" thread (for a parallel for loop, or parallel sections) and store that in the variable outside the parallel block.
reduction(op:variable): Perform a reduction on the variable and store the reduced value in the variable when the parallel block finishes.
See: /usr/cs-local/share/cs338/examples/openmp_reduction
What is a reduction? Basically, the operator is applied to combine the given variable's value in each thread, and the overall result is stored in the variable when the parallel block returns.
We'll see many examples when we talk about message passing. Here's one that's a little more interesting than the made up example above:
See: /usr/cs-local/share/cs338/examples/matmult_omp_explicit2

Other parallel directives

There are several other directives worth looking at a bit:

sections:
Define sections of code (that aren't a loop) that can be executed concurrently. An overly simplistic example:
See: /usr/cs-local/share/cs338/examples/openmp_sections
Each defined section is a block that can be assigned to a thread.
This is useful when we have different tasks to assign to each thread created.
single:
Used within a parallel block, this specifies that the block inside the single should be executed by exactly one thread.
master:
This is a lot like single, but we are guaranteed that the master thread does the execution.
critical:
We've seen this - it defines a critical section.
barrier:
Used within a parallel block, this causes the threads to synchronize at this point. This could be used, for example, to make sure that the threads all complete some preliminary computation before moving on to their next step.
atomic:
Force a simple statement that modifies a single variable to be atomic. It is essentially a critical section, but since it is more restrictive, the compiler may choose more efficient techniques.

Yet another matrix-matrix multiply example that uses some of these:

See: /usr/cs-local/share/cs338/examples/matmult_omp_explicit3

Lab Tasks

There are several files to turn in for this assignment. They should all be included in a file named tut03.tar that you submit using the turnin utility. Please use the filenames specified and be sure to include your name in each file.

Write a C or C++ program using pthreads that implements a domain decomposition approach to the matrix-matrix multiplication example. However, instead of dividing up the work by rows, your program should divide up the work by assigning square submatrices to each thread. To simplify things, you may assume that the number of threads is a perfect square, and that the square root of the number of threads evenly divides the number of rows and columns in the matrices. You may also define the number of threads as a constant in your program rather than on the command line. You may use one of the demo programs as your starting point.
Once your program works, run for 750 × 750 matrices, using 4 threads on the four-processor nodes of bullpen (wetteland or rivera; ppn=4 in PBS). Submit these runs through PBS. Compare the timings for the compute phase of your submatrix decomposition implementation with those for each of the multithreaded matrix-matrix multiplication examples. Explain the differences in running times. Write your answer to this part in a plain text file tut03.txt.
Convert your program to use OpenMP instead of pthreads. Compare the timings for the compute phase with the other OpenMP matrix-matrix multiply examples. Include a discussion of these results in tut03.txt. Also discuss how OpenMP compares with pthreads in efficiency and ease of programming.
Our multithreading capabilities are limited by the fact that we have nodes containing up to 4 processors. Next week, we'll start looking at message passing which will allow us to make use of multiple nodes. But for now, imagine you had access to a shared memory multiprocessor system with dozens or even hundreds of processors. Which of the approaches we have discussed do you think would be most effective? Why? Write your answer to this part in the file tut03.txt.

Your submitted tar file should include the tut03.txt file and subdirectories named pthreads and openmp, each of which contains an appropriate Makefile, your C or C++ source code (including the timer code from class, if you choose to use it), your PBS script(s), a brief README file expaining how to run your program. Please do not include object files or your executable.

Honor code guidelines: While the programming is to be done individually, along the lines of a laboratory program, I want to encourage you to ask questions and discuss the program with me and with each other as you develop it. However, no sharing of code is permitted. If you have any doubts, please check first and avoid honor code problems later.

Grading guidelines: Your grade for the programming will be determined by correctness, design, documentation, and style, as well as the presentation of your timing results.

Computer Science 338 Parallel Processing

Williams College Spring 2006