Lecture 5

Lecture 5 - More Multithreading

Agenda

Announcements
Approaches to Data Parallel Computation
- Divide and Conquer
OpenMP

Announcements

Computer Science Colloquium this week - two more thesis students, Josh Ain and Rob Gonzalez, will tell you about their thesis work. TCL 202, 2:35 PM Friday.

Approaches to Data Parallel Computation

We have looked in some detail at the "Bag of Tasks" and Domain Decomposition approaches to data parallel computation. We'll continue with a very different kind of approach.

Divide and Conquer or Recursive Parallelism

A third major approach to data parallel computation is to take the divide and conquer approach of many of your favorite algorithms, but instead of taking each subproblem and solving one after the other, we solve the subproblems concurrently, with multiple threads/processes. Recursive calls can be made concurrently only if the calls write to disjoint parts of the program's data.

One of the main examples in Chapter 2 of Jordan and Alaghband is the parallel prefix computation. This provides another example where recursive parallelism is often appropriate. Read back over Section 2.3 for more details.

First, consider a parallel summation. We have an array of size n and we want to compute the sum. Sequentially, this takes n-1 steps, but if we can do things in parallel, we can reduce this to log₂ n steps.

Consider a similar idea, but instead of just getting a global sum, we want a prefix sum - each array element gets the sum of itself and all of its predecessors.

See examples in Jordan and Alaghband. Don't worry so much about the Ladner and Fischer's Parallel Prefix in Section 2.3.3.

An example of a more interesting computation that we might consider parallelizing using recursive parallelism is the adaptive quadrature problem. Quadrature can be used to approximate the value of an integral of a function f(x) from a to b (the area between f(x) and the x-axis in the interval [a,b]) by a sum of approximations of the area of smaller subintervals. Each subinterval may be approximated by a trapezoidal rule.

The straightforward (non-adaptive) approach uses fixed width subintervals, and works like this:

  double f_left = f(a), f_right, area = 0.0;
  double width = (b-a) / NUM_INTERVALS;
  for (x=a+width; x<=b; x+=width) {
    f_right = f(x);
    area += (f_left+f_right) * width / 2;
    f_left = f_right;
  }

The more subintervals, the more accurate the approximation.

But the integrals of some functions are much more difficult to approximate than others. In areas of larger curvature, larger subintervals will lead to larger approximation errors.

If we allow different sized subintervals in different parts of the large interval, we can get the same accuracy, but with much less work. We can compute appropriately-sized intervals adaptively with a routine such as this:

  double quad(double left, double right, 
              double f_left, double f_right,
              double lr_area) {
    double mid = (left + right) / 2;
    double f_mid = f(mid);
    double l_area = (f_left+f_mid) * (mid-left) / 2;
    double r_area = (f_mid+f_right) * (right-mid) / 2;
    if (abs((l_area+r_area) - lr_area) > EPSILON) {
       // recurse to integrate both halves
       l_area = quad(left, mid, f_left, f_mid, l_area);
       r_area = quad(mid, right, f_mid, f_right, r_area);
    }
    return (l_area + r_area);
  }

Which is called initially as:

  area = quad(a, b, f(a), f(b), (f(a)+f(b))*(b-a)/2);

This procedure starts with an approximation using a single subinterval, then recursively refines the approximation until it is determined that a recursive step has not improved the approximation enough to justify the next recursive step. This happens when the difference between the new refined approximation and the previous approximation fall below the threshold EPSILON. At each recursive step, we divide the current subinterval in half.

A simple program to compute the integral of a function using adaptive quadrature:

adapt_quad.tgz Also available in /home/faculty/terescoj/shared/cs338/lect05.

How can we think about parallelizing this? If we want to take the recursive parallelism approach, the place to look is the recursive calls.

Are these calls independent? Yes, except for the counter of the number of function calls, but that is only to get an feel for how much work the computation is doing, and can be eliminated. Each recursive call operates only with its formal parameters and local variables. So we can execute the recursive calls concurrently. We just need to make sure both calls finish before we return the values.

Suppose we create a new thread for each recursive calls. That's a lot of threads! Once we get much beyond the number of processors available, extra threads will not add much besides extra overhead. To program this efficiently, we would need to stop creating new threads after a few steps and instead make regular recursive function calls.

Other Approaches to Adaptive Quadrature

What about tackling this problem with other approaches to parallelism? Bag of tasks? Explicit decomposition? How hard would it be to program each of these? What efficiency concerns arise?

A few things to consider for the bag of tasks approach:

We don't know all the tasks at the start! Each worker would grab a task from the bag, but may generate two new tasks!

How do we decide when we're done? Just because the bag of tasks is empty when a worker looks doesn't mean than another worker won't put a new task into the bag.

How do we accumulate the result? We can't "wait" for the tasks to compute subintervals to finish, since they just get into the bag of tasks. We need to accumulate the result only when we get to a task that doesn't generate two new tasks.

See the description of an approach to this in Jordan and Alaghband on p. 134.

OpenMP

The language used by this and other examples in our text will be useful as we continue on. It looks a lot like Pascal, but with a number of keywords to indicate which variables are shared among the threads/processes, which are local (private) to each thread/process, and critical sections.

We can implement the algorithms described in this pseudocode using pthreads - shared variable can be declared globally, private variables can be declared local to our thread function or functions called by the thread function.

But a compiler should be able to do a little of this work for us - convert some higher level constructs into appropriate thread calls and mutex locks.

We saw that the Sun compilers could do some parallelization completely automatically, but this doesn't always work, and it may not choose to parallelize the way we would like. A standard set of compiler directives and library functions called OpenMP have been developed to allow programmers to specify parallelism.

Jordan and Alaghband in Section 4.4 describe OpenMP, but unfortunately, their examples are all in Fortran. We will look at OpenMP in C instead.

As always, we start with a "Hello, world" program:

openmp_hello.tgz Also available in /home/faculty/terescoj/shared/cs338/lect05.

Things to note about the OpenMP hello world:

We include omp.h, the OpenMP header file.

We have some odd syntax just inside of main() that starts a parallel block:

#pragma omp parallel private(nthreads, tid)

It is a preprocessor directive, so the initial pass of the C compiler will replace this with some code to start up a number of tasks.

It means the block that follows is a parallel block which should give private copies to each task of the variables nthreads and tid.

Note the seemlingly extraneous curly braces following the #pragma. They define the extent of the parallel block.

Two self-explanatory query functions are called inside the parallel block:

omp_get_thread_num()

omp_get_num_threads()

To compile this, we need to use Sun's cc with options -xO3 and -xopenmp=parallel.

To request a number of threads to be created, we set the environment variable OMP_NUM_THREADS to the number of threads we want. The system will only start a number of threads up to the number of available processors, by default.

Another example, back to our favorite matrix-matrix multiply:

matmult_openmp.tgz Also available in /home/faculty/terescoj/shared/cs338/lect05.

Again, we include omp.h but the only other change from a straightforward serial version is the

#pragma omp parallel for

which tells OpenMP to parallelize the upcoming for loop.

Setting the maximum number of threads (OMP_NUM_THREADS environment variable or the omp_set_num_threads() function requests a certain number of threads. You are not guaranteed to get that many.
When a thread reaches a #pragma omp parallel directive, it creates a team of threads and becomes the master of the team. The master is a member of that team and has thread number 0 within that team.
Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code.
There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point.

We will get back to bigger and better OpenMP examples next week.