Lecture 4

Lecture 4 - More Multithreading

Agenda

Announcements
Some Homework/Lab 2 Information
More Critical Sections
More pthreads
Approaches to Data Parallel Computation
- Bag of tasks
- Explicit Domain Decomposition
- Divide and Conquer

Announcements

Please avoid logging into the nodes of the Sun Cluster (rivera, wetteland, etc) as much as possible. These nodes are intended for use only by PBS batch jobs, and other processes may interfere with timing results.

Some Homework/Lab 2 Information

My version of the program runs for about 4 seconds using one thread on a cluster node.

I recommend -xO3 for timing runs. Use -g instead if you're debugging. We have two command-line debuggers installed: dbx and gdb, and one graphical debugger: ddd.

Use "top" to see how many threads your program is creating. It will be listed under the "THR" column.

More Critical Sections

We rushed through a lot of things last time to get you the background to get started on Homework/Lab 2. We'll revisit some of that today.

Recall that concurrent access to a shared variable caused trouble in the "pthread danger" example.

When we ran it on bullpen, a single-processor system, the problem didn't show itself - we still got the correct sum when we ran it. However, there is no guarantee that this would be the case. The operating system could switch threads in the middle of the load-increment-store, resulting in a race condition and an incorrect result. Try the program on bullpen with dozens of threads and you might start to run into problems.

The "pthread nodanger" example used a mutex to ensure mutual exclusion, that only one thread at a time could be updating the counter.

A few things to consider about this:

Why isn't the access to the mutex a problem? Isn't it just a shared variable itself? - Yes, it's a shared variable, but access to it is only through the pthread API. Techniques that are discussed in detail in CS 432 (and that we will discuss more here) are used to ensure that access to the mutex itself does not cause a race condition.
Doesn't that lock/unlock have a significant cost? - Let's see. We can time the programs from last class:
pthread_danger_timed.tgz Also available in /home/faculty/terescoj/shared/cs338/lect04.
pthread_nodanger_timed.tgz Also available in /home/faculty/terescoj/shared/cs338/lect04.
So yes, there's a significant cost, at least if we're going to lock and unlock that much. Maybe we shouldn't do so much locking and unlocking. In this case, we're pretty much just going to lock again as soon as we can jump back around through the for-loop again. Here's an alternative:
pthread_nodanger_coarse.tgz Also available in /home/faculty/terescoj/shared/cs338/lect04.
In this case, the coarse-grained locking (one thread gets and holds the lock for a long time) improves the performance significantly. But at what cost? We've completely serialized the computation! Only one thread can actually be doing something at a time, so we can't take advantage of multiple processors. If the "computation" was something more significant, we would need to be more careful about the granularity of the locking.

We will revisit these issues when we get back to the matrix-matrix multiplication example.

More pthreads

Last time, we talked about some settings to specify in hopes of getting the best performance out of your pthreads on Solaris. It is worth looking in more detail at what it means to set the concurrency level or to set the scope of a thread to process or system.

Solaris implements a complex threading mechanism. We have thought of threads being multiple paths through our single process. Ideally, if we have multiple threads ready to run and multiple processors available, the threads would be assigned to those processors to run concurrently. But in Solaris, an extra entity, a lightweight process (LWP) falls between the user threads and the CPU scheduler. It is a LWP that gets scheduled on the CPU.

The pthread functions we looked at last time have to do with how many LWPs your program has, and how your threads are assigned to those LWPs to gain access to a CPU.

The call

  pthread_setconcurrency(n+1);

requests that the system create n+1 LWPs for your process. This will allow up to n+1 of your threads to be running concurrently. There is no guarantee that the system will grant this many LWPs, especially if the number is large. Generally, you should set this to the number of threads you expect to be running concurrently in your program.

The "scope" attributes indicate how you want to have your threads associated with your LWPs:

  pthread_attr_setscope(&attr, PTHREAD_SCOPE_PROCESS);

PTHREAD_SCOPE_PROCESS creates unbound threads. That means that this thread can be associated with any LWP, and its LWP association can change during its lifetime.

PTHREAD_SCOPE_SYSTEM creates bound threads. Here, the thread is assigned to an LWP and remains assigned to that LWP throughout its lifetime.

There are relative advantages and disadvantages to each approach. Allowing the thread to move among the LWPs is more flexible, preventing a situation where you have 4 LWPs and 4 processors, but all of your threads that are active and ready to do work have been assigned to the same LWP. This means only one will be able to run at a time. On the other hand, the unbound threads require an extra level of scheduling - deciding which threads go with which LWPs and deciding when to migrate threads to different LWPs.

To see how these settings come into play in a real program, I took my solution to the palindromic word finder, modified it to use FreeBSD's larger dictionary file (235,881 words instead of 25,143 in Solaris), and ran it on one of the 4-processor Solaris systems in the cluster.

When there is no call to pthread_setconcurrency(), my program's running times with unbound or bound threads using various numbers of threads:
# Threads unbound bound

1 379.1 384.0

2 194.4 197.3

3 197.3 136.8

4 197.7 99.3

And with a call to pthread_setconcurrency(n+1) for n worker threads:
# Threads unbound bound

1 367.0 357.3

2 184.7 185.0

3 132.0 127.2

4 93.4 94.9

What can we conclude about the default number of LWPs when we don't make a call to pthread_setconcurrency(n+1)?

We also see that there is no consistent trend in bound vs. unbound threads in this example. However, in cases where we have more threads than LWPs and some threads may finish their work before others on the same LWP, there is likely to be more of an advantage to unbound threads.

For a more complete discussion of this, see Chapter 5 of one of the books I have placed in the lab: Lewis, B., and Berg, D., Multithreaded Programming with Pthreads, Sun Microsystems Press, 1998.

Approaches to Data Parallel Computation

Tasks such as your palindromic word finder in the homework program, as well as many scientific computations can be solved using a data parallel programming style. A data parallel program is one in which each process executes the same actions concurrently, but on different parts of shared data.

Contrast this with a task parallel approach, where different processes each perform a different step of the computation on the same data. This corresponds to the pipleline model we mentioned earlier.

An important consideration in a data parallel computation is load balancing - making sure that each process/thread has about the same amount of work to do. Otherwise, some would finish before others, possibly leaving available processors idle while other processors continue to work. Load balancing will be an important topic throughout the course. Parallel efficiency and scalability of a data parallel computation will be highly dependent on a good load balance.

Bag of Tasks Approach

One specific way of deciding which processes/threads do the operations on which parts of the data is the bag of tasks. In this case, each thread/process is a worker that finds a task (unit of work) to do (from the "bag"), does it, then goes back for more:

  while (true) {
    // get a task from the bag
    if (no tasks remain) break;
    //execute the task
  }

A nice feature of this approach is that load balancing comes for free, as long as we have more tasks than workers. Even if some tasks cost more than others, or some workers work more slowly than others, any available work can be passed out to the first available worker until no tasks remain.

Back to our matrix multiplication example, we can break up the computation into a bag of tasks. We'll choose a fine-grained parallelization, where the computation of each entry is one of our tasks.

matmult_bagoftasks.tgz Also available in /home/faculty/terescoj/shared/cs338/lect04.

If we run this on a four-processor node, we see that it is still pretty slow. Perhaps the granularity of the computation is too small - too much time picking out a task, not enough time doing it. We created 562,500 tasks. This means we have to acquire the lock 562,500 times.

We can easily break up our computation by row or column of the result matrix, as well. Here is a row-wise decomposition:

matmult_smallbagoftasks.tgz Also available in /home/faculty/terescoj/shared/cs338/lect04.

This is much more efficient! We coarsened the parallelism, but kept it fine enough that all of our threads could keep busy. We still had 750 tasks in the bag.

Explicit Domain Decomposition

If we could improve things significantly by coarsening the parallelism in the bag of tasks, perhaps we can do even better by dividing up the entire computation ahead of time to avoid any selection from the bag of tasks whatsoever.

With the matrix-matrix multiply example, this is easy enough - we can just give SIZE/numworkers rows to each thread, they all go off and compute, and they'll all finish in about the same amount of time:

matmult_decomp.tgz Also available in /home/faculty/terescoj/shared/cs338/lect04.

Some things to notice about this example:

During the setup, we compute the range of rows that each thread will be responsible for. We can't simply give every thread the same number of rows, in case the number of threads does not divide the matrix size evenly. The computation as shown also guarantees no more than a one-row imbalance.
We need to tell each thread its range of rows to compute. But thread functions are restricted to a single parameter, of type void *. We can use this pointer to pass in anything we want, by putting all of the parameters into a structure.
Notice that each thread needs its own copy of this structure - we can't just create a single copy, send it to pthread_create(), change the values, and use it again. Why?

Explicit domain decomposition works out well in this example, since there's an easy way to break it up (by rows), and each row's computation is equal in cost.

Is there any advantage to breaking it down by columns instead? How about, in the case of 4 threads, into quadrants?

In more complicated examples, load balancing with a domain decomposition may be more difficult. We will consider many such examples shortly.

Divide and Conquer Parallelism

A third major approach to data parallel computation is to take the divide and conquer approach of many of your favorite algorithms, but instead of taking each subproblem and solving one after the other, we solve the subproblems concurrently, with multiple threads/processes.

# Threads	unbound	bound
1	379.1	384.0
2	194.4	197.3
3	197.3	136.8
4	197.7	99.3