Computer Science 400
Parallel Processing and High Performance Computing

Fall 2017, Siena College

Lab 11: Critical Sections
Due: 11:59 PM, Tuesday, October 31, 2017

In the previous lab, you saw a bit about how to use POSIX threads. They probably seem a lot easier to use than MPI (and generally, they are). In this lab, you will see the dangers that come up and how to deal with them.

You may work alone or in a group of two or three.

Getting Set Up

You will receive an email with the link to follow to set up your GitHub repository cs-lab-yourgitname for this Lab. One member of the group should follow the link to set up the repository on GitHub, then that person should email the instructor with the other group members' GitHub usernames so they can be granted access. This will allow all members of the group to clone the repository and commit and push changes to the origin on GitHub. At least one group member should make a clone of the repository to begin work.

Critical Sections

By using global variables, or passing pointers to thread functions that access the same stack or heap variables, threads can share memory. This is a wonderful advantage, and can often greatly simplify code and reduce memory overhead of parallelism. However, it comes with a price. Concurrent access to shared variables can be dangerous!

Consider the example in pthread_danger.

Question 1: Run this program on noreaster with 1 thread. What answer do you get and why? (1 point)

Question 2: What answers would you expect (don't run it yet!) with 2, 4, 8, 16, and 32 threads? (1 point)

Question 3: Now run it with 2, 4, 8, 16, and 32 threads. What answers do you get? (1 point)

If all went well, you did not get the answers you expected. Why?

The answer is that we have concurrent access to the shared variable counter. Suppose that two threads are each about to execute counter++, what can go wrong?

counter++ really likely requires three machine instructions: (i) load a register with the value of counter's memory location, (ii) increment the register, and (iii) store the register value back in counter's memory location. Even on a single processor, the operating system could switch the process out in the middle of this. With multiple processors, the statements really could be happening concurrently.

Consider two threads running the statements that modify counter:

Thread A Thread B
A1 R0 = counter; B1 R1 = counter;
A2 R0 = R0 + 1; B2 R1 = R1 + 1;
A3 counter = R0; B3 counter = R1;

If each thread executes its code for counter++, we'd expect the value of counter to be two larger than it was when we started.

Question 4: Consider one possible ordering: A1 A2 B1 A3 B2 B3 , where counter=17 before starting. What is the final value of counter? (2 points)

What we have here is a race condition. We need to make sure that when one process starts modifying counter, that it finishes before the other can try to modify it. This requires synchronization of the processes.

If we run this on a single-processor system (rare in the 2010's), the problem is unlikely to show itself - we almost certainly would see the correct sum when we run it. However, there is no guarantee that this would be the case. The operating system could switch threads in the middle of the load-increment-store, resulting in a race condition and an incorrect result.

Practice Program: Write a C program that will list all possible orderings of the machine instructions generated for the two processes above. Your program should list all possible interleavings of the statements A1, A2, A3, B1, B2, and B3. Also have your program print which interleavings produce a correct result (that counter has a value two greater than it started with). Write your program in a file called interleaving.c. (15 points)

We need to make those statements that increment counter atomic. We say that the modification of counter is a critical section.

There are many solutions to the critical section problem that form a significant topic in an operating systems course. But for our purposes, at least for now, it is sufficient to recognize the problem, and use available tools to deal with it.

The pthread library provides a construct called a mutex that allows us to ensure that only one thread at a time is executing a particular block of code. We can use it to fix our "danger" program to a "no danger version" which you have in pthread_nodanger.

We declare a mutex like any other shared variable. It is of type pthread_mutex_t. Four functions are used:

A few things to consider about this:

Why isn't the access to the mutex a problem? Isn't it just a shared variable itself? - Yes, it's a shared variable, but access to it is only through the pthread API. Techniques that are discussed in detail in an operating systems course (and that we might still discuss more here) are used to ensure that access to the mutex itself does not cause a race condition.

Doesn't that lock/unlock have a significant cost? - Let's see. We can instrument these two programs, as seen in pthread_danger_timed and pthread_nodanger_timed.

Question 5: Try these out for 1, 2, 4, 8, 16, and 32 threads on noreaster. What are the running times of each version and each number of threads? For each, run at least 5 times and take the shortest run time you get. (3 points)

Perhaps the cost is too much if we're going to lock and unlock that much. Maybe we shouldn't do so much locking and unlocking. In this case, we're pretty much just going to lock again as soon as we can jump back around through the for loop again. Here's an alternative:

pthread_nodanger_coarse

In this case, the coarse-grained locking (one thread gets and holds the lock for a long time) should improve the performance significantly.

Question 6: Try this one out for 1, 2, 4, 8, 16, and 32 threads on noreaster. What are the running times of each number of threads? For each, run at least 5 times and take the shortest run time you get. (2 points)

Unfortunately, we've completely serialized the computation! Only one thread can actually be doing something at a time, so we can't take advantage of multiple processors. If the "computation" was something more significant, we would need to be more careful about the granularity of the locking.

Domain Decomposition vs. Bag of Tasks Paradigm

Tasks such as your next programming project, as well as many scientific computations, can be solved using a data parallel programming style. A data parallel program is one in which each process executes the same actions concurrently, but on different parts of shared data.

An important consideration in a data parallel computation is load balancing - making sure that each process/thread has about the same amount of work to do. Otherwise, some would finish before others, possibly leaving available processors idle while other processors continue to work. Load balancing will be an important topic throughout the course. Parallel efficiency and scalability of a data parallel computation will be highly dependent on a good load balance.

One specific way of deciding which processes/threads do the operations on which parts of the data is the bag of tasks. In this case, each thread/process is a worker that finds a task (unit of work) to do (from the "bag"), does it, then goes back for more:

  while (true) {
    // get a task from the bag
    if (no tasks remain) break;
    //execute the task
  }

A nice feature of this approach is that load balancing comes for free, as long as we have more tasks than workers. Even if some tasks cost more than others, or some workers work more slowly than others, any available work can be passed out to the first available worker until no tasks remain.

Let's work with the matrix multiplication example we saw earlier, in matmult.

Question 7: How long does it take this serial version to run on noreaster? Run at least 5 times and take the shortest run time. (1 point)

For this problem, we can break up the computation into a bag of tasks. We'll choose a fine-grained parallelization, where the computation of each entry is one of our tasks in matmult_bagoftasks.

Question 8: Run this on noreaster for 1, 2, 4, 8, 16, and 32 threads. How long does this take to run? Run at least 5 times and take the shortest run time. (2 points)

You should see that it is still pretty slow. Perhaps the granularity of the computation is too small - too much time picking out a task, not enough time doing it. We created 562,500 tasks. This means we have to acquire the lock 562,500 times.

We can easily break up our computation by row or column of the result matrix, as well. Here is a row-wise decomposition:

matmult_smallbagoftasks

You should find that this is much more efficient! We coarsened the parallelism, but kept it fine enough that all of our threads could keep busy. We still had 750 tasks in the bag.

Question 9: Run this on noreaster for 1, 2, 4, 8, 16, and 32 threads. How long does this take to run? Run at least 5 times and take the shortest run time. (2 points)

If we could improve things significantly by coarsening the parallelism in the bag of tasks, perhaps we can do even better by dividing up the entire computation ahead of time to avoid any selection from the bag of tasks whatsoever.

With the matrix-matrix multiply example, this is easy enough - we can just give SIZE/numworkers rows to each thread, they all go off and compute, and they'll all finish in about the same amount of time:

matmult_decomp

Question 10: Run this on noreaster for 1, 2, 4, 8, 16, and 32 threads. How long does this take to run? Run at least 5 times and take the shortest run time. (2 points)

Some things to notice about this example:

Question 11: Notice that each thread needs its own copy of this structure - we can't just create a single copy, send it to pthread_create(), change the values, and use it again. Why? (3 points)

Explicit domain decomposition works out well in this example, since there's an easy way to break it up (by rows), and each row's computation is equal in cost.

Question 12: Is there any advantage to breaking it down by columns instead? How about, in the case of 4 threads, into quadrants? (2 points)

In more complicated examples, load balancing with a domain decomposition may be more difficult. We will consider such examples later in the semester.

Question 13: Throughout this lab, you have been asked to run tests at least 5 times and take the minimum time. Why is this more appropriate than taking the average for these kinds of timing runs? (3 points)

Submitting

Your submission requires that all required deliverables are committed and pushed to the master for your repository on GitHub.

Grading

This assignment is worth 40 points, which are distributed as follows:

> FeatureValueScore
Lab Question 1 1
Lab Question 2 1
Lab Question 3 1
Lab Question 4 2
interleavings.c practice program 15
Lab Question 5 3
Lab Question 6 2
Lab Question 7 1
Lab Question 8 2
Lab Question 9 2
Lab Question 10 2
Lab Question 11 3
Lab Question 12 2
Lab Question 13 3
Total 40