Computer Science 400
Parallel Processing and High Performance Computing
Fall 2017, Siena College
In the previous lab, you saw a bit about how to use POSIX threads. They probably seem a lot easier to use than MPI (and generally, they are). In this lab, you will see the dangers that come up and how to deal with them.
You may work alone or in a group of two or three.
Getting Set Up
You will receive an email with the link to follow to set up your GitHub repository cs-lab-yourgitname for this Lab. One member of the group should follow the link to set up the repository on GitHub, then that person should email the instructor with the other group members' GitHub usernames so they can be granted access. This will allow all members of the group to clone the repository and commit and push changes to the origin on GitHub. At least one group member should make a clone of the repository to begin work.
Critical Sections
By using global variables, or passing pointers to thread functions that access the same stack or heap variables, threads can share memory. This is a wonderful advantage, and can often greatly simplify code and reduce memory overhead of parallelism. However, it comes with a price. Concurrent access to shared variables can be dangerous!
Consider the example in pthread_danger
.
If all went well, you did not get the answers you expected. Why?
The answer is that we have concurrent access to the shared variable counter. Suppose that two threads are each about to execute counter++, what can go wrong?
counter++ really likely requires three machine instructions: (i) load a register with the value of counter's memory location, (ii) increment the register, and (iii) store the register value back in counter's memory location. Even on a single processor, the operating system could switch the process out in the middle of this. With multiple processors, the statements really could be happening concurrently.
Consider two threads running the statements that modify counter:
Thread A | Thread B | ||
A1 | R0 = counter; | B1 | R1 = counter; |
A2 | R0 = R0 + 1; | B2 | R1 = R1 + 1; |
A3 | counter = R0; | B3 | counter = R1; |
If each thread executes its code for counter++, we'd expect the value of counter to be two larger than it was when we started.
What we have here is a race condition. We need to make sure that when one process starts modifying counter, that it finishes before the other can try to modify it. This requires synchronization of the processes.
If we run this on a single-processor system (rare in the 2010's), the problem is unlikely to show itself - we almost certainly would see the correct sum when we run it. However, there is no guarantee that this would be the case. The operating system could switch threads in the middle of the load-increment-store, resulting in a race condition and an incorrect result.
We need to make those statements that increment counter atomic. We say that the modification of counter is a critical section.
There are many solutions to the critical section problem that form a significant topic in an operating systems course. But for our purposes, at least for now, it is sufficient to recognize the problem, and use available tools to deal with it.
The pthread library provides a construct called a mutex that
allows us to ensure that only one thread at a time is executing a
particular block of code. We can use it to fix our "danger" program
to a "no danger version" which you have in pthread_nodanger
.
We declare a mutex like any other shared variable. It is of type pthread_mutex_t. Four functions are used:
A few things to consider about this:
Why isn't the access to the mutex a problem? Isn't it just a shared variable itself? - Yes, it's a shared variable, but access to it is only through the pthread API. Techniques that are discussed in detail in an operating systems course (and that we might still discuss more here) are used to ensure that access to the mutex itself does not cause a race condition.
Doesn't that lock/unlock have a significant cost? - Let's see.
We can instrument these two programs, as seen in
pthread_danger_timed
and pthread_nodanger_timed
.
Perhaps the cost is too much if we're going to lock and unlock that much. Maybe we shouldn't do so much locking and unlocking. In this case, we're pretty much just going to lock again as soon as we can jump back around through the for loop again. Here's an alternative:
pthread_nodanger_coarse
In this case, the coarse-grained locking (one thread gets and holds the lock for a long time) should improve the performance significantly.
Unfortunately, we've completely serialized the computation! Only one thread can actually be doing something at a time, so we can't take advantage of multiple processors. If the "computation" was something more significant, we would need to be more careful about the granularity of the locking.
Domain Decomposition vs. Bag of Tasks Paradigm
Tasks such as your next programming project, as well as many scientific computations, can be solved using a data parallel programming style. A data parallel program is one in which each process executes the same actions concurrently, but on different parts of shared data.
An important consideration in a data parallel computation is load balancing - making sure that each process/thread has about the same amount of work to do. Otherwise, some would finish before others, possibly leaving available processors idle while other processors continue to work. Load balancing will be an important topic throughout the course. Parallel efficiency and scalability of a data parallel computation will be highly dependent on a good load balance.
One specific way of deciding which processes/threads do the operations on which parts of the data is the bag of tasks. In this case, each thread/process is a worker that finds a task (unit of work) to do (from the "bag"), does it, then goes back for more:
while (true) { // get a task from the bag if (no tasks remain) break; //execute the task }
A nice feature of this approach is that load balancing comes for free, as long as we have more tasks than workers. Even if some tasks cost more than others, or some workers work more slowly than others, any available work can be passed out to the first available worker until no tasks remain.
Let's work with the matrix multiplication example we saw earlier, in matmult.
For this problem, we can break up the computation into a bag of tasks.
We'll choose a fine-grained parallelization, where the computation of
each entry is one of our tasks in matmult_bagoftasks
.
You should see that it is still pretty slow. Perhaps the granularity of the computation is too small - too much time picking out a task, not enough time doing it. We created 562,500 tasks. This means we have to acquire the lock 562,500 times.
We can easily break up our computation by row or column of the result matrix, as well. Here is a row-wise decomposition:
matmult_smallbagoftasks
You should find that this is much more efficient! We coarsened the parallelism, but kept it fine enough that all of our threads could keep busy. We still had 750 tasks in the bag.
If we could improve things significantly by coarsening the parallelism in the bag of tasks, perhaps we can do even better by dividing up the entire computation ahead of time to avoid any selection from the bag of tasks whatsoever.
With the matrix-matrix multiply example, this is easy enough - we can just give SIZE/numworkers rows to each thread, they all go off and compute, and they'll all finish in about the same amount of time:
matmult_decomp
Some things to notice about this example:
Explicit domain decomposition works out well in this example, since there's an easy way to break it up (by rows), and each row's computation is equal in cost.
In more complicated examples, load balancing with a domain decomposition may be more difficult. We will consider such examples later in the semester.
Submitting
Your submission requires that all required deliverables are committed and pushed to the master for your repository on GitHub.
Grading
This assignment is worth 40 points, which are distributed as follows:
> Feature | Value | Score |
Lab Question 1 | 1 | |
Lab Question 2 | 1 | |
Lab Question 3 | 1 | |
Lab Question 4 | 2 | |
interleavings.c practice program | 15 | |
Lab Question 5 | 3 | |
Lab Question 6 | 2 | |
Lab Question 7 | 1 | |
Lab Question 8 | 2 | |
Lab Question 9 | 2 | |
Lab Question 10 | 2 | |
Lab Question 11 | 3 | |
Lab Question 12 | 2 | |
Lab Question 13 | 3 | |
Total | 40 | |