Computer Science 338
Parallel Processing

Williams College
Spring 2006

Tutorial Assignment 2: Finding Palindromic Words with POSIX Threads
Due: Tuesday, February 14, 2006 at 9:00 AM


Be prepared to discuss the readings from the text and the notes on this handout during your tutorial meeting on February 14 and 15. In particular, we will discuss some of the questions in this section during the meetings. In particular, you are asked to run several of the example programs and observe their performance. I recommend taking notes on this handout to bring with you to the tutorial meeting. You are encouraged to work through the handout with your tutorial partner.

Read Quinn Chapter 7.1-7.3.

Parallel Algorithm Example: Matrix Multiplication

We'll get started by using a matrix-matrix multiply as a running example. Most of the class examples will be placed in subdirectories of /usr/cs-local/share/cs338/examples on the CSLab Unix systems (including both clusters).

We start with a serial implementation of a matrix-matrix multiply and use it to review a bit about C and Unix:

See: /usr/cs-local/share/cs338/examples/matmult

If we compile and run this program, it reports initialization and matrix multiplication times. Initialization is just filling in matrices a and b. Then we compute the value of each element of c using the dot product of the corresponding row of a and column of b.

Remember your 136: what is the complexity of matrix-matrix multiply?

How long does it take you to run this program on bullpen?

Opportunity for Parallelism

We find opportunities for parallelism by looking for parts of the sequential program that can be run in any order.

Before we look at the matrix-matrix multiply, we step back and look at a simpler example:

1: a = 10;
2: b = a + 5;
3: c = a - 3;
4: b = 7;
5: a = 3;
6: b = c - a;
7: print a, b, c;

Which statements can be run in a different order (or concurrently) but still produce the same answers at the end?

Back to our example, let's see what can be done concurrently.

  /* initialize matrices, just fill with junk */
  for (i=0; i<SIZE; i++) {
    for (j=0; j<SIZE; j++) {
      a[i][j] = i+j;
      b[i][j] = i-j;
  /* matrix-matrix multiply */
  for (i=0; i<SIZE; i++) {  /* for each row */
    for (j=0; j<SIZE; j++) { /* for each column */
      /* initialize result to 0 */
      c[i][j] = 0;

      /* perform dot product */
      for(k=0; k<SIZE; k++) {
        c[i][j] = c[i][j] + a[i][k]*b[k][j];

  for (i=0; i<SIZE; i++) {
    for (j=0; j<SIZE; j++) {
      sum += c[i][j];

The initialization can all be done in any order - each i and j combination is independent of each other, and the assignment of a[i][j] and b[i][j] can be done in either order.

In the actual matrix-matrix multiply, each c[i][j] must be initialized to 0 before the sum can start to be accumulated. Also, iteration k of the inner loop can only be done after row i of a and column j of b have been initialized.

Finally, the sum contribution of each c[i][j] can be added as soon as that c[i][j] has been computed, and after sum has been initialized to 0.

That granularity seems a bit cumbersome, so we might step back and just say that we can initialize a and b in any order, but that it should be completed before we start computing values in c. Then we can initialize and compute each c[i][j] in any order, but we do not start accumulating sum until c is completely computed.

But all of these dependencies in this case can be determined by a relatively straightforward computation. Seems like a job for a compiler!

In the example, if we add the flag -xparallel to the compile command, the Sun compiler will determine what can be done in parallel and generate code to support it. With this executable, we can request a number of parallel processes by setting the environment variable PARALLEL. For example:

setenv PARALLEL 4

If we run this program on a node with multiple processors, hopefully it will actually run faster.. This is an example of a speedup. Quinn formally defines speedup and efficiency on p. 160.

An efficient program is one that exhibits linear speedup - double the number of processors, halve the running time.

The theoretical upper bound on speedup for p processors is p. Anything greater is called superlinear speedup - can this happen?

Try this out. Compile the matrix-matrix multiplication example with the -xparallel flag on bullpen and run it with the PARALLEL environment variable to numbers from 1-4. How does this affect the running time? Now, create a PBS script that runs this version of the program on one of bullpen's 2-processor nodes, again with values of PARALLEL ranging from 1-4. Finally, create another script to run on a 4-processor node and range PARALLEL between 1 and 8. What are the running times you get? Why?

We will return to this example and parallelize it by hand.

No Opportunity for Parallelism

But not everything can be parallelized by the compiler:

See: /usr/cs-local/share/cs338/examples/matmult-serial-init

The new initialization code:

  for (i=0; i<SIZE; i++) {
    for (j=0; j<SIZE; j++) {
      if ((i == 0) || (j == 0)) {
        a[i][j] = i+j;
        b[i][j] = i-j;
      else {
        a[i][j] = a[i-1][j-1] + i + j;
        b[i][j] = b[i-1][j-1] + i - j;

can't be parallelized, so no matter how many processors we throw at it, we can't speed it up.

Any parallel program will have some fraction f that cannot be parallelized, leaving (1-f) that may be parallelized. This means that at best, we can expect running time on p processors to be f + (1-f)/(p). This is known as Amdahl's Law. Quinn states Amdahl's Law in terms of maximum achievable speedup on p. 162.

Automatic parallelism is great, when it's possible. We got it for free (at least once we bought the compiler)! It does have limitations, though:

Approaches to Parallelism

Parallel programs can be categorized by how the cooperating processes communicate with each other:

These are functionally equivalent given operating system support. For example, one can write message-passing software using shared memory constructs, and one can simulate a shared memory by replacing accesses to non-local memory with a series of messages that access or modify the remote memory.

We will look at shared memory parallelism using threads first.

Brief Intro to POSIX threads

Multithreading usually allows for the use of shared memory. Many operating systems provide support for threads, and a standard interface has been developed: POSIX Threads or pthreads.

An online tutorial is available here

You do not need to read this thoroughly, but look through it and remember that it's there.

A google search for "pthread tutorial" yields many others.

Pthreads are available on bullpen.

The basic idea is that we can create and destroy threads of execution in a program, on the fly, during its execution.

Start with a look at a pthreads "Hello, world" program:

See: /usr/cs-local/share/cs338/examples/pthreadhello

The most basic functionality involves the creation and destruction of threads:

Prototypes for pthread functions are in pthread.h and programs need to link with libpthread.a (use -lpthread at link time). When using the Sun compiler, the -mt flag should also be specified to indicate multithreaded code.

It is also a good idea to do some extra initialization, to make sure the system will allow your threads to make use of all available processors. It may, by default, allow only one thread in your program to be executing at any given time. If your program will create up to n concurrent threads, you should make the call:


somewhere before your first thread creation. The "+1" is needed to account for the original thread plus the n you plan to create.

You may also want to specify actual attributes as the second argument to pthread_create(). To do this, declare a variable for the attributes:

  pthread_attr_t attr;

and initialize it with:


and set parameters on the attributes with calls such as:

  pthread_attr_setscope(&attr, PTHREAD_SCOPE_PROCESS);

I recommend the above setting for threads in Solaris.

Then, you can pass in &attr as the second parameter to pthread_create().

Any global variables in your program are accessible to all threads. Local variables are directly accessible only to the thread in which they were created, though the memory can be shared by passing a pointer as part of the last argument to pthread_create().

A slightly more interesting example:

See: /usr/cs-local/share/cs338/examples/proctree_threads

This example builds a "tree" of threads to a depth given on the command line. It includes calls to pthread_self(). This function returns the thread identifier of the calling thread.

Try it out.

Brief Intro to Critical Sections

As CS 432 veterans know and can tell the rest of you, concurrent access to shared variables can be dangerous.

Consider this example:

See: /usr/cs-local/share/cs338/examples/pthread_danger

Run it with one thread, and we get 100000. What if we run it with 2 threads? On a multiprocessor, it is going to give the wrong answer! Why?

The answer is that we have concurrent access to the shared variable counter. Suppose that two threads are each about to execute counter++, what can go wrong?

counter++ really requires three machine instructions: (i) load a register with the value of counter's memory location, (ii) increment the register, and (iii) store the register value back in counter's memory location. Even on a single processor, the operating system could switch the process out in the middle of this. With multiple processors, the statements really could be happening concurrently.

Consider two threads running the statements that modify counter:

Thread A Thread B
A1 R0 = counter; B1 R1 = counter;
A2 R0 = R0 + 1; B2 R1 = R1 + 1;
A3 counter = R0; B3 counter = R1;

Consider one possible ordering: A1 A2 B1 A3 B2 B3 , where counter=17 before starting. Uh oh.

What we have here is a race condition. We need to make sure that when one process starts modifying counter, that it finishes before the other can try to modify it. This requires synchronization of the processes.

When we run it on bullpen, a single-processor system, the problem is unlikely to show itself - we almost certainly the correct sum when we run it. However, there is no guarantee that this would be the case. The operating system could switch threads in the middle of the load-increment-store, resulting in a race condition and an incorrect result. Try the program on bullpen with dozens of threads and you might start to run into problems.

We need to make those statements that increment counter atomic. We say that the modification of counter is a critical section.

There are many solutions to the critical section problem that take up over a week of CS 432. But for our purposes, at least for now, it is sufficient to recognize the problem, and use available tools to deal with it.

The pthread library provides a construct called a mutex that allows us to ensure that only one thread at a time is executing a particular block of code. We can use it to fix our "danger" program:

See: /usr/cs-local/share/cs338/examples/pthread_nodanger

We declare a mutex like any other shared variable. It is of type pthread_mutex_t. Four functions are used:

A few things to consider about this:

Why isn't the access to the mutex a problem? Isn't it just a shared variable itself? - Yes, it's a shared variable, but access to it is only through the pthread API. Techniques that are discussed in detail in CS 432 (and that we will discuss more here) are used to ensure that access to the mutex itself does not cause a race condition.

Doesn't that lock/unlock have a significant cost? - Let's see. We can time the programs we've been looking at:

See: /usr/cs-local/share/cs338/examples/pthread_danger_timed

See: /usr/cs-local/share/cs338/examples/pthread_nodanger_timed

Try these out. What are the running times of each version? Perhaps the cost is too much if we're going to lock and unlock that much. Maybe we shouldn't do so much locking and unlocking. In this case, we're pretty much just going to lock again as soon as we can jump back around through the for-loop again. Here's an alternative:

See: /usr/cs-local/share/cs338/examples/pthread_nodanger_coarse

In this case, the coarse-grained locking (one thread gets and holds the lock for a long time) should improve the performance significantly. How fast does it run now? But at what cost? We've completely serialized the computation! Only one thread can actually be doing something at a time, so we can't take advantage of multiple processors. If the "computation" was something more significant, we would need to be more careful about the granularity of the locking.

More pthreads

Above, there is a brief discussion of some settings we can specify in hopes of getting the best performance out of pthreads on Solaris. It is worth looking in more detail at what it means to set the concurrency level or to set the scope of a thread to process or system.

Solaris implements a complex threading mechanism. We have thought of threads being multiple paths through our single process. Ideally, if we have multiple threads ready to run and multiple processors available, the threads would be assigned to those processors to run concurrently. But in Solaris, an extra entity, a lightweight process (LWP) falls between the user threads and the CPU scheduler. It is a LWP that gets scheduled on the CPU.

The pthread functions we looked at last time have to do with how many LWPs your program has, and how your threads are assigned to those LWPs to gain access to a CPU.

The call

requests that the system create n+1 LWPs for your process. This will allow up to n+1 of your threads to be running concurrently. There is no guarantee that the system will grant this many LWPs, especially if the number is large. Generally, you should set this to the number of threads you expect to be running concurrently in your program.

The "scope" attributes indicate how you want to have your threads associated with your LWPs:

  pthread_attr_setscope(&attr, PTHREAD_SCOPE_PROCESS);

There are relative advantages and disadvantages to each approach. Allowing the thread to move among the LWPs is more flexible, preventing a situation where you have 4 LWPs and 4 processors, but all of your threads that are active and ready to do work have been assigned to the same LWP. This means only one will be able to run at a time. On the other hand, the unbound threads require an extra level of scheduling - deciding which threads go with which LWPs and deciding when to migrate threads to different LWPs.

To see how these settings come into play in a real program, I took my solution to the palindromic word finder, modified it to use FreeBSD's larger dictionary file (235,881 words instead of 25,143 in Solaris), and ran it on one of the 4-processor Solaris systems in the cluster.

When there is no call to pthread_setconcurrency(), my program's running times with unbound or bound threads using various numbers of threads:

# Threads unbound bound
1 379.1 384.0
2 194.4 197.3
3 197.3 136.8
4 197.7 99.3

And with a call to pthread_setconcurrency(n+1) for n worker threads:

# Threads unbound bound
1 367.0 357.3
2 184.7 185.0
3 132.0 127.2
4 93.4 94.9

When we don't make a call to pthread_setconcurrency(n+1), what can we conclude about the default number of LWPs?

We also see that there is no consistent trend in bound vs. unbound threads in this example. However, in cases where we have more threads than LWPs and some threads may finish their work before others on the same LWP, there is likely to be more of an advantage to unbound threads.

For a more complete discussion of this, see Chapter 5 of one of the books I have placed in the lab: Lewis, B., and Berg, D., Multithreaded Programming with Pthreads, Sun Microsystems Press, 1998.

Approaches to Data Parallel Computation

Tasks such as this week's lab program, as well as many scientific computations, can be solved using a data parallel programming style. A data parallel program is one in which each process executes the same actions concurrently, but on different parts of shared data.

Contrast this with a task parallel approach, where different processes each perform a different step of the computation on the same data. This corresponds to the pipeline model mentioned in Quinn Chapter 1.

An important consideration in a data parallel computation is load balancing - making sure that each process/thread has about the same amount of work to do. Otherwise, some would finish before others, possibly leaving available processors idle while other processors continue to work. Load balancing will be an important topic throughout the course. Parallel efficiency and scalability of a data parallel computation will be highly dependent on a good load balance.

Bag of Tasks Paradigm

One specific way of deciding which processes/threads do the operations on which parts of the data is the bag of tasks. In this case, each thread/process is a worker that finds a task (unit of work) to do (from the "bag"), does it, then goes back for more:

  while (true) {
    // get a task from the bag
    if (no tasks remain) break;
    //execute the task

A nice feature of this approach is that load balancing comes for free, as long as we have more tasks than workers. Even if some tasks cost more than others, or some workers work more slowly than others, any available work can be passed out to the first available worker until no tasks remain.

Back to our matrix multiplication example, we can break up the computation into a bag of tasks. We'll choose a fine-grained parallelization, where the computation of each entry is one of our tasks.

See: /usr/cs-local/share/cs338/examples/matmult_bagoftasks

Run this on a four-processor node. You should see that it is still pretty slow. Perhaps the granularity of the computation is too small - too much time picking out a task, not enough time doing it. We created 562,500 tasks. This means we have to acquire the lock 562,500 times. How long does this take to run?

We can easily break up our computation by row or column of the result matrix, as well. Here is a row-wise decomposition:

See: /usr/cs-local/share/cs338/examples/matmult_smallbagoftasks

You should find that this is much more efficient! We coarsened the parallelism, but kept it fine enough that all of our threads could keep busy. We still had 750 tasks in the bag. How long does this take to run?

More C

There are a few other things you should be aware of for this week's lab program.

Lab Tasks

There are several files to turn in for this assignment. They should all be included in a file named tut02.tar that you submit using the turnin utility. Please use the filenames specified and be sure to include your name in each file.

There is an online dictionary /usr/dict/words on bullpen that is used by the spell command. You know that a palindrome is a word or phrase that reads the same in either direction, i.e., if you reverse all the letters you get the same word or phrase. A word is palindromic if its reverse is also in the dictionary. For example, "noon" is palindromic, because it is a palindrome and hence its reverse is trivially in the dictionary. A word like "draw" is palindromic because "ward" is also in the dictionary.

Your task is write a C or C++ program to find all palindromic words in the dictionary. You should first write a sequential program and then parallelize it, using the bag-of-tasks paradigm. However, keep your plan for parallelization in mind when designing and implementing the sequential version. You might want to do some things in the sequential program that you will need in the parallel program, such as finding where each letter begins in the dictionary (see below for more on this).

Your sequential program should have the following phases:

The first few words in the dictionary start with numbers; you can either skip over them or process them, as you wish. None are palindromic, so this choice will not affect your total count. Some words start with capital letters (and hence the dictionary is not sorted in ASCII order). To keep your program simple, leave the capitals alone and do case-sensitive comparisons.

The sequential program is something you should have no trouble designing, but as many of you are not expert C or C++ programmers (yet), it might take longer than you expect to implement. Please ask if you run into trouble with the sequential version! You will need to get it done with enough time left to work on the parallelization - which is the whole point of the assignment!

After you have a working sequential program, modify it to use the bag-of-tasks paradigm, implemented using the pthreads library. Your parallel program should use W worker threads, where W is a command-line argument. Use the workers just for the compute phase; do the input and output phases sequentially. Each worker should count the number of palindromic words that it finds. Sum these W values during the output phase. This avoids a critical section during the compute phase - you'll need to deal with others, though. Use 26 tasks in your program, one for each letter of the alphabet. In particular, the first task is to examine all words that begin with "a" (and numbers), the second task is to examine all words that begin with "b", and so on. During the input phase you should build an efficient representation for the bag of tasks; I suggest using an array, where the value in task[0] is the index of the first "a" word, task[1] is the index of the first "b" word, and so on. You can also use this array during the search phase to limit the scope of your linear searches. Your parallel program should also time the compute phase. You may use the timer.c and timer.h code from our class examples. Start the clock just before you create the workers; read it again as soon as they have finished. Write the elapsed time for the compute phase to stdout.

To summarize, your program should have the following output:

For the timing tests, execute your parallel program on the four-processor nodes of bullpen (wetteland or rivera; ppn=4 in PBS) using using 1, 2, 3, and 4 workers. Run each test 3 times. Submit your job through PBS. You may do all tests in a single PBS script, or create a script for each run and submit them separately. In the file tut02.txt, include a table of results; it should contain all the values written to standard output (but not the words themselves) for all 12 test runs, and a brief analysis of the speedup and parallel efficiency you have achieved.

Your submitted tar file should include your Makefile, your C source code (including the timer code from class, if you choose to use it), your PBS script(s), a brief README file expaining how to run your program, and the tut02.txt file. Please do not include object files or your executable.


Honor code guidelines: While the program is to be done individually, along the lines of a laboratory program, I want to encourage you to ask questions and discuss the program with me, and with each other, as you develop it. However, no direct sharing of code is permitted. If you have any doubts, please check first and avoid honor code problems later.

Grading guidelines: Your grade for the program will be determined by correctness, design, documentation, and style, as well as the presentation of your timing results.