Lecture 1

Lecture 1 - Introduction and Overview

Welcome to CS 338, Parallel Processing.

Agenda

Announcements
Syllabus and Course Ground Rules
Why Parallel Computing?
Some Basics

Announcements

Computer Science Colloquium resumes this week with two senior honors thesis progress reports: Jeremy Redburn and Shimon Rura. Senior majors have to be there, the rest of you are encouraged to come along.

Syllabus and Course Ground Rules

Please read the syllabus carefully. If there are questions, ask!

Why Parallel Computing?

Let's consider a few "real world" examples:

Taking a census of Williamstown.

One person doing this would visit each house and count the people, ask whatever questions are supposed to be asked. This person would keep running counts. At the end, this person has gathered everything.

If there are two people, they can work concurrently. Each visits some houses, and they need to "report in" along the way or at the end to combine their information. But how to split up the work?

Each person could do what the individual was originally doing, but would check to make sure each house along the way had not yet been counted.

Each person could start at the town hall, get an address that has not yet been visited, go visit it, then go back to the town hall to report the result and get another address to visit. Someone at town hall keeps track of the cumulative totals. This is nice because neither person will be left without work to do until the whole thing is done. This is the master-slave method of breaking up the work.

The town could be split up beforehand. Each could get a randomly selected collection of addresses to visit. Maybe one person takes all houses with even street numbers and the other all houses with odd street numbers. Or perhaps one person would take everything north of Route 2 and the other everything south of Route 2. The choice of how to divide up the town may have a big effect on the total cost. There could be excessive travel if one person walks right past a house that has not yet been visited. Also, one person could finish completely while the other still has a lot of work to do. This is a domain decomposition approach.

If there are a few dozen people, which approaches work well? How about 1000 people? 20,000 people?

Grading a stack of exams. Suppose each has several questions. Again, assume two graders to start.

Each person could take half of the stack. Simple enough. But we still have the potential of one person finishing before the other.

Each person could take a paper from the "ungraded" stack, grade it, then put it into the "graded" stack.

Perhaps it makes more sense to have each person grade half of the questions instead of half of the exams, maybe because it would be unfair to have the same question graded by different people. Here, we could use variations on the approaches above. Each takes half the stack, grades his own questions, then they swap stacks.

Or we form a pipeline, where each exam goes from one grader to the next to the finished pile. Some time is needed to start up the pipeline and drain it out, especially if we add more graders. These models could be applied to the census example, if different census takers each went to every house to ask different questions.

Suppose we also add in a "grade totaler and recorder" person. Does that make any of the approaches better or worse?

Adding two 1,000,000 ×1,000,000 matrices.

Each matrix entry in the sum can be computed independently, so we can break this up any way we like. Could use the master-slave approach, though a domain decomposition would probably make more sense. Depending on how many processes we have, we might break it down by individual entries, or maybe by rows or columns.

In each of these cases, we have taken what we might normally think of as a sequential process, and taken advantage of the availability of concurrent processing to make use of multiple workers (processing units).

Parallelism adds complexity (as we will see in great detail), so why bother?

want to solve the same problem but in a shorter time than possible on one processor

want to solve larger problems than can currently be solved at all on a single processor

some algorithms are more naturally expressed or organized concurrently

Some Basics

Sequential Program: sequence of actions that produce a result (statements + variables), called a process, task, or thread (of control). The state of the program is determined by the code, data, and a single program counter.

Concurrent Program: two or more processes that work together. Big difference: multiple program counters.

To cooperate, the processes need communication and synchronization, which can be achieved through shared variables, or message passing.

Hardware to run concurrent processes:

single processor - logical concurrency (see CS 432)

multiprocessor - shared memory

multicomputer - separate memories

network - slower communication

Computers may be classified as:

SISD: single instruction, single data - one processor doing one thing at a time to one piece of data at a time

SIMD: single instruction, multiple data - multiple processors all doing the same thing at the same time, but operating on different data. a.k.a. vector computers. Program operates in "lock step" on each processor.

MIMD: multiple instruction, multiple data - multiple processors each doing their own thing.

SPMD: single program, multiple data - not really a classification of the computer, but of a model used to program a MIMD computer. Multiple processors run the same program, but do not operate in lock step. a.k.a. the "interacting peers" model. This is the model we will use most in this class.

Some examples:

SISD: Squall, Williams College: Macintosh PowerBook, 800 MHz Power PC G4

MIMD: Bullpen Cluster, Williams College: 8-node Sun cluster, total of 20 450 MHz UltraSparc II processors (plus some older stuff)

MIMD: ASCI Red, Sandia National Labs: 4600+ nodes, each with 2 Intel Pentium II Xeon processors, first TeraOp machine in 1997

MIMD: ASCI White, LLNL: 512 nodes, each with 16 Power3 Nighthawk-2 processors, 12 TeraOps total, was number 1 until 2002

Hybrid: Earth Simulator, Yokohama Institute for Earth Sciences, Japan: 640-node NEC system, each node with 8 vector processors, total of 5,120 CPUs, peak performance of 40 TeraOps

See http://www.top500.org/.

How to Achieve Parallelism

Need to determine where concurrency is possible, then break up the work accordingly

Easiest if a compiler can do this for you - take your sequential program and extract the concurrency automatically. This is sometimes possible, especially with fixed-size array computations.

If the compiler can't do it, it is possible to give "hints" to the compiler to tell it what is safe to parallelize.

But often, the parallelization must be done explicitly: the programmer has to create the threads or processes, assign work to them, and manage necessary communication.

The Bullpen Cluster

For part of Homework 1, you will need to use the Bullpen Cluster, located here in the department. You will need to compile and run a parallel version of a "hello, world" program that uses the Message Passing Interface (MPI) to start the processes and for interprocess communication. We will look at MPI in great detail throughout the semester.

MPI program demo.