|
Computer Science 338 Parallel Processing Williams College Spring 2006
|
|
Tutorial Assignment 4: Jacobi Iteration with OpenMP
Due: Tuesday, February 28, 2006 at 9:00 AM
Schedule
We'll be back to a regular meeting schedule on February 28 and March 1.
Readings
This week's readings are your introduction to message passing, but you
will not do any real programming in the message passing paradigm until
next week.
Read Quinn Ch. 4.
So far we have seen three ways to create a parallel program:
- Let the compiler do whatever it can completely automatically
- Create threads explicitly using pthreads
- Specify parallel sections using OpenMP
These all suffer from one significant limitation - the cooperating
threads must be able to communicate through shared variables.
How can we design and run parallel programs to work when there is no
shared memory available?
We will now consider the message passing paradigm.
- Characteristics:
- Locality - each processor accesses only its local memory
- Explicit parallelism - messages are sent and received
explicitly - programmer controls all parallelism. The compiler
doesn't do it.
- Cooperation - every send must have a matching receive in
order for the communication to take place. Beware of
deadlock! One sided communication is possible, but doesn't really
fit the pure message-passing model.
- Advantages:
- Hardware - many current clusters of workstations and
supercomputers fit well into the message passing model.
- Functionality - full access to parallelism. We don't
have to rely on a compiler. The programmer can decide when
parallelism makes sense. But this is also a disadvantage - the full
burden is on the programmer! Advice: If the compiler can do it for
you, let it!
- Performance - data locality is important - especially in
a multi-level memory hierarchy which includes off-processor data.
There is a chance for superlinear speedup with added cache as we
talked about earlier in the course. Communication is often MUCH
more expensive than computation.
Unix programs can use fork() to create new processes. CS 432
veterans remember this from their Cow Shell projects.
The Unix system call fork() duplicates a process. The child is
a copy of the parent - in execution at the same point, the statement
after the return from fork().
The return value indicates if you are child or parent.
0 is child, >0 means parent, -1 means failure (limit reached, permission denied)
Example C program:
pid=fork();
if (pid) {
parent stuff;
}
else {
child stuff;
}
A more complete program that uses fork() along with three other
system calls (wait(), getpid(), and getppid()) is
here:
See: /usr/cs-local/share/cs338/examples/forking
Processes created using fork() do not share context, and must
allocate shared memory explicitly, or rely on a form of message
passing to communicate.
We will not consider the Unix fork() method of multiprocessing
much, but here's an example to show you that in fact, even global
variables are not shared by processes that are created with fork():
See: /usr/cs-local/share/cs338/examples/what_shared
Remember that the advantage of using processes such as these instead
of threads is that the processes could potentially be running on
different systems. But if they are going to cooperate, they will need
to communicate:
- Two processes on the same system can communicate through a named
or an unnamed pipe.
- Two processes on different systems must communicate across a
network - most commonly this is done using sockets.
Sockets and pipes provide only a very rudimentary interprocess
communication. Each "message" sent through a pipe or across a
socket has a unique sender and unique receiver and is really nothing
more than a stream of bytes. The sender and receiver must add any
structure to these communcations.
Here's a very simplistic example of two processes that can communicate
over raw sockets. It is included mainly to show you that you don't
want to be doing this if you can help it:
See: /usr/cs-local/share/cs338/examples/socket
For many applications, this primitive interface is unreasonable. We
want something at a higher level. Message passing libraries have
evolved to meet this need.
- Message passing is supported through a set of library routines.
This allows programmers to avoid dealing with the hardware directly.
Programmers want to concentrate on the problem they're trying to
solve, not worrying about writing to special memory buffers or making
TCP/IP calls or even creating sockets.
- Examples: P4, PVM, MPL, MPI, MPI-2, etc. MPI and PVM are
the most common.
- Common Characteristics:
- Process Management - start and stop processes, query
number of procs or PID.
- Point-to-Point Communications - send/receive between
processes
- Collective Communication - broadcast, scatter, gather,
synchronization
- Terminology:
- Buffering - copy into a buffer somewhere (in library, hardware)
- Blocking communication - wait for some "event" to complete a
communication routine
- Nonblocking communication - "post" a message and
return immediately
- Synchronous communication - special case of blocking - send
does not return until corresponding receive completes
- Asynchronous communication - pretty much nonblocking
All message passing is based on the simple send and receive operations
P0: send(addr,len,dest,tag)
P1: receive(addr,max_len,src,tag,rec_len)
These are basic components in any message-passing implementation.
There may be others introduced by a specific library.
- addr is the address of the send/receive buffer
- len is the length of the sent message
- max_len is the size of the receive buffer (to avoid overflow)
- rec_len is the length of the message actually received
- dest identifies destination of a message being sent
- src identifies desired sender of a message being received (or where it actually came from if "any source" is specified)
- tag a user-defined identifier restricting receipt
Blocking communication has simple semantics:
- send completes when send buffers are ready for reuse,
after message received or at least copied into system buffers
- receive completes when the receive buffer's value is ready
to use
But beware of deadlock when using blocking routines!!
Proc 0 Proc 1
-------------------------------------
bsend(to 1) bsend(to 0)
brecv(from 1) brecv(from 0)
If both processors' send buffers cannot be copied into system buffers,
or if the calls are strictly synchronous, the calls will block until
the corresponding receive call is made... Neither can proceed... deadlock...
Possible solutions - reorder to guarantee matching send/receive pairs,
or use nonblocking routines...
- send or receive calls return immediately - but how
do we know when it's done? When can we use the value?
- can overlap computation and communication
- must call a wait routine to ensure communication has
completed before destroying send buffer or using receive buffer
Example:
Proc 0 Proc 1
-------------------------------------
nbsend(to 1) nbsend(to 0)
nbrecv(from 1) nbrecv(from 0)
compute...... compute......
waitall waitall
use result use result
During the "compute......" phase, it's possible that the
communication can be completed "in the background" while the
computation proceeds, so when the "waitall" lines are reached, the
program can just continue.
Deadlock is less likely but we still must be careful - the burden of
avoiding deadlock is on the programmer in a message passing model of
parallel computation.
Some common operations don't fit well into a point-to-point
communication scheme. Here, we may use collective communication
routines.
- collective communication occurs among a group of processors
- the group can be all or a subset of the processors in a computation
- collective routines are blocking
- types of collective operations
- synchronization/barrier - wait until all processors have
reached a given point
- data movement - broadcast (i.e. error condition,
distribute read-in values), scatter/gather (exchange boundary on a
finite element problem, for example), all-to-all (extreme case of
scatter/gather)
- reductions - collect data from all participating
processors and operate on it (i.e. add, multiply, min, max)
These kinds of operations can be achieved through a series of
point-to-point communication steps, but operators are often provided.
Using collective communication operators provided is generally better
than trying to do it yourself. In addition to providing convenience,
the message passing library can often perform the operations more
efficiently by taking advantage of low-level functionality.
The Message Passing Interface (MPI) was created by a standards
committee in the early 1990's.
- motivated by the lack of a good standard
- everyone had their own library
- PVM demonstrated that a portable library was feasible
- portablity and efficiency were conflicting goals
- The MPI-1 standard was released in 1994, and many
implementations (free and proprietary) have since become available
- MPI specifies C and Fortran interfaces (125 functions in the
standard), more recently C++ as well
- parallelism is explicit - the programmer must identify
parallelism and implement a parallel algorithm using MPI constructs
- MPI-2 is an extention to the standard developed later in the
1990's and there are now some implementations
- Rank - a unique identifier for a process
- values are 0...n-1 when n processes are used
- specify source and destination of messages
- used to control conditional execution
- Group - a set of processes, associated with a communicator
- processes in a group can take part in a collective
communication, for example
- we often use the predefined communicator specifying the group
of all processors in a communication: MPI_COMM_WORLD
- the communicator ensures safe communication within a group -
avoid potential conflicts with other messages
- Application Buffer - application space containing data to
send or received data
- System Buffer - system space used to hold pending messages
You already saw and exectuted a simple MPI "Hello, World" program
back in the first week.
Remember that you will need to run this with mpirun (on bullpen)
or mpiexec (on dhanni) if you want to get more than one process!
MPI calls and constructs in the "Hello, World" program:
- #include <mpi.h> - the standard MPI header file
- MPI_Init(int *argc, char *argv[]) - MPI Initialization
- MPI_COMM_WORLD - the global communicator. Use for MPI_Comm args in most situations
- MPI_Abort(MPI_Comm comm, int rc) - MPI Abort function
- MPI_Comm_size(MPI_Comm comm, int *numprocs) - returns
the number of processes in a given communicator in numprocs
- MPI_Comm_rank(MPI_Comm comm, int *pid) - returns the
rank of the current process in the given communicator
- MPI_Get_processor_name(char *name, int *rc) - returns
the name of the node on which the current process is running
- MPI_Finalize() - clean up MPI
The model of parallelism is very different from what we have seen.
All of our processes exist for the life of the program. We are not
allowed to do anything before MPI_Init() or after MPI_Finalize(). We need to think in terms of a number of copies of
the same program all starting up at MPI_Init().
- MPI_Send/MPI_Recv - standard blocking calls (may have
system buffer)
- MPI_Isend/MPI_Irecv - standard nonblocking calls
- MPI_Ssend/MPI_Issend - synchronous blocking/nonblocking
send
- MPI_Bsend/MPI_Ibsend - buffered blocking/nonblocking
send - programmer allocates message buffer with MPI_Buffer_attach
- MPI_Rsend/MPI_Irsend - ready mode send - matching
receive must have been posted previously
- MPI_Sendrecv - combine send/recv into one call before
blocking
- wait calls for nonblocking communications: MPI_Wait,
MPI_Waitall, MPI_Waitsome, MPI_Waitany
- also: MPI_Probe and MPI_Test calls
A simple MPI program that sends a single message:
See: /usr/cs-local/share/cs338/examples/mpimsg
- MPI_Status status - structure which contains additional info
following a receive
- MPI_Send(void *buf, int count, MPI_Datatype type, int
dest, int tag, MPI_Comm comm) - blocking send - does not return until
the corresponding receive is completed. sends count copies of
data of type type located in buf to the processor with pid
dest
- MPI_Recv(void *buf, int count, MPI_Datatype type, int
src, int tag, MPI_Comm comm, MPI_Status status) - blocking receive - does
not return until the message has been received. src may be
specific PID or MPI_ANY_SOURCE which matches any.
- MPI_Datatype examples: MPI_CHAR, MPI_INT,
MPI_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_BYTE, MPI_PACKED
A slightly more interesting MPI program that sends one message from
each process:
See: /usr/cs-local/share/cs338/examples/mpiring
- MPI_Request request - structure which contains info
needed by nonblocking calls
- MPI_Isend(void *buf, int count, MPI_Datatype type, int
dest, int tag, MPI_Comm comm, MPI_Request *req) - nonblocking send -
returns immediately. buf must not be modified until a wait
function is called using this req
- MPI_Irecv(void *buf, int count, MPI_Datatype type, int
source, int tag, MPI_Comm comm, MPI_Request *req) - nonblocking
receive - returns immediately. buf must not be used until a wait
function is called using this req
- MPI_Wait(MPI_Request *req, MPI_Status *status) - wait
for completion of message which had req as its request argument.
additional info such as source of a message received as MPI_ANY_SOURCE is contained in status
See: /usr/cs-local/share/cs338/examples/mpicoll
- MPI_Barrier(MPI_Comm comm) - synchronize procs
- MPI_Bcast(void *buf,int count,MPI_Datatype type,int
root,MPI_Comm comm) - broadcast - sends count copies of data
of type type located in buf on proc root to buf on all others.
- MPI_Reduce(void *sendbuf,void *recvbuf,int count,
MPI_Datatype type,MPI_Op op,int root,MPI_Comm comm) - combines
data in sendbuf on each proc using operation op and stores
the result in recvbuf on proc root
- MPI_Allreduce() - same as reduce except result is stored
in recvbuf on all procs
- MPI_Op values - MPI_MAX, MPI_MIN, MPI_SUM,
MPI_PROD, MPI_LAND, MPI_BAND, MPI_LOR, MPI_BOR, MPI_LXOR,
MPI_BXOR, MPI_MAXLOC, MPI_MINLOC plus user-defined
- MPI_Scan(void *sendbuf, void *recvbuf, int count,
MPI_Datatype type, MPI_Op op, MPI_Comm comm) - parallel prefix
scan operations
See: /usr/cs-local/share/cs338/examples/mpiscatgath
- MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype
sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int
root, MPI_Comm comm) - root sends sendcount items from
sendbuf to each processor. Each processor receives recvcount items into recvbuf
- MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype
sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int
root, MPI_Comm comm) - each proc sends sendcount items from
sendbuf to root. root receives recvcount
items into recvbuf from each proc
- MPI_Scatterv/MPI_Gatherv work with variable-sized chunks
of data
- MPI_Allgather/MPI_Alltoall variations of scatter/gather
To understand what is going on with the various broadcast and
scatter/gather functions, consider this figure, taken from the
MPI Standard,
p.91
The Game of Life was invented by John Conway in 1970. The game is
played on a field of cells, each of which has eigh neighbors (adjacent
cells). A cell is either occupied (by an organism) or not. The rules
for deriving a generation from the previous one are:
- Death: If an occupied cell has 0, 1, 4, 5, 6, 7, or 8 occupied
neighbors, it dies (of either boredom or overcrowding, as the case
may be)
- Survival: If an occupied cell has 2 or 3 occupied neighbors,
it survives to the next generation
- Birth: If an unoccupied cell has 3 occupied neighbors, it
becomes occupied.
The game is very interesting in that complex patterns and cycles
arise. Do a google
search
to find plenty of Java applets you can try out.
Serial version: See: /usr/cs-local/share/cs338/examples/life
MPI version: See: /usr/cs-local/share/cs338/examples/mpilife
Some Mandelbrot Set Information and
pictures
Java Threads version: See: /usr/cs-local/share/cs338/examples/mand-java
MPI version: See: /usr/cs-local/share/cs338/examples/mand-mpi
Improved MPI version:
See: /usr/cs-local/share/cs338/examples/mand-mpi-better
Matrix-matrix multiplication using message passing is not as
straightforward as matrix-matrix multiplication using shared memory
and threads. Why?
- Since our memory is not shared, which processes have copies of
the matrices?
- Where does the data start out? Where do we want the answer to
be in the end?
- How much data do we replicate?
- What are appropriate MPI calls to make all this happen?
Think about this example and we will discuss it in the tutorial
meeting and will consider it further in next week's readings.
Lab Tasks
You may work alone or in pairs on this week's program. Should you
choose to work in a group, you may but are not required to work with
your tutorial partner. Groups must be formed and confirmed by e-mail
no later than Friday, February 24. If you work in a group, I
recommend getting a Unix group set up to allow sharing of files
without sharing of passwords. See Mary about this.
There are several files to turn in for this assignment. They should
all be included in a file named tut04.tar that you submit using
the turnin utility. Please use the filenames specified
and be sure to include your name (and your partner's name, if you are
working in a group) in each file.
Write a C or C++ program using OpenMP that solves Laplace's equation
on a two-dimensional, uniform, square grid, using Jacobi iteration.
Don't worry if none of those terms make any sense - this document
tells you what little you need to know about the math and physics.
Some background
Laplace's equation is an elliptic partial differential equation that
governs physical phenomena such as heat. In two dimensions, it can be
written
Phi_xx + Phi_yy = 0.
Given a spatial region and values for points on the boundaries of the
region, the goal is to approximate the steady-state solution for
points in the interior. We do this by covering the region with an
evenly-spaced grid of points. A grid of 8 ×8 would look like
this:
* * * * * * * * * *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* . . . . . . . . *
* * * * * * * * * *
The 8 ×8 grid represented by the dots is surrounded by a layer
of boundary points, represented by *'s. Each interior point is
initialized to some value. Boundary points remain constant throughout
the simulation. The steady-state values of interior points are
calculated by repeated iterations. On each iteration, the new value
of a point is set to a combination of the old values of neighboring
points. The computation terminates either after a given number of
iterations or when every new value is within some acceptable
difference eps of every old value.
There are several iterative methods for solving Laplace's equation.
Your program is to use Jacobi iteration, which is the simplest and
easily parallelizable, though certainly not the most efficient in
terms of convergence rate.
In Jacobi iteration, the new value for each grid point in the interior
is set to the average of the old values of the four points left,
right, above, and below it. This process is repeated until the
program terminates. Note that some of the values used for the average
will be boundary points.
What to do and how to do it
Your submitted tar file should include the tut04.txt file, an
appropriate Makefile, your C or C++ source code (including the
timer code from class, if you choose to use it), your PBS script(s), a
brief README file expaining how to run your program and with
your answer. Please do not include object files or your
executable.
Honor code guidelines: While the program is to be done only by
you (meaning your group, if you choose to work in a group), along the
lines of a laboratory program, I want to encourage you to ask
questions and discuss the program with me, our TA, and with classmates
outside your group, as you develop it. However, no sharing of code
between groups is permitted. If you have any doubts, please check
first and avoid honor code problems later.
Grading guidelines: Your grade for the programming will be
determined by correctness, design, documentation, and style, as well
as the presentation of your timing results.