Lecture 23 - Architecture-Aware and Mixed-Mode Programming, Review


Agenda

Announcements

Mixed-Mode Programming

We looked at the three-dimensional cellular automaton program last time. The version used MPI to communicate among processes, and used pthreads to parallelize the actual computation step.

A more popular method for achieving this kind of "hybrid" shared memory/message passing program is to use MPI with OpenMP.

Much information about this programming model, its costs and its benefits, can be found on the web. Try this google search.

The biggest cost is that programmers need to learn two models of parallel computation and apply them correctly to their (likely large and complex) programs. The advantage is a program that might be more efficient than a straight MPI-only program.

Some reasons:

As someone suggested on Tuesday, ideally a single programming model would allow the programmer to specify what can be parallelized and the compiler or run time system would use threads or processes with message passing as appropriate.

For now, there seem to be three approaches:

  1. program the hybrid model directly (IBM, among others, recommends this)
  2. use MPI only and count on the developers of the MPI implementation to "do the right thing" in different circumstances. This is hard because the address space of each process is independent, and messages still need to be implemented as one or more memory-to-memory copies
  3. program using a shared memory model and have the system generate messages automatically for non-local memory accesses. This is a significant burden on the compiler and the run-time system. It also means that programmers may not program with data locality in mind (see below)

Architecture-Aware Computing

Run-time Discovery of a Parallel Architecture

See notes from Lecture 22 (Zoltan machine model, NWS)

Another Possibilty to Account for Heterogeneous Networks

Some balancing algorithms (particularly the geometric algorithms) can use only a computing power of the nodes to decide how much work to assign to a process. Can we distill the information from the machine model (both processor and network speeds) down to a single "power" number that indicates what percentage of the work to give to each process?

A function to do this must account for the time for a node to do all of its work based on the computing power of the node (comp), the bandwidth of the node (band), the computational work (work), and the communication time (comm). The communication time is proportional to the amount of work; thus, comm = tau* work, where tau represents the ratio of computation to communication time for a particular application. This can be selected empirically, but should be estimated automatically, e.g., by using agents.

The time timei to complete the work at node i at a given tree level is timei = (worki)/(compi) + (commi)/(bandi) = (worki)/(compi) + (tau* worki)/(bandi) = worki * (1)/(compi) + (tau)/(bandi) .

Since all nodes finish at the same time in a balanced computation, timei should be a constant, i.e., worki * (1)/(compi) + (tau)/(bandi) = C.

After determining worki (using, e.g., an equilibration algorithm), any load balancing algorithm capable of supporting generalized partitions can be used to distribute the loading. The workload distribution should be constrained by the amount of available memory on each computing node, but this has not yet been included.

This approach may not work in all situations. Consider a system consisting of four clusters each having four computing nodes. The nodes within each cluster are connected by a fast network while the clusters are interconnected by a slower network. All computing nodes have equal processing power. Since the entire system is symmetric, worki is the same for all compute nodes, and network hierarchy does not come into play. Each node ends up being assigned the same amount of work - resulting in the same partitioning as we had when not considering network heterogeneties at all.

Hierarchical Partitioning and Load Balancing

Given a machine model tree, it can be used to drive a hierarchical partitioning or load balancing procedure. This has not been implemented, but how might it work?

Consider the examples we looked at last time, specifically the cluster of 4 4-way SMP nodes. Suppose that, as in that example, we want to use parmetis to partition among the 4 nodes, then use inertial bisection within each node to achieve the final partitions.

The procedure would work its way down the machine model tree. First, the 16 processes need to cooperate to use Parmetis to achieve 4 partitions, each of which are distributed across 4 processes. This is different than what we had done before - how can a partition live on 4 processes? To do this, a "representative" process must be selected for each of the 4 groups. That representative is responsible for making decisions about what needs to happen with the objects on its processes. However, it is not feasible to have all of the objects actually migrate to the representative process.

Then, once the top-level decomposition has been determined, each team of 4 processes computes a 4-way partitioning using inertial bisection. At this level, it will be a more typical procedure, where each process is responsible for its own objects.

An important consideration here is that the actual objects (mesh elements) cannot be migrated between the first and second phases. We migrate only their Zoltan names. Only when the final phase is complete does the application get the new decomposition and it migrates the mesh accordingly.

Other Architecture-Aware Computing Issues

We recently looked at several of the latest architectures.

Today's supercomputers tend to be large clusters with complex network and memory hierarchies. There are the unusual things like HTMT and the Tera MTA systems, but for the most part, they're really just big clusters.

Since there is distributed memory in most systems, message passing is the most common programming paradigm, though there is a significant contingent working with OpenMP. The SGI model is to present their Origin systems, which have distributed memory at some level, as having shared memory.

Even on the Origin, which you recall is a Cache-Coherent Non-Uniform Memory Access (CC-NUMA) system, there is some benefit to running an MPI program instead of a multithreaded program. Since MPI processes do not share memory, they can enhance the locality of a program.

For a program that is accumulating some value, in a multithreaded program, it is tempting to write something like this inside a loop:

  total = total + contribution;

with protection from a pthread mutex or OpenMP barrier, as appropriate. In this case, there is the obvious contention for the access to the shared variable.

An MPI program, in contrast, would accumulate local sums and use a global reduction to get the value.

The multithreaded program could be recast to have a private variable for each thread that gets combined at the end, as well.

Here, the locality gives a clear advantage. It is likely that better locality will help us throughout the entire memory hierarchy:

Even in the single-process, single-processor world, this is becoming more of an issue as CPU/register speeds increase much more quickly than memory speeds, and more levels of cache become available.

With a distributed-memory computation, the difference in speed between the fast and the slow parts of memory are likely to be one of the most significant factors.

What can be done?

What have we done?

We've looked at an overview of parallel programming, focusing mostly on how to write parallel programs in the early part of the course, how to deal with parallelizing scientific computations in the middle of the course, a look at a few parallel algoritms using things like the PRAM model, then back to current issues in cluster computing, supercomputing, and internet computing.

Some of the issues we haven't looked at much will be covered by talks from some of you next week:

Some other interesting issues: