Lecture 22 - Architecture-Aware Computing


Agenda

Announcements

Architecture-Aware Computing

Heterogeneous Processor Speeds

We discussed this last week. Here is an example of a simple computation that was modified to run more efficiently when running on a combination of the "fast" and "slow" nodes of the bullpen cluster.

The problem is a three-dimensional cellular automata simulator. We have seen at least two-dimensional analogs of this earlier in the semester. In this case, the program computes the value of each cell in the next generation based on a rule that combines its current value and the values of its neighbors. The neighborhood can be defined by the 6 neighbors that differ in one dimension, the 18 neighbors that differ in one or two dimensions, or the 26 neighbors that differ in one, two, or three dimensions.

The computation is broken up, much like our earlier two-dimensional examples, by slices. However, in this case, the slices are two-dimensional. At each stage, each process must exchange the boundary cell values with its neighbors.

initialize
while (more iterations)
  send top slice to proc i-1
  send bottom slice to proc i+1
  recv top overlap from proc i-1
  recv bottom overlap from proc i+1
  wait for all sends/recvs
  compute local slices

Normally, each process is assigned the same number of slices, as the amount of work associated with each slice should be the same. In the presence of heterogeneous processor speeds, this partitioning will lead to a load imbalance. The entire computation will proceed at the pace set by the slowest processor.

Our target environment in a preliminary experiment is the bullpen cluster - in particular we wish to run the program using a mixture of "fast" and "slow" nodes. The slow nodes are Ultra 10 workstations with 300 MHz or 333 MHz UltraSparcII processors, and the fast nodes are the server-class systems each with one or more 450 MHz processors.

First, we need to determine the relative speeds of the processors. Just because we have the clock speeds does not mean that the actual performance for this program will vary in the same way. Differences in memory speeds, cache sizes, among other factors, may contribute, along with the actual processor speed.

To determine this, we ran a single-process version of the program on each of the nodes. In fact, the 450-MHz nodes are 1.5 times faster at computing this problem with this program than the 300-MHz nodes. So we will try to assign 1.5 times as many slices to processes assigned to fast nodes.

The full version of the program uses a machine model and determines the relative processor speeds automatically, but for the tests run here, the processor weights are entered manually. The number of slices to assign to each processor are determined from these weights.

When we do this for 4 processors (2 fast, 2 slow), running on various size partitions, we get:
Size Even Distribution System-Sensitive Distribution
20x20x20 9.57422 8.11328
40x40x40 65.1445 52.4648
60x60x60 215.328 187.902
80x80x80 509.801 409.824
100x100x100 998.957 812.008

In some related work, Sinha and Parashar describe a 45% decrease in application execution time when using system-sensitive load balancing on artificially-loaded systems.

Heterogeneous and Hierarchical Networks

Accounting for processing speed heterogeneities alone can be beneficial, but network speeds likely play a significant role as well.

The CMAME paper describes some tests where the same computation was run on a number of systems, using different networking capabilities. Not surprisingly, computations run on systems with faster networks were able to run more quickly overall. Here are some of the numbers:

A few of the graphs, comparing communication across the high-speed SP switch with communication across the Ethernet connecting the same nodes:



Some conclusions:

We always want to minimize communication, but perhaps we really want to do so in the face of a slow network, even if it means sacrificing some of the computational balance to do so. For example, consider 4 different 16-processor systems, and a mesh of 1,103,018 tetrahedral elements which represents part of the human arterial system.

We use inertial bisection as a fast algorithm that produces very strictly balanced partitions, but with relatively large partition boundaries, and Parmetis as a more expensive algorithm that produces very small partition boundaries, but at times does so by sacrificing computational balance.

This sort of hierarchical partitioning is difficult, and is not an automated process. It is also unclear as to which algorithms or which parameters to those algorithms should be chosen for different architectures.

Using Weighted Networks in a Graph Partitioner

Walshaw and Cross describe how they account for network heterogeneity in the Jostle graph partitioner. You will recall from earlier in the semester that Jostle is one of the graph partitioners commonly used to partition meshes.

The goal of the usual Jostle multilevel partitioner is to create partitions of the graph such that each partition contains the same weighted number of graph vertices, while minimizing the "edge cut weight" - the number of graph edges whose end vertices are assigned to different partitions. The edges may have weights as well, so a heavily-weighted edge contributes more of a penalty to the edge cut weight if it is cut.

In the presence of heterogeneous networks, it would make sense to have edges cut by an interface across a slow network interface contribute more to a more general cost function than edges cut across a faster network interface.

There are two times when this information could be taken into account:

  1. during the refinement/optimization operation during the graph partitioning
  2. after the partitions have been created, when assigning partitions to actual processing nodes

Before we can do the actual network-aware partitioning, we need some sort of weights on the network links. This is done by assigning values in a network cost matrix (see the paper for examples). One possible way to assign costs is with a quadratic path length (QPL) penalizing communication across multiple interfaces, specifically slow interfaces.

Given network costs, we can make use of it in the partitioning procedure.

The graph coarsening and the initial partitioning of the coarse graph do not make use of the network costs. This takes place during the refinement/optimization phase. Basically, the optimization normally looks at vertices at the boundaries as candidates to move to another partition. A gain function is computed based on possible vertex movements, and vertices are moved so as to make the most improvement to the cost function. It is in these gain functions that the network costs are taken into account. For details, see the paper.

With the normal (no network weighting) algorithm, it is not necessary to consider moving vertices to non-adjacent partitions. There is no way that that would improve the gain function. However, with network costs taken into account, it might make sense to move a vertex to a partition which is currently not adjacent.

But do all other partitions really need to be considered? In fact, it's usually good enough to consider only partitions that are either currently adjacent, or connected across a fast network link.

See the paper for the results. They show a significant improvement in terms of several metrics.

Mixed-mode programming

In some circumstances, it is not just a matter of adjusting load balancing or other parameters to account for network and processor heterogeneity. On the SP systems, for cxample, IBM recommends using multithreading with pthreads or OpenMP to utilize multiple processors on a single node, while using MPI for inter-node communication. It also makes sense that this approach should work for other SMP nodes as well.

We look again at the three-dimensional cellular automata simulator. The software includes pthreads to take advantage of multiple processors with an SMP node, and MPI calls to provide inter-node communication. A number of pthreads can be created to compute a number of slices concurrently within each node.

Below we see wall clock times for ten iterations of a 512 ×512 ×512 cellular automaton simulation on the bullpen cluster:

This program was run on the bullpen cluster, using Dolphin gigabit interconnect for inter-node communication. We use a 512 ×512 ×512 domain, a neighborhood size of 26, and execute for 10 iterations. The graphs show wall clock times for this simulation, varying the numbers of MPI processes or the number of threads. We see the expected speedup until the number of processes equals the number of processors. Beyond that, adding MPI processes only slows the computation, since the operating system must switch between processes. When we achieve parallelism through multithreading, we again get the expected speedup until the number of threads equals the number of processors. However, as we continue to add more threads, performance continues to improve. This demonstrates a highly efficient pthread implementation in the Solaris 8 operating system. Having more threads makes it more likely that a thread is ready to execute (i.e., its instructions and data reside in cache) as soon as a processor becomes available. Since threads can be created and destroyed inexpensively, they can also be used to add parallelism incrementally and only used for parts of the program that can take advantage of parallelism.

Run-time Discovery of a Parallel Architecture

In order to do any sort of architecture-aware computing, we need to be able to provide or discover some information about the parallel environment. We saw an approach used by Jostle where a network cost matrix is provided. But where do we get such numbers? Is there other information that should come from such a model?

Hardware Model

Work is underway to establish a "machine model" for use by Zoltan load balancers, and by applications that use Zoltan.

Goal: A run-time model of the parallel execution environment
Computing Environment Hardware Model

If we can build this tree, how can it be populated with information?

One popular monitoring tool is called the Network Weather Service. NWS gathers information about available CPU, memory, and network resources, and "predicts" what resources it expects to be available in the near future based on the recent past.

The system under development for Zoltan uses a similar idea - "agents" that are spawned when the computation begins that are responsible for the construction of and the maintenence of the machine model tree.

NIC agents use the Simple Network Management Protocol (SNMP) to collect network traffic information. CpuMemory agents monitor available CPU and memory resources. This information is placed in the machine model tree.

Since many load balancing algorithms may only be able to create some number of weighted partitions, we distill the information in the machine model down to one number, called the power, for each processing node. This number represents the portion of the work that should be assigned to the node. It is a function of processing power and networking resources, but ideally should take additional factors (e.g., memory) into account.