Lecture 22

Lecture 22 - Architecture-Aware Computing

Agenda

Announcements
Architecture-Aware Computing

Announcements

If you would like comments on a draft of your papers, please get them to me as soon as possible and no later than Friday to have any hope of getting them back in time for the comments to be of use to you.
Course evaluations on Thursday.

Architecture-Aware Computing

Heterogeneous Processor Speeds

We discussed this last week. Here is an example of a simple computation that was modified to run more efficiently when running on a combination of the "fast" and "slow" nodes of the bullpen cluster.

The problem is a three-dimensional cellular automata simulator. We have seen at least two-dimensional analogs of this earlier in the semester. In this case, the program computes the value of each cell in the next generation based on a rule that combines its current value and the values of its neighbors. The neighborhood can be defined by the 6 neighbors that differ in one dimension, the 18 neighbors that differ in one or two dimensions, or the 26 neighbors that differ in one, two, or three dimensions.

The computation is broken up, much like our earlier two-dimensional examples, by slices. However, in this case, the slices are two-dimensional. At each stage, each process must exchange the boundary cell values with its neighbors.

initialize
while (more iterations)
  send top slice to proc i-1
  send bottom slice to proc i+1
  recv top overlap from proc i-1
  recv bottom overlap from proc i+1
  wait for all sends/recvs
  compute local slices

Normally, each process is assigned the same number of slices, as the amount of work associated with each slice should be the same. In the presence of heterogeneous processor speeds, this partitioning will lead to a load imbalance. The entire computation will proceed at the pace set by the slowest processor.

Our target environment in a preliminary experiment is the bullpen cluster - in particular we wish to run the program using a mixture of "fast" and "slow" nodes. The slow nodes are Ultra 10 workstations with 300 MHz or 333 MHz UltraSparcII processors, and the fast nodes are the server-class systems each with one or more 450 MHz processors.

First, we need to determine the relative speeds of the processors. Just because we have the clock speeds does not mean that the actual performance for this program will vary in the same way. Differences in memory speeds, cache sizes, among other factors, may contribute, along with the actual processor speed.

To determine this, we ran a single-process version of the program on each of the nodes. In fact, the 450-MHz nodes are 1.5 times faster at computing this problem with this program than the 300-MHz nodes. So we will try to assign 1.5 times as many slices to processes assigned to fast nodes.

The full version of the program uses a machine model and determines the relative processor speeds automatically, but for the tests run here, the processor weights are entered manually. The number of slices to assign to each processor are determined from these weights.

When we do this for 4 processors (2 fast, 2 slow), running on various size partitions, we get:
Size Even Distribution System-Sensitive Distribution

20x20x20 9.57422 8.11328

40x40x40 65.1445 52.4648

60x60x60 215.328 187.902

80x80x80 509.801 409.824

100x100x100 998.957 812.008

In some related work, Sinha and Parashar describe a 45% decrease in application execution time when using system-sensitive load balancing on artificially-loaded systems.

Heterogeneous and Hierarchical Networks

Accounting for processing speed heterogeneities alone can be beneficial, but network speeds likely play a significant role as well.

The CMAME paper describes some tests where the same computation was run on a number of systems, using different networking capabilities. Not surprisingly, computations run on systems with faster networks were able to run more quickly overall. Here are some of the numbers:

Wide range of target architectures

Architecture	IPC vehicle	Procs.	Computers	Key
IBM SP2	switch	2-32	2-32	SP2s
IBM SP2	Ethernet	2-32	2-32	SP2e
Sun Ultra 2/2200	shared memory	1-2	1	SUNsh
Sun Ultra 2/2200	local p4	1-2	1	SUNlp4
Sun Ultra 2/2200	Ethernet p4	1-6	1-3	SUNp4(10)
Sun Ultra 2/2200	fast Ethernet p4	1-6	1-3	SUNp4(100)
SGI Onyx II	shared memory	1-8	1	SGIsh
SGI Onyx II	local p4	1-8	1	SGIlp4

Experiment uses meshes from perforated shock tube
Perform partitioning, refinement, and rebalancing

A few of the graphs, comparing communication across the high-speed SP switch with communication across the Ethernet connecting the same nodes:

Some conclusions:

Shared memory device is not always fastest
- Small differences on Sun and SGI, not significant
- Shared memory is actually slower (!) on the SP
Ethernet saturates above about 8 processes, fast Ethernet is better, SP switch is much better
The network is an important factor for load balancing

We always want to minimize communication, but perhaps we really want to do so in the face of a slow network, even if it means sacrificing some of the computational balance to do so. For example, consider 4 different 16-processor systems, and a mesh of 1,103,018 tetrahedral elements which represents part of the human arterial system.

We use inertial bisection as a fast algorithm that produces very strictly balanced partitions, but with relatively large partition boundaries, and Parmetis as a more expensive algorithm that produces very small partition boundaries, but at times does so by sacrificing computational balance.

16-way SMP, use inertial bisection only:

This achieves an excellent computational balance, with partitions differing by no more than one element, and reasonable surface indices with r_M=1.77 and r_G=0.61.
16 uniprocessors connected by Ethernet, use ParMetis only:

This achieves excellent surface index values, MLSI=0.40 and GSI=0.20 , but at the expense of a large computational imbalance. Here, the partition sizes range from 49,389 to 89,302 regions. For a computation running on a network of workstations (NOW), it may be worth accepting the significant load imbalance to achieve the smaller communication volume.
2 8-way SMP nodes connected by Ethernet, use ParMetis to divide the mesh in half (the Ethernet boundary) then inertial bisection within each SMP.

ParMetis is used to divide the computation between the two SMP nodes, resulting in partitions of 532,063 and 570,955 regions, with MLSI =0.06 and GSI=0.03 . Within each SMP, the mesh is partitioned eight ways using inertial bisection, where the partitions within each SMP are balanced to within one element, and with overall MLSI=1.88 and GSI=0.56 . Since the extra effort minimizes communication across the slow network, this is a very effective partitioning for this environment.
4 5-way SMP nodes connected by Ethernet, use ParMetis to divide the mesh into four (the Ethernet boundaries) then again inertial bisection within each SMP.

Here, we get a ParMetis partitioning across the four SMP nodes with 265,897, 272,976, 291,207 and 272,938 elements and MLSI=0.23 and GSI=0.07 . Within each SMP, partitions are again balanced to within one element, with overall MLSI=1.32 and GSI=0.32 .

This sort of hierarchical partitioning is difficult, and is not an automated process. It is also unclear as to which algorithms or which parameters to those algorithms should be chosen for different architectures.

Using Weighted Networks in a Graph Partitioner

Walshaw and Cross describe how they account for network heterogeneity in the Jostle graph partitioner. You will recall from earlier in the semester that Jostle is one of the graph partitioners commonly used to partition meshes.

The goal of the usual Jostle multilevel partitioner is to create partitions of the graph such that each partition contains the same weighted number of graph vertices, while minimizing the "edge cut weight" - the number of graph edges whose end vertices are assigned to different partitions. The edges may have weights as well, so a heavily-weighted edge contributes more of a penalty to the edge cut weight if it is cut.

In the presence of heterogeneous networks, it would make sense to have edges cut by an interface across a slow network interface contribute more to a more general cost function than edges cut across a faster network interface.

There are two times when this information could be taken into account:

during the refinement/optimization operation during the graph partitioning
after the partitions have been created, when assigning partitions to actual processing nodes

Before we can do the actual network-aware partitioning, we need some sort of weights on the network links. This is done by assigning values in a network cost matrix (see the paper for examples). One possible way to assign costs is with a quadratic path length (QPL) penalizing communication across multiple interfaces, specifically slow interfaces.

Given network costs, we can make use of it in the partitioning procedure.

The graph coarsening and the initial partitioning of the coarse graph do not make use of the network costs. This takes place during the refinement/optimization phase. Basically, the optimization normally looks at vertices at the boundaries as candidates to move to another partition. A gain function is computed based on possible vertex movements, and vertices are moved so as to make the most improvement to the cost function. It is in these gain functions that the network costs are taken into account. For details, see the paper.

With the normal (no network weighting) algorithm, it is not necessary to consider moving vertices to non-adjacent partitions. There is no way that that would improve the gain function. However, with network costs taken into account, it might make sense to move a vertex to a partition which is currently not adjacent.

But do all other partitions really need to be considered? In fact, it's usually good enough to consider only partitions that are either currently adjacent, or connected across a fast network link.

See the paper for the results. They show a significant improvement in terms of several metrics.

Mixed-mode programming

In some circumstances, it is not just a matter of adjusting load balancing or other parameters to account for network and processor heterogeneity. On the SP systems, for cxample, IBM recommends using multithreading with pthreads or OpenMP to utilize multiple processors on a single node, while using MPI for inter-node communication. It also makes sense that this approach should work for other SMP nodes as well.

We look again at the three-dimensional cellular automata simulator. The software includes pthreads to take advantage of multiple processors with an SMP node, and MPI calls to provide inter-node communication. A number of pthreads can be created to compute a number of slices concurrently within each node.

Below we see wall clock times for ten iterations of a 512 ×512 ×512 cellular automaton simulation on the bullpen cluster:

a 2-way SMP node as the number of MPI processes is increased with no pthreads:
a 2-way SMP node as the number of pthreads is increased with one process
a 4-way SMP node as the number of MPI processes is increased with no pthreads
a 4-way SMP node as the number of pthreads is increased with one process

This program was run on the bullpen cluster, using Dolphin gigabit interconnect for inter-node communication. We use a 512 ×512 ×512 domain, a neighborhood size of 26, and execute for 10 iterations. The graphs show wall clock times for this simulation, varying the numbers of MPI processes or the number of threads. We see the expected speedup until the number of processes equals the number of processors. Beyond that, adding MPI processes only slows the computation, since the operating system must switch between processes. When we achieve parallelism through multithreading, we again get the expected speedup until the number of threads equals the number of processors. However, as we continue to add more threads, performance continues to improve. This demonstrates a highly efficient pthread implementation in the Solaris 8 operating system. Having more threads makes it more likely that a thread is ready to execute (i.e., its instructions and data reside in cache) as soon as a processor becomes available. Since threads can be created and destroyed inexpensively, they can also be used to add parallelism incrementally and only used for parts of the program that can take advantage of parallelism.

Run-time Discovery of a Parallel Architecture

In order to do any sort of architecture-aware computing, we need to be able to provide or discover some information about the parallel environment. We saw an approach used by Jostle where a network cost matrix is provided. But where do we get such numbers? Is there other information that should come from such a model?

Hardware Model

Work is underway to establish a "machine model" for use by Zoltan load balancers, and by applications that use Zoltan.

Goal: A run-time model of the parallel execution environment

Computing Environment Hardware Model

Tree structure based on network hierarchy
Computation nodes, assigned "computing power"
- UP - uniprocessor node
- SMP - symmetric multiprocessing node
Communication nodes
- network characteristics (bandwidth, latency)
- assigned a computing power as a function of children
Generation must be automatic

If we can build this tree, how can it be populated with information?

One popular monitoring tool is called the Network Weather Service. NWS gathers information about available CPU, memory, and network resources, and "predicts" what resources it expects to be available in the near future based on the recent past.

The system under development for Zoltan uses a similar idea - "agents" that are spawned when the computation begins that are responsible for the construction of and the maintenence of the machine model tree.

NIC agents use the Simple Network Management Protocol (SNMP) to collect network traffic information. CpuMemory agents monitor available CPU and memory resources. This information is placed in the machine model tree.

Since many load balancing algorithms may only be able to create some number of weighted partitions, we distill the information in the machine model down to one number, called the power, for each processing node. This number represents the portion of the work that should be assigned to the node. It is a function of processing power and networking resources, but ideally should take additional factors (e.g., memory) into account.

Size	Even Distribution	System-Sensitive Distribution
20x20x20	9.57422	8.11328
40x40x40	65.1445	52.4648
60x60x60	215.328	187.902
80x80x80	509.801	409.824
100x100x100	998.957	812.008