We discussed this last week. Here is an example of a simple computation that was modified to run more efficiently when running on a combination of the "fast" and "slow" nodes of the bullpen cluster.
The problem is a three-dimensional cellular automata simulator. We have seen at least two-dimensional analogs of this earlier in the semester. In this case, the program computes the value of each cell in the next generation based on a rule that combines its current value and the values of its neighbors. The neighborhood can be defined by the 6 neighbors that differ in one dimension, the 18 neighbors that differ in one or two dimensions, or the 26 neighbors that differ in one, two, or three dimensions.
The computation is broken up, much like our earlier two-dimensional examples, by slices. However, in this case, the slices are two-dimensional. At each stage, each process must exchange the boundary cell values with its neighbors.
initialize while (more iterations) send top slice to proc i-1 send bottom slice to proc i+1 recv top overlap from proc i-1 recv bottom overlap from proc i+1 wait for all sends/recvs compute local slices
Normally, each process is assigned the same number of slices, as the amount of work associated with each slice should be the same. In the presence of heterogeneous processor speeds, this partitioning will lead to a load imbalance. The entire computation will proceed at the pace set by the slowest processor.
Our target environment in a preliminary experiment is the bullpen cluster - in particular we wish to run the program using a mixture of "fast" and "slow" nodes. The slow nodes are Ultra 10 workstations with 300 MHz or 333 MHz UltraSparcII processors, and the fast nodes are the server-class systems each with one or more 450 MHz processors.
First, we need to determine the relative speeds of the processors. Just because we have the clock speeds does not mean that the actual performance for this program will vary in the same way. Differences in memory speeds, cache sizes, among other factors, may contribute, along with the actual processor speed.
To determine this, we ran a single-process version of the program on each of the nodes. In fact, the 450-MHz nodes are 1.5 times faster at computing this problem with this program than the 300-MHz nodes. So we will try to assign 1.5 times as many slices to processes assigned to fast nodes.
The full version of the program uses a machine model and determines the relative processor speeds automatically, but for the tests run here, the processor weights are entered manually. The number of slices to assign to each processor are determined from these weights.
When we do this for 4 processors (2 fast, 2 slow), running on various size partitions, we get:
Size | Even Distribution | System-Sensitive Distribution |
20x20x20 | 9.57422 | 8.11328 |
40x40x40 | 65.1445 | 52.4648 |
60x60x60 | 215.328 | 187.902 |
80x80x80 | 509.801 | 409.824 |
100x100x100 | 998.957 | 812.008 |
In some related work, Sinha and Parashar describe a 45% decrease in application execution time when using system-sensitive load balancing on artificially-loaded systems.
Accounting for processing speed heterogeneities alone can be beneficial, but network speeds likely play a significant role as well.
The CMAME paper describes some tests where the same computation was run on a number of systems, using different networking capabilities. Not surprisingly, computations run on systems with faster networks were able to run more quickly overall. Here are some of the numbers:
Architecture | IPC vehicle | Procs. | Computers | Key |
IBM SP2 | switch | 2-32 | 2-32 | SP2s |
IBM SP2 | Ethernet | 2-32 | 2-32 | SP2e |
Sun Ultra 2/2200 | shared memory | 1-2 | 1 | SUNsh |
Sun Ultra 2/2200 | local p4 | 1-2 | 1 | SUNlp4 |
Sun Ultra 2/2200 | Ethernet p4 | 1-6 | 1-3 | SUNp4(10) |
Sun Ultra 2/2200 | fast Ethernet p4 | 1-6 | 1-3 | SUNp4(100) |
SGI Onyx II | shared memory | 1-8 | 1 | SGIsh |
SGI Onyx II | local p4 | 1-8 | 1 | SGIlp4 |
A few of the graphs, comparing communication across the high-speed SP switch with communication across the Ethernet connecting the same nodes:
Some conclusions:
We always want to minimize communication, but perhaps we really want to do so in the face of a slow network, even if it means sacrificing some of the computational balance to do so. For example, consider 4 different 16-processor systems, and a mesh of 1,103,018 tetrahedral elements which represents part of the human arterial system.
We use inertial bisection as a fast algorithm that produces very strictly balanced partitions, but with relatively large partition boundaries, and Parmetis as a more expensive algorithm that produces very small partition boundaries, but at times does so by sacrificing computational balance.
This achieves an excellent computational balance, with partitions differing by no more than one element, and reasonable surface indices with rM=1.77 and rG=0.61.
This achieves excellent surface index values, MLSI=0.40 and GSI=0.20 , but at the expense of a large computational imbalance. Here, the partition sizes range from 49,389 to 89,302 regions. For a computation running on a network of workstations (NOW), it may be worth accepting the significant load imbalance to achieve the smaller communication volume.
ParMetis is used to divide the computation between the two SMP nodes, resulting in partitions of 532,063 and 570,955 regions, with MLSI =0.06 and GSI=0.03 . Within each SMP, the mesh is partitioned eight ways using inertial bisection, where the partitions within each SMP are balanced to within one element, and with overall MLSI=1.88 and GSI=0.56 . Since the extra effort minimizes communication across the slow network, this is a very effective partitioning for this environment.
Here, we get a ParMetis partitioning across the four SMP nodes with 265,897, 272,976, 291,207 and 272,938 elements and MLSI=0.23 and GSI=0.07 . Within each SMP, partitions are again balanced to within one element, with overall MLSI=1.32 and GSI=0.32 .
This sort of hierarchical partitioning is difficult, and is not an automated process. It is also unclear as to which algorithms or which parameters to those algorithms should be chosen for different architectures.
Walshaw and Cross describe how they account for network heterogeneity in the Jostle graph partitioner. You will recall from earlier in the semester that Jostle is one of the graph partitioners commonly used to partition meshes.
The goal of the usual Jostle multilevel partitioner is to create partitions of the graph such that each partition contains the same weighted number of graph vertices, while minimizing the "edge cut weight" - the number of graph edges whose end vertices are assigned to different partitions. The edges may have weights as well, so a heavily-weighted edge contributes more of a penalty to the edge cut weight if it is cut.
In the presence of heterogeneous networks, it would make sense to have edges cut by an interface across a slow network interface contribute more to a more general cost function than edges cut across a faster network interface.
There are two times when this information could be taken into account:
Before we can do the actual network-aware partitioning, we need some sort of weights on the network links. This is done by assigning values in a network cost matrix (see the paper for examples). One possible way to assign costs is with a quadratic path length (QPL) penalizing communication across multiple interfaces, specifically slow interfaces.
Given network costs, we can make use of it in the partitioning procedure.
The graph coarsening and the initial partitioning of the coarse graph do not make use of the network costs. This takes place during the refinement/optimization phase. Basically, the optimization normally looks at vertices at the boundaries as candidates to move to another partition. A gain function is computed based on possible vertex movements, and vertices are moved so as to make the most improvement to the cost function. It is in these gain functions that the network costs are taken into account. For details, see the paper.
With the normal (no network weighting) algorithm, it is not necessary to consider moving vertices to non-adjacent partitions. There is no way that that would improve the gain function. However, with network costs taken into account, it might make sense to move a vertex to a partition which is currently not adjacent.
But do all other partitions really need to be considered? In fact, it's usually good enough to consider only partitions that are either currently adjacent, or connected across a fast network link.
See the paper for the results. They show a significant improvement in terms of several metrics.
In some circumstances, it is not just a matter of adjusting load balancing or other parameters to account for network and processor heterogeneity. On the SP systems, for cxample, IBM recommends using multithreading with pthreads or OpenMP to utilize multiple processors on a single node, while using MPI for inter-node communication. It also makes sense that this approach should work for other SMP nodes as well.
We look again at the three-dimensional cellular automata simulator. The software includes pthreads to take advantage of multiple processors with an SMP node, and MPI calls to provide inter-node communication. A number of pthreads can be created to compute a number of slices concurrently within each node.
Below we see wall clock times for ten iterations of a 512 ×512 ×512 cellular automaton simulation on the bullpen cluster:
This program was run on the bullpen cluster, using Dolphin gigabit interconnect for inter-node communication. We use a 512 ×512 ×512 domain, a neighborhood size of 26, and execute for 10 iterations. The graphs show wall clock times for this simulation, varying the numbers of MPI processes or the number of threads. We see the expected speedup until the number of processes equals the number of processors. Beyond that, adding MPI processes only slows the computation, since the operating system must switch between processes. When we achieve parallelism through multithreading, we again get the expected speedup until the number of threads equals the number of processors. However, as we continue to add more threads, performance continues to improve. This demonstrates a highly efficient pthread implementation in the Solaris 8 operating system. Having more threads makes it more likely that a thread is ready to execute (i.e., its instructions and data reside in cache) as soon as a processor becomes available. Since threads can be created and destroyed inexpensively, they can also be used to add parallelism incrementally and only used for parts of the program that can take advantage of parallelism.
In order to do any sort of architecture-aware computing, we need to be able to provide or discover some information about the parallel environment. We saw an approach used by Jostle where a network cost matrix is provided. But where do we get such numbers? Is there other information that should come from such a model?
Work is underway to establish a "machine model" for use by Zoltan load balancers, and by applications that use Zoltan.
Goal: A run-time model of the parallel execution environment
Computing Environment | Hardware Model | |
If we can build this tree, how can it be populated with information?
One popular monitoring tool is called the Network Weather Service. NWS gathers information about available CPU, memory, and network resources, and "predicts" what resources it expects to be available in the near future based on the recent past.
The system under development for Zoltan uses a similar idea - "agents" that are spawned when the computation begins that are responsible for the construction of and the maintenence of the machine model tree.
NIC agents use the Simple Network Management Protocol (SNMP) to collect network traffic information. CpuMemory agents monitor available CPU and memory resources. This information is placed in the machine model tree.
Since many load balancing algorithms may only be able to create some number of weighted partitions, we distill the information in the machine model down to one number, called the power, for each processing node. This number represents the portion of the work that should be assigned to the node. It is a function of processing power and networking resources, but ideally should take additional factors (e.g., memory) into account.