Lecture 18 - More Parallel Algorithms


Agenda

Announcements

Parallel Graph Algorithms

Last time, we started to consider one parallel graph algorithm. We will look back at it today.

Prim's Algorithm

Given a weighted, undirected graph G = (V, E, w), the minimum spanning tree (MST) is a subgraph of G that is a tree containing all vertices of G with minimum edge weight.

Prim's Algorithm can be used to compute a MST. It is O(n2) for a graph with n vertices.

How can we parallelize it?

We cannot do different iterations of the outer loop concurrently, as a non-optimal edge may be added to the MST.

Parallelization must involve the inner loop. We partition the adjacency matrix and the distance array across the processors. Each processor determines the smallest weight to a vertex not yet in the MST, and a global reduction (all-reduce) is used to determine which vertex to choose. Processors can each update their own parts of the distance array, and the process continues.

What is the parallel complexity?

So the overall efficiency is O((n2)/(p)) + O(n logp), assuming p processors.

If we have O(n) processors, the whole thing is O(n) + O(n logn), though realistically, the cost of all that communication might make the constant on the communication term very large.

Dijkstra's Algorithm

We can parallelize Dijkstra's Algorithm (for computing single-source shortest paths) using a similar approach, and the complexities are the same.

Recall Dijkstra's Algorithm:

Dijkstra's Algorithm is a procedure to find shortest paths from a given vertex s in a graph G to all other vertices. The algorithm incrementally builds a sub-graph of G which is a tree containing shortest paths from s to every other vertex in the tree. A step of the algorithm consists of determining which vertex to add to the tree next.

Consider the following graph:

From that graph, the algorithm would construct the following tree for a start node of Williamstown. Costs on edges indicate total cost from the root.

This is O(n2) and can be parallelized just like Prim's Algorithm - it is essentially the same procedure.

But what about all-pairs shortest path?

Instead of the output being a list of shortest paths from one vertex to every other vertex, the output is the shortest path between each distinct pair of vertices in the graph.

One possibility is to apply the single-source Dijkstra's algorithm to each vertex, giving an algorithm of complexity O(n3). This can be called Dijkstra's All-Sources Shortest Path algorithm.

How can we parallelize this?

Source-Partitioned Formulation

The straightforward approach would be to use n processors, and have each start on one of the n vertices, and run the serial single-source algorithm. This is actually perfectly parallel, but this can only use up to n processors, leading to an overall complexity of O(n2). There is no communication component to the cost here.

Source-Parallel Formulation

If we want to have the possibility of more than n processors, up to n2 processors, we can apply the parallel formulation of the single-source algorithm to each vertex.

To analyze this, we divide our p processors into n partitions of size p/n. Each of these n partitions works on one source, using n/p partitions, in parallel, to get O((n3)/(p)) + O(n logp), where the first term represents time for computation and the second represents time for communication.

Floyd's Algorithm

Floyd's Algorithm is another approach to the all-pairs shortest paths problem. Here, we number the vertices of the graph v1, v2, ..., vn. We start by setting the "best known" distance between vi and vj as the weight of the edge between vi and vj, if it exists, infinity otherwise. Then we consider paths from vi to vj that pass through v1. The shortest path is now either the weight of the edge from vi to vj or a path from vi to v1 to vj. Then, we consider paths that include v2, then v3, and so on. At step k, we need to check if the path from vi to vk followed by the path from vk to vj is shorter than the best (so far) known path from vi to vj.

This is implemented with a triple-nested for-loop, so the serial version of O(n3). Below, A is the adjacency matrix of the graph, and D is the matrix of shortest paths:

D0=A;
for (k=1; k<=n; k++)
  for (i=1; i<=n; i++)
    for (j=1; j<=n; j++)
      Dk[i][j] = min(D(k-1)[i][j], D(k-1)[i][k]+D(k-1)[k][j]);

What about parallelization?

Since Dk needs to know values from Dk-1, we cannot easily parallelize the outer loop. But what can we do inside?

Let's think about it in terms of p processors. Furthermore, we'll assume that the processors are arranged logically as a sqrt(p) ×sqrt(p) matrix, and the matrix D is broken into p blocks of size (n)/(sqrt(p)) ×(n)/(sqrt(p)). Each processor updates its own submatrix at each iteration.

What information does each processor need at the kth iteration? It needs information from the processors that contain the kth row and kth column. The processor that contains the kth row or the kth column will need to broadcast that row/column to every other processor.

Pseudocode for this procedure:

  for (k=1; k<=n; k++) {
     each process P[i][j] that has a segment of the kth row of D(k-1)
        broadcasts it to the P[*][j] processes
     each process P[i][j] that has a segment of the kth column of D(k-1)
        broadcasts it to the P[i][*] processes
     each process waits to receive the needed segments
     each process P[i][j] computes its local park of D(k)
  }

In each iteration, the kth row and kth column processors do a one-to-all broadcast along the row or column of sqrt(p) processors. Each such processor broadcasts (n)/(sqrt(p)) elements, for a cost of O((n)/(sqrt(p)) logp) for communication.

The computation at each iteration is O((n2)/(p))

So the total, over n iterations: O((n3)/(p)) + O((n2)/(sqrt(p)) logp).