Lecture 18

Lecture 18 - More Parallel Algorithms

Agenda

Announcements
Parallel Algorithms
- Prim's Algorithm
- Dijkstra's Algorithm
- Floyd's Algorithm

Announcements

CS Preregistration Event: Thursday 6 PM, Birches. Come eat pizza and ask the faculty about courses and more.

Parallel Graph Algorithms

Last time, we started to consider one parallel graph algorithm. We will look back at it today.

Prim's Algorithm

Given a weighted, undirected graph G = (V, E, w), the minimum spanning tree (MST) is a subgraph of G that is a tree containing all vertices of G with minimum edge weight.

Prim's Algorithm can be used to compute a MST. It is O(n²) for a graph with n vertices.

How can we parallelize it?

We cannot do different iterations of the outer loop concurrently, as a non-optimal edge may be added to the MST.

Parallelization must involve the inner loop. We partition the adjacency matrix and the distance array across the processors. Each processor determines the smallest weight to a vertex not yet in the MST, and a global reduction (all-reduce) is used to determine which vertex to choose. Processors can each update their own parts of the distance array, and the process continues.

What is the parallel complexity?

Still have n outer-loop steps
Local updates and finding local minimum each take O(n/p) (computation part)
Finding the global minimum from the local minima takes O(logp) (communication part)

So the overall efficiency is O((n²)/(p)) + O(n logp), assuming p processors.

If we have O(n) processors, the whole thing is O(n) + O(n logn), though realistically, the cost of all that communication might make the constant on the communication term very large.

Dijkstra's Algorithm

We can parallelize Dijkstra's Algorithm (for computing single-source shortest paths) using a similar approach, and the complexities are the same.

Recall Dijkstra's Algorithm:

Dijkstra's Algorithm is a procedure to find shortest paths from a given vertex s in a graph G to all other vertices. The algorithm incrementally builds a sub-graph of G which is a tree containing shortest paths from s to every other vertex in the tree. A step of the algorithm consists of determining which vertex to add to the tree next.

Consider the following graph:

From that graph, the algorithm would construct the following tree for a start node of Williamstown. Costs on edges indicate total cost from the root.

This is O(n²) and can be parallelized just like Prim's Algorithm - it is essentially the same procedure.

But what about all-pairs shortest path?

Instead of the output being a list of shortest paths from one vertex to every other vertex, the output is the shortest path between each distinct pair of vertices in the graph.

One possibility is to apply the single-source Dijkstra's algorithm to each vertex, giving an algorithm of complexity O(n³). This can be called Dijkstra's All-Sources Shortest Path algorithm.

How can we parallelize this?

Source-Partitioned Formulation

The straightforward approach would be to use n processors, and have each start on one of the n vertices, and run the serial single-source algorithm. This is actually perfectly parallel, but this can only use up to n processors, leading to an overall complexity of O(n²). There is no communication component to the cost here.

Source-Parallel Formulation

If we want to have the possibility of more than n processors, up to n² processors, we can apply the parallel formulation of the single-source algorithm to each vertex.

To analyze this, we divide our p processors into n partitions of size p/n. Each of these n partitions works on one source, using n/p partitions, in parallel, to get O((n³)/(p)) + O(n logp), where the first term represents time for computation and the second represents time for communication.

Floyd's Algorithm

Floyd's Algorithm is another approach to the all-pairs shortest paths problem. Here, we number the vertices of the graph v₁, v₂, ..., v_n. We start by setting the "best known" distance between v_i and v_j as the weight of the edge between v_i and v_j, if it exists, infinity otherwise. Then we consider paths from v_i to v_j that pass through v₁. The shortest path is now either the weight of the edge from v_i to v_j or a path from v_i to v₁ to v_j. Then, we consider paths that include v₂, then v₃, and so on. At step k, we need to check if the path from v_i to v_k followed by the path from v_k to v_j is shorter than the best (so far) known path from v_i to v_j.

This is implemented with a triple-nested for-loop, so the serial version of O(n³). Below, A is the adjacency matrix of the graph, and D is the matrix of shortest paths:

D0=A;
for (k=1; k<=n; k++)
  for (i=1; i<=n; i++)
    for (j=1; j<=n; j++)
      Dk[i][j] = min(D(k-1)[i][j], D(k-1)[i][k]+D(k-1)[k][j]);

What about parallelization?

Since D_k needs to know values from D_k-1, we cannot easily parallelize the outer loop. But what can we do inside?

Let's think about it in terms of p processors. Furthermore, we'll assume that the processors are arranged logically as a sqrt(p) ×sqrt(p) matrix, and the matrix D is broken into p blocks of size (n)/(sqrt(p)) ×(n)/(sqrt(p)). Each processor updates its own submatrix at each iteration.

What information does each processor need at the kth iteration? It needs information from the processors that contain the kth row and kth column. The processor that contains the kth row or the kth column will need to broadcast that row/column to every other processor.

Pseudocode for this procedure:

  for (k=1; k<=n; k++) {
     each process P[i][j] that has a segment of the kth row of D(k-1)
        broadcasts it to the P[*][j] processes
     each process P[i][j] that has a segment of the kth column of D(k-1)
        broadcasts it to the P[i][*] processes
     each process waits to receive the needed segments
     each process P[i][j] computes its local park of D(k)
  }

In each iteration, the k^th row and k^th column processors do a one-to-all broadcast along the row or column of sqrt(p) processors. Each such processor broadcasts (n)/(sqrt(p)) elements, for a cost of O((n)/(sqrt(p)) logp) for communication.

The computation at each iteration is O((n²)/(p))

So the total, over n iterations: O((n³)/(p)) + O((n²)/(sqrt(p)) logp).