Computer Science 335
Parallel Processing and High Performance Computing

Fall 2021, Siena College

Lab 6: MPI on Stampede2
Due: 4:00 PM, Monday, October 11, 2021

In this lab, you will learn how to run interactive and batch jobs with MPI on the Stampede2 production nodes.

You must work individually on this lab, so everyone learns how to use Stampede2.

Learning goals:

  1. To learn how to use Stampede2 to run MPI programs on interactive nodes.
  2. To learn how to use Stampede2 to run MPI programs on batch nodes.

Much of what we'll be doing is based on the Stampede2 User Guide.

Getting Set Up

You can find the link to follow to set up your GitHub repository stampedempi-lab-yourgitname for this Lab in Canvas. One member of the group should follow the link to set up the repository on GitHub, then that person should email the instructor with the other group members' GitHub usernames so they can be granted access. This will allow all members of the group to clone the repository and commit and push changes to the origin on GitHub. At least one group member should make a clone of the repository to begin work.

You can create your repository and clone on your computer right away if you'd like, but don't clone it on Stampede2 until instructed to do so.

You may choose to answer the lab questions in the README.md file in the top-level directory of your repository, or upload a document with your responses to your repository, or add a link to a shared document containing your responses to the README.md file.

Better GitHub Interaction on Stampede2

As we have seen, GitHub authentication on systems where you cannot authenticate in a browser is inconvenient at best. On noreaster, we used the gh tool to authenticate with an authentication token. We will do the same on Stampede2.

Using the procedure from the earlier lab, use the XSEDE single sign on system to connect to the Stampede2 system at TACC. Recall that this is a two-step process, where first you connect to the XSEDE SSO with your username and password. This will require the DUO app for 2-factor authentication. Then you use gsissh to connect to stampede2.

Question 1: What is the hostname of the node to which you are connected on Stampede2? (Hint: this is the output of the hostname command) (2 points)

On the Stampede2 login node, create a directory named bin within your home directory. Unix systems generally use directories named bin to hold executables, and these directories are added to your shell's search path. It is a common convention when you have executables to install within your own account (as opposed to in a system-wide location available to all users) that you place them in the bin directory at the top level of your account. On many systems, this directory is already in your search path, but this is not the case on Stampede2. So we will add it.

To do this, edit the .bashrc file in your account's home directory using your favorite text editor. Scroll down to line 63 and uncomment the line that modifies the PATH environment variable to include the bin directory you just created.

Now, that line will execute every time you log into Stampede2, and any programs in your bin directory will be available. There's nothing there yet. Let's put a shell script there to make sure things are set up as intended:

  1. cd into your new bin directory.
  2. Create a new file binworks with the cat utility (this is a quick and convenient way to create small text files on a Unix system)
  3. Assuming you successfully got your shell prompt back, you now just need to change the permissions on the file so it's executable:
    chmod 755 binworks
    
  4. Return to your home directory with the cd command with no parameters
  5. Re-read your bash startup file with the command
    . .bashrc  
    
  6. Type binworks, and if you see "my bin works!" as output, everything worked
  7. If it didn't, log out and back in try to type binworks.
  8. If it still doesn't work, email your helpful instructor.

Now every time you log into Stampede2 anything you place in that bin directory will be in your search path.

The whole point of this was to be able to install a copy of the gh utility so you can use it to authenticate git to your GitHub account through your authentication token. You could download the cli repository from GitHub, configure, build, and install it in your Stampede2 account, but instead you can just copy in the executable I built.

  1. cd ~/bin
  2. Copy the gh executable from the course web site with the command
    wget --no-check-certificate https://courses.teresco.org/cs335_f21/gh
    

    (you might get some warnings, but the file should transfer)

  3. Make sure the file has execute permissions with the command
    chmod 755 gh
    

Finally, you can authenticate to GitHub so you can clone, push to, and pull from repositories on Stampede2. You will likely need to repeat these steps periodically, as it seems the Stampede2 login nodes don't remain authenticated for long.

gh auth login

and choose the responses as shown here:

You can now clone a copy of this lab's repository and your repository from Lab 4 onto Stampede2.

Using a Stampede2 Login Node

The default compiler and MPI configurations should be sufficient for most or all of our purposes. Compile with mpicc and run programs with mpirun.

We can compile but not run MPI programs on the login nodes. They are not intended for that purpose. But let's try anyway.

Question 2: Compile the mpihello.c program (in your repository for this lab) and mpirds (from the earlier lab) on a Stampede2 login node and try to run it with 2 processes. What is your output? (2 points)

Parallel computations will be done on Stampede2's compute nodes, not the login nodes. These are managed by a queueing system called Slurm, which grants access to subsets of the computer to various users. When you are allocated a set of compute nodes, no other users can run jobs on them. But, it counts against our service allocation, so we only want to do this when you really need to.

You can see current summaries of the jobs in the queueing system with the commands sinfo and squeue.

You can see the types of nodes available through the queueing system with the command

sinfo -o "%18P %8a %16F"

This will show a compact summary of the status of the queues. See here to interpret the output.

Question 3: When you executed the above command, what was the output? How many normal nodes are in the Active state? (2 points)

Using a Compute Node Interactively

A user can and should only log into compute nodes allocated to that user at a given time. We can gain exclusive access to one with the idev command.

Run idev. When prompted, request the default of 68 tasks per node. After a (hopefully) short time, you should get a command prompt on a compute node.

Question 4: What is the host name of the login node you were allocated? (2 points)

Question 5: What is the output of the mpihello program and the mpirds program when executed with 64 processes on a compute node? (4 points)

Please log out from the compute node as soon as you are finished with the tasks above. The time the node is allocated to you is charged against our class allocation, regardless of whether you are actively using the CPUs.

Submitting Batch Jobs to the Compute Nodes

The most common way to use the compute nodes of this or any supercomputer is by submitting batch jobs. The idea is that you get your program ready to run, then submit it to be executed when resources to do so come available.

To set up a batch job, you first need a batch script that can be configured to run your program.

  1. Copy the file /share/doc/slurm/knl.mpi.slurm to the directory that contains your clone of the repository.
  2. Edit it to use "hellotest" in place of "myjob" (on 3 lines), to request 1 node (the -N line) and 32 MPI tasks (the -n line), a run time of 5 minutes (the -t line), and replace the -mail-user option with your own email address. You should remove the "SBATCH -A myproject" line, since we have just one allocation to be charged for our run. Finally, replace the ./mycode.exe with the name of your executable (which should be mpihello for this first run).

Note that the Slurm script also says you should be running out of the $SCRATCH directory. This means you should copy your executable to the $SCRATCH directory in your account. This is a directory on a shared partition on which the compute nodes have faster access.

cp mpihello $SCRATCH

Submit the job to the queueing system with the command

sbatch knl.mpi.slurm

Question 6: What output do you get at your terminal before you get your prompt back? (1 point)

Question 7: You should have received email when your program began executing, and again when it finished. What are the subject lines of those emails? (2 points)

Question 8: What file contains your program's output? How was it specified? Place this file in your repository for this lab submission (don't forget to add, commit and push it so it's on GitHub). (3 points)

Question 9: According your program's output, what was the host name on which your program executed? (1 point)

Next, let's run with more nodes and processes. We will request 8 nodes, and run 64 processes per node (so the -n value in your script should be 512).

Question 10: What were the host names allocated to your processes on this run? (1 point)

Question 11: What are the ranks of the processes that were assigned to each node? Hint: the grep command might be helpful here. (3 points)

Include the last batch script (3 points) and the last output file (4 points) in your repository.

Now let's do the same for the mpirds program. Create another Slurm script with appropriate values to run mpirds on 4 nodes and a total of 256 processes. Include this batch script and the output file in your repository for this lab (not the previous one where you wrote mpirds). (5 points)

Submission

Commit and push!

Grading

This assignment will be graded out of 35 points.

Feature

Value Score
Question 1 2
Question 2 2
Question 3 2
Question 4 2
Question 5 4
Question 6 1
Question 7 2
Question 8 3
Question 9 1
Question 10 1
Question 11 3
mpihello batch script 3
mpihellobatch output file 4
mpirds batch script and output file 5
Total 35