Programming Project 3 -- Multithreaded Palindromic Word Finder

Computer Science 400
Parallel Processing and High Performance Computing

Fall 2017, Siena College

Programming Project 3: Multithreaded Palindromic Word Finder
Due: 11:59 PM, Thursday, November 2, 2017

For this project, you will be writing a program to find palindromic words in a large dictionary. You may work alone or in groups of two or three.

Getting Set Up

You will receive an email with the link to follow to set up your GitHub repository palindromic-project-yourgitname for this Programming Project. One member of the group should follow the link to set up the repository on GitHub, then that person should email the instructor with the other group members' GitHub usernames so they can be granted access. This will allow all members of the group to clone the repository and commit and push changes to the origin on GitHub. At least one group member should make a clone of the repository to begin work.

Palindromic Word Finder

There is an online dictionary /usr/share/dict/words on noreaster that can be used by programs that need to do spell checking. You know that a palindrome is a word or phrase that reads the same in either direction, i.e., if you reverse all the letters you get the same word or phrase. A word is palindromic if its reverse is also in the dictionary. For example, "noon" is palindromic, because it is a palindrome and hence its reverse is trivially in the dictionary. A word like "draw" is palindromic because "ward" is also in the dictionary.

Your task is write a C or C++ program to find all palindromic words in the dictionary. You should first write a sequential program and then parallelize it, using the bag-of-tasks paradigm. However, keep your plan for parallelization in mind when designing and implementing the sequential version. You might want to do some things in the sequential program that you will need in the parallel program, such as finding where each letter begins in the dictionary (see below for more on this).

Your sequential program should have the following phases:

Read the file /usr/share/dict/words into an array of strings. You can determine the size of this array ahead of time by running wc on the dictionary to see how many words it contains. You may assume that each word is at most 28 characters long when running on FreeBSD and 45 characters long when running on Linux.
Aside: I figured this out with the command
```
awk ' { if (length($1) > max) max = length($0) } \
  END { print max } ' words
```
and the longest word (on Linux) is "pneumonoultramicroscopicsilicovolcanoconiosis".
For each word, compute its reverse and do a linear search of the dictionary to see if the reverse of the word is in the dictionary. If so, mark the word and increment your counter of the total number of palindromic words. To improve efficiency, you might want to keep track of the positions in your array that store the first word beginning with each letter of the alphabet. Then you can search only within that letter's range when searching for a word's reverse.
Print the total number of palindromic words to the screen (stdout).

Words in the dictionary could start with numbers or punctuation; you can either skip over them or process them, as you wish. None are palindromic, so this choice will not affect your total count. Some words start with capital letters (and hence the dictionary is not sorted in ASCII order). To keep your program simple, leave the capitals alone and do case-sensitive comparisons.

The sequential program should be fairly straightforward, but it will likely take longer than you expect to implement. Please ask if you run into trouble with the sequential version! You will need to get it done soon enough to leave enough time left to work on the parallelization - which is the whole point of the assignment!

After you have a working sequential program, modify it to use the bag-of-tasks paradigm, implemented using the pthreads library. Your parallel program should use W worker threads, where W is a command-line argument. Use the workers just for the compute phase; do the input and output phases sequentially. Each worker should count the number of palindromic words that it finds. Sum these W values during the output phase. This avoids a critical section during the compute phase - you'll need to deal with others, though. Use 26 tasks in your program, one for each letter of the alphabet. In particular, the first task is to examine all words that begin with "a" (and numbers), the second task is to examine all words that begin with "b", and so on. During the input phase you should build an efficient representation for the bag of tasks; I suggest using an array, where the value in task[0] is the index of the first "a" word, task[1] is the index of the first "b" word, and so on. You can also use this array during the search phase to limit the scope of your linear searches. Your parallel program should also time the compute phase. You may use the timer.c and timer.h code from our class examples. Start the clock just before you create the workers; read it again as soon as they have finished. Write the elapsed time for the compute phase to stdout.

To summarize, your program should have the following output:

For the timing tests, execute your parallel program on noreaster using 1, 2, 4, 8, 16, and 26 workers. Run each test 3 times. Create a table of results; it should contain all the values written to standard output (but not the words themselves) for all 18 test runs, and a brief analysis of the speedup and parallel efficiency you have achieved.

Repeat your experiment on a 68-core node on Stampede2. Note that the dictionary file for Linux is different, and not installed by default on Stampede2 nodes. You should transfer a copy of /usr/share/dict/linux.words from another Linux machine such as olsen.cs.siena.edu (use sftp from a Stampede2 login node back here to olsen to keep this simple). Place it right in your directory with your code and executable, but since the file is large, please do not have it tracked by git. I will obtain my own copy for testing when I grade your submissions. This file has more words and includes longer words, so be sure to make any needed code changes before running there. You must follow the guidelines about where to compile (login nodes) and where you can run your tests (an interactive node allocated with idev or a batch node allocated through Slurm).

Your submission should include your Makefile, your C source code (including the timer code from class, if you choose to use it), and a note in README.md file expaining how to run your program Please do not include object files or your executable.

Notes:

Submitting

Your submission requires that all required deliverables are committed and pushed to the master for your repository on GitHub.

Grading

This assignment is worth 75 points, which are distributed as follows:

> Feature	Value	Score
Palindromic correctness	40
Palindromic style	5
Palindromic documentation	10
Timing studies	20
Total	75