Computer Science 210
Data Structures

Fall 2018, Siena College

Lab 8: Word Frequency Counting with Ordered Structures
Due: 3:45 PM, Tuesday, November 27, 2018

In this week's lab, we will use an ordered structure to improve the efficiency of a program that tracks the freqency of words it encounters on its input.

You will be assigned a partner to work with on this lab. Only one submission per group is needed.

As you finish each step, please have your instructor initial the hard copy of this lab handout you were issued to indicate successful completion.

Names:

Learning goals:

  1. To learn more about ordered data structures by studying its in an application.
  2. To learn how to run a Java program from the command line, passing command line parameters, and redirecting input and output.
  3. To gain extra experience with Comparable objects and using Comparators.
  4. To introduce Java's Arrays.sort generic sorting capability.

Submitting

As you complete each programming task, commit with a meaningful commit message and push to your group's GitHub repository for this lab.

Once all written items are initialed to indicate completion, turn in the hard copy of this lab handout you were issued.

Getting Set Up

You will receive an email with the link to follow to set up your GitHub repository for this Lab (ordered-lab-yourgitname). One member of the group should follow the link to set up the repository on GitHub. Other group members will receive a subsequent email with a link to follow that will grant them the rights to clone the repository and commit and push changes to the origin on GitHub.

Improving a Program with an Ordered Structure

In your repository, you will find a program in WordFrequency.java that reads in a series of words (well not necessarily words, but whatever Strings are created from whitespace-delimited groups of characters) from the input. As it reads, it keeps track of all the words it's found and how many times each has occurred on the input. At the end, it prints out each word with its frequency, a count of the total number of words, and how long the program took to process those words.

While you can compile and run this in BlueJ, there's no way (it seems) to tell BlueJ you're done giving input, which is now it knows to stop reading words and report results. So we will need to run this program outside of BlueJ at the command line.

In your Git Bash or Mac Terminal window (subsequently called the "command line"), you can compile and run Java applications. First, make sure your command line has been issued the proper cd command to have your repository as its working directory. If you type "ls" and see the files for your repository, you're good to go.

To compile a Java program with the command line, use the javac compiler. For this program, you'll need

javac WordFrequency.java

This is the equivalent of hitting the "Compile" button in BlueJ, and will either issue you some error messages, or will create the file WordFrequency.class.

Once you have successfully compiled and have the WordFrequency.class file, you can run it by passing the name of a class that has a proper main method as a parameter to the java command, which invokes the Java Virtual Machine, starting execution with that main method. In this case:

java WordFrequency

You can now type whatever words you'd like the program to count right in your window. To tell the program that you're done, type a Ctrl-d on a blank line. This send the "end of file" character and makes the Scanner's hasNext method return false.

Question 1: Demonstrate the completed execution of your test run. (1 point)

Once we are running at the command line, we can take advantage of other convenient features. One of the most useful is input and output redirection. We'll work first with the input.

First take a look at the contents of the file little.txt. One way to do this is the command

cat little.txt

You'll see a small collection of words. If we want to use the contents of the file as the input to our program, we can modify our command line to make that happen:

java WordFrequency < little.txt

The < followed by the name of the file directs the shell program that's interpreting your commands to tell the operating system to run the program, but instead of waiting for any System.in input (or in the Unix terminology, the standard input) to come from that file instead of from the keyboard. From your program's perspective, it's exactly the same as if you typed in the contents of that file while it was running.

Question 2: Demonstrate the result of your execution using input redirection from little.txt. (1 point)

Try again, this time redirect input from the file whosonfirst.txt.

Question 3: Demonstrate the result of your execution using input redirection from whosonfirst.txt. (1 point)

Unsurprisingly, a similar syntax can be used at the command line to perform output redirection, where anything that is printed to System.out (Unix term: the standard output) and normally displayed at the terminal is instead sent to a file.

To run our program using little.txt as input and redirecting the output to little-out1.txt, you can run:

java WordFrequency < little.txt > little-out1.txt

You won't see anything in your terminal, because everything the program sent to System.out was redirected by the operating system to the file little-out1.txt. Check it out by running the ls command to see that the file has been created, and with the cat command to make sure the expected output is in there.

Question 4: Demonstrate that your successfully redirected the output of the program to little-out1.txt (1 point)

Now, do the same with the somewhat larger input file whosonfirst.txt to create whosonfirst-out1.txt and the much larger input file sienacatalog.txt to create sienacatalog-out1.txt.

For these larger files, you might not want to look at the whole output file with cat. Instead, use the tail command to show only the last 10 lines of each.

Question 5: Demonstrate the existence of whosonfirst-out1.txt and sienacatalog-out1.txt by using tail to show the last 10 lines of each. (2 points)

Add, commit, and push these three files to your repository on GitHub for safekeeping. We'll need them again later.

Take a look at the program in WordFrequency.java.

Question 6: Add detailed comments to the while loop in the main method describing how this code accomplishes the task of counting the frequencies of each word. (6 points)

Question 7: Describe in some detail how the indexOf method call, searching for a WordCount object which always has a count of 0, will ever find a match, given that every entry in the ArrayList will be a reference to a WordCount object which has at a count of at least 1. (6 points)

Question 8: Draw a memory diagram, limited to the ArrayList and the WordCount objects in existence, after the first line of words from little.txt has been processed. (15 points)

Question 9: If n unique words are in the ArrayList, what are the best and worst case running times of the indexOf, add, and get method calls? Explain briefly. (6 points)

Next, we will use the ordered structure in OrderedArrayList as a replacement for the plain old ArrayList in the program to start improving its efficiency. The OrderedArrayList is essentially the same as Bailey's OrderedVector we considered in class, but it has no dependency on the rest of his structure package, and it has an additional method called get.

Question 10: If n elements are in an OrderedArrayList, what are the best and worst case running times of the add, get, and contains methods? Explain briefly. (6 points)

Make a copy of all of the WordFrequency class in WordFrequency.java and call it WordFrequency1 and make it a non-public class. This will allow you to keep it around as a reference as you modify the main method of the actual WordFrequency class in the remaining tasks of this lab.

Your programming task is to replace the ArrayList with an OrderedArrayList while maintaining the functionality. This should result in a more efficient implementation, and as a side effect, when the words and frequencies are printed at the end, they'll be in alphabetical order.

A few things to consider as you design and implement this improvement:

Question 11: Demonstrate your working version of WordFrequency using the OrderedArrayList. (25 points)

Using input and output redirection, create files little-out2.txt, whosonfirst-out2.txt, and sienacatalog-out2.txt containing the output for our three input files.

Question 12: What are the running times reported for each of the input files for the original and improved versions of the program? (You might find tail useful again.) Where does the new program gain in efficiency over the old? Be specific. (10 points)

Finding Most Frequent Words

As a final improvement, we'll add some functionality to report the (up to) 10 most frequent words encountered, in descending order. There are many ways to do this, but we'll take a straightforward approach that only involves creating a Comparator, putting our list of words into an array, and using a generic array sorting method.

Your tasks:

Question 13: Demonstrate your program with this enhancement implemented. (20 points)

Don't forget to commit and push all of your code for this lab to your group's GitHub repository.