Lab Bonus/10 -- Huffman Trees

Fall 2017, Siena College

Home Syllabus Schedule Examples Style Guide

Lab Bonus/10: Huffman Trees
Due: before the final exam

In this optional bonus lab, we will bring together several of the data structures we have studied and build a program to perform Huffman compression and decompression. There is a good amount of code to write here (my solution to the programming project is about 300 lines).

You may work alone or in a group of two or three on this lab. Only one submission per group is needed.

Getting Set Up

You will receive an email with the link to follow to set up your GitHub repository huffman-lab-yourgitname for this Lab. One member of the group should follow the link to set up the repository on GitHub, then that person should email the instructor with the other group members' GitHub usernames so they can be granted access. This will allow all members of the group to clone the repository and commit and push changes to the origin on GitHub.

Practice Program: Counting Word Frequencies

One of the steps in constructing a Huffman tree is to determine the frequency of the objects to be encoded (in our case, these will be words). Your task is to develop two classes: WordCount and WordCountList to achieve this. The WordCount class is straightforward, it is a custom class that encapsulates a String, a word from the input, and an int, the number of times that word has been encountered on the input. WordCountList maintains a list of words, and has methods to update the list as words are processed from the input.

Include a main method in WordCountList that reads in words from the standard input, converts them to lower case, and adds them to a WordCountList. After the input has been exhausted, it should print out all the words encountered and their frequencies.

Note that when typing values in the terminal window using the keyboard, you can indicate the end of input by typing a Ctrl-D.

Both of these classes will come in handy for the programming assignment below.

Huffman Trees

The remaining task will be graded as a programming assignment, so pay attention to style and documentation.

Our discussion in class and the example code in Bailey builds Huffman trees that encode a string character by character using bit strings that are computed to minimize the length of the encoding for that string. However, Huffman trees can be used to encode items of any kind of data. It only makes sense if many of the items are the same, and hence can share common encodings. Here, we will encode a piece of text by words.

For example, if we are presented the text

I like the same words over and over and over and over I mean the same
words over and over you see I like the same words over and over and
over again

One possible word-based Huffman encoding (ignoring case) would be:

Encoding of and is 00 (frequency was 6)
Encoding of i is 010 (frequency was 3)
Encoding of you is 01100 (frequency was 1)
Encoding of mean is 01101 (frequency was 1)
Encoding of again is 01110 (frequency was 1)
Encoding of see is 01111 (frequency was 1)
Encoding of over is 10 (frequency was 9)
Encoding of like is 1100 (frequency was 2)
Encoding of words is 1101 (frequency was 3)
Encoding of same is 1110 (frequency was 3)
Encoding of the is 1111 (frequency was 3)

which results in the encoded text:

0101100111111101101100010001000100100110111111110110
110001001100011110101100111111101101100010001001110

Program Requirements

Your program should be run from a main method in a class named HuffmanStrings. It should have three modes of operation, selected based on the command-line parameters provided.

When no command-line parameters are provided, the program reads words from the standard input (the keyboard, unless input is redirected from a file), constructs a Huffman tree, and prints out the encodings it has computed for the words found on the input.
Here is an example run (from the command line) to use this mode to compute and display the encoding for the text in a file samewords.txt:
```
java HuffmanStrings < samewords.txt
```
When two command-line parameters are provided, -e followed by the name of a file, the program reads words from the standard input (again the keyboard, unless input is redirected from a file), constructs a Huffman tree for the words found on the input, and then stores a representation of the tree followed by the actual encoding of the input in the file given as the second command-line parameter.
Here is an example run (from the command line) to use this mode to compute the encoding for the text in a file samewords.txt and store the Huffman tree and the encoded data in the file samewords.hc:
```
java HuffmanStrings -e samewords.hc < samewords.txt
```
The contents of samewords.hc:
```
6 and
3 i
1 you
1 mean
0
1 again
1 see
0
0
0
0
9 over
2 like
3 words
0
3 same
3 the
0
0
0
0
-1
0101100111111101101100010001000100100110111111110110
110001001100011110101100111111101101100010001001110
```
(though the code is all on one line, not wrapped as shown above).

The file format is described below.
When two command-line parameters are provided, -d followed by a file name, the program reads the encoded file, uses its contents to recreate the Huffman tree, then uses the tree to decode the original words, and prints those to the screen.
Here is an example run (from the command line) to use this mode to decode samewords.hc:
```
java HuffmanStrings -d samewords.hc
```
The original text from the start of this section is printed.

File Format

You may design your own file format or you may use the one from the example above. The format shown above is quite simple, and takes advantage of the fact that Huffman trees are always structured such that all internal nodes have both a left and a right child. Of course, leaf nodes have no children. So there are no nodes with either just a left or just a right child. This allows us to generate the above file with a simple post-order traversal, and reconstruct the tree from the file using a straightforward process.

To generate the file, perform a post-order traversal. For a leaf node, output a line with the frequency of the word represented by that leaf and the word. For an internal node, output a line with a 0. After the traversal, a -1 is placed in the file to separate the tree from the encoded text.

To re-create the tree from the file, a process similar to that used to create the original Huffman tree is used. When a line in the file is processed, it will either be a 0 or a positive number followed by a word. When the latter is encountered, a leaf node is created and remembered in a list. When a 0 is encountered, the two most recently created tree nodes become the right and left children of a new tree node. When the -1 is encountered to indicate the end of this part of the file, there should be a single tree in the list, and that is the original Huffman tree!

At that point, the tree is ready to be used to decode the pattern of 0's and 1's back to the words they represent.

Hints, Suggestions, Further Details

This section will expand as questions are asked and answered.

Be sure you understand everything in the topic notes on Huffman trees and in Sections 12.8 and 13.2 of Bailey. The sample code in those sections is not ideal as a starting point, but can help guide your design. Especially the class that implements the Huffman tree using the structure5.BinaryTree class.
Develop with small files (here's another: ferris.txt), but be sure to test on larger files also.
Other than the main method in WordCountList, all of the code you developed for the practice program should be ready to use for the programming assignment.
This is a large program, so tackle one piece at a time. Comment before you code! If you can't describe what you're intending a piece of code to do in English then how can you describe it in Java? The proper choice of data structures will simplify your code. Here's one way to approach it in a step-by-step procedure:
1. Write a main method in HuffmanStrings that computes the word frequencies. Yes, you do already have code that does this. Use it!
2. Set up to build your Huffman tree. You'll need a class (I call mine HuffmanTree) that represents nodes of the tree. The example in Bailey is a good one: a HuffmanTree object encapsulates a binary tree node whose value is one of your WordCount objects (for leaf nodes, while you probably want it to be null for interior nodes) and which also has a field to store the total frequency of the words represented by all descendents of the node. You'll need one of these for each WordCount that will become the leaf nodes of your Huffman tree.
3. Recall that the Huffman tree construction procedure involves finding the two nodes (in this case, HuffmanTree objects) with the lowest total word frequency that have not yet been added as the child of some other node. If you make your HuffmanTree objects Comparable with "high priority" nodes being ones with a low total frequence, a priority queue is exactly what you need. Again, you can use the Bailey sample code (in the priority queue chapter) as a guide.
4. The construction of your entire tree then becomes very straightforward: keep taking the smallest two values out of the priority queue, create a new tree node with these as children, and add that node to the priority queue. When the priority queue eventually has just a single value remaining, it is the root of your Huffman tree! By properly structuring your data (you made HuffmanTree a Comparable) and letting existing data structures (a priority queue written by someone else) do the work, a complicated process is achieved with surprisingly simple code. Since BinaryTrees are not allowed to have null values, you'll need a placeholder WordCount for the internal nodes. Something like this would work:
```
new WordCount("", frequency)
```
  If the total frequency at that node is represented by the variable frequency.
5. Next, you'll need to be able to generate the Huffman codes for individual words from the Huffman tree. Recall that the code for, in our case, a word, is a series of 0's and 1's, where you add to the code a 0 each time you take a left child and a 1 each time you take a right child as you traverse the tree from the root down to the leaf that represents the word you're encoding. This is inconvenient for real processing, so it's probably a good idea to create a table that associates words with their codes. A nice data structure like a binary search tree or hash table might be nice here, but since Huffman codes are intended for cases where there are not too many different items encoded, something simpler is reasonable for us. My lookup table is an ArrayList of Association<String,String> where the key of each association is one of the words we encoded, and the value is its code. A great way to build this list is recursively, much like the way the print method in the Bailey example code does so (it's recursive, excellent!).
6. Now with a handy lookup table ready to map words to codes, you can again traverse your input and output the code for each word in the order it appeared in the original input. One wrinkle here, though. Your input came in from a Scanner, and it's gone now. So you'll need to remember it somewhere. Think about an appropriate data structure to store this information. You'll populate the data structure with the words as you're reading them in (where you are building your WordCountList), and use those values when you get to the point of creating the encoding (i.e., this step). Congratulations! If you got this far with well-documented code in a good style, you've earned over 3/4 of the credit for this lab. Save a copy of your code somewhere safe just in case you break something. Maybe even turn it in. You can always turn in a better version later.
7. From here, you'll be adding support for storing the encoding in a file and reading one of those file back in to recreate the original text. So first enhance your program so it looks at its args parameter to main to determine which of the modes it should use. You already have code that does what it should do when no parameters are provided. Add conditionals that have your program behave as before for no command-line parameters, print out a message like "encode to file" when the -e parameter and a file name are given, and a message like "decode from file" when the -d parameter and a file name are given. In all other cases, you should print an error message and exit.
8. Now, it's down to replacing those two printed messages with the code that does those things. Since you can't really read and decode a file until you have the ability to write the encoded file, do the encoding first. Much of the code for this part is the same as you did for the no-command-line mode. The difference is that once you have constructed your tree, you need to write that tree and the encoding to a file (see the "File Format" section above).
9. Last, you'll need to read an encoding file. This involves re-creating your tree then traversing that tree to decode. In this case, it is most convenient and efficient to have the actual tree, and traverse it, going left when you see a 0, right when you see a 1, and then when you get to a leaf, print the word, and jump back to the root to decode the next word.

Submitting

Your submission requires that all required deliverables are committed and pushed to the master for your repository's origin on GitHub. That's it! If you see everything you intend to submit when you visit your repository's page on GitHub, you're set.

Grading

This assignment is worth 110 points, which are distributed as follows:

> Feature	Value	Score
`WordCount` class	5
`WordCountList` class	8
`WordCountList` `main` method	5
Huffman tree construction	25
Print list of codes for words	10
Generate encoding of text	10
Tree and encoding in file	5
Recreate tree from file	10
Decode and print original text from file	10
Program design	4
Program style	3
Program documentation	15
Total	110