Computer Science 210
Data Structures

Fall 2017, Siena College

Lab Bonus/10: Huffman Trees
Due: before the final exam

In this optional bonus lab, we will bring together several of the data structures we have studied and build a program to perform Huffman compression and decompression. There is a good amount of code to write here (my solution to the programming project is about 300 lines).

You may work alone or in a group of two or three on this lab. Only one submission per group is needed.

Getting Set Up

You will receive an email with the link to follow to set up your GitHub repository huffman-lab-yourgitname for this Lab. One member of the group should follow the link to set up the repository on GitHub, then that person should email the instructor with the other group members' GitHub usernames so they can be granted access. This will allow all members of the group to clone the repository and commit and push changes to the origin on GitHub.

Practice Program: Counting Word Frequencies

One of the steps in constructing a Huffman tree is to determine the frequency of the objects to be encoded (in our case, these will be words). Your task is to develop two classes: WordCount and WordCountList to achieve this. The WordCount class is straightforward, it is a custom class that encapsulates a String, a word from the input, and an int, the number of times that word has been encountered on the input. WordCountList maintains a list of words, and has methods to update the list as words are processed from the input.

Include a main method in WordCountList that reads in words from the standard input, converts them to lower case, and adds them to a WordCountList. After the input has been exhausted, it should print out all the words encountered and their frequencies.

Note that when typing values in the terminal window using the keyboard, you can indicate the end of input by typing a Ctrl-D.

Both of these classes will come in handy for the programming assignment below.

Huffman Trees

The remaining task will be graded as a programming assignment, so pay attention to style and documentation.

Our discussion in class and the example code in Bailey builds Huffman trees that encode a string character by character using bit strings that are computed to minimize the length of the encoding for that string. However, Huffman trees can be used to encode items of any kind of data. It only makes sense if many of the items are the same, and hence can share common encodings. Here, we will encode a piece of text by words.

For example, if we are presented the text

I like the same words over and over and over and over I mean the same
words over and over you see I like the same words over and over and
over again

One possible word-based Huffman encoding (ignoring case) would be:

Encoding of and is 00 (frequency was 6)
Encoding of i is 010 (frequency was 3)
Encoding of you is 01100 (frequency was 1)
Encoding of mean is 01101 (frequency was 1)
Encoding of again is 01110 (frequency was 1)
Encoding of see is 01111 (frequency was 1)
Encoding of over is 10 (frequency was 9)
Encoding of like is 1100 (frequency was 2)
Encoding of words is 1101 (frequency was 3)
Encoding of same is 1110 (frequency was 3)
Encoding of the is 1111 (frequency was 3)

which results in the encoded text:

0101100111111101101100010001000100100110111111110110
110001001100011110101100111111101101100010001001110

Program Requirements

Your program should be run from a main method in a class named HuffmanStrings. It should have three modes of operation, selected based on the command-line parameters provided.

  1. When no command-line parameters are provided, the program reads words from the standard input (the keyboard, unless input is redirected from a file), constructs a Huffman tree, and prints out the encodings it has computed for the words found on the input.

    Here is an example run (from the command line) to use this mode to compute and display the encoding for the text in a file samewords.txt:

    java HuffmanStrings < samewords.txt
    
  2. When two command-line parameters are provided, -e followed by the name of a file, the program reads words from the standard input (again the keyboard, unless input is redirected from a file), constructs a Huffman tree for the words found on the input, and then stores a representation of the tree followed by the actual encoding of the input in the file given as the second command-line parameter.

    Here is an example run (from the command line) to use this mode to compute the encoding for the text in a file samewords.txt and store the Huffman tree and the encoded data in the file samewords.hc:

    java HuffmanStrings -e samewords.hc < samewords.txt
    

    The contents of samewords.hc:

    6 and
    3 i
    1 you
    1 mean
    0
    1 again
    1 see
    0
    0
    0
    0
    9 over
    2 like
    3 words
    0
    3 same
    3 the
    0
    0
    0
    0
    -1
    0101100111111101101100010001000100100110111111110110
    110001001100011110101100111111101101100010001001110
    

    (though the code is all on one line, not wrapped as shown above).

    The file format is described below.

  3. When two command-line parameters are provided, -d followed by a file name, the program reads the encoded file, uses its contents to recreate the Huffman tree, then uses the tree to decode the original words, and prints those to the screen.

    Here is an example run (from the command line) to use this mode to decode samewords.hc:

    java HuffmanStrings -d samewords.hc
    

    The original text from the start of this section is printed.

File Format

You may design your own file format or you may use the one from the example above. The format shown above is quite simple, and takes advantage of the fact that Huffman trees are always structured such that all internal nodes have both a left and a right child. Of course, leaf nodes have no children. So there are no nodes with either just a left or just a right child. This allows us to generate the above file with a simple post-order traversal, and reconstruct the tree from the file using a straightforward process.

To generate the file, perform a post-order traversal. For a leaf node, output a line with the frequency of the word represented by that leaf and the word. For an internal node, output a line with a 0. After the traversal, a -1 is placed in the file to separate the tree from the encoded text.

To re-create the tree from the file, a process similar to that used to create the original Huffman tree is used. When a line in the file is processed, it will either be a 0 or a positive number followed by a word. When the latter is encountered, a leaf node is created and remembered in a list. When a 0 is encountered, the two most recently created tree nodes become the right and left children of a new tree node. When the -1 is encountered to indicate the end of this part of the file, there should be a single tree in the list, and that is the original Huffman tree!

At that point, the tree is ready to be used to decode the pattern of 0's and 1's back to the words they represent.

Hints, Suggestions, Further Details

This section will expand as questions are asked and answered.

Submitting

Your submission requires that all required deliverables are committed and pushed to the master for your repository's origin on GitHub. That's it! If you see everything you intend to submit when you visit your repository's page on GitHub, you're set.

Grading

This assignment is worth 110 points, which are distributed as follows:

> FeatureValueScore
WordCount class 5
WordCountList class 8
WordCountList main method 5
Huffman tree construction 25
Print list of codes for words 10
Generate encoding of text 10
Tree and encoding in file 5
Recreate tree from file 10
Decode and print original text from file 10
Program design 4
Program style 3
Program documentation 15
Total 110