Computer Science 210
Data Structures
Fall 2017, Siena College
Lab Bonus/10: Huffman Trees
Due: before the final exam
In this optional bonus lab, we will bring together several of the data
structures we have studied and build a program to perform Huffman
compression and decompression. There is a good amount of code to
write here (my solution to the programming project is about 300
lines).
You may work alone or in a group of two or three on this lab. Only
one submission per group is needed.
Getting Set Up
You will receive an email with the
link to follow to set up your GitHub repository
huffman-lab-yourgitname for this Lab. One member of the
group should follow the link to set up the repository on GitHub,
then that person should email the instructor with the other group
members' GitHub usernames so they can be granted access. This will
allow all members of the group to clone the repository and commit
and push changes to the origin on GitHub.
Practice Program: Counting Word Frequencies
One of the steps in constructing a Huffman tree is to determine the
frequency of the objects to be encoded (in our case, these will be
words). Your task is to develop two classes: WordCount and
WordCountList to achieve this. The WordCount class is
straightforward, it is a custom class that encapsulates a String,
a word from the input, and an int, the number of times that word
has been encountered on the input. WordCountList maintains a
list of words, and has methods to update the list as words are
processed from the input.
Include a main method in WordCountList that reads in words
from the standard input, converts them to lower case, and adds them to
a WordCountList. After the input has been exhausted, it should
print out all the words encountered and their frequencies.
Note that when typing values in the terminal window using the
keyboard, you can indicate the end of input by typing a Ctrl-D.
Both of these classes will come in handy for the programming
assignment below.
Huffman Trees
The remaining task will be graded as a programming assignment, so pay
attention to style and documentation.
Our discussion in class and the example code in Bailey
builds Huffman trees that encode a string character by character using
bit strings that are computed to minimize the length of the encoding
for that string. However, Huffman trees can be used to encode items
of any kind of data. It only makes sense if many of the items are the
same, and hence can share common encodings. Here, we will encode a
piece of text by words.
For example, if we are presented the text
I like the same words over and over and over and over I mean the same
words over and over you see I like the same words over and over and
over again
One possible word-based Huffman encoding (ignoring case) would be:
Encoding of and is 00 (frequency was 6)
Encoding of i is 010 (frequency was 3)
Encoding of you is 01100 (frequency was 1)
Encoding of mean is 01101 (frequency was 1)
Encoding of again is 01110 (frequency was 1)
Encoding of see is 01111 (frequency was 1)
Encoding of over is 10 (frequency was 9)
Encoding of like is 1100 (frequency was 2)
Encoding of words is 1101 (frequency was 3)
Encoding of same is 1110 (frequency was 3)
Encoding of the is 1111 (frequency was 3)
which results in the encoded text:
0101100111111101101100010001000100100110111111110110
110001001100011110101100111111101101100010001001110
Program Requirements
Your program should be run from a main method in a class named
HuffmanStrings. It should have three modes of operation, selected
based on the command-line parameters provided.
- When no command-line parameters are provided, the program reads
words from the standard input (the keyboard, unless input is
redirected from a file), constructs a Huffman tree, and prints out
the encodings it has computed for the words found on the input.
Here is an example run (from the command line) to use this mode to
compute and display the encoding for the text in a file
samewords.txt:
java HuffmanStrings < samewords.txt
- When two command-line parameters are provided, -e followed
by the name of a file, the program reads words from the standard
input (again the keyboard, unless input is redirected from a file),
constructs a Huffman tree for the words found on the input, and then
stores a representation of the tree followed by the actual encoding
of the input in the file given as the second command-line parameter.
Here is an example run (from the command line) to use this mode to
compute the encoding for the text in a file samewords.txt and
store the Huffman tree and the encoded data in the file
samewords.hc:
java HuffmanStrings -e samewords.hc < samewords.txt
The contents of samewords.hc:
6 and
3 i
1 you
1 mean
0
1 again
1 see
0
0
0
0
9 over
2 like
3 words
0
3 same
3 the
0
0
0
0
-1
0101100111111101101100010001000100100110111111110110
110001001100011110101100111111101101100010001001110
(though the code is all on one line, not wrapped as shown above).
The file format is described below.
- When two command-line parameters are provided, -d followed
by a file name, the program reads the encoded file, uses its
contents to recreate the Huffman tree, then uses the tree to decode
the original words, and prints those to the screen.
Here is an example run (from the command line) to use this mode to
decode samewords.hc:
java HuffmanStrings -d samewords.hc
The original text from the start of this section is printed.
File Format
You may design your own file format or you may use the one from the
example above. The format shown above is quite simple, and takes
advantage of the fact that Huffman trees are always structured such
that all internal nodes have both a left and a right child. Of
course, leaf nodes have no children. So there are no nodes with
either just a left or just a right child. This allows us to generate
the above file with a simple post-order traversal, and reconstruct the
tree from the file using a straightforward process.
To generate the file, perform a post-order traversal. For a leaf
node, output a line with the frequency of the word represented by that
leaf and the word. For an internal node, output a line with a 0.
After the traversal, a -1 is placed in the file to separate the tree
from the encoded text.
To re-create the tree from the file, a process similar to that used to
create the original Huffman tree is used. When a line in the file is
processed, it will either be a 0 or a positive number followed by a
word. When the latter is encountered, a leaf node is created and
remembered in a list. When a 0 is encountered, the two most recently
created tree nodes become the right and left children of a new tree
node. When the -1 is encountered to indicate the end of this part of
the file, there should be a single tree in the list, and that is the
original Huffman tree!
At that point, the tree is ready to be used to decode the pattern of
0's and 1's back to the words they represent.
Hints, Suggestions, Further Details
This section will expand as questions are asked and answered.
- Be sure you understand everything in the topic notes on Huffman
trees and in Sections 12.8 and 13.2 of Bailey. The
sample code in those sections is not ideal as a starting point, but
can help guide your design. Especially the class that implements
the Huffman tree using the structure5.BinaryTree class.
- Develop with small files (here's another:
ferris.txt), but be sure to test on larger
files also.
- Other than the main method in WordCountList, all of
the code you developed for the practice program should be ready to
use for the programming assignment.
- This is a large program, so tackle one piece at a time. Comment before you code! If you can't describe what you're
intending a piece of code to do in English then how can you describe
it in Java? The proper choice of data structures will simplify
your code. Here's one way to approach it in a step-by-step
procedure:
- Write a main method in HuffmanStrings that
computes the word frequencies. Yes, you do already have code
that does this. Use it!
- Set up to build your Huffman tree. You'll need a class (I
call mine HuffmanTree) that represents nodes of the tree.
The example in Bailey is a good one: a
HuffmanTree object encapsulates a binary tree node whose
value is one of your WordCount objects (for leaf nodes,
while you probably want it to be null for interior nodes)
and which also has a field to store the total frequency of the
words represented by all descendents of the node. You'll need
one of these for each WordCount that will become the leaf
nodes of your Huffman tree.
- Recall that the Huffman tree construction procedure involves
finding the two nodes (in this case, HuffmanTree objects)
with the lowest total word frequency that have not yet been
added as the child of some other node. If you make your
HuffmanTree objects Comparable with "high priority"
nodes being ones with a low total frequence, a priority queue is
exactly what you need. Again, you can use the Bailey
sample code (in the priority queue chapter) as a guide.
- The construction of your entire tree then becomes very
straightforward: keep taking the smallest two values out of the
priority queue, create a new tree node with these as children,
and add that node to the priority queue. When the priority queue
eventually has just a single value remaining, it is the root of
your Huffman tree! By properly structuring your data (you made
HuffmanTree a Comparable) and letting existing data
structures (a priority queue written by someone else) do the
work, a complicated process is achieved with surprisingly simple
code. Since BinaryTrees are not allowed to have null
values, you'll need a placeholder WordCount for the internal
nodes. Something like this would work:
new WordCount("", frequency)
If the total frequency at that node is represented by the variable
frequency.
- Next, you'll need to be able to generate the Huffman codes
for individual words from the Huffman tree. Recall that the code
for, in our case, a word, is a series of 0's and 1's, where you
add to the code a 0 each time you take a left child and a 1 each
time you take a right child as you traverse the tree from the
root down to the leaf that represents the word you're encoding.
This is inconvenient for real processing, so it's probably a good
idea to create a table that associates words with their codes. A
nice data structure like a binary search tree or hash table might
be nice here, but since Huffman codes are intended for cases
where there are not too many different items encoded, something
simpler is reasonable for us. My lookup table is an
ArrayList of Association<String,String> where the key
of each association is one of the words we encoded, and the value
is its code. A great way to build this list is recursively, much
like the way the print method in the Bailey
example code does so (it's recursive, excellent!).
- Now with a handy lookup table ready to map words to codes,
you can again traverse your input and output the code for each
word in the order it appeared in the original input. One wrinkle
here, though. Your input came in from a Scanner, and it's
gone now. So you'll need to remember it somewhere. Think about
an appropriate data structure to store this information. You'll
populate the data structure with the words as you're reading them
in (where you are building your WordCountList), and use
those values when you get to the point of creating the encoding
(i.e., this step). Congratulations! If you got this far
with well-documented code in a good style, you've earned over 3/4
of the credit for this lab. Save a copy of your code somewhere
safe just in case you break something. Maybe even turn it in.
You can always turn in a better version later.
- From here, you'll be adding support for storing the encoding
in a file and reading one of those file back in to recreate the
original text. So first enhance your program so it looks at its
args parameter to main to determine which of the modes
it should use. You already have code that does what it should do
when no parameters are provided. Add conditionals that have your
program behave as before for no command-line parameters, print out
a message like "encode to file" when the -e parameter and a
file name are given, and a message like "decode from file" when
the -d parameter and a file name are given. In all other
cases, you should print an error message and exit.
- Now, it's down to replacing those two printed messages with
the code that does those things. Since you can't really read and
decode a file until you have the ability to write the encoded
file, do the encoding first. Much of the code for this part is
the same as you did for the no-command-line mode. The difference
is that once you have constructed your tree, you need to write
that tree and the encoding to a file (see the "File Format"
section above).
- Last, you'll need to read an encoding file. This involves
re-creating your tree then traversing that tree to decode. In
this case, it is most convenient and efficient to have the actual
tree, and traverse it, going left when you see a 0, right when you
see a 1, and then when you get to a leaf, print the word, and jump
back to the root to decode the next word.
Submitting
Your submission requires that all required deliverables are committed
and pushed to the master for your repository's origin on GitHub.
That's it! If you see everything you intend to submit when you visit
your repository's page on GitHub, you're set.
Grading
This assignment is worth 110 points, which are distributed as follows:
>
Feature | Value | Score |
WordCount class | 5 | |
WordCountList class | 8 | |
WordCountList main method | 5 | |
Huffman tree construction | 25 | |
Print list of codes for words | 10 | |
Generate encoding of text | 10 | |
Tree and encoding in file | 5 | |
Recreate tree from file | 10 | |
Decode and print original text from file | 10 | |
Program design | 4 | |
Program style | 3 | |
Program documentation | 15 | |
Total | 110 | |
|