Computer Science 210
Data Structures
Fall 2016, Siena College
In this, our last lab, we will bring together several of the data structures we have studied and build a program to perform Huffman compression and decompression. There is a good amount of code to write here (my solution to the programming project is about 300 lines).
You may work alone or in a group of two or three on this lab. Only one submission per group is needed.
Getting Set Up
To get your BlueJ environment set up for this week's lab assignment, start BlueJ and choose "New Project" from the "Project" menu. Navigate to your folder for this course and choose the name "Lab10" (no spaces) for the project.
Create a document where you will record your answers to the lab questions. If you use plain text, call it "lab10.txt". If it's a Word document, you can call it whatever you'd like, but when you submit, be sure you convert it to a PDF document "lab10.pdf" before you submit it.
Practice Program: Counting Word Frequencies
One of the steps in constructing a Huffman tree is to determine the frequency of the objects to be encoded (in our case, these will be words). Your task is to develop two classes: WordCount and WordCountList to achieve this. The WordCount class is straightforward, it is a custom class that encapsulates a String, a word from the input, and an int, the number of times that word has been encountered on the input. WordCountList maintains a list of words, and has methods to update the list as words are processed from the input.
Include a main method in WordCountList that reads in words from the standard input, converts them to lower case, and adds them to a WordCountList. After the input has been exhausted, it should print out all the words encountered and their frequencies.
Note that when typing values in the terminal window using the keyboard, you can indicate the end of input by typing a Ctrl-D.
Both of these classes will come in handy for the programming assignment below.
Huffman Trees
The remaining task will be graded as a programming assignment, so pay attention to style and documentation.
Our discussion in class and the example code in Bailey builds Huffman trees that encode a string character by character using bit strings that are computed to minimize the length of the encoding for that string. However, Huffman trees can be used to encode items of any kind of data. It only makes sense if many of the items are the same, and hence can share common encodings. Here, we will encode a piece of text by words.
For example, if we are presented the text
I like the same words over and over and over and over I mean the same words over and over you see I like the same words over and over and over again
One possible word-based Huffman encoding (ignoring case) would be:
Encoding of and is 00 (frequency was 6) Encoding of i is 010 (frequency was 3) Encoding of you is 01100 (frequency was 1) Encoding of mean is 01101 (frequency was 1) Encoding of again is 01110 (frequency was 1) Encoding of see is 01111 (frequency was 1) Encoding of over is 10 (frequency was 9) Encoding of like is 1100 (frequency was 2) Encoding of words is 1101 (frequency was 3) Encoding of same is 1110 (frequency was 3) Encoding of the is 1111 (frequency was 3)
which results in the encoded text:
0101100111111101101100010001000100100110111111110110 110001001100011110101100111111101101100010001001110
Program Requirements
Your program should be run from a main method in a class named HuffmanStrings. It should have three modes of operation, selected based on the command-line parameters provided.
Here is an example run (from the command line) to use this mode to compute and display the encoding for the text in a file samewords.txt:
java HuffmanStrings < samewords.txt
Here is an example run (from the command line) to use this mode to compute the encoding for the text in a file samewords.txt and store the Huffman tree and the encoded data in the file samewords.hc:
java HuffmanStrings -e samewords.hc < samewords.txt
The contents of samewords.hc:
6 and 3 i 1 you 1 mean 0 1 again 1 see 0 0 0 0 9 over 2 like 3 words 0 3 same 3 the 0 0 0 0 -1 0101100111111101101100010001000100100110111111110110 110001001100011110101100111111101101100010001001110
(though the code is all on one line, not wrapped as shown above).
The file format is described below.
Here is an example run (from the command line) to use this mode to decode samewords.hc:
java HuffmanStrings -d samewords.hc
The original text from the start of this section is printed.
File Format
You may design your own file format or you may use the one from the example above. The format shown above is quite simple, and takes advantage of the fact that Huffman trees are always structured such that all internal nodes have both a left and a right child. Of course, leaf nodes have no children. So there are no nodes with either just a left or just a right child. This allows us to generate the above file with a simple post-order traversal, and reconstruct the tree from the file using a straightforward process.
To generate the file, perform a post-order traversal. For a leaf node, output a line with the frequency of the word represented by that leaf and the word. For an internal node, output a line with a 0. After the traversal, a -1 is placed in the file to separate the tree from the encoded text.
To re-create the tree from the file, a process similar to that used to create the original Huffman tree is used. When a line in the file is processed, it will either be a 0 or a positive number followed by a word. When the latter is encountered, a leaf node is created and remembered in a list. When a 0 is encountered, the two most recently created tree nodes become the right and left children of a new tree node. When the -1 is encountered to indicate the end of this part of the file, there should be a single tree in the list, and that is the original Huffman tree!
At that point, the tree is ready to be used to decode the pattern of 0's and 1's back to the words they represent.
Hints, Suggestions, Further Details
This section will expand as questions are asked and answered.
new WordCount("", frequency)
If the total frequency at that node is represented by the variable frequency.
Submitting
Before 4:30 PM, Monday, December 12, 2016, submit your lab for grading. There are two things to do to complete the submission: (i) Copy your file with the answers to the lab questions into your project directory. Be sure to use the correct file name. If you prepared your answers in Word, export to a PDF file and submit that. and (ii) upload a copy of your lab (a .7z or .zip file containing your project directory) to Blackboard under "Lab10". Don't forget to check your programming assignment programs for compliance with the Style Guide for CSIS 210 Programs
Grading
This assignment is worth 110 points, which are distributed as follows:
> Feature | Value | Score |
WordCount class | 5 | |
WordCountList class | 8 | |
WordCountList main method | 5 | |
Huffman tree construction | 25 | |
Print list of codes for words | 10 | |
Generate encoding of text | 10 | |
Tree and encoding in file | 5 | |
Recreate tree from file | 10 | |
Decode and print original text from file | 10 | |
Program design | 4 | |
Program style | 3 | |
Program documentation | 15 | |
Total | 110 | |