Computer Science 136 |
Lab 3: Word Generator
Due: Monday, October 3, 2005 at 9:00 AM
Take a look at the following questions from the book (SC refers to the Self Check Problems):
SC 3.2, SC 3.3, SC 3.4, 3.1, 4.1, 4.5.
You do not need to submit anything for these practice problems.
Short Answers
Complete the following problems from the book and turn them as a file lab3.txt at the start of lab.
3.6 (don't write the class-- just answer what the advantage would be), 4.2, 4.4, 4.10.
Lab Description
This week, you will be developing an implementation of the word generator program described below. You should come to lab with a written design of the Table and FrequencyList classes.
This week's lab can be thought of as an experiment in Artificial Intelligence. The idea is to write a program which will read in text and then use that text to generate some new text. The method for generating the text uses simple probability - we read the text character by character and keep track of how often each three-character sequence appears. From this, we can compute the probability that a certain character will immediately follow two given characters. For example, if the text is: "the theatre is their thing", e appears after th 3 times, and i appears after th 1 time; no other letters appear after th. So the probability that e follows th is .75; the probability that i follows th is .25; the probability that any other letter follows th is 0.
Once we have the text processed, and stored in a structure that allows us to check probabilities, we then pick two letters (for example, the first two in the input text) to use as a beginning for our new text. Then we choose subsequent characters based on the preceding two characters and the probability information.
During this lab, you will become proficient in using existing structures to build more complex ones, you will learn the importance of careful class design, and you will learn about the Vector and Association classes in the structure package.
You should think about the design of this program carefully before entering the lab. What would constitute a good data structure for this problem? The table of information should support requests of the form:
A three-dimensional array might seem reasonable at first, but its size would be quite large (approximately 27,000 entries if you include blanks and punctuation). Instead, develop a Table class which is implemented as a Vector of Associations. Use generic Vectors, where you declare and construct a Vector that holds a specific type of Association. Each Association would have a 2-character pair (stored as a String) as its key, along with a value which is a frequency list. The frequency list would keep track of which characters appeared after the given 2-character pair, along with a frequency. Again, use a generic Association, which is constructed to have a String as a key and a FrequencyList as a value.
Each frequency list should be an object of another class, FrequencyList, that you will define. There are many ways to implement the frequency list. A good possibility is ...another Vector of Associations. Thus the frequency list's Vector would consist of Associations in which the key is a single character (stored either as a Character or a String) and the value is a count of the number of times that letter occurred after the pair with which the list is associated (stored as an Integer). Think carefully about what methods the frequency list needs to support and any other instance variables that might be useful. Again, use appropriate instantiations of generic Vectors and Associations.
The data structure design built from these two classes has the benefit of having only as many entries as necessary for the given input text.
You will find it helpful to look carefully at the word frequency program on page 48 of Bailey.
Your main method for the program should be written in a third class, WordGen, which reads the input text, builds the table, and prints out a randomly-generated string based on the character sequence probabilities from the input text.
Warning: When you import the package with the classes Random and Scanner, use
import java.util.Random; import java.util.Scanner;
If you write "import java.util.*;", the program will get confused as to which version of the Vector class it should use as there is one in java.util as well as one in structure.
To ensure that your program can find the structure package for the implementations of Vector and Association, you will need to add this to the top of your Java files:
import structure.*;
First debug your program using as input a String constant (e.g., "the theater is their thing") until it works properly. After it works, you can take input from the keyboard using the Scanner class.
To use the Scanner, you will want to build up a string of the entire input line by line, using the methods hasNextLine() to find out if there is another line of input ready, and if there is, to read it with the method nextLine().
End of the input is signalled for Java on the Mac (and, indeed, on any Unix system) by typing "Control-D" on a new line.
You may also read your input from any text file, e.g. "whosonfirst.txt", with this command:
java WordGen < whosonfirst.txt
This is not specific to Java, in fact this "input redirection" can be used with any program running on Unix system.
After the input has been processed you should generate and print out new text using the frequencies in the table. You may start with a fixed pair of letters that appears in the table or choose starting characters randomly. Generate and print randomly-generated text up to about 2000 letters (about a screenful in a terminal window) so that we can see how your program works.
Package your three Java source files into a "tar file" lab3.tar, and submit using the turnin utility.
As in all labs, you will be graded on design, documentation, style, and correctness. Be sure to document your program with appropriate Javadoc comments, including a general description at the top of the file, a description of each method with pre- and post-conditions where appropriate. Also use comments and descriptive variable names to clarify sections of the code which may not be clear to someone trying to understand it.