Computer Science 211
Data Structures

Mount Holyoke College
Fall 2009

Lab 2: Word Generator
Due: 9:00 AM, Friday, October 2, 2009

Your task is to design and implement a word generator program. You should come to our Friday meeting with a written design of the Table and FrequencyList classes described below. Draw pictures!

Our goals for the program are to become proficient in using existing structures to build more complex ones, to learn the importance of careful class design, and to gain more experience using the Vector and Association classes.

Problem Description

This assignment comes from the world of artificial intelligence. The idea is to read a collection of text and then use some properties of that text to generate some new, "machine generated" text. The results can make interesting reading, but you probably don't want to use this to generate your next term paper.

We "train" our program in how to generate words based on an input text. To do this, we read the text character by character and keep track of how often each three-character sequence appears. From this, we can compute the probability that a certain character will immediately follow two given characters. For example, if the text is: the theater is their thing, e appears after th 3 times, and i appears after th once, and no other letters appear after th. So the probability that e follows th is .75; the probability that i follows th is .25; the probability that any other letter follows th is 0.

To generate our output text, we pick two letters (for example, the first two in the input text, or two random characters) to use as a beginning for our new text. Subsequent characters for our text are chosen based on the preceding two characters and the probability information gathered from the input text.

For example, consider the theater is their thing as the input text, and choosing th as the starting string. With a probability of .75, we will choose e as the next letter. So our text so far is the. Now, what are the possibilities to follow he? Well, we can choose a space, an a, or an i with equal probability. If we choose a, our text is thea and we continue the process, choosing among the possible letters to follow ea.

The program stops when we encounter a two-character sequence for which there is no subsequent character (which will occur only if the last two characters of the input occur only in that location) or when we have generated a total number of characters as specified as a parameter to our program (my program defaults to 2000 characters).

Design

The primary functionality required of your table will be:

to update the probabilities in your table, given a new triple of characters, and
to select a next character, given a pair of characters and the probabilities stored in your table.

A good choice of a data structure is essential to make it reasonably easy and efficient to support these operations. A three-dimensional array might seem reasonable at first, but its size would be quite large (approximately 27,000 entries if you include blanks and punctuation) and many if not most of the entries would be empty.

Instead, develop a Table class which stores mappings between 2-character sequences with possible subsequent characters in a Vector of Associations. Each Association should have a 2-character pair (stored as a String) as its key, and a frequency list as its value.

Each frequency list should be an object of another class, FrequencyList, that you will define. It should keep track of each character that appeared after the given 2-character pair, along with the number of times it appeared. There are many ways to implement the frequency list. A good possibility is ...another Vector of Associations. In this case, the Associations have a key of a single character (stored either as a Character or a String, just be consistent) and a value which is the count of the number of times that letter occurred after the pair with which this list is associated (stored as an Integer). Think carefully about what methods the frequency list needs to support and which other instance variables might be useful.

Your main method for the program should be written in a third class, WordGen, which reads the input text, builds the table, and prints out a randomly-generated string based on the character sequence probabilities from the input text.

Input

You should read your input text from the keyboard using the Scanner class. To use the Scanner, you will want to build up a string of the entire input line by line, using the methods hasNextLine() to find out if there is another line of input ready, and if there is, to read it with the method nextLine(). Separate successive lines in your String with a space, which will allow the characters on one line to be followed by characters on the next.

The end of the input is signalled for Java on the Mac (and, indeed, on any Unix system) by typing "Control-D" on a new line.

You may also read your input from any text file, e.g. whosonfirst.txt, without any need to modify your program, with this command at the Terminal window:

java WordGen < whosonfirst.txt

This is not specific to Java, in fact this input redirection can be used with any program running on Unix system.

whosonfirst.txt is available in the shared area under labs/wordgen.

Another "interesting" input to try out is the Java source code for one of the classes you have implemented.

Output

After the input has been processed you should generate and print out new text using the frequencies in the table. You may start with a fixed pair of letters that appears in the table or choose starting characters randomly. As stated above, you should generate text until there is either no next character available or until you have produced a maximum number of characters that was specified as a command-line parameter to your program.

Notes

You might find it helpful to look carefully at the word frequency program on page 48 of Bailey.
Random numbers can be generated conveniently using java.util.Random. It is inefficient to construct new random number generators each time you need a random number, so try to construct, and reuse, only a few. Perhaps one in each class that needs random numbers.
When you import the classes Random and Scanner, use
```
import java.util.Random;
import java.util.Scanner;
```
If you write "import java.util.*;", the program will get confused as to which version of the Vector class it should use as there is one in java.util as well as one in structure.

To ensure that your program can find the structure package for the implementations of Vector and Association, you will need to add this to the top of your Java files:
```
import structure5.*;
```
When debugging the table construction, keep in mind that you can use the built-in toString methods in Vector and Association very easily, but they may not present the information in a convenient form. Consider writing custom toString methods for your FrequencyList and Table classes.
You may treat spacing and punctuation just like any other characters. You may also make your program case-sensitive.

Submission

Package your three Java source files into a "tar file" lab2.tar, and submit it as an email attachment to jteresco AT mtholyoke.edu.

As in all labs, you will be graded on design, documentation, style, and correctness. Be sure to document your program with appropriate Javadoc comments, including a general description at the top of the file, a description of each method with pre and postconditions where appropriate. Also use comments and descriptive variable names to clarify sections of the code which may not be clear to someone trying to understand it.

Grading Breakdown
Pre-lab design	3 points
Program design	3 points
Program style	3 points
Program documentation	6 points
Program correctess	15 points

Computer Science 211 Data Structures

Mount Holyoke College Fall 2009

Computer Science 211
Data Structures

Mount Holyoke College
Fall 2009