Computer Science 210
Data Structures

Fall 2018, Siena College

Problem Set 6
Due: 4:00 PM, Friday, November 30, 2018

You may work alone or in a group of 2 or 3 on this assignment. However, in order to make sure you learn the material and are well-prepared for the exams, you should work through non-coding problems on your own before discussing them with your partner, should you choose to work with someone. Also, be sure every programming task is a team effort. In particular, the "you do these and I'll do these" approach is sure to leave you unprepared for the exams.

All GitHub repositories must be created with all group members having write access and all group member names specified in the README.md file by 4:00 PM, Monday, November 26, 2018. This applies to those who choose to work alone as well!

There is a significant amount of work to be done here, and you are sure to have questions. It will be difficult if not impossible to complete the assignment if you wait until the last minute. A slow and steady approach will be much more effective.

Getting Set Up

You will receive an email with the link to follow to set up your GitHub repository for this problem set (ps6-yourgitname). One member of the group should follow the link to set up the repository on GitHub. Other group members will receive a subsequent email with a link to follow that will grant them the rights to clone the repository and commit and push changes to the origin on GitHub.

Syracuse Sequence

Practice Program: Write a Java application in the main method of SyracuseSeq.java that generates a Syracuse Sequence for a given value of n. This sequence by repeatedly applying the function:

f(n) = 3n+1 if n is odd, and f(n) = n/2 if n is even.

Your program should take a number as a command-line parameter and print the values in Syracuse sequence starting at that value until it reaches 1, and then print out the length of the sequence. (6 points)

Finding the "Best Of"

For this part of the problem set, you will be implementing an interesting ordered structure and using it in a number of contexts.

In a recent lab, you were asked to find the 10 most frequently occuring words in an input text. At that point, we had already created an ArrayList of objects that represented the words and their frequencies. To find the 10 most frequent, we placed those objects in an array, then sorted that array by decreasing frequency, and finally printed the first 10 values in the sorted array.

Question 1: Assuming the sort uses an efficient sorting algorithm like merge sort, what is the Big O complexity of the sorting and printing described above if the text contains n different words and you are finding the k most frequent words? (2 points)

In general, we are solving the problem where we are presented with n unordered values, and we wish to find the k "best" values, where "best" might mean smallest, largest, longest, any other metric we might define to compare the items. We assume that k << n.

Another option would be to search the values for the best one, then the second best one, then the third, and so on, and finally the kth best one.

Question 2: What is the Big O complexity for that process, again if we have n values and are looking to find the k best? (3 points)

Let's consider a different approach. We will add all of the values to a data structure (to be designed and implemented next) that tracks only the k best values that have been added to it so far. Once all n values have been added, the overall best k remain.

Question 3: Once k+1 values have been added to such a structure, is it reasonable to throw away the "worst" value seen so far? Explain briefly. (2 points)

Question 4: Suppose at least k values have been added to the structure, so it contains the k best so far. A new value is added to the structure, describe in English and specifically, what would need to be done to determine if the new value needs to be stored, and if it does, which old value should be removed. (5 points)

In the starter repository, you will find a class BestOf that includes a framework on which you will implement this data structure.

The structure is very restrictive: you can only construct one, add new values to it, query its size, and create an Iterator over its contents. Its constructor requires a size (the k value above) and a Comparator to use to define "best".

The instance variables include an array of the objects stored by the structure at any given time.

All remaining programming tasks will be graded as a programming assignment.

Your tasks to complete the structure are:

Notes:

Question 5: What is the Big O complexity of one call to your add method? (2 points)

Applying to the Syracuse Sequence

Complete the main method in class Syracuse so that it generates the Syracuse sequences for all starting values from 1 up to n, given as a command-line parameter. To be completely clear, you will be generating n sequences, one for each starting value from 1 to n.

Your program should track and report the following:

  1. The largest k (a second command-line parameter) intermediate values encountered at any point in the sequences. At the end it should report the k values, and for each, print out the value, the n that generated it, and the position in the sequence for n where the large value occurred.
  2. Find the k longest sequences. At the end it should report the length of the k longest sequences and the n that generated each.

You should maintain two BestOfs for this program, one responsible for each above the parts above. Should you wish to use an extra class to keep track of information for the Syracuse sequence generation (and that's a Good Idea), you might want to declare a non-public class inside of Syracuse.java that can be used only within that Java file (like the list nodes and iterator from SimpleLinkedList). You will also need to define "best" for each, either by creating appropriate Comparator objects or make your objects Comparable and using a NaturalComparator.

The intermediate values you will generate can get quite large, so you should store numbers in variables of type long and Long instead of int and Integer as you generate your sequences.

You will be generating some very long sequences, and a whole lot of them. There is no need to store the sequences in their entirety in any form! In particular there should be no ArrayLists or similar structures anywhere in your program. Generate each value in the sequence from the previous one, and add an appropriate entry to hold it to your first BestOf. And when you have finished generating each sequence, add an appropriate entry to hold it in your second BestOf.

Question 6: What are the top 25 intermediate values encountered at any point in the sequences up to n=10,000? Include the n that generated the value and the position in that n's sequence. (2 points)

Question 7: What are the lengths of the top 25 longest sequences for values of n up to 10,000? Include the value of n that generated each. (2 points)

Applying to METAL Graph Data

Next, you will apply this nifty new data structure to the graph data from the METAL project you first saw earlier this semester. Recall that graph files are linked from http://travelmapping.net/graphs/ and that you can see what this data looks like in map form by loading it into the METAL project's Highway Data Examiner, at /metal/hdx/.

Initially, you will work only with waypoint data and ignore the road segments. Start with your Waypoint class from earlier in the semester. Recall that a Waypoint represents the data for a single waypoint. It should include fields for the waypoint name and its latitude and longitude values, a constructor, accessors for the three components, and appropriate equals and toString methods. Also recall that the VertexSearch example has a main method that takes the name of a .tmg file to be loaded as a command-line parameter, and reads all of the waypoints from that file into an array of Waypoint objects. You do not need this program, but you might borrow code from it as you complete the tasks below.

Now let's use your working BestOf structure to compute some information about this mapping data.

What you need to do:

  1. Write a Comparator class or classes that compare Waypoint objects in the following ways: increasing latitude (northernmost), decreasing latitude (southernmost), increasing longitude (easternmost), decreasing longitude (westernmost), alphabetical order by waypoint name, order by length of waypoint name.
  2. Write a program WaypointBest that has a main method that takes 2 command-line parameters: a .tmg file name, and a number n, and reports the best n waypoints in the file using each of the 6 criteria from the previous item.

Note that your program never needs to store all of the Waypoint objects in any structure. Just add each Waypoint to each of your BestOfs.

A point of emphasis in this program is on error checking: if your program expects a command-line parameter that is not provided, your program should not crash with an exception. It should print a meaningful error message.

Question 8: Report the results of your program for one small graph (less than 100 waypoints), one medium graph (around 2000 waypoints), and one large graph (over 100,000 waypoints), giving the "top 10" best by each criterion. (5 points)

Submitting

Please answer questions that require a text response in the README.md file of your repository. Only one submission per group is needed.

Your submission requires that all required code deliverables are committed and pushed to the master for your repository's origin on GitHub. That's it! If you see everything you intend to submit when you visit your repository's page on GitHub, you're set.

Grading

This assignment is worth 100 points, which are distributed as follows:

> FeatureValueScore
SyracuseSeq.java correctness 6
Question 1 2
Question 2 3
Question 3 2
Question 4 5
BestOf add correctness 20
Question 5 2
Syracuse main method correctness 6
Syracuse objects and comparators correctness 7
Question 6 (top 25 numbers) 2
Question 7 (top 25 sequence lengths) 2
Waypoint Comparator(s) for 6 criteria 6
WaypointBest correctness 10
Question 8 5
Program design 4
Program style 3
Program documentation 12
Git commit frequency and message quality 3
Total 100