CS 211 — Homework Seven

Huffman Codes

Due: 05 April 2013, 11:59p

Worth: 90 points

The Task

This week you are going to implement a new data structure and the main Huffman tree. Next week we will finish this off with a full application for compressing and decompressing text files.

The Map

Next, we need a new data structure. We are going to build this in two pieces so that we can reuse the interface later. So, the first this you should build is an interface called Map<K extends Comparable<K>,V>. The interface should look like this:

void put(K key, V value): This should load the value value into the map at the location associated with the key, replacing any preexisting value already associated with the key.
V get(K key): This should fetch the value V associated with the key key, or null, if no match can be found.
boolean containsKey(K key): This will report if there is a value associated with the value key in the map.
V remove(K key): This method should remove the entry associated with key and return the associated value. Return null if the key is not found.
Set<K> keySet(): This method should return a Set of all available keys in the map. You may use the java.util.TreeSet class to return these. You can build these on the fly using a traversal, or collect them as new keys are added to the map.

Of course, since this is just the interface, you will need to implement an actual instance so that you can use it. This instance should be called BSTMap<K,V>. Obviously, this will use a binary search tree to actually implement the methods described above. While one perfectly valid implementation choice would be to create a separate binary search tree class and just use it to implement the described methods, it can force you to jump through some hoops to make methods like get and containsKey to work properly. So instead, you can build the binary search tree structure directly into your implementation of BSTMap, the way we built the PriorityQueue using an underlying heap data structure.

When constructing the underlying BST, you will need to create some form of node structure that will hold both the key and the value (as well as the left, right, and parent pointers, of course). You will, of course, need to adjust the pseudocode that we walked through in class to work directly with the key values. It is important that these nodes are not visible outside of the class.

The Huffman tree

Once you have built your map, you will have enough tools to actually build the Huffman tree. The tree should be in a class called HuffmanTree. The interface should look like this:

public static HuffmanTree newTreeFromFile(File file) throws FileNotFoundException: This is a static factory method that creates a new Huffman tree. In essence, this just creates the new object and calls buildFromFile on it. This should throw a FileNotFoundException when appropriate.
private void buildFromFile(File file) throws FileNotFoundException: This reads in an unencoded file and builds the Huffman tree based on the character frequencies. This should throw a FileNotFoundException when appropriate.
public String encode(String text) throws ParseException: This method takes in some plain text and returns the encoded version as a String of 1s and 0s. If an encoding cannot be found for a particular character, this should throw a ParseException<
public String decode(String text) throws ParseException: This method takes in a String of 1s and 0s and returns a decoded raw text version. If this encounters a character that is not a 1 or a 0, it should throw a ParseException. If there are spare 1s and 0s at the end (i.e., they do not decode to a character), ignore them and return the decoded result for the rest of the String.

For this assignment I am introducing you to a new Java design pattern called the "factory method". Factory methods are methods (usually static) that return new instances of a class. We typically use them when there is some complex initialization that needs to be performed or we have more than one way in which we want an object to be created. In this case, we are paving the way for future development. The factory method should just create a new HuffmanTree and then call buildFromFile on it.

To actually build the tree, you should start by reading the seed file one character at a time, using the Map to keep track of the frequencies of each character (i.e., each time you see a character, look up how many times you have seen it, add 1 to that number, and put it back in the Map). Once the file has been consumed, you will use a PriorityQueue (use the one from java.util) to to build the Huffman tree. Important note: java.util.PriorityQueue is a min queue. The top element has the smallest value.

You will want to create a private inner node class that holds a frequency and a String. Since we want to assemble everything into a binary tree structure (structurally it will be a heap, but we won't construct it that way), also include left and right children. Call this class HTree. Write a constructor that sets all of the fields at once.

For each character you read in, create an HTree wrapper for it and its frequency and put it in the PriorityQueue. The general algorithm for building the Huffman tree goes like this:

Remove the two trees with the lowest frequency from the queue
Use these two trees as the left and right children of a new tree with a root node that has the frequency of the two children combined. For the text of the new tree, concatenate the text from the left and right children together.
Return this tree to the queue
Repeat until there is only one tree — this is the Huffman tree

To make this work, you will want to write a Comparator that uses the lowest frequency as the highest priority. If the frequency of two trees are the same, order the trees alphabetically (so "A" comes before "B"). Once the tree is fully constructed, you should create a second Map that stores the final codes for each character. The tree and this map should be the only two data structures that you keep around after the construction process. For encoding, you will use the map to lookup each character, and for decoding, you will use the tree to find the decoded characters.

I want you to notice that the two methods, encode and decode are really only for testing. They transform the data into Strings of 1s and 0s, which is certainly not a compressed form for the data.

Grading

Points	Item	Comments
Correctness
30 pts	BSTree
50 pts	HuffmanTree
Misc
10 pts	Comments	All methods and classes should have Javadoc style comments and non-trivial code should have explanations.

Book keeping

I would like you to continue to book keep your time. So please log the amount of time that you spend on:

initial setup (creating the classes, putting in the methods and the javadoc comments
writing tests
implementation (please break this down for each of the four methods you are writing)
debugging

Try to be as accurate as possible (it would be great if you didn't just get to the end and go "oh right, I need to keep track of my time & well it seemed like five hours..."). I am using this to (a) figure out if you (as a class, and also on a personal level) need more direction on how to direct your energies, and (b) inform the shape of future assignments. The more accurate this information is, the better I can do those two things, so please do not exaggerate in an attempt to get me to trim the assignments down or underestimate because you feel like your classmates didn't need as much time as you. Please include these in a comment at the top of the AssignmentSix class.

Last modified: Fri Mar 29 13:49:13 EDT 2013