CS 211 — Homework Seven
Huffman Codes
Due: 05 April 2013, 11:59p
Worth: 90 points
The Task
This week you are going to implement a new data structure and the main Huffman tree. Next week we will finish this off with a full application for compressing and decompressing text files.
The Map
Next, we need a new data structure. We are going to build this in two pieces so that we can reuse the interface later. So, the first this you should build is an interface called Map<K extends Comparable<K>,V>
. The interface should look like this:
void put(K key, V value)
- This should load the value
value
into the map at the location associated with thekey
, replacing any preexisting value already associated with thekey
. V get(K key)
- This should fetch the value V associated with the key
key
, or null, if no match can be found. boolean containsKey(K key)
- This will report if there is a value associated with the value
key
in the map. V remove(K key)
- This method should remove the entry associated with
key
and return the associated value. Return null if the key is not found. Set<K> keySet()
- This method should return a Set of all available keys in the map. You may use the
java.util.TreeSet
class to return these. You can build these on the fly using a traversal, or collect them as new keys are added to the map.
Of course, since this is just the interface, you will need to implement an actual instance so that you can use it. This instance should be called BSTMap<K,V>
. Obviously, this will use a binary search tree to actually implement the methods described above. While one perfectly valid implementation choice would be to create a separate binary search tree class and just use it to implement the described methods, it can force you to jump through some hoops to make methods like get
and containsKey
to work properly. So instead, you can build the binary search tree structure directly into your implementation of BSTMap
, the way we built the PriorityQueue using an underlying heap data structure.
When constructing the underlying BST, you will need to create some form of node structure that will hold both the key and the value (as well as the left, right, and parent pointers, of course). You will, of course, need to adjust the pseudocode that we walked through in class to work directly with the key values. It is important that these nodes are not visible outside of the class.
The Huffman tree
Once you have built your map, you will have enough tools to actually build the Huffman tree. The tree should be in a class called HuffmanTree
. The interface should look like this:
public static HuffmanTree newTreeFromFile(File file) throws FileNotFoundException
- This is a static factory method that creates a new Huffman tree. In essence, this just creates the new object and calls
buildFromFile
on it. This should throw aFileNotFoundException
when appropriate. private void buildFromFile(File file) throws FileNotFoundException
- This reads in an unencoded file and builds the Huffman tree based on the character frequencies. This should throw a
FileNotFoundException
when appropriate. public String encode(String text) throws ParseException
- This method takes in some plain text and returns the encoded version as a String of 1s and 0s. If an encoding cannot be found for a particular character, this should throw a
ParseException
< public String decode(String text) throws ParseException
- This method takes in a String of 1s and 0s and returns a decoded raw text version. If this encounters a character that is not a 1 or a 0, it should throw a ParseException. If there are spare 1s and 0s at the end (i.e., they do not decode to a character), ignore them and return the decoded result for the rest of the String.
For this assignment I am introducing you to a new Java design pattern called the "factory method". Factory methods are methods (usually static) that return new instances of a class. We typically use them when there is some complex initialization that needs to be performed or we have more than one way in which we want an object to be created. In this case, we are paving the way for future development. The factory method should just create a new HuffmanTree
and then call buildFromFile
on it.
To actually build the tree, you should start by reading the seed file one character at a time, using the Map
to keep track of the frequencies of each character (i.e., each time you see a character, look up how many times you have seen it, add 1 to that number, and put it back in the Map). Once the file has been consumed, you will use a PriorityQueue
(use the one from java.util) to to build the Huffman tree. Important note: java.util.PriorityQueue
is a min queue. The top element has the smallest value.
You will want to create a private inner node class that holds a frequency and a String
. Since we want to assemble everything into a binary tree structure (structurally it will be a heap, but we won't construct it that way), also include left and right children. Call this class HTree
. Write a constructor that sets all of the fields at once.
For each character you read in, create an HTree
wrapper for it and its frequency and put it in the PriorityQueue
. The general algorithm for building the Huffman tree goes like this:
- Remove the two trees with the lowest frequency from the queue
- Use these two trees as the left and right children of a new tree with a root node that has the frequency of the two children combined. For the text of the new tree, concatenate the text from the left and right children together.
- Return this tree to the queue
- Repeat until there is only one tree — this is the Huffman tree
To make this work, you will want to write a Comparator
that uses the lowest frequency as the highest priority. If the frequency of two trees are the same, order the trees alphabetically (so "A" comes before "B"). Once the tree is fully constructed, you should create a second Map
that stores the final codes for each character. The tree and this map should be the only two data structures that you keep around after the construction process. For encoding, you will use the map to lookup each character, and for decoding, you will use the tree to find the decoded characters.
I want you to notice that the two methods, encode
and decode
are really only for testing. They transform the data into Strings
of 1s and 0s, which is certainly not a compressed form for the data.
Grading
Points | Item | Comments |
---|---|---|
Correctness | ||
30 pts | BSTree | |
50 pts | HuffmanTree | |
Misc | ||
10 pts | Comments | All methods and classes should have Javadoc style comments and non-trivial code should have explanations. |
Book keeping
I would like you to continue to book keep your time. So please log the amount of time that you spend on:
- initial setup (creating the classes, putting in the methods and the javadoc comments
- writing tests
- implementation (please break this down for each of the four methods you are writing)
- debugging
Try to be as accurate as possible (it would be great if you didn't just get to the end and go "oh right, I need to keep track of my time & well it seemed like five hours..."). I am using this to (a) figure out if you (as a class, and also on a personal level) need more direction on how to direct your energies, and (b) inform the shape of future assignments. The more accurate this information is, the better I can do those two things, so please do not exaggerate in an attempt to get me to trim the assignments down or underestimate because you feel like your classmates didn't need as much time as you. Please include these in a comment at the top of the AssignmentSix
class.