CS 211 — Homework Eight

Huffman Codes Part II

Due: 16 April 2013, 11:59p

Worth: 90 points

The Task

Okay, we are finally going to finish off the Huffman application. Last week, we got to the point where we could build the tree and do some simple encoding and decoding. The problem with our technique is that we were converting text into Strings of 0s and 1s. Hopefully, the vast majority realized that this is the opposite of actual compression. Our goal was to reduce the number of bits required to encode a character. Instead, we blew it up, replacing one character with a whole String of them! Ouch.

So, we are going to do some bit level work, and learn how to write binary data out to files and read it back in again. The second issue that we have is that once we have an encoded file, we don't have a good way to decode it without first having the original source file, which rather defeats the purpose of compressing the file in the first place. Thus, we need to store a copy of the tree in the coded file. Of course, we want to store the tree in a format we can read back out again, and also doesn't take up too much room...

Background: Canonical form

We would like our Huffman codes to be canonical. The canonical form adds two rules to the way Huffman codes work. First, all codes of a given bit length have lexicographically consecutive values (i.e., not just in order, but immediately consecutive), in the same order as the symbols they represent, and second, shorter codes lexicographically precede longer ones. An example will make this a little clearer...

Suppose we have some piece of text in which the characters A,B,C,and D appear, with the following frequencies [A:1, B:5, C:6, D:2]. We could put these into a Huffman tree and come out with the following codes:

A001
B01
C1
D000

This is not in canonical form. The first rule is broken because A comes before D, but 000 should precede 001. The second rule is broken because C had a shorter code than B, but 01 precedes 1 lexicographically. Here is an alternate assignment of codes that does satisfy the rule:

A110
B10
C0
D111

0 precedes 10, which precedes 11x, and 110 and 111 are lexicographically consecutive.

The algorithm for deriving canonical codes is fairly straightforward, provided we start by knowing the lengths of the codes. The trick is to think of the codes as binary numbers. The first code, which is given to the lexicographically smallest symbol with the shortest code, is 0, padded out with 0s to be as long as the symbol's code. In our example, C has the shortest code (length 1), so its code is 0 (if C's code has been length three, the code would have been 000). We then increment the code (counting in binary), which will produce the lexicographically consecutive code. If the next symbol has a longer code, we pad the code with 0s to the right to match the code length (in other words, we perform a left shift or a multiply by 2). Thus, the next code would be 1, but B has a longer code length, so it gets the code 10. To move to A, we increment (11), and shift again to get 110. D has the same code length as A, so it gets 111.

Why is this interesting? Well, it means that we can recover the entire tree starting only from knowledge about the length of the codes. So, when we want to store our tree in a file with our data, we just have to record the character followed by the length of its code. Cool, huh?

Part one: Generate canonical codes

We can keep all of the work that we did in the last assignment, but the tree that we produce is no longer the tree that we want to keep. Its sole purpose is to provide us with the initial code lengths for generating the canonical codes. So after building the tree, instead of traversing the tree and loading the reverse lookup into a map, you are going to traverse the tree and collect symbols and their code lengths, dumping them into a new map as you go.

You should then write a new private method called buildTreeFromMap which takes in a Map of character-to code lengths. You will then use a PriorityQueueto implement the above algorithm. The queue should order the characters so that the shortest code comes first, with alphabetical (lexicographical) order breaking ties (this should sound very familiar). You are welcome to write another wrapper class, or reuse the HTree class. As you generate the code for each character, insert the character into the tree in the correct location. The easiest way to do this is by essentially doing the travel through the tree that you already are doing for decode. The difference is that if the code tells you to take a path towards a null child, insert a new HTree into that location in the tree before proceeding. In this way the tree will spool out behind you as you walk down it.

Part two: Reading a binary file

We need to be able to read binary data in from a file (as opposed to the textual data that you have always dealt with up until now). To help us, we are going to create a simple class that handles the details for us. The class should be called BufferedBitReader, and it should have the following methods:

public BufferedBitReader(File inputFile) throws IOException, FileNotFoundException
This is the main constructor of the class. This will open the file named inputFile and read it into a buffer.
public boolean hasNext()
This will work like our iterator and will indicate if there is any more data to be read from the file.
public int read()
This will return a single bit from the file. Since Java doesn't have "bits", this will just return an int 1 or 0.
public void skip(int count)
This will skip forward count bits in the input stream (i.e., it will advance the cursor without reading the values and returning them)

The small complexity of implementing this class is that not only does Java not have a native bit type, it doesn't allow us to read a bit at a time either. We have to read data as bytes. We will actually be reading the whole file in within the constructor. Start by creating a FileInputStream. This class allows us to read in raw bytes of data from a file. This is actually a pretty straight forward process. Create a byte[] that is big enough to hold the entire file (look for a method on FileInputStream that will give you that information). Pass that to the FileInputStream's read method. Close the stream. The entire contents of the file will now be in your "buffer". Obviously we wouldn't always want to read an entire file into local memory, but for this assignment it will be fine.

The complexity is all in the read() method since the data is packed into bytes and we need to break those apart to get at the individual bits. The first step is to create a variable that will keep track of the position of the next bit to be returned (just like the iterators we looked at). Obviously, this will help you implement hasNext().

Since there are eight bits in a byte, if you divide this position by 8, you will get the index of the byte in the buffer containing the bit that we want. If you take the %8 of the position, you will get the index of the bit within the byte. Of course, we can't just grab an individual bit quite so easily. Use the index of the bit to shift the byte to the right until the bit is in the one's position (question, if you want bit 0 in the byte, how many spaces do you need to shift?). Now, we only want the value of the bit at that position, so we can use bit-wise AND (&) to AND the value with 1, which will erase any data in any other bit position. This number will now be 1 or 0, and that is what you return (after incrementing the position, of course).

To help you out, I am going to give you BufferedBitReaderTest.java, a test file for this class. You will also need the data file: assignment8_raw_data_reference for this to work. Make sure you put it in the correct sub directory. Also, since it is a raw binary file, it won't look like anything if you try and open it in your web browser.

Part three: Writing a binary file

Obviously, we need to have a class that performs writing as well. This one is slightly harder, but we are going to make use of the same basic principle. I would like you to create a class called BufferedBitWriter that looks like this:

BufferedBitWriter(File outputFile) throws FileNotFoundException
This creates a new BufferedBitWriter and opens up the output stream.
public void write(int bit) throws IOException, UnsupportedDataTypeException
This will write 1 bit to the file. If the input is not a 1 or a 0, it should throw an UnsupportedDataTypeException
public void close() throws IOException
This will close the output stream after flushing any remaining data from the buffer

The principle behind this class is very similar to the principle behind the reader. We will have a buffer that sits between the calls to write() and the actual writing to the file. Unlike the reader, however, we can't know how much data we really will have ahead of time. So, in the constructor, create a byte[] with some reasonable large value. If it becomes full, we will just write the contents to a file and start over at the beginning.

In the constructor, you should create this buffer and open a FileOutputStream, which will allow us to write byte data to the file. Make sure that you use append mode so that we can write the tree out normally and then use this to tack the rest on. For the write() method, we will do the inverse of what we did for read. You should maintain a variable that holds the index of the next bit to be written. Again, dividing by 8 will tell you while byte should be written to, and %8 will tell you which bit. This time you will shift the bit to be aligned with the correct position in the byte. You will then do a logical OR (|) to merge that bit into the byte. You will then increment the index variable and, as I mentioned, write the buffer to the file if the index is no longer a valid position in the array.

The close method will close the stream. However, before it can do that, it need to see if there is anything in the buffer that has not been written to the file. If there is, it should handle that before closing the stream.

To help you out, I am going to give you BufferedBitWriterTest.java, a test file for this class.

Part four: Putting it all together

We can finally get to the actual application! The first step will be to encode the file. Start by making two additional tweaks to buildFromFile. The first is to save the File object in an instance variable. The second is to introduce a new character into your tree. We have a slight problem in that our encoded files will not end on even byte boundaries, which means that we have up to seven bogus bits handing around. This could easily be mistaken for a valid last character (or more). Our solution will be to add a unique character with a very low frequency (1 occurrence) that we will use to end the file. When you see that you will know not to read any further. To make your lives easier we will actually use a String rather than a single character. Use the String "\u0003" (this is the unicode for the non-printing character 'End of text'). Just add this to your initial Map when you are finding the frequencies.

Once you have made that small tweak, we can get ready to add a couple of new methods to HuffmanTree:

public void saveCompressedFile(File outputFile) throws IOException
This will take the file used to build the tree, compress it and write it out into outputFile.
public static HuffmanTree newTreeFromCompressedFile(File compressedFile) throws IOException
This is a second factory method. This one builds the tree form that data stored in a compressed file. Like the other factory method, this will just create a new HuffmanTree and then call another function to do the hard work.
private void buildFromCompressedFile(File compressedFile) throws IOException
This would be the function that does the hard work. This will read the code length data stored in the head of the file and build the tree.
public void saveExpandedFile(File expandedFile) throws IOException
This will read the original compressed file and write the uncompressed data out into expandedFile

The saveCompressedFile method has two tasks to complete. First, it need to write the representation of the tree out to the file. Start by writing an integer that says how many character, code lengths pairs to expect (this tells the reader how many bytes of the file are consumed by the tree). Then write out each character followed by the length of its code. Following the tree, you can start writing the compressed data from the file. This works just like encode, walking character by character, except that you will want to write bits using the BufferedBitWriter.

Building the tree from the compressed file is actually fairly straight forward. Open the file with a FileInputStream, and read the first byte. This will tell you how many bytes you will need to read to get the whole tree. You can read the whole header in with a single read operation. Then just take the character, code length pairs and put them in a Map. If you pass this Map to the buildTreeFromMap you wrote earlier, you will have a tree.

Saving the expanded version is also pretty straight forward. Open the file with a BufferedBitReader. Skip past the tree information to get to the data. Now just read the bits and follow the decode procedure.

Useful tool

Since you will dealing with binary files, you may find that you want to look inside them, and a conventional text editor will not give you a very good picture. What you need is a hex editor. There are a number of hex editors available, but if you have a Mac and have XCode installed (or are in the lab), you will find that you already have one on the command line. type hexdump -C filename, and you will be able to see the actual contents of the file. You can consult the man pages about this, and I will talk more about it in class.

Grading

PointsItemComments
Correctness
25 ptsBuilding canonical codes
10 ptsBufferedBitReader
10 ptsBufferedBitWriter
10 ptsSaving compressed file
15 ptsReading compressed file
10 ptsSaving expanded file
Misc
10 ptsCommentsAll methods and classes should have Javadoc style comments and non-trivial code should have explanations.

Book keeping

I would like you to continue to book keep your time. So please log the amount of time that you spend on:

Try to be as accurate as possible (it would be great if you didn't just get to the end and go "oh right, I need to keep track of my time & well it seemed like five hours..."). I am using this to (a) figure out if you (as a class, and also on a personal level) need more direction on how to direct your energies, and (b) inform the shape of future assignments. The more accurate this information is, the better I can do those two things, so please do not exaggerate in an attempt to get me to trim the assignments down or underestimate because you feel like your classmates didn't need as much time as you. Please include these in a comment at the top of the AssignmentSix class.


Last modified: Wed Apr 10 10:38:48 EDT 2013