CS 211 — Homework Ten

Building a Simple Database

Due: 06 May 2013, 11:59p

Worth: 70 points

Introduction

For this assignment we are going to implement one more iteration of the Map interface, this time as a hash map. You are then going to use the hash map to create a fast lookup into a simple database that we will build.

Part one: Building the HashMap

This step is relatively straightforward. I would like you to write a new class called HashMap that implements the Map interface. To keep things easy on you, you can implement this using external chaining. So your actual internal data storage will be an array of some kind of Node objects, which will be capable of storing a key and value and a next pointer. The size of this array should be determined by a capacity argument to the constructor.

Unfortunately, Java will not let us make an array of Node objects if they contain generic fields. We can't even get away with the casting trick we used for ArrayLists So, you will have to create your array as an array of Objects and cast the objects as you refer to them. Sad, but that is just one of those Java things.

On the flip side, Java makes the actual hashing really easy. To get the hash code for a particular key, you can just call the hashCode() method on it. There is a caveat, however. The hash code will be a 32-bit integer. Hopefully you will have figured out that half of those numbers will be negative. Another decision made in Java was that the modulo operator can return negative numbers. This is unfortunate for us since we can't have negative indicies into our array. "Ah hah!", you say, "we can just use abs()...". This is almost correct except for one single value. In two's compliment (the way integers are represented in Java), the most negative number does not have a positive counterpart (just as zero doesn't have a counterpart). So the cool kid way to handle the hash codes is to mask out the highest bit to force the number to be positive (note that this does not make -1 now be 1... but we don't care either). So, given a table with M entries, the index we are interested in would be key.hashCode() & 0x7fffffff % M.

Other than that, everything should be very straightforward. The only thing to remember is that since we are implementing a Map you need to check for previous entries for the same key so you can change the value (so you will walk the entire linked list if the value isn't there).

I will add one additional requirement however. For the BSTMap, most of you maintained a Set object to keep track of the keys at all times. This time I would like you to build the set from scratch each time the keySet() method is called. You will create a new Set object and walk the table finding all of the keys and putting them in the Set. You are welcome to use a TreeSet again, or a HashSet if you want to continue the theme.

Part two: Building a database

We are now going to build a very simple database called SimpleDatabase. This database will only have a single table, it will only store String data, and we can only add things to it and look things up. We cannot edit or delete records. The interface for the class looks like this:

public SimpleDatabase(String[] fields): This is the constructor that initially sets up the database. The input here is an array of Strings which provide the names for the columns in the table. This also defines how many columns are available.
public void addRecord(String[] record) throws DataFormatException: This method allows us to add new records to the database. A record is passed in as an array of Strings. The length of the record should exactly match the number of fields in the database (and be in the same order). If the number of fields doesn't match, this should throw a DataFormatException.
public String[] fields(): This function is a simple getter for checking the names of the fields in the table. It essentially returns the same thing that was passed in to the constructor.
public int size(): This returns the number of elements that are currently present in the database.
public Iterator find(String query): This function is for looking data up in the database. The query should be a single word and this will return an Iterator that will walk through all of the records that contain this word in any field. If no records are found, this will return null.
public Iterator find(String query, String field) throws NameNotFoundException: This is the slightly more specific version of the find() method. This will only return records in which query appears in the specified field. If the field name doesn't exist, this throws a NameNotFoundException. If the query returns no results, null is returned.

Implementation

The basic storage for our database is just an ArrayList<String[]>. In other words, we can just take the data that is passed into the addRecord method and add it on to the end of the list. The interesting part of the database is the way we implement find.

In order to have a fast, full-text search of our database, we are going to build something called an inverted index. This is basically a big lookup table that maps from individual words to the records in our database. We are going to implement ours using a HashMap (you knew we were going to use it somewhere in here). Our hash map will have a String for the key values (individual words) and a List of lookup records that tell us where to find the word (yes, the values in our hash map will be Lists).

In order to implement this you will need to create yet another Node class. Call this one IndexRecord. This should have two fields: one that holds the index of the record the word appears in and one that holds the field within that record (so if the word "lemur" appears in the second column of the third record we enter into the database, we would create a new IndexRecord that held 2 and 3. You should also implement the equals method on this class to make it easier to eliminate duplicate records (in case a word shows up multiple times in the same field).

In order to build this inverted index, you will have to do a little more work in addRecord. First, you add the record to the main ArrayList. Then, figure out which index that record is stored at (if you just added it in on the end, it will be the length of the list - 1). Next, iterate through the fields of the record. For each field, iterate through each word (using a Scanner, for instance). For each word, start by creating a new IndexRecord item. Then, check if it is in the index yet. If it is, look it up, get the List, and look at the last item. If the last item doesn't match the current record, then add it to the List (if it does match, just move on – we only need one copy of each record). If the index doesn't know about the word yet, create a new List (any kind you like), add the IndexRecord to it, and then put it into the hash map. To support case-insensitive searching, make sure to convert the word to lowercase before putting it in the map.

The find functionality is relatively straightforward. The first thing you should do is to create a new Iterator. Call this IndexIterator and it should extend Iterator<String[]>. Give this a constructor that accepts a List<IndexRecord>. You can then maintain an index into the List and walk through it in the usual way.

There is one wrinkle, however. We would rather not return the same record more than once. While we eliminated exact matches, we may still have multiple IndexRecords that point to the same record index if a word appears in multiple fields. So, in the next() function, when you advance your index into the list, continue to advance it until you hit a record that has a different index.

This makes the simple find method pretty easy to write. Look the query up in the inverted index. If it isn't found, return null, otherwise, create a new IndexIterator with the associated List and return that.

The second find method is very similar. However, we only want the records that reference a particular field. Create a new List object and iterate through the full List that was stored in the index. If the IndexRecord references the field, put it in the new List, otherwise skip it. You can then pass this new List to the constructor of your iterator.

Part three: Browsing a database

The final part of this assignment is to build a very simple command line browser for your database using the same general strategy that we have used in the past. Create a new class called DatabaseExplorer. This class should have the following interface:

public DatabaseExplorer(String filename) throws FileNotFoundException: This is the constructor of our class, it will take in the file that we will use to load the database.
public void cliHandler(Scanner in, PrintWriter out): This will be our old friend the command line interface. When it starts up, it should print out "Database contains X records" (where X is the number of records present in the database). The user can then issue single word queries. The response should start with a line listing the column names of the fields, separated by tabs (\t). Then it will print out each matching record with the fields also separated by tab characters. If the user writes two words, the first word is interpreted as a column name and only results matching that field should be printed. If there are no matches, this should print "No records found".
public static void main(String[] args) throws FileNotFoundException: This is, of course, main. It will take in the name of the file to use for the database in args. It will use this to create a new DatabaseExplorer and then call cliHandler on it.

Most of this should seem fairly straightforward. The last detail is about the files that we will getting our data from. The data we will be working with will be stored in CSV files (comma-separated values). CSV files are very common, and conceptually very simple. They are plain text files in which we store tabular data. Each record in the table is stored on a single line, with the fields separated by commas. Typically, the first row is the header, providing the names of the fields. That's really it. Well, sort of. Some CSV files don't actually use commas as the delimiter (some use tabs or semi-colons or something else). There are various ways to handle the problem of wanting to use the chosen delimiter in a record field (a typical approach is to put quotes around the field... which then means you need to escape quotes that you want to include in the field...). There really isn't a standard for this. You can learn more about CSV files on Wikipedia. For our purposes, we will go for the simpest form: commas separate fields, first line provides column headings, and we don't allow commas inside of a field. You can make a CSV file out of any kind of data you find interesting, here is a short example (some information about lemurs), and a considerably longer example (a sample of some of my music collection) that you can play with.

When you implement the constructor, you will take in each line, split it based on the commas (the split() function would be a good choice), and then add it to the database as a record. The cliHandler will then loop through the Scanner, using hasNext() and nextLine(), handling queries until the user types "quit".

Saving time

At this point I think you have the fundamentals about setting up classes under your belt, and I want to you to focus on the important parts of this assignment. So, I am providing you with some starter code that will just get you up and running. This is just the framework, so don't get too excited, but it should save you some typing. The other thing to remember is that you are just reimplementing the Map interface so there is no need to write a whole new battery of tests. Just take the regression tests from your last assignment and copy them into this one (remembering, of course, to change the map to your hashMap.

Bonus round

Okay, last chance for a little bit of extra credit. Please make sure to indicate what you attempted.

[5 points] Implement table resizing: Add a second constructor to the HashMap that allows you to specify a maximum load factor. When addition will cause you to exceed the load factor, create a new array that is twice the size and rehash all of your entries into it.
[10 points] Implement linear probing: Create an additional class called LinearHashMap that uses linear probing instead of external chaining. You can double dip and implement a target load factor (under 1) and resizing for this as well. For deletions, rehash the values in the slots that come before the index of the original hash value.
[5 points] Support commas in the CSV fields: Support the use of quoted fields to allow commas to be used. This should also support escaped quotes within the quoted field.

Points	Item	Comments
Correctness
15 pts	HashMap
30 pts	SimpleDatabase
15 pts	DatabaseExplorer
Misc
10 pts	Comments	All methods and classes should have Javadoc style comments and non-trivial code should have explanations.

Last modified: Sat Apr 27 01:39:53 EDT 2013