CS 211 — Homework Ten
Building a Simple Database
Due: 06 May 2013, 11:59p
Worth: 70 points
Introduction
For this assignment we are going to implement one more
iteration of the Map
interface, this time as a hash
map. You are then going to use the hash map to create a fast
lookup into a simple database that we will build.
Part one: Building the HashMap
This step is relatively straightforward. I would like you to
write a new class called HashMap
that implements
the Map
interface. To keep things easy on you, you
can implement this using external chaining. So your actual
internal data storage will be an array of some kind of
Node
objects, which will be capable of storing a
key and value and a next pointer. The size of this array should
be determined by a capacity argument to the constructor.
Unfortunately, Java will not
let us make an array of Node
objects if they
contain generic fields. We can't even get away with the casting
trick we used for ArrayLists
So, you will have to
create your array as an array of Objects
and cast
the objects as you refer to them. Sad, but that is just one of
those Java things.
On the flip side, Java makes the actual hashing really
easy. To get the hash code for a particular key, you can just
call the hashCode()
method on it. There is a
caveat, however. The hash code will be a 32-bit
integer. Hopefully you will have figured out that half of those
numbers will be negative. Another decision made in Java was that
the modulo operator can return negative numbers. This is
unfortunate for us since we can't have negative indicies into
our array. "Ah hah!", you say, "we can just use
abs()...". This is almost correct except for one single
value. In two's compliment (the way integers are represented in
Java), the most negative number does not have a positive
counterpart (just as zero doesn't have a counterpart). So the
cool kid way to handle the hash codes is to mask out the highest
bit to force the number to be positive (note that this does
not make -1 now be 1... but we don't care either). So,
given a table with M entries, the index we are interested in
would be key.hashCode() & 0x7fffffff % M
.
Other than that, everything should be very
straightforward. The only thing to remember is that since we are
implementing a Map
you need to check for previous entries
for the same key so you can change the value (so you will walk the
entire linked list if the value isn't there).
I will add one additional requirement however. For the BSTMap, most
of you maintained a Set
object to keep track of the keys
at all times. This time I would like you to build the set from scratch
each time the keySet()
method is called. You will create
a new Set
object and walk the table finding all of the
keys and putting them in the Set
. You are welcome to use
a TreeSet
again, or a HashSet
if you want to
continue the theme.
Part two: Building a database
We are now going to build a very simple database called SimpleDatabase
. This
database will only have a single table, it will only store
String
data, and we can only add things to
it and look things up. We cannot edit or delete records. The interface
for the class looks like this:
public SimpleDatabase(String[] fields)
- This is the constructor that initially sets up the database. The
input here is an array of
Strings
which provide the names for the columns in the table. This also defines how many columns are available. public void addRecord(String[] record) throws DataFormatException
- This method allows us to add new records to the database. A
record is passed in as an array of
Strings
. The length of the record should exactly match the number of fields in the database (and be in the same order). If the number of fields doesn't match, this should throw aDataFormatException
. public String[] fields()
- This function is a simple getter for checking the names of the fields in the table. It essentially returns the same thing that was passed in to the constructor.
public int size()
- This returns the number of elements that are currently present in the database.
public Iterator
find(String query) - This function is for looking data up in the database. The query
should be a single word and this will return an
Iterator
that will walk through all of the records that contain this word in any field. If no records are found, this will return null. public Iterator
find(String query, String field) throws NameNotFoundException - This is the slightly more specific version of the
find()
method. This will only return records in whichquery
appears in the specified field. If the field name doesn't exist, this throws aNameNotFoundException
. If the query returns no results, null is returned.
Implementation
The basic storage for our database is just an
ArrayList<String[]>
. In other words, we can just
take the data that is passed into the addRecord
method
and add it on to the end of the list. The interesting part of the
database is the way we implement find.
In order to have a fast, full-text search of our database, we are
going to build something called an inverted
index. This is basically a big lookup table that maps from
individual words to the records in our database. We are going to
implement ours using a HashMap
(you knew we were going
to use it somewhere in here). Our hash map will have a
String
for the key values (individual words) and a
List
of lookup records that tell us where to find the
word (yes, the values in our hash map will be
Lists
).
In order to implement this you will need to create yet another
Node
class. Call this one
IndexRecord
. This should have two fields: one that
holds the index of the record the word appears in and one that holds
the field within that record (so if the word "lemur"
appears in the second column of the third record we enter into the
database, we would create a new IndexRecord
that held 2
and 3. You should also implement the equals
method on
this class to make it easier to eliminate duplicate records
(in case a word shows up multiple times in the same field).
In order to build this inverted index, you will have to do a
little more work in addRecord
. First, you add the
record to the main ArrayList
. Then, figure out which
index that record is stored at (if you just added it in on the end,
it will be the length of the list - 1). Next, iterate through the
fields of the record. For each field, iterate through each word
(using a Scanner, for instance). For each word, start by creating a
new IndexRecord
item. Then, check if it is in
the index yet. If it is, look it up, get the List
, and
look at the last item. If the last item doesn't match the current
record, then add it to the List
(if it does match, just
move on – we only need one copy of each record). If the index
doesn't know about the word yet, create a new List
(any
kind you like), add the IndexRecord
to it, and then
put
it into the hash map. To support case-insensitive searching, make sure to convert the word to lowercase before putting it in the map.
The find functionality is relatively straightforward. The first
thing you should do is to create a new Iterator
. Call
this IndexIterator
and it should extend
Iterator<String[]>
. Give this a constructor that
accepts a List<IndexRecord>
. You can then maintain
an index into the List
and walk through it in the usual
way.
There is one wrinkle, however. We would rather not return the
same record more than once. While we eliminated exact matches, we may
still have multiple IndexRecords
that point to the same
record index if a word appears in multiple fields. So, in the
next()
function, when you advance your index into the
list, continue to advance it until you hit a record that has a
different index.
This makes the simple find
method pretty easy to
write. Look the query up in the inverted index. If it isn't found,
return null, otherwise, create a new IndexIterator
with
the associated List
and return that.
The second find
method is very similar. However, we
only want the records that reference a particular field. Create a new
List
object and iterate through the full List
that was stored in the index. If the IndexRecord
references the field, put it in the new List
, otherwise
skip it. You can then pass this new List
to the
constructor of your iterator.
Part three: Browsing a database
The final part of this assignment is to build a very simple command line browser for your database using the same general strategy that we have used in the past. Create a new class called DatabaseExplorer
. This class should have the following interface:
public DatabaseExplorer(String filename) throws FileNotFoundException
- This is the constructor of our class, it will take in the file that we will use to load the database.
public void cliHandler(Scanner in, PrintWriter out)
- This will be our old friend the command line interface. When it starts up, it should print out "Database contains X records" (where X is the number of records present in the database). The user can then issue single word queries. The response should start with a line listing the column names of the fields, separated by tabs (\t). Then it will print out each matching record with the fields also separated by tab characters. If the user writes two words, the first word is interpreted as a column name and only results matching that field should be printed. If there are no matches, this should print "No records found".
public static void main(String[] args) throws FileNotFoundException
- This is, of course, main. It will take in the name of the file to use for the database in
args
. It will use this to create a newDatabaseExplorer
and then callcliHandler
on it.
Most of this should seem fairly straightforward. The last detail is about the files that we will getting our data from. The data we will be working with will be stored in CSV files (comma-separated values). CSV files are very common, and conceptually very simple. They are plain text files in which we store tabular data. Each record in the table is stored on a single line, with the fields separated by commas. Typically, the first row is the header, providing the names of the fields. That's really it. Well, sort of. Some CSV files don't actually use commas as the delimiter (some use tabs or semi-colons or something else). There are various ways to handle the problem of wanting to use the chosen delimiter in a record field (a typical approach is to put quotes around the field... which then means you need to escape quotes that you want to include in the field...). There really isn't a standard for this. You can learn more about CSV files on Wikipedia. For our purposes, we will go for the simpest form: commas separate fields, first line provides column headings, and we don't allow commas inside of a field. You can make a CSV file out of any kind of data you find interesting, here is a short example (some information about lemurs), and a considerably longer example (a sample of some of my music collection) that you can play with.
When you implement the constructor, you will take in each line, split it based on the commas (the split()
function would be a good choice), and then add it to the database as a record. The cliHandler
will then loop through the Scanner
, using hasNext()
and nextLine()
, handling queries until the user types "quit".
Saving time
At this point I think you have the fundamentals about setting up classes under your belt, and I want to you to focus on the important parts of this assignment. So, I am providing you with some starter code that will just get you up and running. This is just the framework, so don't get too excited, but it should save you some typing. The other thing to remember is that you are just reimplementing the Map
interface so there is no need to write a whole new battery of tests. Just take the regression tests from your last assignment and copy them into this one (remembering, of course, to change the map to your hashMap
.
Bonus round
Okay, last chance for a little bit of extra credit. Please make sure to indicate what you attempted.
- [5 points] Implement table resizing
- Add a second constructor to the
HashMap
that allows you to specify a maximum load factor. When addition will cause you to exceed the load factor, create a new array that is twice the size and rehash all of your entries into it. - [10 points] Implement linear probing
- Create an additional class called
LinearHashMap
that uses linear probing instead of external chaining. You can double dip and implement a target load factor (under 1) and resizing for this as well. For deletions, rehash the values in the slots that come before the index of the original hash value. - [5 points] Support commas in the CSV fields
- Support the use of quoted fields to allow commas to be used. This should also support escaped quotes within the quoted field.
- initial setup (creating the classes, putting in the methods and the javadoc comments
- writing tests
- implementation (please break this down for each of the four methods you are writing)
- debugging
Grading
Points | Item | Comments |
---|---|---|
Correctness | ||
15 pts | HashMap | |
30 pts | SimpleDatabase | |
15 pts | DatabaseExplorer | |
Misc | ||
10 pts | Comments | All methods and classes should have Javadoc style comments and non-trivial code should have explanations. |
Book keeping
I would like you to continue to book keep your time. So please log the amount of time that you spend on:
Try to be as accurate as possible (it would be great if you didn't just get to the end and go "oh right, I need to keep track of my time & well it seemed like five hours..."). I am using this to (a) figure out if you (as a class, and also on a personal level) need more direction on how to direct your energies, and (b) inform the shape of future assignments. The more accurate this information is, the better I can do those two things, so please do not exaggerate in an attempt to get me to trim the assignments down or underestimate because you feel like your classmates didn't need as much time as you. Please include these in a comment at the top of the AssignmentSix
class.