CS 452 Homework 4 - Spam classification using SVMs

Due: Friday 11/3, at noon

Preliminary code due Monday 10/30, at noon

You will work on this assignment in teams of 3, which I created semi-randomly.

In this assignment you will train a support vector machine (SVM) to classify email messages into spam / no spam (aka "ham"). We will use a subset of the pre-processed Enron Spam data set. We will use 8,000 training examples labeled ham/spam, and 2,000 test examples with hidden labels. Each email has been processed to remove all headers except for the subject line. In addition, each file has been "tokenized" so that all words are lower case and all words and punctuation marks are separated by spaces. You can find a small sample here.

We will use the fast, light-weight SVM implementation SVMlight by Thorsten Joachims.

Your job for this homework is to write python code that (1) extracts a useful word list from all email messages, (2) translates each training example into a feature vector and saves it in a format suitable for SVMlight; and (3) to tune the SVM algorithm for best performance. As last time, you must use python3.

Start by creating a hw4 folder and save the following archives in it:

  1. svm_light.tar.gz from the SVMlight webpage
  2. hw4-data-small.zip
  3. hw4-data.zip

Then uncompress the archives and compile SVMlight as follows:

unzip hw4-data-small.zip
unzip hw4-data.zip
rm hw4-data*zip
mkdir svm
cd svm
tar xvfz ../svm_light.tar.gz
make
cd ..
rm svm_light.tar.gz

To test the SVM, try

svm/svm_learn

svm/svm_learn data-small/sample-train.svm model.svm

The first command prints the usage; the second trains the SVM using the sample training file and saves the resulting model. Also try different values for the parameter C, e.g.

svm/svm_learn -c 1.0 data-small/sample-train.svm model.svm

You can use the saved model to classify unseen test examples:

svm/svm_classify

svm/svm_classify data-small/sample-test.svm model.svm predictions.txt

Look at all .svm files and see if you can make sense of them. The file predictions.txt contains a single number for each test case whose sign indicates the classification, i.e., if the number is >= 0 it indicates a positive classification.

Writing .svm files

Your first job is to write a program that translates email messages into feature vectors suitable for SVMlight. Here is some code to get you started: spamsvm.py Save this file in your hw4 folder and run it using

python3 spamsvm.py

You will see it creates files my-train.svm and my-test.svm in data-small/. Your goal is to match the existing files sample-train.svm and sample-test.svm.

Once your code creates identical files to my samples, submit your code using the submission page (link below). This part of the homework is due by noon on Monday.

Running on the full data set

Try running your code on the full training set (8,000 emails) in data/training/. Since you don't have access to the labels for the test set, you'll have to write some code to split the data into a smaller training set and a validation set (and create .svm files for each). You might also want to try adding python functions that run the executables svm_learn and svm_classify for you so you can automate the testing better. Hint: look up documentation for os.system(...) or better yet subprocess.run(...).

Better word lists and parameter tuning

Add some code to extract better (and longer!) word lists and build more powerful feature vectors. There are many possibilities. You could try the 1000 most frequent words. Hint: use dictionaries and perhaps consult old CS 150 homeworks. You could also try to find frequent words that are in one of ham or spam but not the other. You could even try all unique words from all 10,000 emails (can SVMlight handle that many features?) You could also try non-binary features, i.e., instead of whether a word was present use its frequency.

Another idea is to do more text processing -- correct misspellings, join words that were split by the tokenizer, ... Yet another idea is to extract words from the subject line and message body separately.

Keep in mind Andrew Ng's advice about machine learning system design. For instane, examine missclassified examples and brainstorm what features would allow classifying them correctly.

Also, try different parameter settings. The most important parameter is C (and you'll need a validation set to tune it), but you could also experiment with some of the other parameters. I'm not sure whether it's worth it to try kernels (non-linear SVMs)... Discuss with your group in what situations this might be beneficial.

Report and predictions on the test set

Your final submission will consist of your python program (add all your code to the same program spamsvm.py), your predictions on the 2000 test cases in a file predictions.txt, and a brief report (1-2 pages) that discusses all the ideas you tried and all the things you implemented. In class on Friday we will have a competition where we'll evaluate your results on the test set and create a public leaderboard as usual.

You can use svm_classify to create your predictions file, which should have 2000 lines. Each line should contain exactly one number whose sign determines the classification of the corresponding test case. That is, if the number is >= 0, the prediction is spam, and if the number is < 0 the prediction is ham. This is exactly the file format produced by svm_classify. In case it's useful, here's a data file with "fake" labels (all zero) for the 2000 test cases, from which you can create a ".svm" file to use with svm_classify: fake-test-labels.txt

Once you have translated this file into an ".svm" file, say fake-test.svm, you can create the prediction file as follows:

svm/svm_classify fake-test.svm model.svm predictions.txt

Of course, since these labels are not correct, you should ignore the accuracy reported by svm_classify.

Submitting

Submit your program spamsvm.py, your test set predictions predictions.txt, as well as your report report.pdf using the HW 4 submission page. For the final submission, all 3 team members need fill out this page, which records both the time each of you spent, and your estimate of the fraction of the overall work you contributed. Only one person per team should upload the files.

For the preliminary deadline (Monday at noon), only submit the current version of your program spamsvm.py (one submission per group). I will ignore the time and effort estimates for those.