CS 452 Homework 2 - Logistic regression using stochastic gradient descent

Due: Friday 10/6, at 1pm

This assignment was derived from an assignment by Joe Redmon, Middlebury CS major, class of 2011.5. You may work on this assignment in pairs or by yourself.

In this assignment you will be predicting income from census data using logistic regression. Specifically, you will predict the probability that a person earns more than $50k per year.

The assignment is in Python. The code should run both under python2 and python3. We will not use NumPy, but we'll simply use lists to represent vectors. There are no matrices used in this assigment.

Start by downloading hw2.zip and unzip it. You will be modifying the file sgd.py. Tests for your program are in test.py. To test your implementation, run:

python3 test.py

You will see the number of tests your implementation passes and any problems that arise.

The data is in adult-data.csv; it is from the Adult dataset. Open the data file with Excel and take a look.

Logistic Function [1 point]

To perform logistic regression you have to be able to calculate the logistic function:

https://en.wikipedia.org/wiki/Logistic_regression#Definition_of_the_logistic_function

Fill in the logistic function.

Dot Product [1 point]

The model you are training is just a bunch of numerical weights. To run your model on a data points you will need to take the dot product of your weights and the features for that data point and run the result through your logistic function.

Fill in the dot function to take the dot product of two vectors.

Prediction [1 point]

Now that you can calculate the dot product, predicting new data points should be easy! Fill in the predict function to run your model on a new data point. Take a look at test.py to see what the format for data points is.

Prediction should be straightforward, to predict new points you simply multiply your model's weights by the corresponding features, sum up the result, and pass it through the logistic function. This should be easy with your dot product and logistic functions.

Accuracy [2 points]

Once you start training your model you are going to want to know how well you are doing. Modify the accuracy function to calculate your accuracy on a dataset given a list of data points and the associated predictions.

Train Your Model [6 points]

Fill in the train and update functions to train your model! You should use logistic regression with L2 regularization where alpha is the learning rate and lambd is the regularization parameter.

The training should run for some number of epochs performing stochastic gradient descent:

https://en.wikipedia.org/wiki/Stochastic_gradient_descent

This means you will randomly select a point from the dataset and run the model on that data point. Then you will calculate the error for that point and adjust your model weights based on the gradient of that error. An epoch refers to a full pass over the dataset. In practice it is easier (and more statistically valid) to sample randomly with replacement. Thus an epoch just means examining m data points where m is the number of points in your training data.

This is different than batch gradient descent where you look at all of the data points before updating. SGD converges faster but can also be less stable because you have a noisy estimate of the gradient instead of the true gradient. In practice it is often much better to use SGD than full batch gradient descent.

It might be a good idea to print the accuracy after each epoch, so you can see if (and how fast) training converges.

When you see a new data point \(x\), your prediction will be:

\(h_\theta(x) = P(\mbox{income} > 50\mbox{k}\; |\; \theta, x) = g(\theta^T x)\)

To adjust the model \(\theta\) you have to calculate the gradient of the cost function at a given point. The gradient will come from two sources, the error and the regularization. The update rule is similar to the one from class, but remember, since we are doing SGD you only look at one point before updating the model:

\(\theta_j := \theta_j - \alpha [ ( h_\theta(x) - y) x_j + \lambda \theta_j ]\)

Remember not to regularize \(\theta_0\).

When you run python test.py it will tell you your current accuracy on the training and validation set. By default these are the same dataset. To get a more accurate evaluation you can modify data.py to use different training and validation sets by splitting your data.

Extract Better Features [4 points]

Take a look at the feature extracting code in extract_features, and at the raw data in adult-data.csv. Right now your model is only considering age, education, and one possible marital status.

Good feature extraction is often the key to making good machine learning models. Add more feature extraction rules to help improve your model's performance. This is very open ended, be creative and find features that work well with your model.

Tune Your Submission [5 points]

Tune your submission function to train your final model. You should change your feature extraction and training code to produce the best model you can. Try different learning rates and regularization parameters, how do they compare? Often it is good to start with a high learning rate and decrease it over time, feel free to add this to your training code.

Your final model will be trained on the full training data and run on test data that you don't have access to. Your grade for this section will be based on your performance relative to an untuned baseline and to the other other students in class. Good luck!

Submitting

Create a zip archive with only your .py files and submit it using the CS 451 HW 2 submission page. Only one person per team should submit. Be sure to have a comment with both names at the top of all your files if you work in a team.