This assignment was derived from an assignment by Joe Redmon, Middlebury CS major, class of 2011.5. You may work on this assignment in pairs or by yourself.

In this assignment you will be predicting income from census data using logistic regression. Specifically, you will predict the probability that a person earns more than $50k per year.

The assignment is in Python. The code should run both under python2 and python3. We will not use NumPy, but we'll simply use lists to represent vectors. There are no matrices used in this assigment.

Start by downloading hw2.zip and unzip it. You will be modifying the file `sgd.py`

. Tests for your program are in `test.py`

. To test your implementation, run:

`python3 test.py`

You will see the number of tests your implementation passes and any problems that arise.

The data is in `adult-data.csv`

; it is from the Adult dataset. Open the data file with Excel and take a look.

To perform logistic regression you have to be able to calculate the logistic function:

https://en.wikipedia.org/wiki/Logistic_regression#Definition_of_the_logistic_function

Fill in the `logistic`

function.

The model you are training is just a bunch of numerical weights. To run your model on a data points you will need to take the dot product of your weights and the features for that data point and run the result through your logistic function.

Fill in the `dot`

function to take the dot product of two vectors.

Now that you can calculate the dot product, predicting new data points should be easy! Fill in the `predict`

function to run your model on a new data point. Take a look at `test.py`

to see what the format for data points is.

Prediction should be straightforward, to predict new points you simply multiply your model's weights by the corresponding features, sum up the result, and pass it through the logistic function. This should be easy with your dot product and logistic functions.

Once you start training your model you are going to want to know how well you are doing. Modify the `accuracy`

function to calculate your accuracy on a dataset given a list of data points and the associated predictions.

Fill in the `train`

and `update`

functions to train your model! You should use logistic regression with L2 regularization where `alpha`

is the learning rate and `lambd`

is the regularization parameter.

The training should run for some number of `epochs`

performing stochastic gradient descent:

https://en.wikipedia.org/wiki/Stochastic_gradient_descent

This means you will randomly select a point from the dataset and run the model on that data point. Then you will calculate the error for that point and adjust your model weights based on the gradient of that error. An epoch refers to a full pass over the dataset. In practice it is easier (and more statistically valid) to sample randomly with replacement. Thus an epoch just means examining `m`

data points where `m`

is the number of points in your training data.

This is different than batch gradient descent where you look at all of the data points before updating. SGD converges faster but can also be less stable because you have a noisy estimate of the gradient instead of the true gradient. In practice it is often much better to use SGD than full batch gradient descent.

It might be a good idea to print the accuracy after each epoch, so you can see if (and how fast) training converges.

When you see a new data point \(x\), your prediction will be:

\(h_\theta(x) = P(\mbox{income} > 50\mbox{k}\; |\; \theta, x) = g(\theta^T x)\)

To adjust the model \(\theta\) you have to calculate the gradient of the cost function at a given point. The gradient will come from two sources, the error and the regularization. The update rule is similar to the one from class, but remember, since we are doing SGD you only look at one point before updating the model:

\(\theta_j := \theta_j - \alpha [ ( h_\theta(x) - y) x_j + \lambda \theta_j ]\)

Remember not to regularize \(\theta_0\).

When you run `python test.py`

it will tell you your current accuracy on the training and validation set. By default these are the same dataset. To get a more accurate evaluation you can modify `data.py`

to use different training and validation sets by splitting your data.

Take a look at the feature extracting code in `extract_features`

, and at the raw data in `adult-data.csv`

. Right now your model is only considering age, education, and one possible marital status.

Good feature extraction is often the key to making good machine learning models. Add more feature extraction rules to help improve your model's performance. This is very open ended, be creative and find features that work well with your model.

Tune your `submission`

function to train your final model. You should change your feature extraction and training code to produce the best model you can. Try different learning rates and regularization parameters, how do they compare? Often it is good to start with a high learning rate and decrease it over time, feel free to add this to your training code.

Your final model will be trained on the full training data and run on test data that you don't have access to. Your grade for this section will be based on your performance relative to an untuned baseline and to the other other students in class. Good luck!

Create a zip archive with **only** your .py files and submit it using the **CS 451 HW 2 submission page**. Only **one person per team** should submit. Be sure to have a comment with **both** names at the top of all your files if you work in a team.