CS 452 Homework 5 - Clustering using K-means

Due: Wednesday 11/15, at noon

Credit: this assignment was adapted from an assignment by Joe Redmon for the University of Washington's Machine Learning course.

In this assignent you will implement the K-means algorithm in Python and run it on country survey data. You can work on this assignment in groups of two or by yourself.

The data

The data comes from a UN survey on people's political priorities. You can find the original data here. People from 194 countries were asked about 16 priorities, ranging from action taken on climate change to internet access to job opportunities. Download the data file country.csv, which contains data aggregated across countries. Each row lists the relative importance for each of the 16 priorities (a number between 0 and 1).

You will cluster the data to find which countries are similar based on what the residents of those countries care about.

The algorithm

You should implement the K-means algorithm as follows:

  1. Select K starting centroids that are points from your data set. You should be able to select these centroids randomly or simply use the first K points (countries).

  2. Assign each data point \(x_i\) to the cluster associated with the nearest of the K centroids. Store the index of the nearest centroid in \(c_i\).

  3. Re-calculate the centroids as the mean vector of each cluster from (2).

  4. Repeat steps (2) and (3) until convergence or iteration limit.

Convergence means that there was no change in label assignment from the previous iteration.

Optimization objective

The goal of clustering can be thought of as minimizing the variation within groups and consequently maximizing the variation between groups.

We have m = 194 data points (countries) and n = 16 features. Using a notation similar to the Coursera course, but adopting Python's indexing from 0, we have

\(x_i\) = n-dimensional data point, \(i\) = 0...m-1

\(\mu_k\) = cluster centroid, \(k\) = 0...K-1

\(c_i\) = index of cluster (0...K-1) to which data point \(x_i\) is currently assigned

\(\mu_{c_i}\) = centroid of cluster to which data point \(x_i\) is currently assigned

The optimization objective (cost) \(J\) is defined as the average squared Euclidean distance between each data point and its nearest centroid:

\[ J = \frac{1}{m} \sum_{i=1}^m ||x_i - \mu_{c_i}|| ^ 2 \]

A good model has a low sum of squared differences within each group, and therefore minimizes the above objective.

Implementation

I'm not providing any starter code for this assignment -- you'll have to write everything from scratch. Create a new program kmeans.py and make sure to use python3 for testing. (If you're really stumped, I can provide a code skeleton for a 10% grade penalty.)

I recommend using NumPy arrays (vectors) for each xi and muk, but you can use lists for everything else. We don't have to distinguish row vectors from column vectors, so you can use 1-D NumPy arrays.

I suggest you proceed in the following order:

At this point, you can compare your results with my sample results for K = 1 .. 5. You should get the same numbers, assuming you initialized your clusters to the first K datapoints (i.e., the first five countries: Afghanistan, Albania, ...).

Here are some more sample results with 100 repetitions for K = 1 .. 3. I'm listing the size of the cluster containing the USA to help with debugging. You should get the same numbers. I'm also listing how many times the minimum was found (e.g. 5/100). That number will differ from run to run since it depends on the randomization. Note that for larger K, 100 repetitions may not be enough to find the global minimum reliably.

Experiments

Experiment with your program and summarize your findings in a brief report:

  1. Run your algorithm for K = 1...30 and keep track of the minimum costs J obtained. Create a plot of J vs. K, either using matplotlib, or by loading the values into a spreadsheet. (You might want to try multiple runs to see how much your results vary.) Include your plot in your report.

  2. Based on your plot, can you choose a "best" value of K for this dataset? How / why, or why not? If the plot does not suggest a good value for K, what other criteria could you use? In either case, please pick a value for K and explain why you chose this value.

  3. For your optimal value of K, examine the resulting clusters, and also how their clusters centers differ from the average over all countries. What general trends to you see in this data? For example, how well balanced are the clusters? List the countries in each cluster. Do the countries in each cluster appear to be related?

  4. Pick a country you are interested in. It could be the country you are from, somewhere you have visited, or a country you would like to learn more about. What cluster does this country belong to? What sets this cluster apart from other countries? Are the countries in this cluster related somehow (geographically, politically, economically)? Are there any unexpected countries in this cluster?

Submitting

Submit your program kmeans.py and your report report.pdf in PDF format using the HW 5 submission page. Make sure both your program and your report has your name(s) at the top. Only one person per team should upload the files.