CS 150 - Assignment 7 - Zipf's Law

Due: Wednesday 4/19 at the beginning of class

George Kingsley Zipf was a linguist who noticed an interesting phenomenon regarding the frequencies of words in a corpus (collection of documents). For almost any corpus the frequency of the occurrence of a word (i.e., how many times it occurs) is inversely proportional to the word's frequency rank in the corpus:

For example, the most frequent word (word_rank = 1) generally occurs twice as many times as the second most frequent word (word_rank = 2 ), etc. The graph below shows a graph of the word frequencies, sorted by rank/frequency, from Moby Dick:

This phenomenon tends to show up not only in text data, but in a variety of naturally occurring data ranging from search engines queries to connections in social networks. This phenomenon is so predominant that is has been coined Zipf's law. In this lab, we will be playing with one corollary of Zipf's law.

Zipf's law take two

One of the corollaries of Zipf's law is that if you count the number of words that occur just once, twice, three times, etc., in a corpus and plot them, you see a similar distribution to that above. This is because there are lots of words that occur just once, a few less that occur twice, even less that occur three times, etc. For example, here is a plot of this from some data from one of Dr. Seuss's books:

Here the x-axis represents the words that occur x times in the corpus (similar to the y-axis of the previous plot). Lots of words occur only once (~135 words), fewer words occur 5 times (~10 words), and even fewer words occur 19 times (in fact, just 1).

The program

For this lab, you will be writing a program that reads text data from a file and generates two things based on the words in the file: When the program starts it should ask the user to enter the name of a file. It should then count the frequencies of the words in this file and generate a graph using matplotlib with appropriate x and y labels and title as well as print out the 10 most frequent words. For example, given the file:
the dog is here
the dog is not here
dog good
dog sometimes not good
i like dogs
The program would produce the following graph:

and print out the following list of words:

The 10 most frequent words are:
dog   4
good  2
is    2
here  2
not   2
the   2
like  1
i     1
sometimes 1
dogs  1
There are 4 words that occur once ("like", "i", "sometimes", "dogs"), 5 words that occur twice ("good", "is", "here", "not", "the"), no words that occur three times, and 1 word that occurs four times ("dog"), so the plot has the values x = [1, 2, 3, 4] vs. y = [4, 5, 0, 1].

Your program should always print just 10 words, so if there are ties for position 10, you can pick any of the tied words for the 10th.

For our purposes you can count anything that is separated by a space as a word so things with punctuation like "ham." will become words. If you want, for extra points, you can try to remove punctuation.

The implementation

For this lab, I am giving you a fair amount of control over how you accomplish the program above. As you design and implement it, think hard about how you can break up your program into smaller functional units (i.e., functions) and how the functions will interact. An important component of your grade for this lab will be based upon how well you do this.

Some hints/suggestions

Data

I've posted a few different data sets for you to try your program on the course page (here is a link). "fox.txt" and "sam.txt" are the texts from two Dr. Seuss books. Try out these different files and see what the data looks like.

The other files contain much more text and will therefore have many more different words that will force your graph to have fairly a large x maximum. If you just look at the graph as is, most of the interesting information is lost by these large magnitudes. To see this picture better, you can use the "zoom in" tool in the matplotlib window to zoom in on the lower left corner. Alternatively, you can modify your script to discard some of the higher x values (if you do this later approach, don't forget to change it back when you're done).

Extra points

You may earn up to 2 extra points on this assignment. Below are some ideas, but you may incorporate your own if you'd like. Make sure to document your extra point additions in comments at the top of the file.

When you're done

Make sure that your program is properly commented: In addition, make sure that you've used good style.

Submission procedure

Submit your .py file online using the digital submission link on the course web page. You must have submitted it online before the beginning of class on Wednesday.

Grading

points
Word frequencies 3
Word occurrence counts 3
Print top 10 words 3
Plot: looks correct 3
Plot: labels, etc. 2
Code organization 2
Comments, style 4
Design (prelab) 3
Extra points 2
Total 23 + 2