Lab 8: Zipf’s Law Due: 08:00:00AM on 2022-11-18

FAQ

Background

George Kingsley Zipf was a linguist who noticed an interesting phenomenon regarding the frequencies of words in a corpus (collection of documents). For almost any corpus the frequency of the occurrence of a word (i.e., how many times it occurs) is inversely proportional to the word’s frequency rank in the corpus, i.e.

$$\text{word frequency} \propto \frac{1}{\text{word rank}}$$

For example, the most frequent word (rank of 1) generally occurs twice as many times as the second most frequent word (rank of 2), etc.

This phenomenon tends to show up not only in text data, but in a variety of naturally occurring data ranging from search engines queries to connections in social networks. This phenomenon is so predominant that is has been coined Zipf’s law.

Zipf’s law is most easily observed by plotting the data on a log-log plot. In this form we would expect a linear relationship of the form:

$$\log{(\text{word frequency})} \propto -s\log{(\text{word rank})}$$

where s is a constant that depends on the language, but we should expect to be approximately 1. In this lab we will write a program to generate a log-log plot for a text corpus, like shown below for the Grimm’s fairytale “Hansel and Gretel”.

Hansel and Gretel

Specification

For this lab, you will be writing a program that reads text data from a file and generates the following:

  1. A plot like that shown above, that is a log-log plot of word count versus word rank.
  2. A printed list (i.e., printed using print) of up to the 10 most frequent words in the file in descending order of frequency along with each word’s count in the file. The word and its count should be separated by a tab (“\t”).

Your program must include the following functions:

Your program should take the name of the file as a command line argument. Your program should then count the frequencies of the words in this file, generate a graph using Matplotlib with appropriate x-axis and y-axis labels and an appropriate title, and print out the 10 most frequent words. As in Lab 7, your program should not do anything when imported except defining functions (i.e. it should not print, generate a plot, etc.). Your program should only read the file once (to avoid repeating any computations).

Punctuation, e.g. periods, commas, quotes, etc., can bias our word counts and so needs to be removed prior to counting. For simplicity hyphens and other punctuation within words, e.g. the “-“ in “wood-cutter” or the “?--” in the paragraph below, does not need to be removed. Note that apostrophes in contractions, e.g. “I’ll”, should not be removed (as it would change the word). We want to reuse existing code whenever possible. We might expect that Python already has a source of all possible punctuation, and indeed the string module has a constant punctuation you can use.

Similarly capitalization can bias word counts. Your program should convert all words to lower case to ensure that “The” and “the” are treated as the same word.

For example, given a file with the first paragraph of Hansel and Gretel (download):

Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had
little to bite and to break, and once when great dearth fell on the
land, he could no longer procure even daily bread. Now when he thought
over this by night in his bed, and tossed about in his anxiety, he
groaned and said to his wife: 'What is to become of us? How are we
to feed our poor children, when we no longer have anything even for
ourselves?' 'I'll tell you what, husband,' answered the woman, 'early
tomorrow morning we will take the children out into the forest to where
it is the thickest; there we will light a fire for them, and give each
of them one more piece of bread, and then we will go to our work and
leave them alone. They will not find the way home again, and we shall be
rid of them.' 'No, wife,' said the man, 'I will not do that; how can I
bear to leave my children alone in the forest?--the wild animals would
soon come and tear them to pieces.' 'O, you fool!' said she, 'then we
must all four die of hunger, you may as well plane the planks for our
coffins,' and she left him no peace until he consented. 'But I feel very
sorry for the poor children, all the same,' said the man.

The program would produce the following graph:

Paragraph

and the following printed output:

Word	Count
the	14
and	12
to	9
we	7
he	5
children	5
his	5
will	5
them	5
of	5

Your program should always print the 10 top-ranked words (and no more, even if there are words, like above, with equivalent counts). Your program should also handle the unlikely case that there are fewer than 10 words.

Several words will likely have the same frequency (or count), but should not have the same rank. The ordering among words with identical counts can be determined in any way, e.g. both “RANK1” and “RANK2” below are valid rankings, as long as the global ranking is correct. Thus your output may be different from that shown above (e.g. “his” could be ranked above “children”).

WORD   COUNT   RANK1   RANK2
the    4       1       1
and    3       2       3
a      3       3       2
is     2       4       4

If you create a NumPy/datascience implementation, you must also keep your implementation using Python built-in functions (i.e. you would have two implementations). Thus make sure your built-in implementation is working before you tackle NumPy/datascience.

Creativity points

You may earn up to 2 creativity points on this assignment. Below are some ideas, but you may incorporate your own if you’d like. Make sure to document your additions in comments at the top of the file.

Guide

As our labs get more complex, the assignments are less prescriptive, giving you more flexibility in how you implement your solution. As you design and implement your program, think carefully about how you can break up your program into smaller functional units (i.e., functions) and how the functions will interact. In this lab, more of the “Code design and style” evaluation will focus on the design, with full credit for programs that are efficient and easy to understand/maintain.

Some notes and suggestions

Data

Here are several text files you can use for testing, derived from public domain texts at the Project Gutenberg:

Per the Project Gutenberg license, the above eBooks are for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org.

The second is much longer than the first. I suggest you start with the smaller files first - make sure your program is working correctly and efficiently - then tackle the larger dataset. You can also pick interesting inputs of your own from Project Gutenberg or other sources.

When you’re done

Make sure that your program is properly documented:

In addition, make sure that you’ve used good code design and style (including meaningful variable names, constants where relevant, vertical white space, etc.).

Submit your program via Gradescope. Your program program file must be named lab8_zipf_law.py. You can submit multiple times, with only the most recent submission (before the due date) graded. Note that the tests performed by Gradescope are limited. Passing all of the visible tests does not guarantee that your submission correctly satisfies all of the requirements of the assignment.

Gradescope will import your file for testing so that make sure that no code executes on import. That is when imported your program should not try to read the file, generate the plot, etc.

Grading

Features Points
read_corpus 4
count_and_rank 4
Print top 10 words 4
Plot: Correct data 4
Plot: Labels, etc. 2
Code design and style 5
Creativity points 2
Total 25

FAQ Excerpts Click entry title for more information

What is a fixed line representing Zipf's law

Recall that Zipf’s law postulates the following relationship: