George Kingsley Zipf was a linguist who noticed an interesting phenomenon regarding the frequencies of words in a corpus (collection of documents). For almost any corpus the frequency of the occurrence of a word (i.e., how many times it occurs) is inversely proportional to the word’s frequency rank in the corpus, i.e.
For example, the most frequent word (rank of 1) generally occurs twice as many times as the second most frequent word (rank of 2), etc.
This phenomenon tends to show up not only in text data, but in a variety of naturally occurring data ranging from search engines queries to connections in social networks. This phenomenon is so predominant that is has been coined Zipf’s law.
Zipf’s law is most easily observed by plotting the data on a log-log plot. In this form we would expect a linear relationship of the form:
where s is a constant that depends on the language, but we should expect to be approximately 1. In this lab we will write a program to generate a log-log plot for a text corpus, like shown below for the Grimm’s fairytale “Hansel and Gretel”.
For this lab, you will be writing a program that reads text data from a file and generates the following:
print
) of up to the 10 most frequent
words in the file in descending order of frequency along with each word’s
count in the file. The word and its count should be separated by a tab
(“\t”).Your program must include the following functions:
read_corpus
which takes a filename as a parameter and returns a list of
cleaned and normalized words (see below for more details).count_and_rank
which takes a list of words as a parameter and returns a
tuple containing a list of words and a list of counts for those words. Both
lists should should be sorted in decreasing order of count, i.e. the most
common word and its count should the first element of the lists. For example:
>>> count_and_rank(["a", "a", "the", "a", "in", "the"])
(['a', 'the', 'in'], [3, 2, 1])
count_and_rank
should not plot any data or print anything to the shell.
Your program should take the name of the file as a command line argument. Your program should then count the frequencies of the words in this file, generate a graph using Matplotlib with appropriate x-axis and y-axis labels and an appropriate title, and print out the 10 most frequent words. As in Lab 7, your program should not do anything when imported except defining functions (i.e. it should not print, generate a plot, etc.). Your program should only read the file once (to avoid repeating any computations).
Punctuation, e.g. periods, commas, quotes, etc., can bias our word counts and
so needs to be removed prior to counting. For simplicity hyphens and other
punctuation within words, e.g. the “-“ in “wood-cutter” or the “?--” in the
paragraph below, does not need to be removed. Note that apostrophes in
contractions, e.g. “I’ll”, should not be removed (as it would change the word). We want to reuse existing code whenever possible. We might expect that Python already has a source of all possible punctuation, and indeed the string module has a constant punctuation
you can use.
Similarly capitalization can bias word counts. Your program should convert all words to lower case to ensure that “The” and “the” are treated as the same word.
For example, given a file with the first paragraph of Hansel and Gretel (download):
Hard by a great forest dwelt a poor wood-cutter with his wife and his
two children. The boy was called Hansel and the girl Gretel. He had
little to bite and to break, and once when great dearth fell on the
land, he could no longer procure even daily bread. Now when he thought
over this by night in his bed, and tossed about in his anxiety, he
groaned and said to his wife: 'What is to become of us? How are we
to feed our poor children, when we no longer have anything even for
ourselves?' 'I'll tell you what, husband,' answered the woman, 'early
tomorrow morning we will take the children out into the forest to where
it is the thickest; there we will light a fire for them, and give each
of them one more piece of bread, and then we will go to our work and
leave them alone. They will not find the way home again, and we shall be
rid of them.' 'No, wife,' said the man, 'I will not do that; how can I
bear to leave my children alone in the forest?--the wild animals would
soon come and tear them to pieces.' 'O, you fool!' said she, 'then we
must all four die of hunger, you may as well plane the planks for our
coffins,' and she left him no peace until he consented. 'But I feel very
sorry for the poor children, all the same,' said the man.
The program would produce the following graph:
and the following printed output:
Word Count
the 14
and 12
to 9
we 7
he 5
children 5
his 5
will 5
them 5
of 5
Your program should always print the 10 top-ranked words (and no more, even if there are words, like above, with equivalent counts). Your program should also handle the unlikely case that there are fewer than 10 words.
Several words will likely have the same frequency (or count), but should not have the same rank. The ordering among words with identical counts can be determined in any way, e.g. both “RANK1” and “RANK2” below are valid rankings, as long as the global ranking is correct. Thus your output may be different from that shown above (e.g. “his” could be ranked above “children”).
WORD COUNT RANK1 RANK2
the 4 1 1
and 3 2 3
a 3 3 2
is 2 4 4
If you create a NumPy/datascience implementation, you must also keep your implementation using Python built-in functions (i.e. you would have two implementations). Thus make sure your built-in implementation is working before you tackle NumPy/datascience.
You may earn up to 2 creativity points on this assignment. Below are some ideas, but you may incorporate your own if you’d like. Make sure to document your additions in comments at the top of the file.
group
method
to implement the histogram. You can change the column names for a Table with
the relabeled
method.
To select just a subset of rows, check out the take
method.
When you are printing your Table, investigate the as_text
method.
Although that method generates a ‘|’ table by default, you can change the sep
parameter to use tabs instead.As our labs get more complex, the assignments are less prescriptive, giving you more flexibility in how you implement your solution. As you design and implement your program, think carefully about how you can break up your program into smaller functional units (i.e., functions) and how the functions will interact. In this lab, more of the “Code design and style” evaluation will focus on the design, with full credit for programs that are efficient and easy to understand/maintain.
with open(filename, "r", encoding='utf-8') as file:
or
with open(filename, "r", errors='ignore') as file:
split
method can be called on a string and returns a
list of the words in that string separated by any whitespace (e.g. a
space).Recall the strip
method can take an optional argument, a string of
characters to strip from the beginning and end of the string. For
example:
>>> "word.".strip(".,")
'word'
removes any leading or trailing periods and commas. Your actual program will need to be able to strip more punctuation than just the comma and period shown here.
To generate the ranking you will need to sort the word-count pairs by counts.
There are several approaches to do so. Some of the reading describes one approach that
is based on inverting the dictionary, another is to sort (with the sort
method or sorted
function) a data structure derived from the dictionary,
another still is to write a function that finds and removes, e.g. “pops”, the
key-value pair with the maximum value (and then applies that function until
the dictionary is empty).
I think the “middle” approach will be the most straightforward. To do so, you
can use the optional key
parameter for sorted
as described
here. This parameter expects
function that can be applied to every element to obtain the key that is
compared to determine the sort order. For example the following would sort
the list of strings
based on the second letter of the string.
def second_letter(elem):
return elem[1];
strings = ["yw", "ac", "fg", "de"]
strings.sort(key=second_letter)
show
function, your program will wait until
you close that window (and then proceed). This is normal and expected. It may
be helpful to print your “top 10” before plotting so that don’t have to
remember to close the plot for the rest of your program to execute.xscale
function
and yscale
function with the string 'log'
as the argument. The latter
approach is easier (and better since you don’t need to modify the data). Note
depending on your aesthetic choices for points, etc. your plot might look a
little different than the examples - that is OK. But the data should be correct and the axis and
title labels meaningful.Here are several text files you can use for testing, derived from public domain texts at the Project Gutenberg:
Per the Project Gutenberg license, the above eBooks are for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with this eBook or online at www.gutenberg.org.
The second is much longer than the first. I suggest you start with the smaller files first - make sure your program is working correctly and efficiently - then tackle the larger dataset. You can also pick interesting inputs of your own from Project Gutenberg or other sources.
Make sure that your program is properly documented:
In addition, make sure that you’ve used good code design and style (including meaningful variable names, constants where relevant, vertical white space, etc.).
Submit your program via Gradescope. Your program program file must be named lab8_zipf_law.py. You can submit multiple times, with only the most recent submission (before the due date) graded. Note that the tests performed by Gradescope are limited. Passing all of the visible tests does not guarantee that your submission correctly satisfies all of the requirements of the assignment.
Gradescope will import your file for testing so that make sure that no code executes on import. That is when imported your program should not try to read the file, generate the plot, etc.
Features | Points |
---|---|
read_corpus |
4 |
count_and_rank |
4 |
Print top 10 words | 4 |
Plot: Correct data | 4 |
Plot: Labels, etc. | 2 |
Code design and style | 5 |
Creativity points | 2 |
Total | 25 |
Recall that Zipf’s law postulates the following relationship: