Lab 11: Zipf’s Law in R Due: 11:59 PM on 2020-05-11

For this lab we will reimplement the Zipf’s Law lab in R, including both the plotting and printing.

Specification

Download the starter file.

The specifications are similar to those from lab 10. You will write a program that reads text data from a file and generates the following:

  1. A printed list of the 10 most frequent words in the file in descending order of frequency along with each word’s count
  2. A log-log plot of word count versus word rank

Your program must include the following functions:

When the program starts it should prompt the user to enter the name of a file. Your program should then determine the frequencies of the words in this file, print out the 10 most frequent words, and generate a graph with appropriate data, x-axis and y-axis labels, and title. Your program should only read the file once (to avoid repeating any computations).

Punctuation, e.g., periods or commas, can bias our word counts and so needs to be removed prior to counting. Hyphens and other punctuation within words do not need to be removed. Note that apostrophes in contractions, e.g. “I’ll”, should not be removed (as it would change the word). The starter file contains a function already defined that will strip out leading and trailing punctuation.

Similarly capitalization can bias word counts. Your program should convert all words to lower case to ensure that “The” and “the” are treated as the same word.

Your program should always print the 10 top-ranked words. Your program should also handle the unlikely case that there are fewer than 10 words.

The key difference from your Python implementation is that your R program must not have any explicit loops, i.e. you must exclusively use vector operations.

Creativity points

You may earn up to 2 creativity points on this assignment. Below are some ideas, but you may incorporate your own if you’d like. Make sure to document your additions in comments at the top of the file.

Example

For the example from “Harry Potter and the Deathly Hallows” (download), your program should produce the following printed output:

 Word Count
  the     8
    a     7
  and     7
   he     7
   to     7
   it     6
  you     6
  but     5
    i     5
   so     5

and the following plots with R base graphics (left) or ggplot2 (right):

Paragraph
Paragraph

Note that the counts may be slightly different than the Python version due to differences in how the R version handles splitting lines into words and stripping punctuation. Depending on your aesthetic choices for points, etc., your plot might look a little different - that is OK. But the data should be correct and the axis and title labels meaningful.

Guide

We have provided you a starter file that includes several string processing functions mimicking familiar Python methods. Fill in the remaining functions with your code.

Some notes and suggestions

Data

Here are several text files you can use for testing, derived from public domain texts at the Project Gutenberg:

Per the Project Gutenberg license, an eBook is for the use of anyone anywhere at no cost and with almost no restrictions whatsoever. You may copy it, give it away or re-use it under the terms of the Project Gutenberg License included with the eBook or online at www.gutenberg.org.

Start with a smaller test file first – make sure your program is working correctly and efficiently – then tackle a larger dataset. You can also pick interesting inputs of your own from Project Gutenberg or other sources.

When you’re done

Make sure that your program is properly commented:

In addition, make sure that you’ve used good coding style (including meaningful variable names, constants where relevant, vertical white space, etc.).

Submit your R program via Gradescope. Your program file must be named lab11_zipf_law.R. You can submit multiple times, with only the most recent submission graded. Note that the tests performed by Gradescope are limited. Passing all of the visible tests does not guarantee that your submission correctly satisfies all of the requirements of the assignment.

Grading

Features Points
Read file and extract words 3
Word occurrence counts and ranking 4
Print top 10 words 3
Plot: Correct data 4
Plot: Labels, etc. 2
Code organization 2
Comments, style 4
Creativity points 2
Total 25