Prelab 11 Due: 11:59 PM on 2020-05-11

Getting ready

To install R and RStudio on your own laptop:

  1. First download and install R. [A note for OS X users: the most recent version of R requires OS X 10.11+ (El Capitan). If you are running an older version of OS X download the previous version of R.]
  2. Next download and install RStudio.
  3. Install the packages we are using in class by executing the following in the R Console:
install.packages(c("ggplot2","reshape2","plyr","dplyr","stringr"), repos="https://cloud.r-project.org")

For Mac users, depending on your version of OSX and the version of the packages, you may get errors when R can’t install binary packages and needs to compile these packages from the original source code. Doing so requires you have the Apple developer tools installed. You can force the installation of those tools by opening the Terminal application and invoking the xcode-select --install command and following the prompts to install the command line developer tools.

Be sure to solve any installation issues by consulting with myself or one of the ASIs.

Recall that you can run your entire R program by using the “Source” button in the top right of the editor window (it works similar to the “green arrow” in Thonny).

Revisiting Zipf’s Law

We will be reimplementing our Zipf’s law lab in R. The program will generate the same two outputs as before:

  1. A printed list of the 10 most frequent words in the file in descending order of frequency along with each word’s frequency (count)
  2. A log-log plot of the word count versus word rank

In this implementation we will not use any loops; instead all of our computations will be vectorized.

Counting Words

Write a R function count_words that takes a vector of words as a parameter and returns a data frame of the words and their counts (with column labels “Word” and “Count”). There are many ways to go about this using both R built-in functions and the plyr package, most concisely the plyr count function. Note that in some versions of R/RStudio, the count function in plyr is getting overridden by another function with the same name in a different package (but different functionality). To prevent that problem, use the full qualified name, i.e., plyr::count.

Here is an example call to count_words and its output:

> count_words(c("a", "a", "the", "a", "in", "the"))
  Word Count
1    a     3
2   in     1
3  the     2

Depending on your approach you may need to change the column names. You can do so by assigning to the result of the colnames function, e.g.,

> frame <- data.frame(a=c(1, 2), b=c(2, 3))
> colnames(frame)
[1] "a" "b"
> colnames(frame) <- c("col1", "col2")
> frame
  col1 col2
1    1    2
2    2    3

Submit your function in a R file named prelab11.R to Gradescope. You can submit multiple times, with only the most recent submission graded. Note that the tests performed by Gradescope are limited. Passing all of the visible tests does not guarantee that your submission correctly satisfies all of the requirements of the assignment.

Gradescope does not integrate as tightly with R as it does with Python, thus you won’t see the same list of passing and failing tests. Instead pay attention to the Autograder Output window, like shown below:

> library("testthat"); test_file("prelab11.R");
✔ | OK F W S | Context
✔ |  4       | prelab11 [0.1 s]

══ Results ═════════════════════════════════════════════════════════════════════
Duration: 0.2 s

OK:       4
Failed:   0
Warnings: 0
Skipped:  0

You are looking to see all ✔ marks and for Failed: 0, i.e., there are zero failing tests.