Lab 5: Data for everyone Due: 11:59 PM on 2020-04-02

Background

For this lab, you will implement some initial data analysis functions and then analyze some real data.

Creating/Editing Text Files

For this lab, we will read data from files. The files must be of type “.txt” (no Microsoft Word or Google Docs files, etc.). Also, recall from class that the files must be saved in the same directory as your Python program that will be reading them. There are many ways to do this, but one easy way is to just create them using Thonny. To do this, just create a new file and then put whatever text data you want in the file (e.g., a list of numbers). When you save the file, use a “.txt” extension, e.g., “test.txt”, and save it in the same folder as your Python program.

Specifications

Right-click (Ctrl-click on a Mac) on the link below, select “Save as…”, and save the file for this assignment. Then open the saved file with Thonny.

Download the starter file.

Add your code in the places specified by the comments. Do not delete the if __name__ == '__main__': statement, it is needed for Gradescope.

At a minimum write a function named data_analysis, with no parameters, that prompts the user for the name of a data file that contains one number per line and then prints the following statistics about the file:

  1. The number of entries in the file
  2. The largest value in the file
  3. The smallest value in the file
  4. The average of the values in the file (recall we saw this in class)
  5. The median of the values in the file. For an odd number of data elements, the median is the “middle” item if the data were in sorted order. For an even number of data elements, the median is the average of the two middle items if the data were in order.
  6. The corrected sample standard deviation of the values in the file. The corrected sample standard deviation of N sample data points is defined as:

    $$s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N} (x_{i} - \bar{x})^2}$$

    that is, the square root of the sum of the squared differences between the data points and their average divided by one less than the number of data points.

The required output must be formatted as follows:

File contained 10 entries
Max: 47
Min: 21
Average: 31.0
Median: 30.0
Std. dev: 8.259674462242579

Note: Your program should handle a file with zero lines, that is no data, without an error. In such a case you should only print:

File contained 0 entries

At a minimum your program should also include:

Guide

Data Basics: Summary Statistics

Suppose we have a file called “temps.txt” that contains the following:

21
47
33
29
25
38
38
31
21
27

(This data is the recorded high temperatures in Middlebury VT on March 7 in the years 2011 through 2020 from World Weather Online.) Then a run of the program would output:

Enter file to analyze: temps.txt
File contained 10 entries
Max: 47
Min: 21
Average: 31.0
Median: 30.0
Std. dev: 8.259674462242579

Your program should only read data from the file once. Like the examples from class, read the data from the file once, then store the values in a list and use that list to calculate the various statistics.

You have a fair bit of flexibility in how you implement this program, but use good coding practice. For example, think about how to break your program into a number of functions instead of writing one “mega” function.

You are encouraged to use any of the functions we have discussed or developed in class, e.g., sum. You are prohibited, however, from using other statistics libraries, such as the Python statistics module, to compute mean, standard deviation, etc.

Frequency

The following is a function that attempts to print out the frequency of each item in the data:

def frequencies(data):
    """
    Attempts to print the frequency of each item in the list data
    
    Args:
    	data: List of "sortable" data items
    """
    data.sort()
    
    count = 0
    previous = data[0]

    print("data\tfrequency") # '\t' is the TAB character

    for d in data:
        if d == previous:
            # Same as the previous, increment the count for the run
            count += 1
        else:
            # We've found a different item so print out the old and reset the count
            print(str(previous) + "\t" + str(count))
            count = 1
        
        previous = d

For example, given the list [6, 5, 5, 1, 1, 2, 2, 3, 3, 4, 5, 5] it should print:

data    frequency
1       2
2       2
3       2
4       1
5       4
6       1

Unfortunately the function has a bug and doesn’t do this. Copy and paste the above function into your program and then fix it so that it works properly. There are many ways to fix it, however, one straightforward way will only require adding/changing one line of code. Be careful when you paste to preserve the indentation!

Real Data

We’re providing two real-world data sets:

For each of these data sets, provide an analysis of the data using your program (one analysis for each data set, for a total of two analyses). For example, your two experiments might be:

Provide the output from your experiments in a multi-line comment at the beginning of your program file (after your names, etc.), together with one or two sentences describing your analysis and results. For example at the top of your file there should be a comment block like:

"""
Enter file to analyze: temps.txt
File contained 10 entries
Max: 47
Min: 21
Average: 31.0
Median: 30.0
Std. dev: 8.259674462242579

Analysis:
This data is the maximum temperature in Middlebury VT on March 7
in the years 2011-2020. The max temperature on this date has varied in
the last ten years from 21 to 47, with an average just below freezing.
"""

Your text will be ignored by Python (like any comment), but is readable by the grader. If you analyzed data about the movie Bio-Dome, you might include in your analysis comment something like “the standard deviation of ratings of Bio-Dome is nearly half the total range, suggesting that viewers disagreed on the quality.”

Creativity Points

You may earn up to 2 creativity points on this assignment by adding improvements to your program. If you do, include in your comment at the top of your program what you added. Below are some suggestions, but feel free to add your own:

Note: If you add additional statistics, print them after the required statistics. Our testing scripts run by Gradescope will automatically test for the required output and will get confused if you change the required output lines. Similarly do not add any additional input statements to data_analysis. (The testing scripts would not be able to run.)

When you’re done

Make sure that your program is properly commented:

In addition, make sure that you’ve used good coding style (including meaningful variable names, constants where relevant, vertical white space, etc.).

Submit your program via Gradescope. Your program program file must be named lab5_data_analysis.py. You can submit multiple times, with only the most recent submission (before the due date) graded. Note that the tests performed by Gradescope are limited. Passing all of the visible tests does not guarantee that your submission correctly satisfies all of the requirements of the assignment.

Grading

Feature Points
File reading 3
Basic statistics (num, max, min, avg) 3
Median 2
Standard deviation 4
Frequency 3
Data analyses 5
Comments, style 3
Creativity points 2
Total 25