Lab 5: Data for everyone Due: 8:00:00AM on 2022-10-21

FAQ

Background

For this lab, you are going to implement some initial data analysis functions and then analyze some real data.

You are allowed (and encouraged) to work in pairs on this assignment. If you do, you must do all of the work together. I advocate a Pair Programming approach in which one partner is the driver writing the code and the other is the navigator reviewing the code as the driver types. Only turn in one copy of the assignment, but make sure both of your names are in the comments at the top of the file and you add your partner to your Gradescope submission (as described at the end of the assignment).

Creating/Editing Text Files

For this lab, we will be reading data from files. The files must be of type “.txt” (no Microsoft Word or Google Docs files, etc.). Also, recall from class that the files must be saved in the same directory as your Python program that will be reading them. There are many ways to do this, but one easy way is to just create them using Thonny. To do this, just create a new file and then put whatever text data you want in the file (e.g., a list of numbers). When you save the file, us a “.txt” extension, e.g., “test.txt”, and save it in the same folder as your Python program.

Specifications

Download the starter file.

Right-click on the above link, select “Save as…”, and save the file for this assignment. Then open the saved file with Thonny. Add your code in the places specified by the comments. *Do not delete the if __name__ == '__main__': statement, it is needed for Gradescope.

At a minimum write a function named data_analysis, with no parameters, that prompts the user for the name of a data file that contains one integer per line and then prints the following statistics about the file:

  1. The number of entries in the file
  2. The largest value in the file
  3. The smallest value in the file
  4. The average of the values in the file (recall we saw this in class)
  5. The median of the values in the file. For an odd number of data elements, the median is the “middle” item if the data were in sorted order. For an even number of data elements, the median is the average of the two middle items if the data were in order.
  6. The corrected sample standard deviation of the values in the file. The corrected sample standard deviation of N data points is defined as:

    $$s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N} (x_{i} - \bar{x})^2}$$

    that is, the square root of the sum of the squared differences between the data points and their average divided by one less than the number of data points.

For full points on code design and style, define functions for significant separate tasks above (e.g., reading from a file, computing median of a list, computing corrected sample standard deviation) rather than including all functionality in the data_analysis function.

The required output must be formatted as follows:

File contained 10 entries
Max: 10
Min: 1
Average: 5.8
Median: 5.5
Std. dev: 2.859681411936962

Note: Your program should handle a file with zero lines, that is no data, without an error. In such a case you should only print:

File contained 0 entries

And handle files with only one line, i.e., one data point, that do not have a valid standard deviation without an error. In such a case, you should not compute and not print the standard deviation.

At a minimum your program should also include:

Guide

Data Basics: Summary Statistics

For example, if I have a file called “temps.txt” that contains the following:

34
71
32
34
25
33
34
24
35
41
27

(this data is the recorded high temperatures in Middlebury VT on March 19 in the years 2011 through 2021 from World Weather Online), then a run of the program would output:

Enter file to analyze: temps.txt
File contained 11 entries
Max: 71
Min: 24
Average: 35.45454545454545
Median: 34
Std. dev: 12.769993236988313

Your program should only read data from the file once. Like the examples in class, read the data from the file once, then store the values in a list and use that list to calculate the various statistics.

You have a fair bit of flexibility in how you implement this program, but use good coding practice. For example, think about how to break your program into a number of functions instead of writing one “mega” function.

You are welcome to and in fact encouraged to use any of the functions we have discussed or developed in class, e.g. sum, min, max. You are prohibited, however, from using other statistics libraries, such as the Python statistics module, to compute mean, standard deviation, etc..

Frequency

The following is a function that attempts to print out the frequency of each item in the data:

def frequencies(data):
    """
    Attempts to print the frequency of each item in the list data
    
    Args:
    	data: List of "sortable" data items
    """
    data.sort()
    
    count = 0
    previous = data[0]

    print("data\tfrequency") # '\t' is the TAB character

    for d in data:
        if d == previous:
            # Same as the previous, increment the count for the run
            count += 1
        else:
            # We've found a different item so print out the old and reset the count
            print(str(previous) + "\t" + str(count))
            count = 1
        
        previous = d

For example, given the list [6, 5, 5, 1, 1, 2, 2, 3, 3, 4, 5, 5] frequencies should print:

data    frequency
1       2
2       2
3       2
4       1
5       4
6       1

Unfortunately the function has a bug and doesn’t do this. Copy and paste the above function into your program and then fix it so that it works properly. There are many ways to fix it, however, one straightforward way will only require adding/changing one line of code. Be careful when you paste to preserve the indentation!

Your program does not need to invoke the frequencies function automatically in the way that data_analysis is invoked automatically when you click the green arrow in Thonny. Gradescope will test it separately. If you want frequencies to run automatically along with data_analysis, one way to do so is to modify your data_analysis function to return a list of the data and use that list as the argument to frequencies in the if __name__ == "__main__" conditional, e.g.,

if __name__ == "__main__":
    data = data_analysis()
    frequencies(data)

Real Data

I’m providing two real-world data sets:

For each of these data sets, provide an analysis of the data using your program (one analysis for each data set, for a total of two analysis). For example, your two experiments might be:

Provide the output from your experiments in a multi-line comment at the beginning of your program file (after your names, etc.), together with one or two sentences describing your analysis and results. For example at the top of your file there should be a comment block like:

"""
Enter file to analyze: test.txt
File contained 10 entries
Max: 10
Min: 1
Average: 5.8
Median: 5.5
Std. dev: 2.859681411936962

Here is a few sentences
about my analyses
on several lines.
"""

Your text will be ignored by Python (like any comment), but is readable by the grader. For example, you might say something like “the standard deviation of ratings of Bio-Dome is nearly half the total range suggesting that viewers disagreed on the quality.”

Creativity Points

You may earn up to 2 creativity points on this assignment by adding improvements to your program. If you do, include in your comment at the top of your program what you added. Below are some suggestions, but feel free to add you own:

Note: If you add additional statistics, print them after the required statistics. Gradescope will automatically test for the required output and will get confused if you change the required output lines. Similarly do not add any additional input statements to data_analysis. Gradescope won’t know to expect those questions and so your program will hang and/or the tests will fail.

When you’re done

Make sure that your program is properly commented:

In addition, make sure that you’ve used good code design and style (including meaningful variable names, constants where relevant, vertical white space, etc.).

Submit your program via Gradescope. Your program program file must be named lab5_data_analysis.py. You can submit multiple times, with only the most recent submission (before the due date) graded. Note that the tests performed by Gradescope are limited. Passing all of the visible tests does not guarantee that your submission correctly satisfies all of the requirements of the assignment.

If you worked with a partner, only one person needs to submit to Gradescope, but that person does need to add their partner’s name as shown in Gradescope documentation. Make sure both names are included in the comment at the top of the file.

Grading

Features Points
File reading 3
Basic statistics (num, max, min, avg) 3
Median 2
Standard deviation 4
Frequency 3
Data analyses 5
Code design and style 5
Creativity points 2
Total 27

FAQ Excerpts Click entry title for more information

Use specified formula for standard deviation

There are multiple ways to calculate the standard deviation. The lab explicity specifies the formula you should implement. If your results don’t match the examples in the lab make sure you are implementing the specified formula. Specifically, notice that the denominator is not the number of observations, but the number of observations less one.