Programming Assignment 5: Data for everyone

Initial Due Date: 2024-10-17 8:00AM
Final Due Date: 2024-10-31 4:15PM

Background

For this assignment, you are going to implement some initial data analysis functions and then apply those functions to some real data.

You are allowed (and encouraged) to work in pairs on this assignment. If you do, you must do all of the work together. I advocate a Pair Programming approach in which one partner is the driver writing the code and the other is the navigator reviewing the code as the driver types. Only turn in one copy of the assignment, but make sure both of your names are in the comments at the top of the file and you add your partner to your Gradescope submission (as described at the end of the assignment).

Creating/Editing Text Files

For this assignment, we will be reading data from files. The files must be of type “.txt” (no Microsoft Word or Google Docs files, etc.). Also, recall from class that the files must be saved in the same directory as your Python program that will be reading them. There are many ways to do this, but one easy way is to just create them using Thonny. To do this, just create a new file and then put whatever text data you want in the file (e.g., a list of numbers). When you save the file, us a “.txt” extension, e.g., “test.txt”, and save it in the same folder as your Python program.

Specifications

Download the starter file.

Right-click on the above link, select “Save as…” or “Download as…”, and save the file for this assignment. Then open the saved file with Thonny. Add your code in the places specified by the comments. Do not delete the if __name__ == '__main__': statement, it is needed for the Gradescope tests to work correctly.

At a minimum write a function named data_analysis, with no parameters, that prompts the user for the name of a data file that contains one integer per line and then prints the following statistics about the file:

The number of entries in the file
The largest value in the file
The smallest value in the file
The average of the values in the file (recall we saw this in class)
The median of the values in the file. For an odd number of data elements, the median is the “middle” item if the data were in sorted order. For an even number of data elements, the median is the average of the two middle items if the data were in sorted order.
The corrected sample standard deviation of the values in the file. The corrected sample standard deviation of N data points is defined as:

\[s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N} (x_{i} - \bar{x})^2}\]

that is, the square root of the sum of the squared differences between the data points and their average divided by one less than the number of data points.

For “exemplary” style, define functions for the significant separate tasks above (e.g., reading from a file, computing median of a list, computing corrected sample standard deviation) rather than including all functionality in the data_analysis function.

The required output must be formatted as follows:

File contained 10 entries
Max: 10
Min: 1
Average: 5.8
Median: 5.5
Std. dev: 2.859681411936962

Note: Your program should handle a file with zero lines, that is no data, without an error. In such a case you should only print:

File contained 0 entries

Similarly, your program should handle files with only one line, i.e., one data point, without an error. In such a case, there is not a valid standard deviation and so you should not compute and not print the standard deviation.

At a minimum your program should also include:

A corrected frequencies function (see guide section). Your program does not need to invoke the frequencies function automatically.
Two analyses of the real-world data provided in the guide section (as comments in the same file)

Guide

Data Basics: Summary Statistics

For example, if I have a file named “temps.txt” that contains the following:

(this data is the recorded high temperatures in Middlebury VT on March 19 in the years 2011 through 2021 from World Weather Online), then a run of the program would output:

Enter file to analyze: temps.txt
File contained 11 entries
Max: 71
Min: 24
Average: 35.45454545454545
Median: 34
Std. dev: 12.769993236988313

Your program should only read data from the file once. Like the examples in class, read the data from the file once, then store the values in a list and use that list to calculate the various statistics.

You have a fair bit of flexibility in how you implement this program, but use good coding practice. For example, think about how to break your program into a number of functions instead of writing one “mega” function.

You are welcome to and in fact encouraged to use any of the functions we have discussed or developed in class, e.g. sum, min, max. You are prohibited, however, from using other statistics libraries, such as the Python statistics module, to compute mean, standard deviation, etc..

Frequency

The following is a function that attempts to print out the frequency of each item in the data:

def frequencies(data):
    """
    Attempts to print the frequency of each item in the list data
    
    Args:
        data: List of "sortable" data items
    """
    data.sort()
    
    count = 0
    previous = data[0]

    print("data\tfrequency") # '\t' is the TAB character

    for d in data:
        if d == previous:
            # Same as the previous, increment the count for the run
            count += 1
        else:
            # We've found a different item so print out the old and reset the count
            print(str(previous) + "\t" + str(count))
            count = 1
        
        previous = d

For example, given the list [6, 5, 5, 1, 1, 2, 2, 3, 3, 4, 5, 5] frequencies should print:

data    frequency
1       2
2       2
3       2
4       1
5       4
6       1

Unfortunately the function has a bug and doesn’t do this. Copy and paste the above function into your program and then fix it so that it works correctly. There are many ways to fix it, however, one straightforward way will only require adding/changing one line of code. Be careful when you paste to preserve the indentation!

Your program does not need to invoke the frequencies function automatically in the way that data_analysis is invoked automatically when you click the green arrow in Thonny. Gradescope will test it separately. If you want frequencies to run automatically along with data_analysis, one way to do so is to modify your data_analysis function to return a list of the data and use that list as the argument to frequencies in the if __name__ == "__main__" conditional, e.g.,

if __name__ == "__main__":
    data = data_analysis()
    frequencies(data)

Real Data

I’m providing two real-world data sets:

Data from the Northeast region of the US collected in the 1995 census. There are three files corresponding to the age, number of kids and income of the surveyed participants. You can download any of these by right clicking and selecting “Save link as…” (or something similar depending on the browser).
Ratings for three different movies (if you’re curious, this is a snippet of data from movielens). This data is extracted from the 3movie_reviews.xlsx Excel file. If you want different combinations of this data, you’ll need to open this file in Excel, copy the data you’re interested in and and paste it into new file in Thonny that you save as a “.txt” file.

For each of these data sets, provide an analysis of the data using your program (one analysis for each data set, for a total of two analysis). For example, your two experiments might be:

Compute the summary statistics for income in 1995
Compare the summary statistics for two different movies

Provide the output from your experiments in a multi-line comment at the beginning of your program file (after your names, etc.), together with one or two sentences describing your analysis and results. For example at the top of your file there should be a comment block like:

"""
Enter file to analyze: test.txt
File contained 10 entries
Max: 10
Min: 1
Average: 5.8
Median: 5.5
Std. dev: 2.859681411936962

Here is a few sentences
about my analyses
on several lines.
"""

Your text will be ignored by Python (like any comment), but is readable by the grader. For example, you might say something like “the standard deviation of ratings of Bio-Dome is nearly half the total range suggesting that viewers disagreed on the quality.”

Creativity suggestions

Here are some possible creativity additions, although you are encouraged to include your own ideas. Make sure to document your additions in the docstring comment at the top of the file.

[1-2 points] Add a calculation of the mode of your data, that is the most frequently occurring element in the data. Notice this should have a similar feel to the frequencies function. To earn the full points, identify all modal values when there is not a unique mode.
[0.5-1 points] Do additional experiments/analysis of the real world data sets. You’ll be scored based on how creative you are. For example, just running an additional census file through the program will only earn minimum points.
[0.5-1 points] Add additional statistics to your analysis function.
[? points] Add your own ideas. Points will be awarded based on difficulty and innovation.

Note: If you add additional statistics, print them after the required statistics. Gradescope will automatically test for the required output and will get confused if you change the required output lines. Similarly do not add any additional input statements to data_analysis. Gradescope won’t know to expect those questions and so your program will hang and/or the tests will fail.

When you’re done

Make sure that your program is properly documented:

You should have a comment at the very beginning of the file with your name, section, and a listing of your creativity additions.
Each function should have an appropriate docstring (including arguments and return value if applicable).
Other miscellaneous inline/block comments if the code might otherwise be unclear.

In addition, make sure that you’ve used good code design and style (including helper functions where useful, meaningful variable names, constants where relevant, vertical white space, removing “dead code” that doesn’t do anything, removing testing code, etc.).

Submit your program via Gradescope. Your program program file must be named pa5_data_analysis.py. You can submit multiple times, with only the most recent submission (before the due date) graded. Note that the tests performed by Gradescope are limited. Passing all of the visible tests does not guarantee that your submission correctly satisfies all of the requirements of the assignment.

If you worked with a partner, only one person needs to submit to Gradescope, but that person does need to add their partner’s name as shown in Gradescope documentation. Make sure both names are included in the comment at the top of the file.

Grading

Assessment	Requirements
Revision needed	Some but not all tests are passing.
Meets Expectations	All tests pass, the required functions are implemented correctly and your implementation uses satisfactory style.
Exemplary	All requirements for Meets Expectations, 2 creativity points, and your implementation is clear, concise, readily understood, and maintainable.

FAQ

Use the specified formula for standard deviation

There are multiple ways to calculate the standard deviation. The assignment explicity specifies the formula you should implement. If your results don’t match the examples make sure you are implementing the specified formula. Specifically, notice that the denominator is not the number of observations, but the number of observations less one.