Background

For this lab, you are going to implement some initial data analysis functions and then analyze some real data.

You are allowed (and encouraged) to work in pairs on this assignment. If you do, you must do all of the work together. I advocate a Pair Programming approach in which one partner is the driver writing the code and the other is the navigator reviewing the code as the driver types. Only turn in one copy of the assignment, but make sure both of your names are in the comments at the top of the file and you add your partner to your Gradescope submission (as described at the end of the assignment).

Creating/Editing Text Files

For this lab, we will be reading data from files. The files must be of type “.txt” (no Microsoft Word or Google Docs files, etc.). Also, recall from class that the files must be saved in the same directory as your Python program that will be reading them. There are many ways to do this, but one easy way is to just create them using Thonny. To do this, just create a new file and then put whatever text data you want in the file (e.g., a list of numbers). When you save the file, us a “.txt” extension, e.g., “test.txt”, and save it in the same folder as your Python program.

Specifications

Right-click on the above link, select “Save as…”, and save the file for this assignment. Then open the saved file with Thonny. Add your code in the places specified by the comments. *Do not delete the if __name__ == '__main__': statement, it is needed for Gradescope.

At a minimum write a function named data_analysis, with no parameters, that prompts the user for the name of a data file that contains one integer per line and then prints the following statistics about the file:

1. The number of entries in the file
2. The largest value in the file
3. The smallest value in the file
4. The average of the values in the file (recall we saw this in class)
5. The median of the values in the file. For an odd number of data elements, the median is the “middle” item if the data were in sorted order. For an even number of data elements, the median is the average of the two middle items if the data were in order.
6. The corrected sample standard deviation of the values in the file. The corrected sample standard deviation of N data points is defined as:

$$s = \sqrt{\frac{1}{N-1}\sum_{i=1}^{N} (x_{i} - \bar{x})^2}$$

that is, the square root of the sum of the squared differences between the data points and their average divided by one less than the number of data points.

For full points on code design and style, define functions for significant separate tasks above (e.g., reading from a file, computing median of a list, computing corrected sample standard deviation) rather than including all functionality in the data_analysis function.

The required output must be formatted as follows:

File contained 10 entries
Max: 10
Min: 1
Average: 5.8
Median: 5.5
Std. dev: 2.859681411936962


Note: Your program should handle a file with zero lines, that is no data, without an error. In such a case you should only print:

File contained 0 entries


At a minimum your program should also include:

• A corrected frequencies function (see guide section). Your program does not need to invoke the frequencies function automatically.
• Two analyses of the real-world data provided in the guide section (as comments in the same file)

Guide

Data Basics: Summary Statistics

For example, if I have a file called “temps.txt” that contains the following:

34
71
32
34
25
33
34
24
35
41
27


(this data is the recorded high temperatures in Middlebury VT on March 19 in the years 2011 through 2021 from World Weather Online), then a run of the program would output:

Enter file to analyze: temps.txt
File contained 11 entries
Max: 71
Min: 24
Average: 35.45454545454545
Median: 34
Std. dev: 12.769993236988313


Your program should only read data from the file once. Like the examples in class, read the data from the file once, then store the values in a list and use that list to calculate the various statistics.

You have a fair bit of flexibility in how you implement this program, but use good coding practice. For example, think about how to break your program into a number of functions instead of writing one “mega” function.

You are welcome to and in fact encouraged to use any of the functions we have discussed or developed in class, e.g. sum, min, max. You are prohibited, however, from using other statistics libraries, such as the Python statistics module, to compute mean, standard deviation, etc..

Frequency

The following is a function that attempts to print out the frequency of each item in the data:

def frequencies(data):
"""
Attempts to print the frequency of each item in the list data

Args:
data: List of "sortable" data items
"""
data.sort()

count = 0
previous = data[0]

print("data\tfrequency") # '\t' is the TAB character

for d in data:
if d == previous:
# Same as the previous, increment the count for the run
count += 1
else:
# We've found a different item so print out the old and reset the count
print(str(previous) + "\t" + str(count))
count = 1

previous = d


For example, given the list [6, 5, 5, 1, 1, 2, 2, 3, 3, 4, 5, 5] frequencies should print:

data    frequency
1       2
2       2
3       2
4       1
5       4
6       1


Unfortunately the function has a bug and doesn’t do this. Copy and paste the above function into your program and then fix it so that it works properly. There are many ways to fix it, however, one straightforward way will only require adding/changing one line of code. Be careful when you paste to preserve the indentation!

Your program does not need to invoke the frequencies function automatically in the way that data_analysis is invoked automatically when you click the green arrow in Thonny. Gradescope will test it separately. If you want frequencies to run automatically along with data_analysis, one way to do so is to modify your data_analysis function to return a list of the data and use that list as the argument to frequencies in the if __name__ == "__main__" conditional, e.g.,

if __name__ == "__main__":
data = data_analysis()
frequencies(data)


Real Data

I’m providing two real-world data sets:

• 95census is a folder containing some data from the Northeast region of the US from the 1995 census. There are three files in the folder corresponding to the age, number of kids and income of the surveyed participants. You can download any of these by right clicking and selecting “Save link as…” (or something similar depending on the browser).
• movie_reviews is a folder containing ratings for three different movies (if you’re curious, this is a snippet of data from movielens). This data is extracted from the 3movie_reviews.xlsx Excel file. If you want different combinations of this data, you’ll need to open this file in Excel, copy the data you’re interested in and and paste it into new file in Thonny that you save as a “.txt” file.

For each of these data sets, provide an analysis of the data using your program (one analysis for each data set, for a total of two analysis). For example, your two experiments might be:

• Compute the summary statistics for income in 1995
• Compare the summary statistics for two different movies

Provide the output from your experiments in a multi-line comment at the beginning of your program file (after your names, etc.), together with one or two sentences describing your analysis and results. For example at the top of your file there should be a comment block like:

"""
Enter file to analyze: test.txt
File contained 10 entries
Max: 10
Min: 1
Average: 5.8
Median: 5.5
Std. dev: 2.859681411936962

Here is a few sentences
on several lines.
"""


Your text will be ignored by Python (like any comment), but is readable by the grader. For example, you might say something like “the standard deviation of ratings of Bio-Dome is nearly half the total range suggesting that viewers disagreed on the quality.”

Creativity Points

You may earn up to 2 creativity points on this assignment by adding improvements to your program. If you do, include in your comment at the top of your program what you added. Below are some suggestions, but feel free to add you own:

• [1-2 points] Add a calculation of the mode of your data, that is the most frequently occurring element in the data. Notice this should have a similar feel to the frequencies function. To earn the full points, identify all modal values when there is not a unique mode.
• [0.5-1 points] Do additional experiments/analysis of the real world data sets. You’ll be scored based on how creative you are. For example, just running an additional census file through the program will only earn minimum points.
• [? points] Add your own ideas. Points will be awarded based on difficulty and innovation.

Note: If you add additional statistics, print them after the required statistics. Gradescope will automatically test for the required output and will get confused if you change the required output lines. Similarly do not add any additional input statements to data_analysis. Gradescope won’t know to expect those questions and so your program will hang and/or the tests will fail.

When you’re done

Make sure that your program is properly commented:

• You should have comments at the very beginning of the file stating your name(s), section and creativity additions.
• Each function should have an appropriate docstring (including arguments and return value if applicable).
• Other miscellaneous comments to make things clear

In addition, make sure that you’ve used good code design and style (including meaningful variable names, constants where relevant, vertical white space, etc.).

Submit your program via Gradescope. Your program program file must be named lab5_data_analysis.py. You can submit multiple times, with only the most recent submission (before the due date) graded. Note that the tests performed by Gradescope are limited. Passing all of the visible tests does not guarantee that your submission correctly satisfies all of the requirements of the assignment.

If you worked with a partner, only one person needs to submit to Gradescope, but that person does need to add their partner’s name as shown in Gradescope documentation. Make sure both names are included in the comment at the top of the file.

Features Points
Basic statistics (num, max, min, avg) 3
Median 2
Standard deviation 4
Frequency 3
Data analyses 5
Code design and style 5
Creativity points 2
Total 27