For this lab, you are going to implement some initial data analysis functions and then analyze some real data.
You are allowed (and encouraged) to work in pairs on this assignment. If you do, you must do all of the work together. I advocate a Pair Programming approach in which one partner is the driver writing the code and the other is the navigator reviewing the code as the driver types. Only turn in one copy of the assignment, but make sure both of your names are in the comments at the top of the file and you add your partner to your Gradescope submission (as described at the end of the assignment).
For this lab, we will be reading data from files. The files must be of type “.txt” (no Microsoft Word or Google Docs files, etc.). Also, recall from class that the files must be saved in the same directory as your Python program that will be reading them. There are many ways to do this, but one easy way is to just create them using Thonny. To do this, just create a new file and then put whatever text data you want in the file (e.g., a list of numbers). When you save the file, us a “.txt” extension, e.g., “test.txt”, and save it in the same folder as your Python program.
Right-click on the above link, select “Save as…”, and save the file for this
assignment. Then open the saved file with Thonny. Add your code in the places
specified by the comments. *Do not delete the if __name__ == '__main__':
statement, it is needed for Gradescope.
At a minimum write a function named data_analysis
, with no parameters, that
prompts the user for the name of a data file that contains one integer per line
and then prints the following statistics about the file:
The corrected sample standard deviation of the values in the file. The corrected sample standard deviation of N data points is defined as:
that is, the square root of the sum of the squared differences between the data points and their average divided by one less than the number of data points.
For full points on code design and style, define functions for significant separate tasks above (e.g., reading from a file, computing median of a list, computing corrected sample standard deviation) rather than including all functionality in the data_analysis
function.
The required output must be formatted as follows:
File contained 10 entries
Max: 10
Min: 1
Average: 5.8
Median: 5.5
Std. dev: 2.859681411936962
Note: Your program should handle a file with zero lines, that is no data, without an error. In such a case you should only print:
File contained 0 entries
And handle files with only one line, i.e., one data point, that do not have a valid standard deviation without an error. In such a case, you should not compute and not print the standard deviation.
At a minimum your program should also include:
frequencies
function (see guide section). Your program does not need to invoke the frequencies
function automatically.For example, if I have a file called “temps.txt” that contains the following:
34
71
32
34
25
33
34
24
35
41
27
(this data is the recorded high temperatures in Middlebury VT on March 19 in the years 2011 through 2021 from World Weather Online), then a run of the program would output:
Enter file to analyze: temps.txt
File contained 11 entries
Max: 71
Min: 24
Average: 35.45454545454545
Median: 34
Std. dev: 12.769993236988313
Your program should only read data from the file once. Like the examples in class, read the data from the file once, then store the values in a list and use that list to calculate the various statistics.
You have a fair bit of flexibility in how you implement this program, but use good coding practice. For example, think about how to break your program into a number of functions instead of writing one “mega” function.
You are welcome to and in fact encouraged to use any of the functions we have
discussed or developed in class, e.g. sum
, min
, max
. You are prohibited, however,
from using other statistics libraries, such as the Python statistics module,
to compute mean, standard deviation, etc..
The following is a function that attempts to print out the frequency of each item in the data:
def frequencies(data):
"""
Attempts to print the frequency of each item in the list data
Args:
data: List of "sortable" data items
"""
data.sort()
count = 0
previous = data[0]
print("data\tfrequency") # '\t' is the TAB character
for d in data:
if d == previous:
# Same as the previous, increment the count for the run
count += 1
else:
# We've found a different item so print out the old and reset the count
print(str(previous) + "\t" + str(count))
count = 1
previous = d
For example, given the list [6, 5, 5, 1, 1, 2, 2, 3, 3, 4, 5, 5]
frequencies
should print:
data frequency
1 2
2 2
3 2
4 1
5 4
6 1
Unfortunately the function has a bug and doesn’t do this. Copy and paste the above function into your program and then fix it so that it works properly. There are many ways to fix it, however, one straightforward way will only require adding/changing one line of code. Be careful when you paste to preserve the indentation!
Your program does not need to invoke the frequencies
function automatically in the way that data_analysis
is invoked automatically when you click the green arrow in Thonny. Gradescope will test it separately. If you want frequencies
to run automatically along with data_analysis
, one way to do so is to modify your data_analysis
function to return a list of the data and use that list as the argument to frequencies
in the if __name__ == "__main__"
conditional, e.g.,
if __name__ == "__main__":
data = data_analysis()
frequencies(data)
I’m providing two real-world data sets:
For each of these data sets, provide an analysis of the data using your program (one analysis for each data set, for a total of two analysis). For example, your two experiments might be:
Provide the output from your experiments in a multi-line comment at the beginning of your program file (after your names, etc.), together with one or two sentences describing your analysis and results. For example at the top of your file there should be a comment block like:
"""
Enter file to analyze: test.txt
File contained 10 entries
Max: 10
Min: 1
Average: 5.8
Median: 5.5
Std. dev: 2.859681411936962
Here is a few sentences
about my analyses
on several lines.
"""
Your text will be ignored by Python (like any comment), but is readable by the grader. For example, you might say something like “the standard deviation of ratings of Bio-Dome is nearly half the total range suggesting that viewers disagreed on the quality.”
You may earn up to 2 creativity points on this assignment by adding improvements to your program. If you do, include in your comment at the top of your program what you added. Below are some suggestions, but feel free to add you own:
Note: If you add additional statistics, print them after the required
statistics. Gradescope will automatically test for the required output and will
get confused if you change the required output lines. Similarly do not add any
additional input
statements to data_analysis
. Gradescope won’t know to
expect those questions and so your program will hang and/or the tests will
fail.
Make sure that your program is properly commented:
In addition, make sure that you’ve used good code design and style (including meaningful variable names, constants where relevant, vertical white space, etc.).
Submit your program via Gradescope. Your program program file must be named lab5_data_analysis.py. You can submit multiple times, with only the most recent submission (before the due date) graded. Note that the tests performed by Gradescope are limited. Passing all of the visible tests does not guarantee that your submission correctly satisfies all of the requirements of the assignment.
If you worked with a partner, only one person needs to submit to Gradescope, but that person does need to add their partner’s name as shown in Gradescope documentation. Make sure both names are included in the comment at the top of the file.
Features | Points |
---|---|
File reading | 3 |
Basic statistics (num, max, min, avg) | 3 |
Median | 2 |
Standard deviation | 4 |
Frequency | 3 |
Data analyses | 5 |
Code design and style | 5 |
Creativity points | 2 |
Total | 27 |
There are multiple ways to calculate the standard deviation. The lab explicity specifies the formula you should implement. If your results don’t match the examples in the lab make sure you are implementing the specified formula. Specifically, notice that the denominator is not the number of observations, but the number of observations less one.