CS 150 - Assignment 5 - Data for Everyone
Due: Thursday 3/23 at 9am
For this assignment we're going to implement some initial data
analysis functions and then analyze some real data.
You are allowed (and encouraged) to work in pairs on this
assignment. If you do, you must do all of the work
together. Only turn in one copy of the assignment, but make sure both
of your names are in the comments at the top of the file.
Creating/Editing Text Files
For this assignment, we will be reading data from files. The files
must be of type ".txt" (no Microsoft Word docs, etc.). Also, recall
from class that the files must be saved in the same directory as
your Python program that will be reading them. There are many
ways to do this, but one easy way is to just create them using Spyder.
To do this, just create a new file and then put whatever text data you
want in the file (e.g., a list of numbers). When you save the file,
use "Save as type: Text files (*.txt)" to save the file with a ".txt"
extension, and save it in the same folder as your Python program.
TextWrangler is installed on all the lab machines and can
also be used to edit and save text files.
Read through this whole section before starting!
Write a program that prompts the user for the name of a data file that
contains one number per line and then prints the following statistics
about the file:
For example, if I have a file called "test.txt" that contains the following:
- The number of entries in the file
- The largest value in the file
- The smallest value in the file
- The average of the values in the file (recall we did this one in class)
- The median of the values in the file. For an odd number of data
elements, the median is the "middle" item if the data were in order.
For an even number of data elements, the median is the average of the
two middle items if the data were in order. Note, the median we
examined in class was not complete/correct.
- The sample standard deviation of the values in the file. The sample standard
deviation of N sample data points is defined as:
that is, the square root of the sum of the squared differences between the data
points and their average divided by one less than the number of data points.
then a run of the program would output:
Your program should only read data from the file once. Like
the examples in class, read the data from the file once, then store
the values in a list and use that list to calculate what you need.
Enter file to analyze: test.txt
File contained 10 entries
Std. dev: 2.859681411936962
I'm giving you a fair amount of flexibility regarding how you
implement this, but use good style. For example, think about how to
break your program into a number of functions instead of writing one
giant piece of code.
The following is a function that attempts to print out the frequency of each item in the data:
For example, given the list [6, 5, 5, 1, 1, 2, 2, 3, 3, 4, 5, 5] it should print:
"""Attempts to print the frequency of each item in the list data"""
count = 0
previous = data
print("data\tfrequency") # '\t' is the TAB character
for d in data:
if d == previous:
# same as the previous, so just increment the count
count += 1
# we've found a new item so print out the old and reset the count
print(str(previous) + "\t" + str(count))
count = 1
previous = d
Unfortunately, the program has a bug and doesn't do this. Copy and
paste the above function into your program and then fix it so that it
works properly. There are many ways to fix it, however, one
straightforward way will only require adding/changing one line of code.
(Be careful when you paste - make sure you preserve the indentation!)
I'm providing two real-world data sets:
For each of these data sets, provide an analysis of the data using
your program (one analysis for each data set, for a total of two
analysis). For example, your two experiments might be:
- 95census is a folder containing some data from the
Northeast region of the US from the 1995 census. There are three
files in the folder corresponding to the age, number of kids and
income of the surveyed participants. You can download any of these by
right clicking and selecting "Save link as..." (or something similar
depending on the browser).
- 3movie_reviews.xlsx is an Excel file containing movie
reviews from three different movies. To use this data, you'll need to
open it in Excel and then grab the data you're interested in and copy
and paste it into Spyder and save as a ".txt" file.
(If you're curious, this is just a snippet of data from http://grouplens.org/datasets/movielens/.)
Be creative! But don't spend too much time on this part.
- analyze income in 1995
- compare the data analysis numbers for two different movies
Provide the output from your experiments in a comment at the
beginning of your program file (after your names, etc.), together with
one or two sentences describing your analysis and results. If you
look under the "Edit" menu in Spyder, there are entries "Add block
comment" and "Remove block comment", which allow you to select some
text and comment/uncomment the entire thing.
You may earn up to 2 extra points on this assignment by
adding improvements to your program. If you do, include in your
comments at the top of your program what you added. Below are some
suggestions, but feel free to add you own:
- (1 point) Add a calculation of the mode of your data, that
is the most frequently occurring element in the data. Notice this
should have a similar feel to the frequencies function.
- (? points) Do additional experiments/analysis of the real world
data sets. You'll be scored based on how creative you are. For
example, just running an additional census file through the program
won't be worth much.
- (? points) Add some other statistic to analyze your data with.
- (? points) Add your own ideas. Points will be awarded based on
difficulty and innovativeness.
When you're done
Make sure that your program is properly commented:
In addition, make sure that you've used good style.
- You should have comments at the very beginning of the file
stating your name(s), course (including section number), assignment
number and the date.
- Each function should have an appropriate docstring
- Other miscellaneous comments to make things clear
What to hand in:
You should have implemented:
- The program described in section "Data Basics"
- Your fixed frequencies function (in the same file)
- Your two analyses of the real-world data (as comments in the same file)
Submit your .py file online using the digital submission link on the
course web page. You must have submitted it online before the
beginning of class on Wednesday. If you worked with a partner, you
should only submit one copy, but make sure both people's names are at
the top of the submitted file.
| || points |
| Data Basics || |
| file reading || 3 |
| # of entries, largest, smallest, average || 3 |
| median|| 2 |
| standard deviation|| 4 |
| frequencies || 3 |
| Data analysis || 5 |
| Comments, style || 3|
| Extra points || 2 |
| Total || 23 + 2|