CS 150 - Assignment 5 - Data for Everyone

Due: Thursday 3/23 at 9am


http://dilbert.com/fast/2008-05-08/

For this assignment we're going to implement some initial data analysis functions and then analyze some real data.

You are allowed (and encouraged) to work in pairs on this assignment. If you do, you must do all of the work together. Only turn in one copy of the assignment, but make sure both of your names are in the comments at the top of the file.

Creating/Editing Text Files

For this assignment, we will be reading data from files. The files must be of type ".txt" (no Microsoft Word docs, etc.). Also, recall from class that the files must be saved in the same directory as your Python program that will be reading them. There are many ways to do this, but one easy way is to just create them using Spyder. To do this, just create a new file and then put whatever text data you want in the file (e.g., a list of numbers). When you save the file, use "Save as type: Text files (*.txt)" to save the file with a ".txt" extension, and save it in the same folder as your Python program. TextWrangler is installed on all the lab machines and can also be used to edit and save text files.

Data Basics

Read through this whole section before starting!

Write a program that prompts the user for the name of a data file that contains one number per line and then prints the following statistics about the file:

For example, if I have a file called "test.txt" that contains the following:
1
2.0
10
5
5
9
8
6
7
5
then a run of the program would output:
Enter file to analyze: test.txt
File contained 10 entries
Max: 10.0
Min: 1.0
Average: 5.8
Median: 5.5
Std. dev: 2.859681411936962
Your program should only read data from the file once. Like the examples in class, read the data from the file once, then store the values in a list and use that list to calculate what you need.

I'm giving you a fair amount of flexibility regarding how you implement this, but use good style. For example, think about how to break your program into a number of functions instead of writing one giant piece of code.

Frequency

The following is a function that attempts to print out the frequency of each item in the data:
def frequencies(data):
    """Attempts to print the frequency of each item in the list data"""
    data.sort()
    
    count = 0
    previous = data[0]

    print("data\tfrequency") # '\t' is the TAB character

    for d in data:
        if d == previous:
            # same as the previous, so just increment the count
            count += 1
        else:
            # we've found a new item so print out the old and reset the count
            print(str(previous) + "\t" + str(count))
            count = 1
        
        previous = d
For example, given the list [6, 5, 5, 1, 1, 2, 2, 3, 3, 4, 5, 5] it should print:
data    frequency
1       2
2       2
3       2
4       1
5       4
6       1
Unfortunately, the program has a bug and doesn't do this. Copy and paste the above function into your program and then fix it so that it works properly. There are many ways to fix it, however, one straightforward way will only require adding/changing one line of code. (Be careful when you paste - make sure you preserve the indentation!)

Real Data


http://xkcd.com/539/

I'm providing two real-world data sets:

For each of these data sets, provide an analysis of the data using your program (one analysis for each data set, for a total of two analysis). For example, your two experiments might be: Be creative! But don't spend too much time on this part.

Provide the output from your experiments in a comment at the beginning of your program file (after your names, etc.), together with one or two sentences describing your analysis and results. If you look under the "Edit" menu in Spyder, there are entries "Add block comment" and "Remove block comment", which allow you to select some text and comment/uncomment the entire thing.

Extra Points

You may earn up to 2 extra points on this assignment by adding improvements to your program. If you do, include in your comments at the top of your program what you added. Below are some suggestions, but feel free to add you own:

When you're done

Make sure that your program is properly commented: In addition, make sure that you've used good style.

What to hand in:

You should have implemented:

Submission procedure

Submit your .py file online using the digital submission link on the course web page. You must have submitted it online before the beginning of class on Wednesday. If you worked with a partner, you should only submit one copy, but make sure both people's names are at the top of the submitted file.

Grading

points
Data Basics
      file reading 3
      # of entries, largest, smallest, average   3
      median 2
      standard deviation 4
frequencies 3
Data analysis 5
Comments, style 3
Extra points 2
Total 23 + 2