CS101 - Homework 8

Due: 2018-04-25 11:59p

Objectives

Learn how to work with files
Learn how to use dictionaries
Learn how to use APIs

[15 points] Part 1: Analyze a file

This homework question is based on Section 13.3 in the Think Python text. We break the tasks down into smaller parts and give hints on how to complete each part.

Write a function called analyzeFile(filename). This function will analyze a file and return a few statistics in a tuple. The statistics you will return are:

longest word
shortest word
most common word
least common word

Refer to the example code from Tuesday 4/17 and the code below to help get started.

def analyzeFile(filename):
    """
    Reads a file and returns statistics about the contents.
		 Input: filename - a string containing the name of the file to read
		 Output: a tuple containing the longest word, shortest word, 
                   most common word and least common word
    """
    
    f = open(filename, encoding="utf-8") # open the file
	
    for line in f:		# iterate over the lines of the file
        words = line.split()            # turn the line into an array of words
        print(words)			# REPLACE THIS LINE

Run this code and make sure that you see a list of words for every line in the input file (do this for short files only :-).

Here are two text files you can work with. Download them to the same folder as your Python file.

alma.txt The Middlebury College Alma Mater
angelou.txt "On the Pulse of Morning" by Maya Angelou

Step one: Read the file

The first thing to do is read the contents of a file and count how many times each word appears in it. We will do this using a dictionary. Within the dictionary, the keys are words that appear in the file, and each associated value is the number of times that word appears in the file.

This sample code opens a file (the filename variable is a string containing the name of the file) and prints out the contents to the console:


f = open(filename, encoding="utf-8") # open the file

for line in f:   	  # iterate over the file line by line
    print(line)		  # print out the line

Rather than printing out each line, we want to extract all of the words. Splitting a string into words is easy:


words = line.split()

However, there is some cleaning that we would like to do. The split() function just breaks at the spaces, so we end up with some stray punctuation stuck on the end of words and "garden" and "garden," are different strings. We also want "Who" and "who" to count as the same word.

The second of these is pretty easy. Given a string called word we call word.lower() to get a lowercase version of the string.

To remove punctuation we use can use the strip() function. By default it strips whitespace characters of the ends of strings (spaces, tabs, and new line characters). However, we can pass a string in as an argument and the function will strip any of the characters in the argument from the front and end of the original string.

The string library defines a number of strings that just consist of important sets of characters. The ones that are most interesting to us at this point are string.punctuation (all punctuation characters) and string.whitespace (all whitespace characters). Add import string to the top of your file. Pass string.whitespace+string.punctuation in as an argument to the strip() function.

To remove all punctuation including in the middle of words, we could use the replace function as we did in the example code from Tuesday 4/17.

Use a loop to do this "cleaning" to every word in the list of words.

For testing purposes, you may want to print out each word after you have cleaned it to make sure you are just getting lower case words.

Your function should now follow this outline:

open the file
iterate over each line of the file
- split the line into a list of words
- iterate over the words
  - strip off the bad characters and convert the word to lowercase
  - print out the word

Step two: Count word occurrences

We'll use a dictionary to keep track of how many times each word appears. Create an empty dictionary called counts before the loop reading the file – this will hold our counts.

Then in the loop, instead of the print line in the example above, we'll store the number of times each word appears. For each word read, you'll want to look up how many times you have seen this word before, add one to the result, and put the new value back in the dictionary, i.e., use counts[word] = counts.get(word, 0) + 1. This use of get() will use the default value of 0 if the word hasn't been added to the dictionary yet.

Step three: Most common word

Once the whole file has been read in, counts will have a record of how many times every single word in the file appears. So, it is time to perform some analysis.

Create a new variable called mostCommonWord and initialize it to any word in the dictionary. The variable you created to loop through all of the words should still have the last word in it, so go ahead and use that.

We now want to iterate over the entire dictionary and look at each word. If that word occurs more often than the word stored in mostCommonWord, update mostCommonWord to be that new word.

For now (for testing), print out the most common word to make sure it makes sense.

open the file
create empty dictionary called counts
iterate over each line of the file
- split the line into a list of words
- iterate over the words
  - strip off the bad characters and convert the word to lowercase
  - look up how many times you have seen the word before
  - add one to that number and store the result back in the dictionary
create new variable mostCommonWord
iterate over the words in the dictionary
- if the word occurs more frequently than mostCommonWord, set mostCommonWord to it
print(mostCommonWord)

Note that there may be more than one word that is "most common" in your file. It's fine to just return any one of these most common words. Or if you wish, you can store all such most common words in a list.

Reminder: We can iterate over the dictionary using the form


for key, value in dictionary.items():
  # do something, key and value contain the current entry in the dictionary

Step four: Other statistics

Gather the other statistics (longest word, shortest word, least common word) in a similar fashion. For the word length metrics, you don't need to look at the value in the dictionary, just look at the length of the word.

Step five: Returning the statistics

Return the statistics in a 4-tuple. The fields of the tuple you return should contain the most common word, the least common word, the longest word, and shortest word. The form of your function should now be:

open the file
create empty dictionary called counts
iterate over each line of the file
- split the line into a list of words
- iterate over the words
  - strip off the bad characters and convert the word to lowercase
  - look up how many times you have seen the word before
  - add one to that number and store the result back in the dictionary
create new variable mostCommonWord
create new variable leastCommonWord
create new variable longestWord
create new variable shortestWord
iterate over the words in the dictionary
- if the word occurs more frequently than mostCommonWord, set mostCommonWord to it
- if the word occurs less frequently than leastCommonWord, set leastCommonWord to it
- if the word is longer than longestWord, set longestWord to it
- if the word is shorter than shortestWord, set shortestWord to it
return tuple containing the four statistics

[15 points] Part 2: Weather Report

In this portion of the lab we'll build a function that collects weather data from the web for a given zip code. Specifically, you will write a function called weather(zipcode) that takes a string as an input parameter and treats it as a zip code (you can assume it's a correct 5-digit zipcode), looks up the weather for that zipcode, and returns the current temperature at that zipcode. For example:

>>> weather('05753')
69.55

http://xkcd.com/1245/

We've broken the description of this functionality into a few steps. Be sure to read the entire write-up carefully before proceeding.

Step one: Set up an account

API stands for "Application Program Interface" and it means that a service (such as a weather data server on the web) provides a protocol specifically designed to be used by programs, rather than by humans.

We will use the API by OpenWeatherMap. The API asks you to create an account, you'll then get your own unique ID that will allow you to query their website from a Python program. Vist OpenWeatherMap, follow the link for "current weather data", and then scroll down to "by ZIP code". You will see that you can use a URL like

http://api.openweathermap.org/data/2.5/weather?zip=94040,us&appid=2de143494c0b295cca9337e1e96b00e0

to get the weather conditions for a given zip code. Here is a sample page we retrieved for Middlebury's zip code via the API:

http://www.cs.middlebury.edu/~cs101/homework/hw08-data/weather-05753.txt

To start, you can design your program to use only this page. If you follow this link you'll see it's a text encoding for the weather for Middlebury, with the current temperature being 62.92. Your job is to write the Python code that extracts just the temperature from this data.

Here is one suggested approach:

Write some code that opens the web page above and reads through it a line at a time. Or the code could read the entire file at once using something like "contents = webpage.read()" instead of "for line in webpage:".
Once you have this working, you need to extract the temperature. An elegant approach is to utilize the structure of the data returned by the API. Given your knowledge of Python, what do you notice about the data? It could be interpreted as a Python dictionary! Python has an "eval" function that will come in handy here to convert the string into a dictionary. See the example code from Tuesday 4/17 where we used the Google maps API.
Now, extract the temperature, store it in a variable, and return it as the value of the weather function.

For a given zip code, the url should look as follows:


http://api.openweathermap.org/data/2.5/weather?zip=05753,us&appid=2de143494c0b295cca9337e1e96b00e0&units=imperial

Note that the URL has several "variable definitions" separated with ampersands. For instance, we specify the zip code via "zip=05753,us". At the end we request "imperial" units, i.e., Fahrenheit, since by default we get Kelvin which is not as useful. The "appid" variable is the ID you were assigned when you created an account. (The current value is simply copied from their documentation page, and won't work for arbitrary zip codes.)

Once you have an appid, to use the actual API you need to use a URL like the one above, but with the correct zip code substituted, which is not hard to do in Python.

Change your weather function to generate an appropriate url based on the zip code passed in and then use this url to get the temperature. You should now be able to query the current weather based on the zip code entered:

>>> weather('05753')
62.92
>>> weather('33111')
80.8

Turning in your work

Before submitting, be sure to check the grading rubric and make sure you have followed all instructions correctly.

Put all of your functions into a single Python file called username_hw8.py

Be sure to comment your code. The top of the file should include a multiline comment that lists your name, the name of the homework, and your lab section, at a minimum. Each function should also include a docstring comment at the beginning of the body of the function describing what the function does.

Submit your file username_hw8.py using the CS 101 submit script.