Due: 2018-04-25 11:59p
This homework question is based on Section 13.3 in the Think Python text. We break the tasks down into smaller parts and give hints on how to complete each part.
Write a function called analyzeFile(filename)
. This
function will analyze a file and return a few statistics in a tuple.
The statistics you will return are:
Refer to the example code from Tuesday 4/17 and the code below to help get started.
def analyzeFile(filename):
"""
Reads a file and returns statistics about the contents.
Input: filename - a string containing the name of the file to read
Output: a tuple containing the longest word, shortest word,
most common word and least common word
"""
f = open(filename, encoding="utf-8") # open the file
for line in f: # iterate over the lines of the file
words = line.split() # turn the line into an array of words
print(words) # REPLACE THIS LINE
Run this code and make sure that you see a list of words for every line in the input file (do this for short files only :-).
Here are two text files you can work with. Download them to the same folder as your Python file.
The first thing to do is read the contents of a file and count how many times each word appears in it. We will do this using a dictionary. Within the dictionary, the keys are words that appear in the file, and each associated value is the number of times that word appears in the file.
This sample code opens a file (the filename
variable
is a string containing the name of the file) and prints out the
contents to the console:
f = open(filename, encoding="utf-8") # open the file
for line in f: # iterate over the file line by line
print(line) # print out the line
Rather than printing out each line, we want to extract all of the words. Splitting a string into words is easy:
words = line.split()
However, there is some cleaning that we would like to do. The
split()
function just breaks at the spaces, so we end up
with some stray punctuation stuck on the end of words and
"garden" and "garden," are different strings. We
also want "Who" and "who" to count as the
same word.
The second of these is pretty easy. Given a string called
word
we call word.lower()
to get a lowercase
version of the string.
To remove punctuation we use can use
the strip()
function. By default it strips whitespace
characters of the ends of strings (spaces, tabs, and new line
characters). However, we can pass a string in as an argument and the
function will strip any of the characters in the argument from the
front and end of the original string.
The string
library defines a number of strings that
just consist of important sets of characters. The ones that are most
interesting to us at this point are string.punctuation
(all punctuation characters) and string.whitespace
(all
whitespace characters). Add import string
to the top of
your file. Pass string.whitespace+string.punctuation
in
as an argument to the strip()
function.
To remove all punctuation including in the middle of words, we
could use the replace
function as we did in the
example code from Tuesday 4/17.
Use a loop to do this "cleaning" to every word in the list of words.
For testing purposes, you may want to print out each word after you have cleaned it to make sure you are just getting lower case words.
Your function should now follow this outline:
We'll use a dictionary to keep track of how many times
each word appears.
Create an empty dictionary called
counts
before the loop reading the file
– this will hold our counts.
counts[word] = counts.get(word, 0) + 1
. This use of
get()
will use the default value of 0 if the word
hasn't been added to the dictionary yet.
Once the whole file has been read in, counts
will have
a record of how many times every single word in the file appears. So,
it is time to perform some analysis.
Create a new variable called mostCommonWord
and
initialize it
to any word in the dictionary. The variable you created to loop through
all of the words should still have the last word in it, so go ahead and use that.
We now want to iterate over the entire dictionary and look at each
word. If that word occurs more often than the word stored in
mostCommonWord
, update mostCommonWord
to be
that new word.
For now (for testing), print out the most common word to make sure it makes sense.
Note that there may be more than one word that is "most common" in your file. It's fine to just return any one of these most common words. Or if you wish, you can store all such most common words in a list.
Reminder: We can iterate over the dictionary using the form
for key, value in dictionary.items():
# do something, key and value contain the current entry in the dictionary
Gather the other statistics (longest word, shortest word, least common word) in a similar fashion. For the word length metrics, you don't need to look at the value in the dictionary, just look at the length of the word.
Return the statistics in a 4-tuple. The fields of the tuple you return should contain the most common word, the least common word, the longest word, and shortest word. The form of your function should now be:
>>> weather('05753') 69.55
We've broken the description of this functionality into a few steps. Be sure to read the entire write-up carefully before proceeding.
API stands for "Application Program Interface" and it means that a service (such as a weather data server on the web) provides a protocol specifically designed to be used by programs, rather than by humans.
We will use the API by OpenWeatherMap. The API asks you to create an account, you'll then get your own unique ID that will allow you to query their website from a Python program. Vist OpenWeatherMap, follow the link for "current weather data", and then scroll down to "by ZIP code". You will see that you can use a URL like
http://api.openweathermap.org/data/2.5/weather?zip=94040,us&appid=2de143494c0b295cca9337e1e96b00e0to get the weather conditions for a given zip code. Here is a sample page we retrieved for Middlebury's zip code via the API:
http://www.cs.middlebury.edu/~cs101/homework/hw08-data/weather-05753.txt
To start, you can design your program to use only this page. If you follow this link you'll see it's a text encoding for the weather for Middlebury, with the current temperature being 62.92. Your job is to write the Python code that extracts just the temperature from this data.
Here is one suggested approach:
Now, extract the temperature, store it in a variable, and return it as the value of the weather function.
Note that the URL has several "variable definitions" separated with ampersands. For instance, we specify the zip code via "zip=05753,us". At the end we request "imperial" units, i.e., Fahrenheit, since by default we get Kelvin which is not as useful. The "appid" variable is the ID you were assigned when you created an account. (The current value is simply copied from their documentation page, and won't work for arbitrary zip codes.)http://api.openweathermap.org/data/2.5/weather?zip=05753,us&appid=2de143494c0b295cca9337e1e96b00e0&units=imperial
Once you have an appid, to use the actual API you need to use a URL like the one above, but with the correct zip code substituted, which is not hard to do in Python.
Change your weather function to generate an appropriate url based on the zip code passed in and then use this url to get the temperature. You should now be able to query the current weather based on the zip code entered:
>>> weather('05753') 62.92 >>> weather('33111') 80.8
Before submitting, be sure to check the grading rubric and make sure you have followed all instructions correctly.
Put all of your functions into a single Python file called username_hw8.py
Be sure to comment your code. The top of the file should include a multiline comment that lists your name, the name of the homework, and your lab section, at a minimum. Each function should also include a docstring comment at the beginning of the body of the function describing what the function does.
Submit your file username_hw8.py using the CS 101 submit script.