Let’s develop some functions to compute statistics about the words in a
sentence. To do so we will first split the sentence into a list of words.
Strings have a method to do exactly that, split
.
>>> help(str.split)
Help on method_descriptor:
split(...)
S.split(sep=None, maxsplit=-1) -> list of strings
Return a list of the words in S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are
removed from the result.
>>> "this is a sentence".split()
['this', 'is', 'a', 'sentence']
Let’s develop some functions for computing statistics about lists of words, specifically
average_word_length(words)
longest_word(words)
shortest_word(words)
Let’s start with average_word_length
. What components will we need in that
function? What about in longest_word
or shortest_word
? For the latter,
often the key insight is how to initialize the variable in which we keep track
of the current longest or shortest word.
At a minimum your program should run correctly for all the examples included in the lab, say the examples in Lab 5 (Gradescope uses these as tests). But let’s also think more generally about testing (a preview of the upcoming reading). Recall we have previously discussed some of the conditions to think about when designing (and testing) our algorithms. From 15.4 in “Practical Programming”:
random_equation
function work for zero operators?As we gain experience we will build out our mental checklist of situations to consider.
Furthermore, make sure each individual function obeys the specification given.
This was a source of difficulty for many in Lab 4, e.g.,
random_equation
could only use numbers in the range 1 to 10, and
query_equation
expected
a string as a parameter.
Most data analyses of any scale will start and/or end with data stored in a file. How do we read and write files?
The simplest way to open a file is with
open("myfile", "r")
The first argument is the name of the file. Note Python starts looking in the current working directory (typically the same directory as your script). If your file is elsewhere you will need to supply the necessary path.
open("/path/to/some/file", "r")
The second argument is the mode, e.g., ‘r’ for reading, ‘w’ for writing, etc.
For example, to open a file of English words for reading (small-file.txt):
>>> file = open("small-file.txt", "r")
In your programs, a best practice is to use with
blocks, which
ensure that the file is always closed for you. For example:
with open("filename", "r") as file:
# Work with file object
# File is automatically closed when you exit the with block
In this course the expectation is that you will always use with
blocks when
reading files.
Once we have opened the file we can easily read all of the lines with a for
loop (note there are other ways to read a file, but we will use for
loops
most frequently):
for <loop variable> in <file variable>:
<loop body>
The loop body will get executed for each line of the file with the loop variable will be assigned the line, as a string, including any newline (i.e., return).
For example:
>>> for line in file:
... print(line)
...
hello world
how are you
cat
dog
Notice the empty lines; these result from the newlines in the file itself and
the newline added by print
. Note we typically don’t want the newline from the
file, so we often use string’s strip
method to remove it.
>>> help(str.strip)
Help on method_descriptor:
strip(...)
S.strip([chars]) -> str
Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
>>> a = "string with newline\n"
>>> a
'string with newline\n'
>>> a.strip()
'string with newline'
Let’s put this together with the functions that we wrote earlier to generate basic statistics about english.txt. You can download all the functions in word-stats.py.
>>> file_stats("english.txt")
Number of words: 47158
Longest word: antidisestablishmentarianism
Shortest word: Hz
Avg. word length: 8.37891768098732
Notice that almost all the code is shared between sentence_stats
and
file_stats
. This is nicely DRY! If you ever find yourself copying and pasting
code, make a function instead.
What are some alternate design choices we could have made? Recall from the
reading that we could have used the readlines
method to produce the list.
However, we would still need to remove the trailing newline characters.
Recall that lists can store values of any (and potentially different) types. Thus we can have lists of lists. Doing so is often helpful if we have a sequence of values where each value might have multiple attributes or pieces of information.
Imagine we have some data about the probability of detecting different
particles, e.g., there is 0.55 probability of detecting a 'neutron'
. How could
we organize this using lists and then find the particle with the highest
probability of detection?
We organize our data as a list of lists:
>>> particles = [['neutron', 0.55], ['proton', 0.21], ['meson', 0.03], ['muon', 0.07], ['neutrino', 0.14]]
To find the particle of the highest probability of detection, we will need to access the elements within the nested lists:
highest = particles[0]
for particle in particles:
if particle[1] > highest[1]:
highest = particle
print(highest[0])
Note that highest
is a list not a string or integer, and in the if
statement
we are accessing the probability in that list (at index 1).
Python strings and lists are both examples of sequences. As such they
both support many of the same operations, e.g., indexing,
slicing. Many functions can take both strings and lists as an argument
(e.g., len
). However there are several key differences between
strings and lists, notably:
sort
, modify the list on which they are invoked.This also impacts indexing. If we try the reassign the character of a string we will get an error, but we can reassign the element of a list.
```
>>> a = "abcd"
>>> a[0] = "h"
Traceback (most recent call last):
File "<pyshell>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>> a = list("abcd")
>>> a[0] = "h"
>>> a
['h', 'b', 'c', 'd']
```