with
block to encapsulate file operationsPython strings and lists are both examples of sequences. As such they both
support many of the same operations, e.g. indexing, slicing, and many functions
can take both strings and lists as an argument (e.g. len
). However there are several key differences between strings and lists, notably:
A string is a immutable while a list is mutable. As a result string methods
do not modify the string on which they are invoked, instead they create a new
string with those modifications applied. In contrast, as we saw last time, some
list methods, e.g. sort
, modify the list on which they are invoked.
This also impacts indexing. If we try the reassign the character of a string we will get an error, but we can reassign the element of a list.
>>> a = "abcd"
>>> a[0] = "h"
Traceback (most recent call last):
File "<pyshell>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>> a = list("abcd")
>>> a[0] = "h"
>>> a
['h', 'b', 'c', 'd']
Let’s develop some functions to compute statistics about the words in a
sentence. To do so we are first going to split the sentence into a list of words.
Strings have a method to do exactly that, split
.
>>> help(str.split)
Help on method_descriptor:
split(...)
S.split(sep=None, maxsplit=-1) -> list of strings
Return a list of the words in S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator and empty strings are
removed from the result.
>>> "this is a sentence".split()
['this', 'is', 'a', 'sentence']
Let’s develop some functions for computing statistics about lists of words, specifically
average_word_length(words)
longest_word(words)
shortest_word(words)
I will type, you drive!
Let’s start with average_word_length
. What components will we need in that
function? What about in longest_word
or shortest_word
? For the latter,
often the key insight is how to initialize the variable in which we keep track
of the current longest or shortest word.
At a minimum you program should run correctly for all the examples included in the lab, say the examples in Lab 5 (Gradescope uses these as tests!). But let’s also think about testing for generally. Recall we have previously discussed some of the conditions to think about when designing (and testing) our algorithms. From 15.4 in “Practical Programming”:
random_equation
function work for zero operators?random_equation
function work for both single and two digit numbers?As we gain experience we will build out our mental checklist of situations to consider.
Recall that lists can store values of any (and potentially different) types. Thus we can have lists of lists. Doing so is often helpful if we a sequence of values where each value might multiple attributes or piece of information.
Imagine we have some data about the probability of detecting different
particles, e.g. there is 0.55 probability of detecting a 'neutron'
. How could
we organize this using lists and then find the particle with the highest
probability of detection?
We organize our data as a list of lists:
>>> particles = [['neutron', 0.55], ['proton', 0.21], ['meson', 0.03], ['muon', 0.07], ['neutrino', 0.14]]
To find the particle of the highest probability of detection, we will need access the elements within in the nested lists:
highest = particles[0]
for particle in particles:
if particle[1] > highest[1]:
highest = particle
print(highest[0])
Note that highest
is a list (not a string or integer) and in the if
statement
we are accessing the probability in that list (at index 1).
To date all of our data is ephemeral, like the sentence examples we just tested. But most data analyses of any scale will start and or end with data stored in a file. How do we read and right files?
open("/path/to/some/file", "r")
The first argument is the path to the file. Note Python starts looking in the current working directory (typically the same directory as your script). If your file is elsewhere you will need to supply the necessary path.
The second argument is the mode, e.g. ‘r’ for reading, ‘w’ for writing, etc.
Let’s open a file of English words for reading (small-file.txt):
>>> file = open("small-file.txt", "r")
>>> file
<_io.TextIOWrapper name='small-file.txt' mode='r' encoding='US-ASCII'>
>>> type(file)
<class '_io.TextIOWrapper'>
Once we have opened the file we can easily read all of the lines with a for
loop (note there are other ways to read a file, but we will use for
loops
most frequently):
for <loop variable> in <file variable>:
<loop body>
The loop body will get executed for each line of the file with the loop variable will be assigned the line, as a string, including any newline (i.e. return).
For example:
>>> for line in file:
... print(line)
...
hello world
how are you
cat
dog
Notice the empty lines, these result from the newlines in the file itself and
the newline added by print
. Note we typically don’t want the newline from the
file, so we often use string’s strip
method to remove it.
>>> help(str.strip)
Help on method_descriptor:
strip(...)
S.strip([chars]) -> str
Return a copy of the string S with leading and trailing
whitespace removed.
If chars is given and not None, remove characters in chars instead.
>>> a = "string with newline\n"
>>> a
'string with newline\n'
>>> a.strip()
'string with newline'
If we try to run our loop again (with strip
added), e.g.
>>> for line in file:
... print(line.strip()()
...
nothing will be printed. This is not unexpected. The file
object maintains
state, specifically a pointer to how much of the file has been read. When we
first open the file, the pointer “points to” the beginning of the file. Once we
have read the file it points to the end. Thus there is nothing more to read.
There are methods that we can use to reset the pointer, or we can close and
then reopen the file.
All open files need to be closed (with the close
method, e.g.
file.close()
). This is especially important for writing to files as it will
force the data to actually be written to the disk/file system. You can do so
manually, but it is easy to forget, and there are error situations where you
may not be able to explicitly call close
. Best practices are use to with
blocks, which ensure that the file is always closes for you. For example:
with open("filename", "r") as file:
# Work with file object
# File is automatically closed when you exit the with block
In class the expectation is that you will always use with
blocks when
reading files.
Let’s put this together with the functions that we wrote earlier to generate basic statistics about english.txt. You can download all the functions in word-stats.py.
>>> file_stats("english.txt")
Number of words: 47158
Longest word: antidisestablishmentarianism
Shortest word: Hz
Avg. word length: 8.37891768098732
Notice that almost all the code is shared between sentence_stats
and
file_stats
. This is nicely DRY! If you ever find yourself copying and pasting
code, make a function instead.
What are some alternate design choices we could have made? Recall from the
reading that we could have used the readlines
method to produce the list.
However, we would still need to remove the trailing newline characters.
Note that these examples assume that the file you are reading and the program are in the same directory.