Class 14: Lists and Files

Objectives for today

Revisiting strings vs. lists

Python strings and lists are both examples of sequences. As such they both support many of the same operations, e.g. indexing, slicing, and many functions can take both strings and lists as an argument (e.g. len). However there are several key differences between strings and lists, notably:

Writing some functions with lists

Let’s develop some functions to compute statistics about the words in a sentence. To do so we are first going to split the sentence into a list of words. Strings have a method to do exactly that, split.

>>> help(str.split)
Help on method_descriptor:

split(...)
    S.split(sep=None, maxsplit=-1) -> list of strings
    
    Return a list of the words in S, using sep as the
    delimiter string.  If maxsplit is given, at most maxsplit
    splits are done. If sep is not specified or is None, any
    whitespace string is a separator and empty strings are
    removed from the result.

>>> "this is a sentence".split()
['this', 'is', 'a', 'sentence']

Let’s develop some functions for computing statistics about lists of words, specifically

I will type, you drive!

Let’s start with average_word_length. What components will we need in that function? What about in longest_word or shortest_word? For the latter, often the key insight is how to initialize the variable in which we keep track of the current longest or shortest word.

A bit about testing in our labs

At a minimum you program should run correctly for all the examples included in the lab, say the examples in Lab 5 (Gradescope uses these as tests!). But let’s also think about testing for generally. Recall we have previously discussed some of the conditions to think about when designing (and testing) our algorithms. From 15.4 in “Practical Programming”:

As we gain experience we will build out our mental checklist of situations to consider.

Lists of lists

Recall that lists can store values of any (and potentially different) types. Thus we can have lists of lists. Doing so is often helpful if we a sequence of values where each value might multiple attributes or piece of information.

Imagine we have some data about the probability of detecting different particles, e.g. there is 0.55 probability of detecting a 'neutron'. How could we organize this using lists and then find the particle with the highest probability of detection?

We organize our data as a list of lists:

>>> particles = [['neutron', 0.55], ['proton', 0.21], ['meson', 0.03], ['muon', 0.07], ['neutrino', 0.14]]

To find the particle of the highest probability of detection, we will need access the elements within in the nested lists:

highest = particles[0]
for particle in particles:
    if particle[1] > highest[1]:
        highest = particle
print(highest[0])

Note that highest is a list (not a string or integer) and in the if statement we are accessing the probability in that list (at index 1).

Files

To date all of our data is ephemeral, like the sentence examples we just tested. But most data analyses of any scale will start and or end with data stored in a file. How do we read and right files?

open("/path/to/some/file", "r")

The first argument is the path to the file. Note Python starts looking in the current working directory (typically the same directory as your script). If your file is elsewhere you will need to supply the necessary path.

The second argument is the mode, e.g. ‘r’ for reading, ‘w’ for writing, etc.

Let’s open a file of English words for reading (small-file.txt):

>>> file = open("small-file.txt", "r")
>>> file
<_io.TextIOWrapper name='small-file.txt' mode='r' encoding='US-ASCII'>
>>> type(file)
<class '_io.TextIOWrapper'>

Once we have opened the file we can easily read all of the lines with a for loop (note there are other ways to read a file, but we will use for loops most frequently):

for <loop variable> in <file variable>:
	<loop body>

The loop body will get executed for each line of the file with the loop variable will be assigned the line, as a string, including any newline (i.e. return).

For example:

>>> for line in file:
...     print(line)
... 
hello world

how are you

cat

dog

Notice the empty lines, these result from the newlines in the file itself and the newline added by print. Note we typically don’t want the newline from the file, so we often use string’s strip method to remove it.

>>> help(str.strip)
Help on method_descriptor:

strip(...)
    S.strip([chars]) -> str
    
    Return a copy of the string S with leading and trailing
    whitespace removed.
    If chars is given and not None, remove characters in chars instead.

>>> a = "string with newline\n"
>>> a
'string with newline\n'
>>> a.strip()
'string with newline'

If we try to run our loop again (with strip added), e.g.

>>> for line in file:
...     print(line.strip()()
... 

nothing will be printed. This is not unexpected. The file object maintains state, specifically a pointer to how much of the file has been read. When we first open the file, the pointer “points to” the beginning of the file. Once we have read the file it points to the end. Thus there is nothing more to read. There are methods that we can use to reset the pointer, or we can close and then reopen the file.

All open files need to be closed (with the close method, e.g. file.close()). This is especially important for writing to files as it will force the data to actually be written to the disk/file system. You can do so manually, but it is easy to forget, and there are error situations where you may not be able to explicitly call close. Best practices are use to with blocks, which ensure that the file is always closes for you. For example:

with open("filename", "r") as file:
	# Work with file object
	# File is automatically closed when you exit the with block

In class the expectation is that you will always use with blocks when reading files.

Let’s put this together with the functions that we wrote earlier to generate basic statistics about english.txt. You can download all the functions in word-stats.py.

>>> file_stats("english.txt")
Number of words: 47158
Longest word: antidisestablishmentarianism
Shortest word: Hz
Avg. word length: 8.37891768098732

Notice that almost all the code is shared between sentence_stats and file_stats. This is nicely DRY! If you ever find yourself copying and pasting code, make a function instead.

What are some alternate design choices we could have made? Recall from the reading that we could have used the readlines method to produce the list. However, we would still need to remove the trailing newline characters.

Note that these examples assume that the file you are reading and the program are in the same directory.