CSCI 150 Spring 2020

Lecture 10: Lists and Files

Objectives for today

Explain and use functions, methods and operators on lists including appending, indexing, slicing and sorting
Use a for loop to iterate through a sequence (string, range, list, etc.)
Open a file for reading and read its contents or iterate through its contents line-by-line

Writing some functions with lists

Let’s develop some functions to compute statistics about the words in a sentence. To do so we will first split the sentence into a list of words. Strings have a method to do exactly that, split.

>>> help(str.split)
Help on method_descriptor:

split(...)
    S.split(sep=None, maxsplit=-1) -> list of strings
    
    Return a list of the words in S, using sep as the
    delimiter string.  If maxsplit is given, at most maxsplit
    splits are done. If sep is not specified or is None, any
    whitespace string is a separator and empty strings are
    removed from the result.

>>> "this is a sentence".split()
['this', 'is', 'a', 'sentence']

Let’s develop some functions for computing statistics about lists of words, specifically

average_word_length(words)
longest_word(words)
shortest_word(words)

Let’s start with average_word_length. What components will we need in that function? What about in longest_word or shortest_word? For the latter, often the key insight is how to initialize the variable in which we keep track of the current longest or shortest word.

A bit about testing in our labs

At a minimum your program should run correctly for all the examples included in the lab, say the examples in Lab 5 (Gradescope uses these as tests). But let’s also think more generally about testing (a preview of the upcoming reading). Recall we have previously discussed some of the conditions to think about when designing (and testing) our algorithms. From 15.4 in “Practical Programming”:

Think about size, including collections with zero, one, and more than one value. For example does your random_equation function work for zero operators?
Think about dichotomies, e.g., empty/full, even/odd, positive/zero/negative, and alphabetic/nonalphabetic
Think about boundaries
Think about order, e.g., is a collection sorted or not

As we gain experience we will build out our mental checklist of situations to consider.

Furthermore, make sure each individual function obeys the specification given. This was a source of difficulty for many in Lab 4, e.g., random_equation could only use numbers in the range 1 to 10, and query_equation expected a string as a parameter.

Files

Most data analyses of any scale will start and/or end with data stored in a file. How do we read and write files?

The simplest way to open a file is with

open("myfile", "r")

The first argument is the name of the file. Note Python starts looking in the current working directory (typically the same directory as your script). If your file is elsewhere you will need to supply the necessary path.

open("/path/to/some/file", "r")

The second argument is the mode, e.g., ‘r’ for reading, ‘w’ for writing, etc.

For example, to open a file of English words for reading (small-file.txt):

>>> file = open("small-file.txt", "r")

In your programs, a best practice is to use with blocks, which ensure that the file is always closed for you. For example:

with open("filename", "r") as file:
	# Work with file object
	# File is automatically closed when you exit the with block

In this course the expectation is that you will always use with blocks when reading files.

Once we have opened the file we can easily read all of the lines with a for loop (note there are other ways to read a file, but we will use for loops most frequently):

for <loop variable> in <file variable>:
	<loop body>

The loop body will get executed for each line of the file with the loop variable will be assigned the line, as a string, including any newline (i.e., return).

For example:

>>> for line in file:
...     print(line)
... 
hello world

how are you

cat

dog

Notice the empty lines; these result from the newlines in the file itself and the newline added by print. Note we typically don’t want the newline from the file, so we often use string’s strip method to remove it.

>>> help(str.strip)
Help on method_descriptor:

strip(...)
    S.strip([chars]) -> str
    
    Return a copy of the string S with leading and trailing
    whitespace removed.
    If chars is given and not None, remove characters in chars instead.

>>> a = "string with newline\n"
>>> a
'string with newline\n'
>>> a.strip()
'string with newline'

Let’s put this together with the functions that we wrote earlier to generate basic statistics about english.txt. You can download all the functions in word-stats.py.

>>> file_stats("english.txt")
Number of words: 47158
Longest word: antidisestablishmentarianism
Shortest word: Hz
Avg. word length: 8.37891768098732

Notice that almost all the code is shared between sentence_stats and file_stats. This is nicely DRY! If you ever find yourself copying and pasting code, make a function instead.

What are some alternate design choices we could have made? Recall from the reading that we could have used the readlines method to produce the list. However, we would still need to remove the trailing newline characters.

Lists of lists

Recall that lists can store values of any (and potentially different) types. Thus we can have lists of lists. Doing so is often helpful if we have a sequence of values where each value might have multiple attributes or pieces of information.

Imagine we have some data about the probability of detecting different particles, e.g., there is 0.55 probability of detecting a 'neutron'. How could we organize this using lists and then find the particle with the highest probability of detection?

We organize our data as a list of lists:

>>> particles = [['neutron', 0.55], ['proton', 0.21], ['meson', 0.03], ['muon', 0.07], ['neutrino', 0.14]]

To find the particle of the highest probability of detection, we will need to access the elements within the nested lists:

highest = particles[0]
for particle in particles:
    if particle[1] > highest[1]:
        highest = particle
print(highest[0])

Note that highest is a list not a string or integer, and in the if statement we are accessing the probability in that list (at index 1).

Revisiting strings vs. lists

Python strings and lists are both examples of sequences. As such they both support many of the same operations, e.g., indexing, slicing. Many functions can take both strings and lists as an argument (e.g., len). However there are several key differences between strings and lists, notably:

A string is a sequence of characters, while a list is a sequence of elements of any type (e.g., a list might contain integers, strings, other lists, etc.)
A string is immutable (unchangeable) while a list is mutable (changeable). As a result string methods do not modify the string on which they are invoked, instead they create a new string with those modifications applied. In contrast, as we saw last time, some list methods, e.g., sort, modify the list on which they are invoked.

This also impacts indexing. If we try the reassign the character of a string we will get an error, but we can reassign the element of a list.

```
>>> a = "abcd"
>>> a[0] = "h"
Traceback (most recent call last):
File "<pyshell>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>> a = list("abcd")
>>> a[0] = "h"
>>> a
['h', 'b', 'c', 'd']
```

Summary

Lists
Files
Quiz 5 and Lab 5 on Friday (cancelled). Complete the in-class exercises and Prelab 5 beforehand.

Links in today’s notes

word-stats.py