Class 9

Lists, loops, files, modules, the command line

Objectives for today

Use loops to compute statistics for different inputs
Describe the motivations for using data structures as the inputs to our functions
Open a file for reading and read its contents or iterate through its contents line-by-line
Explain the purpose of and use a with block to encapsulate file operations
Use optional and keyword arguments
Create a correctly formatted and documented module
Utilize command line arguments to control program execution
Launch Python programs from the command line

Writing some functions with lists

Let’s develop some functions to compute statistics about the words in a sentence. To do so we are first going to split the sentence into a list of words. Strings have a method to do exactly that, split.

help(str.split)

Help on method_descriptor:

split(self, /, sep=None, maxsplit=-1)
    Return a list of the substrings in the string, using sep as the separator string.
    
      sep
        The separator used to split the string.
    
        When set to None (the default value), will split on any whitespace
        character (including \\n \\r \\t \\f and spaces) and will discard
        empty strings from the result.
      maxsplit
        Maximum number of splits (starting from the left).
        -1 (the default value) means no limit.
    
    Note, str.split() is mainly useful for data that has been intentionally
    delimited.  With natural text that includes punctuation, consider using
    the regular expression module.

"this is a sentence".split()

['this', 'is', 'a', 'sentence']

Let’s develop some functions for computing statistics about lists of words, specifically

average_word_length(words)
longest_word(words)
shortest_word(words)

Let’s start with average_word_length. What components will we need in that function? What about in longest_word or shortest_word? For the latter, can we use the built-in min and max function we saw previously?

Thinking through these functions

We will need a loop to iterate through the list of words. In this case a for loop since we know the number of iterations (the length of the list) ahead of time. One of the key challenges here is that we are not comparing the strings directly, e.g., “the” vs. “antidisestablishmentarianism”, but the lengths of those strings. That is in all of functions we will first need to determine the lengths of each string, then compute the average, etc. If we just use max(words), we will get the lexicographically largest word not the longest word. As an example, consider:

words = "this is a sentence".split()
max(words)

'this'

Check out word-stats.py for a possible solution. For longest_word and shortest_word, one of the decisions we need to make is how to do we correctly initialize the variable we use to track the length of the longest and shortest word seen so far. The question we must ask ourselves is what is the longest and shortest word we could ever encounter. We can then initialize the length to be larger and smaller than those values, respectively.

For the shortest word that is easier. The shortest word has a length of 0 (the empty string) and so we could initialize the length to -1 since we know that is less than all possible words. But for the longest word, for most any constant number we could come up with, we could imagine a longer word. If we can assume or required that the input list is non-empty (and we should require that as the min/max of an empty sequence is ill-defined), then a common approach is to use the length of the first word. That ensures that the initial value is valid length. If the first word happens to be the longest, we will have the correct answer. If not, we will find the longest word as we scan the remainder of the list.

Alternate implementations using Python features we haven’t discussed

The example code contains alternate implementations for longest_word and shortest_word using Python features, specifically optional keyword parameters and functions as arguments, that we haven’t yet talked about (but will). I encourage you to check those out.

A bit about testing our programs

At a minimum you program should run correctly for all the examples included in the assignment (I typically configure Gradescope with those examples as tests!). But let’s also think about testing more generally. Recall we have previously discussed some of the conditions to think about when designing (and testing) our algorithms. From 15.4 in “Practical Programming”:

Think about size, including collections with zero, one and more than one value. For example does your random_equation function in programming assignment 4 work for zero operators?
Think about dichotomies, e.g. empty/full, even/odd, positive/zero/negative, and alphabetic/nonalphabetic. For example, does your random_equation function work for both single and two digit numbers?
Think about boundaries
Think about order, e.g. a collection is sorted or not

As we gain experience we will build out our mental checklist of situations to consider.

Why did we use lists as the input?

Why did we use lists as the input to our functions? Could we not just have processed the input string one word as a time and not even created the list in the first place, e.g., something like the following. Yes! For these operations. But what if we wanted to compute the median (i.e., middle) length? No. We can’t compute the median with the loop below because to find the median we need to retain all of the lengths (so we can find the middle). Using a list enables to store all the words (and their lengths) such that we can compute statistics that require knowing the entire dataset.

for word in sentence.split():
    # Update average length numerator/denominator, etc.

There is a second motivation. Here we analyzing a single string, but we can imagine other sources of words we might want to analyze, like a file of words. We can use data structures, like a list, to decouple components within program. In this case, we separate the the source of the input from the calculations. We can use any input, strings, files, etc, that can be read into a list without changing our analysis functions (the average length, etc.)

Files

To date all of our data is ephemeral, like the sentence examples we just tested. But most data analyses of any scale will start and or end with data stored in a file. How do we read and write to files?

open("filename", "r")

The first argument is the path to the file. Note Python starts looking in the current working directory (typically the same directory as your script). If your file is elsewhere you will need to supply the necessary path. The second argument is the mode, e.g. ‘r’ for reading, ‘w’ for writing, etc.

Let’s open a file of English words for reading (small-file.txt):

file = open("small-file.txt", "r")
file

<_io.TextIOWrapper name='small-file.txt' mode='r' encoding='UTF-8'>

We don’t need to know what a TextIOWrapper is. That is part of Python’s internal implementation peaking through. Instead what is important for us to know is that file object created by open is iterable, i.e., can read through all the lines in the file with a for loop (see below) and it has methods like read, for reading the entire contents of the file.

Once we have opened the file we can easily read all of the lines with a for loop (note there are other ways to read a file, but we will use for loops most frequently). That is we use the file as the loop sequence.

for <loop variable> in <file variable>:
    <loop body>

The loop body will get executed for each line of the file with the loop variable assigned the line, as a string, including any newline (i.e. return). For example:

for line in file:
    print(line)

hello world

how are you

cat

dog

Notice the empty lines, these result from the newlines in the file itself and the newline added by print. That is the contents of the file are really equivalent to:

hello world\nhow are you\ncat\ndog\n"

and Python reads each line of file it includes the newline, i.e., the first value for line is "hello world\n". By default print adds its own newline, so the result is to print hello world\n\n (note the two nelines). We typically don’t want the newline from the file, so we often use string’s strip method to remove it from each line after reading it from the file.

help(str.strip)

Help on method_descriptor:

strip(self, chars=None, /)
    Return a copy of the string with leading and trailing whitespace removed.
    
    If chars is given and not None, remove characters in chars instead.

a = "string with newline\n"
a.strip()

'string with newline'

If we try to run our loop again (with strip added), e.g.

for line in file:
    print(line.strip())

nothing will be printed. This is not unexpected. The file object maintains state, specifically a pointer to how much of the file has been read. When we first open the file, the pointer “points to” the beginning of the file. Once we have read the file it points to the end. Thus, there is nothing more to read. There are methods that we can use to reset the pointer, or we can close and then reopen the file.

All open files need to be closed (with the close method, e.g. file.close()). This is especially important for writing to files as it will force the data to actually be written to the disk/file system. You can do so manually, but it is easy to forget, and there are error situations where you may not be able to explicitly call close. Best practices are use to with blocks, which ensure that the file is always closes for you. For example:

with open("filename", "r") as file:
    # Work with file object
    # File is automatically closed when you exit the with block

In our class the expectation is that you will always use with blocks when reading files.

Let’s put this together with the functions that we wrote earlier to generate basic statistics about english.txt. You can download all the functions in word-stats.py.

>>> file_stats("english.txt")
Number of words: 47158
Longest word: antidisestablishmentarianism
Shortest word: Hz
Avg. word length: 8.37891768098732

Notice that almost all the code is shared between sentence_stats and file_stats. This is nicely DRY! If you ever find yourself copying and pasting code, make a function instead.

Note that these examples assume that the file you are reading and the program are in the same directory.

Optional Parameters

We have used range extensively, and done so with different numbers of parameters.

>>> help(range)
Help on class range in module builtins:

class range(object)
 |  range(stop) -> range object
 |  range(start, stop[, step]) -> range object

This works because Python supports optional arguments, e.g. the optional “step”. How would we implement our own version of range? Consider the following (optional_paramters.py):

def my_range_with_step(start, stop, step):
    """
    Return a range
    
    Args:
        start: inclusive start index
        stop: exclusive stop index
        step: range increment

    Returns: A list of integers
    """
    i = start
    r = []
    
    while i < stop:
        r.append(i)
        i += step
    
    return r

def my_range_with_unitstep(start, stop):
    return my_range_with_step(start, stop, 1)

We could condense these two functions into one, if we could set a default value for step. We can do by providing a value in the function header as shown below. We describe those parameters with default values as “optional”, i.e., we no longer have to provide a value for that parameter when we call the function. If we don’t provide a value, Python uses the default value specified in the header.

def my_range(start, stop, step=1):
    """
    Return a range
    
    Args:
        start: inclusive start index
        stop: exclusive stop index
        step: range increment

    Returns: A list of integers
    """
    i = start
    r = []
    
    while i < stop:
        r.append(i)
        i += step
    
    return r

Now we can use the same function for the two different use cases. More generally, optional parameters are useful when there is a sensible default value (i.e. stepping by one) but the caller might want/need to change that value sometimes.

Note that you can also specify parameters by name, which is helpful if there are many optional parameters and you only want to change one or two.

>>> my_range(0, 5, step=2)
[0, 2, 4]
>>> my_range(start=1, stop=5)
[1, 2, 3, 4]
>>> my_range(5, start=0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: my_range() got multiple values for argument 'start'
>>> my_range(start=0, 5)
  File "<stdin>", line 1
SyntaxError: positional argument follows keyword argument

All Python function parameters can actually be specified by name (if we wanted to do so). There are some limits, however: keyword arguments must follow positional arguments and you can’t specify the same argument more than once. Our general practice to is provide required arguments (those without default values) by position (without the name) and optional arguments by name.

>>> help(print)
Help on built-in function print in module builtins:

print(...)
    print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)
    
    Prints the values to a stream, or to sys.stdout by default.
    Optional keyword arguments:
    file:  a file-like object (stream); defaults to the current sys.stdout.
    sep:   string inserted between values, default a space.
    end:   string appended after the last value, default a newline.
    flush: whether to forcibly flush the stream.

A common place to use keyword arguments is with print, where you will likely only want to modify one of the many optional arguments, e.g. the separator (sep) or the end, but not all.

Modules

What is a module? A collection of related functions and variables. Why we do have modules? To organize and distribute code in a way that minimizes naming conflicts.

How many of you have written a module? A trick question… Everyone. Every .py file is a module.

Let’s consider the linked my_module as an example. my_module includes a constant and several functions. After importing my_module we can use those functions like any of those in math or the other modules we have used.

import my_module
my_module.a()
my_module.b(10, 15)
my_module.c("this is a test")
my_module.SOME_CONSTANT

Loaded my_module
Loaded my_module
Importing the module my_module
10

'tt'

What about help?

help(my_module)

Help on module my_module:

NAME
    my_module - Some basic functions to illustrate how modules work

DESCRIPTION
    A more detailed description of the module.

FUNCTIONS
    a()
        Prints out the number 10
    
    b(x, y)
        Returns x plus y
    
    c(some_string)
        Returns the first and last character of some_string

DATA
    SOME_CONSTANT = 10

FILE
    /Users/mlinderman/Library/CloudStorage/Dropbox/Courses/CSCI146-Fall2024/site/classes/my_module.py

That multi-line comment at the top of the file is also a docstring: * The NAME and brief description come from the filename and that docstring * DESCRIPTION is any subsequent lines in that docstring * FUNCTIONS enumerates the functions and the description from their docstrings * DATA enumerates any constants

Importing

What happens when I import a module? Python executes the Python file.

Where did the __pycache__ folder come from? When we import a module, Python compiles to bytecode in a “.pyc” file. This lower-level representation is more efficient to execute. These files aren’t important for this class, but I want you to be aware of where those files are coming from…

So if I add a print statement, e.g. print("Loaded my_module"), to my Python file I should expect that message to print at import.

>>> import my_module
>>>

Why didn’t it print? Python doesn’t re-import modules that are already imported. Why does that behavior make sense? What if multiple modules import that same module, e.g. math? What if two modules import each other?

As a practical matter that means if we change our module we will need restart the Python console (with the stop sign) or use the explicit reload function:

>>> import importlib
>>> importlib.reload(my_module)
Loaded my_module
<module 'my_module' from 'my_module.py'>

Run vs. Import

When we click the green arrow in Thonny we are “running” our Python programs. We could also have been importing them. When would you want to do one or the other?

Think about our Cryptography assignment. We could imagine using our functions in a program that help people securely communicate with each other, or that other programmers might want to use our functions in their own communication systems. For the former we would want to be able encrypt/decrypt when our module is run, for the latter we would to make our code “importable” without actually invoking any of the functions.

Python has a special variable __name__ that can be used to determine whether our module is being run or being imported. When file is run, Python automatically sets that variable to be "__main__". If a file is imported Python sets that variable to be filename as a string (without the “.py” extension).

We typically use this variable in a conditional at the end of the file that changes the behavior depending on the context. For example:

if __name__ == "__main__":
    print("Running the module")
else:
    print("Importing the module")

In most cases, you will only have the “if” branch, that is you will only be doing something if the program is run.

For example, in our past assignments, when we prompted users for input (say for a file to read data from), we would do so only if the program is being run (not imported). Gradescope imports your files so that it can test functions without necessarily simulating all of the user interactions. In a future assignment, you will specifically write Python code that can be either used as a standalone program to obtain the current weather, or as part of a more complex application.

Command ine (and the Terminal)

In many programming assignments and examples we have either “hardcoded” the input, e.g., “english.txt” filename above, or we have solicited input from the user (via the input function) to control the execution of program (i.e. what file to read, etc.). I suspect that during the testing process typing those inputs in each time gets a little tedious… And that you get the sense that controlling our programs that way is not really compatible with automation, that is running a program in an automated way on different inputs. There must be a different way…

What does the “green arrow” actually do? Notice in Thonny, %Run sys_args.py. This is the script plus any command line arguments.

What are command line arguments? Like function arguments/parameters, command line arguments, are values passed to a Python program that will affect its execution. We use function parameters to change the inputs for our function. Command line argumens do the same for the program as a whole. Instead of using the input function to solicit input from the user to control the execution of our programs, say to pick the input data file in our data analysis lab, we can specify those “inputs” on the command line as command line arguments. Doing so would facilitate controlling our programs in an automated way.

The why of the command line is a much larger question that we won’t fully experience in class. Speaking from my own experience, being able to efficiently use a command line environment (and write programs to be used in that environment) will make you a much more productive and effective at data analysis and other computational tasks.

For example, I was curious about how many lines of code are included in your lecture examples. I wrote the function below to count the non-empty lines in a file, but how can I run this on every file?

def count_lines(filename):
    """
    Count non-empty lines in file

    Args:
        filename: File to examine
    
    Return: Count of non-empty lines
    """
    with open(filename, "r") as file:
        count = 0
        for line in file:
            if line.strip() != "":
                count +=1
        return count

I could manually make a list of all the files, but that is slow and error prone. Instead I would like to solve this problem programmatically. The command line can help us do so. It provides a mechanism for programmatically interacting with your computer, e.g. programmatically accessing directories, files, other programs and more. Specifically counting all the lines in all the example programs can be as simple as the following. Let’s learn how to make that work.

$ python3 line_counter.py *.py
Total lines: 1074

A Command Line Example

We will use sys_args.py as our working example:

With the Python module sys (short for “system”) there is a variable argv that is set to be a list of the command line arguments. The first element of this list is always the path of the program that is executing.

>>> %Run sys_args.py
Arguments: ['sys_args.py']
0: sys_args.py

If we added command line arguments to the Thonny run command, they would be appended to the argv array.

>>> %Run sys_args.py these are some arguments
Arguments: ['sys_args.py', 'these', 'are', 'some', 'arguments']
0: sys_args.py
1: these
2: are
3: some
4: arguments

While we can specify command line arguments in Thonny that is not how this functionality is most useful. Instead, we typically use command line arguments within a separate terminal program (Terminal on OSX, or PowerShell on Windows).

Python at the command line

We can invoke Python, specifically python3 from the command line (python on Windows). We can open the terminal from within Thonny via “Tools -> Open System Shell” menu option. Once we have launched the shell we need to navigate the folder with our Python program (we will learn how shortly…).

$ python3 sys_args.py these are some arguments
Arguments: ['sys_args.py', 'these', 'are', 'some', 'arguments']
0: sys_args.py
1: these
2: are
3: some
4: arguments

python3 is the Python interpreter (python on Windows), the programs that actually runs inside the Thonny shell. If we run python3 (python on Windows) without any arguments we launch the familiar REPL (invoke the quit() function to exit, or on OSX Ctrl+d):

$ python3
Python 3.10.6 (v3.10.6:9c7b4bd164, Aug  1 2022, 17:13:48) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

When we supply a path to a Python script as the first argument, Python runs that script (just like the “green arrow” in Thonny). Any additional arguments after the script become the command line arguments to the script (available in argv).

Thonny largely insulates us from the notion of working directory, that is where in file system we are executing our program. When we invoke Python in the terminal, we will need to navigate within the terminal to the directory containing our program.

The key commands will use to navigate the terminal are:

Command	Description
`ls`	List files
`cd dir`	Change directory to `dir`
`cd ..`	Change to parent directory (i.e., go up one level of hierarchy)
`cd`	Change to home directory
`pwd`	Print the the path of the current working directory
`more <file>`	Show contents of file one screen full at a time (hit `q` to exit)

The Windows equivalent to terminal is cmd (type cmd into the search bar). The mapping between commands for navigating within the terminal/shell are:

Linux/OSX	Windows
`ls`	`dir`
`cd`	`cd`
`cd /home/mlinderman/`	`cd C:\Users\mlinderman`

With these commands we are navigating the same file system and directories you see with your graphical browser, but doing so in a text-based programmatic environment.

For example you will likely need to navigate to the directory that contains your Python script. A protocol to do so:

Find the directory containing your Python program (it is displayed in the title bar of the Thonny window). For the file /Users/mlinderman/cs146/sys_args.py, the directory is everything up the last /, i.e. /Users/mlinderman/cs146.
In the terminal at the command prompt, e.g. at the $, type cd for “change directory” then enter the path. For example:
```
$ cd /Users/mlinderman/cs146/
```
cd only works on directories. If you have any spaces in your path, you will need add quotes around the path so it is interpreted as a single string (you can use left and right arrows to move in your command to edit it). For example:
```
$ cd "/Users/mlinderman/cs146/"
```

Returning to our line counter

In our earlier examples usage we use *.py as a wildcard (or globbing) that the terminal expands into all files that end in “.py”, i.e. that was equivalent to

$ python3 line_counter.py my_module.py sys_args.py ...

We can write our Python code to process any number of files provided on the command line. Here we use a for loop to iterate through all the files provided on the command line and thus in the sys.argv list (recall that the first element, at index 0, is always the name of the program that is executing). With that small amount of code we now have a very useful (and efficient) tool. Check out the complete implementation.

if __name__ == "__main__":
    if len(sys.argv) == 1:
        # Check that at least one file is provided on the command line
        print("Usage: python line_counter.py <1 or more files>")
    else:
        count = 0
        # Process all of the command line arguments (after the name of the program that is always at index 0)
        for filename in sys.argv[1:]:
            count += count_lines(filename)
        print("Total lines:", count)

Could we have accomplished the same task purely within Python, without using the command line environment? Yes, although the resulting approach would be less flexible. For example, we could use the listdir function on the os module to return a list of all the files in the current directory and then filter that list for just those files with names ending in “.py”:

import os
filenames = os.listdir()

count = 0
for filename in filenames:
    if filename.endswith(".py"):
        count += count_lines(filename)
print("Total lines:", count)

While this code may seem simpler than the approach above, I would argue the opposite. This approach has several assumptions built-in, that we are only interested in files in the current directory and only files ending in “.py”. If we want to look at files in a different directory or with different/multiple file endings we will need to modify our program. In contrast, our approach using the command line works for all those scenarios without any modification. For example,

$ python line_counter.py *.py *.qmd

counts the lines both Python files and Markdown files (the format I use to write the lecture notes) by expanding both wildcards. In this respect, the command line environment “augments” the capabilities of our Python programs. The combination (in my opinion) is more than the sum of the parts!