CSCI 150 Spring 2020

Lecture 15: Command line, URLs, Debugging

Objectives

Utilize command line arguments to control program execution
Launch Python programs from the command line
Fetch data from a URL; scrape HTML source; use APIs
Open a file for writing and use the write method to write to that file
Utilize the debugger to examine program state and execution

Command lines (and the Terminal)

Command line arguments are values that are passed to a program at runtime.

Rather than using the input function to get input from the user while a program is running, we can allow the user to specify necessary information as they invoke the program.

A Command Line Example

We will use sys_args.py as our working example:

With the Python module sys (short for “system”) there is a variable argv that is set to be a list of the command line arguments. The first element of this list is always the name of the program that is executing (including the path where the file resides if not in the same folder).

>>> %Run sys_args.py
Arguments: ['sys_args.py']
0: sys_args.py

If we added command line arguments to the Thonny run command, they would be appended to the argv list.

>>> %Run sys_args.py these are some arguments
Arguments: ['sys_args.py', 'these', 'are', 'some', 'arguments']
0: sys_args.py
1: these
2: are
3: some
4: arguments

While we can specify command line arguments in Thonny, that is not how this functionality is most useful. Instead, we typically use command line arguments at the command line in a terminal window.

Python at the command line in a terminal window

We can invoke Python, specifically python3 (python on Windows) from the command line in a terminal window. We can open a terminal window from within Thonny via “Tools -> Open System Shell” menu option. Once we have launched the shell we need to navigate to the folder with our Python program (see below for a brief list of terminal commands).

$ python3 sys_args.py these are some arguments
Arguments: ['sys_args.py', 'these', 'are', 'some', 'arguments']
0: sys_args.py
1: these
2: are
3: some
4: arguments

python3 is the Python interpreter (python on Windows), the program that actually runs inside the Thonny shell. If we run python3 (python on Windows) without any arguments, we launch the familiar REPL (read-eval-print-loop). Invoke the quit() function to exit, or on OSX Ctrl+d):

$ python3
Python 3.7.3 (default, Mar 27 2019, 09:23:15) 
[Clang 10.0.1 (clang-1001.0.46.3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

When we supply a path to a Python script as the first argument, Python runs that script (just like the “green arrow” in Thonny). Any additional arguments after the script become the command line arguments to the script (available in argv).

Thonny largely insulates us from the notion of working directory, that is, where in the file system we are executing our program. When we invoke Python in the terminal, we will need to navigate within the terminal to the directory containing our program.

The key commands to navigate the terminal are:

Command	Description
`ls`	List files
`cd <dir>`	Change directory to <dir>
`cd ..`	Change to parent directory
`cd`	Change to home directory
`pwd`	Print current working directory
`more <file>`	Show contents of file one screen full at a time (hit `q` to exit)

The Windows equivalent to terminal is cmd (type cmd into the search bar). The mapping between commands for navigating within the terminal/shell are:

Linux/OSX	Windows
`ls`	`dir`
`cd`	`cd`
`cd /home/briggs/`	`cd C:\Users\briggs`

With these commands we are navigating the same file system and directories you see with your graphical browser, but doing so in a text-based programmatic environment.

For example you will likely need to navigate to the directory that contains your Python script. A protocol to do so:

Find the directory containing your Python program (it is displayed in the title bar of the Thonny window). For the file /Users/briggs/cs150/sys_args.py, the directory is everything up the last /, i.e. /Users/briggs/cs150.
In the terminal at the command prompt, e.g. at the $, type cd for “change directory” then enter the path. For example:
```
 $ cd /Users/briggs/cs150/
```
cd only works on directories. If you have any spaces in your path, you will need add quotes around the path so it is interpreted as a single string (you can use left and right arrows to move in your command to edit it). For example:
```
 $ cd "/Users/briggs/cs150/"
```

Reading from URLs

Imagine we are trying to study the structure of Middlebury syllabi, and particularly the use of e-mail. To further this study we want to scrape Middlebury course webpages for the e-mail addresses listed by the professor and write them to file. And we will start with our course webpage. Note web scraping can have potential legal issues and so you should always check a site’s term and conditions and robots.txt file, if available, before doing any scraping.

How could we implement our scraper? Let’s check out the page source and see if we can come up with some ideas… What are some additional tools we might need?

Let’s check out a potential implementation: web_scraper0.py

What is new in this program?

urllib.request.urlopen: As the name suggests this function opens a URL for reading much the same way we read from a file. In fact, we can iterate line by line with a for loop in exactly the same way. One key difference is that the response is raw bytes, not a string. To obtain a string we need to use the decode method. Here we decode assuming the encoding is “utf-8” and we want to “ignore” errors. Those are reasonable settings for a webpage. This is also our first nested module, that is, we are importing a module in a module.
We are writing a file. Specifically we opened the file with “w” as the second argument to open and then use the write method to write a string to the file. Note that unlike print, write doesn’t automatically append a newline and so we need to do so.

How could we adapt this code to take command line arguments instead of the fixed URL and output file? We would add the following code to the bottom of our program.

Inside the if __name__ == "__main__" conditional we check for the expected number of command-line arguments. Recall that the first element of sys.argv is always the name of the file, so if we expect two command-line arguments the length of sys.argv should be 3 (the number of arguments + 1). If we don’t have the correct number of arguments we print a help message for the user showing the expected command line. If we do have the correct number of arguments we proceed to invoke the program code using the command-line arguments we extracted from the sys.argv list. See web_scraper.py for the addition of this code.

def print_usage():
    """Print usage of the program"""
    print("python3 web_scraper.py <URL> <OUT_FILE>")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print_usage()
    else:
        url = sys.argv[1] 
        outfile = sys.argv[2]
        data = get_data(url)        
        write_list_to_file(data, outfile)
        print("Wrote:", outfile)

An example use of command-line arguments

The why of the command line is a much larger question that we won’t fully experience in this course. Being able to efficiently use a command line environment (and write programs to be used in that environment) will make you much more productive and effective at data analysis and other computational tasks.

For example, suppose we are curious about how many lines of code are included in our lecture examples. The function below counts the non-empty lines in a file.

def count_lines(filename):
    """
    Count non-empty lines in file

    Args:
        filename: File to examine
    
    Return: Count of non-empty lines
    """
    with open(filename, "r") as file:
        count = 0
        for line in file:
            if line.strip() != "":
                count +=1
        return count

How could we run this on every .py file in our cs150 folder?

We could manually make a list of all the files, but that is slow and error prone. Instead we would like to solve this problem programmatically. The command line can help us do so. It provides a mechanism for programmatically interacting with your computer, e.g., programmatically accessing directories, files, other programs and more. Counting all the lines in all the example programs can be as simple as the following. Let’s learn how to make this work:

$ python3 line_counter.py *.py
Total lines: 1074

Here we use *.py as a wildcard that expands into all files that end in “.py”, i.e., this is equivalent to

$ python3 line_counter.py my_module.py sys_args.py ...

To make this work, we can write Python code to make our line counter process any number of files provided on the command line. Here we use a for loop to iterate through all the files provided on the command line and thus in the sys.argv list. With that small amount of code we now have a very useful (and efficient) tool. Check out the complete implementation line_counter.py

if __name__ == "__main__":
    if len(sys.argv) == 1:
        # Check that at least one file is provided on the command line
        print("Usage: python line_counter.py <1 or more files>")
    else:
        count = 0
        for filename in sys.argv[1:]:
            count += count_lines(filename)
        print("Total lines:", count)

Debugging in Thonny

We can see there are times when it is easier to work at the command line and other times, most times, when it is easier to work within Thonny. In most situations it will be more efficient to develop within Thonny since it nicely integrates all the tools we need.

One tool we have not yet used is the debugger. The debugger allows us to step through our program one statement at a time and inspect the current state of variables. This is an alternative, and sometimes more powerful, approach to debugging compared to inserting print statements as we have been doing to date.

Our workflow is to use Thonny to step through our program one line a time while investigating the current value of any variables. We “Step over” lines to get to the area of the program we suspect has a problem and then “Step into” to investigate further. If we want to run until a specific spot in the program we can click to the right of a line number to place a red dot. Then hitting the debug icon will run to that point. Another option is to first hit the debug icon and then click on the code line where we want to stop, and then use the pull-down “Run -> Run to cursor”. We will try it out on the following example.

More generally, the combination of introspection and control is a very powerful for debugging our programs. There is quite a bit the debugger can do, and we hope you will start experimenting with it when you are trying to figure out why your code doesn’t work.

Let’s investigate a simple function with bug: debug_example.py

We can place a red dot to the right of the line number on the line decibels = 10 * math.log10(ratio) and then click the debug icon. Using the variable display on the lower left, we see that ratio is 0 (because used floor division) and 0 is an invalid argument to math.log10.