Class 23: Plotting with Matplotlib

Objectives for today

Matplotlib

Matplotlib is a library for production-ready 2D-plotting (and more experimental 3D-plotting).

Before launching into this library, lets think about the features we would need/want to generate a plot:

Matplotlib supports all of these features.

Imports

Recall that we need to import the scientific computing modules before use (note that the order matters, datascience needs to be imported before matplotlib):

import numpy as np
import datascience as ds
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt

Note the two step process for importing matplotlib. This is needed to work around some incompatibilities that have arisen in Thonny and Matplotlib.

Some simple examples

A simple line plot:

import matplotlib.pyplot as plt

def simple_plot():
    """ Plotting a simple line """
    x = [1, 2, 3, 4, 5]
    y = [2, 4, 6, 8, 10]
    
    plt.plot(x, y)
    plt.show() 

Note that after the plotting functions, we call the show() function. Why does the library work this way? So that we can make changes to the plot before rendering. Note that you will need to close the plotting window before you can execute further commands in the shell.

The result is a nice figure with attractive defaults, etc. and importantly the ability to save the figure as an image.

A more complex example here below. Here two different lines are plotted. Note plot can take any number of “x, y vector” pairs, or you can invoke plot multiple times.

def multiple_plot():
    """ Plotting multiple lines """
    x = [1, 2, 3, 4, 5]
    y1 = [1, 2, 3, 4, 5]
    y2 = [2, 4, 6, 8, 10]
    
    plt.plot(x, y1, x, y2)
    # could also be two different plot calls

    plt.show()

Now with additional formatting. Here we add an additional formatting string that controls the color, line and point style. Check out the documentation of this format string.

def fancy_multiple_plot():
    """ Plotting of multiple lines with different attributes """
    x = [1, 2, 3, 4, 5]
    y1 = [1, 2, 3, 4, 5]
    y2 = [2, 4, 6, 8, 10]
    
    plt.plot(x, y1, "ro")
    plt.plot(x, y2, "b+--")
    # Could also have been implemented as:
    # plt.plot(x, y1, "ro", x, y2, "b+--")
    
    plt.show()

In this case:

Revisiting list vs set

We previously ran an experiment comparing the query times for lists vs sets. We plotted the results in Excel. With Matplotlib, we can now generate that plot directly within Python. Check out lists_vs_sets_improved.py.

What would our plot look like:

How could we add this to original program? Previously we printed the results for each collection size as we ran the test. Doing so now makes it difficult to plot. Lets decouple those two operations, that is collect the data in one function and then either print or plot those results in a separate function.

In the former we return a tuple of lists. This is an approximation of a Table (or DataFrame) using built-in types. In the plotting code, we see:

Integrating datascience and Matplotlib

Not surprisingly, datascience and Matplotlib, work well together. The key additions are

perf = ds.Table().with_columns(
    "sizes", sizes,
    "list", list_times,
    "set", set_times
)
perf.plot("sizes")
plt.show()

which creates a Table and then plots the different series.

There are many more powerful integrations between these two libraries that I encourage you to investigate.

Putting it together with a random walk

A random walk is a path composed of a succession of random steps, e.g. that path of molecule as it travels in a liquid of gas. Random walks have many application in a wide variety of fields.

Let’s consider a very simple 1-D random walk that starts at 0 and takes steps of +1 or -1 with equal probability (source).

How could we implement this walk in a “vectorized” way? What are some capabilities we would need?

Some relevant NumPy functions: random.choice and cumsum.

Let’s try to write walk_vector function that returns a 1-D NumPy array of the position at each step. Peek at an implementation.

def walk_vector(steps):
    """ Return array of +1/-1 random walk of length steps """
    # Return array of size steps by randomly choosing each value from [-1, 1]
    delta = np.random.choice([-1, 1], steps)
    walk = np.cumsum(delta)
    return walk
    
my_walk = walk_vector(100)

We could further plot this walk easily with Matplotlib as a line plot with relevant labels. Note we can use range to generate the “steps” based on the length of the walk. Let’s try to write a plot_walk function that plots its walk collection parameter such that we could do the following. Peek at an implementation.

my_walk = walk_vector(100)
plot_walk(my_walk)
def plot_walk(walk):
    """ Plot random walk as a line plot """
    plt.plot(range(0, len(walk)), walk)
    plt.xlabel("Step")
    plt.ylabel("Position")
    plt.title("Random walk with +1/-1 steps")
    plt.show()

How could we translate this code into a function that just uses Python built-in functions, i.e. doesn’t use NumPy? As starting point we should think about what variables are scalars (i.e. a single value) and which are vectors. The latter would be translated into lists. Operations that produce vectors, translate into loops that build-up lists.

PI Questions

Peek at an implementation only using Python built-ins:

import random

def walk_builtin(steps):
    """ Return list of +1/-1 random walk of length steps """
    pos = 0
    walk = []
    for i in range(steps):
        pos += random.choice([-1, 1])
        walk.append(pos)
    return walk

Here in each step we generate a random step and decrement or increment the position respectively. The new position is appended to the list containing our walk.

We might want to calculate some other statistics, e.g. min and max, and also when (i.e the index) the walk reaches its maximum value.

Min and max are quite simple. Note that we use the NumPy min and max functions so that we use the optimized implementations (instead of the built-in min and max functions).

np.min(my_walk)
np.max(my_walk)

For determining the index of the maximum, we can use the argmax function, e.g. np.argmax(my_walk). As we discussed earlier, these are large libraries with 100s of functions. Often there is a function “for that”. But no one can or should just know all of these functions. Instead we want to cultivate an idea of what kind of functions should exist and the background/experience to effectively find the relevant function in the documentation.

How would we implement argmax with “built-in” Python?

def argmax(a_list):
    index = 0
    for i in range(len(a_list)):
        if a_list[i] > a_list[index]:
            index = i
    return index

Recall that we rarely do a simulation once! Last time for instance we performed 5000 simulations to determine the empirical null distribution for birthweight vs. mother’s smoking status. What if we wanted to compute the maximum values across 5000 random walks? Here is where the vectorized approach can be particularly helpful (for both conciseness and performance). We can think of 5000 random walks of length 100 as working with a 2-D array with 5000 rows and 100 columns. All the NumPy functions we used can also be applied to 2-D (or more than 2-D) arrays!

delta = np.random.choice([-1, 1], (5000, 100))
walks = np.cumsum(delta, axis=1)

Notice we only need to make two changes, one to specify the size as a tuple of length two (rows, columns) and the specify that the cumulative sum should be performed along axis 1, that is along the rows. Here are the resulting walks (note the 2-D array), with an overall maximum of 38.

>>> walks
array([[ -1,   0,  -1, ...,  -2,  -1,   0],
       [  1,   0,   1, ...,  14,  15,  16],
       [  1,   2,   1, ...,  12,  11,  10],
       ...,
       [ -1,   0,  -1, ...,  -2,  -1,   0],
       [  1,   0,  -1, ..., -16, -15, -16],
       [  1,   2,   3, ...,   2,   3,   2]])
>>> np.max(walks)
38