Matplotlib is a library for production-ready 2D-plotting (and more experimental 3D-plotting).
Before launching into this library, lets think about the features we would need/want to generate a plot:
Matplotlib supports all of these features.
Recall that we need to import the scientific computing modules before use (note that the order
datascience needs to be imported before
import numpy as np import datascience as ds import matplotlib matplotlib.use('TkAgg') import matplotlib.pyplot as plt
Note the two step process for importing
matplotlib. This is needed to work
around some incompatibilities that have arisen in Thonny and Matplotlib.
A simple line plot:
import matplotlib.pyplot as plt def simple_plot(): """ Plotting a simple line """ x = [1, 2, 3, 4, 5] y = [2, 4, 6, 8, 10] plt.plot(x, y) plt.show()
Note that after the plotting functions, we call the
show() function. Why does
the library work this way? So that we can make changes to the plot before
rendering. Note that you will need to close the plotting window before you can execute further commands in the shell.
The result is a nice figure with attractive defaults, etc. and importantly the ability to save the figure as an image.
A more complex example here below. Here two different lines are plotted. Note
plot can take any number of “x, y vector” pairs, or you can invoke
def multiple_plot(): """ Plotting multiple lines """ x = [1, 2, 3, 4, 5] y1 = [1, 2, 3, 4, 5] y2 = [2, 4, 6, 8, 10] plt.plot(x, y1, x, y2) # could also be two different plot calls plt.show()
Now with additional formatting. Here we add an additional formatting string that controls the color, line and point style. Check out the documentation of this format string.
def fancy_multiple_plot(): """ Plotting of multiple lines with different attributes """ x = [1, 2, 3, 4, 5] y1 = [1, 2, 3, 4, 5] y2 = [2, 4, 6, 8, 10] plt.plot(x, y1, "ro") plt.plot(x, y2, "b+--") # Could also have been implemented as: # plt.plot(x, y1, "ro", x, y2, "b+--") plt.show()
In this case:
We previously ran an experiment comparing the query times for lists vs sets. We plotted the results in Excel. With Matplotlib, we can now generate that plot directly within Python. Check out lists_vs_sets_improved.py.
What would our plot look like:
How could we add this to original program? Previously we printed the results for each collection size as we ran the test. Doing so now makes it difficult to plot. Lets decouple those two operations, that is collect the data in one function and then either print or plot those results in a separate function.
In the former we return a tuple of lists. This is an approximation of a Table (or DataFrame) using built-in types. In the plotting code, we see:
plot: Plot the two lines in different colors
ylabel: Add axis labels
title: Add plot title
legend: Add a legend
Not surprisingly, datascience and Matplotlib, work well together. The key additions are
perf = ds.Table().with_columns( "sizes", sizes, "list", list_times, "set", set_times ) perf.plot("sizes") plt.show()
which creates a Table and then plots the different series.
There are many more powerful integrations between these two libraries that I encourage you to investigate.
A random walk is a path composed of a succession of random steps, e.g. that path of molecule as it travels in a liquid of gas. Random walks have many application in a wide variety of fields.
Let’s consider a very simple 1-D random walk that starts at 0 and takes steps of +1 or -1 with equal probability (source).
How could we implement this walk in a “vectorized” way? What are some capabilities we would need?
Let’s try to write
walk_vector function that returns a 1-D NumPy array of the
position at each step. Peek at an
def walk_vector(steps): """ Return array of +1/-1 random walk of length steps """ # Return array of size steps by randomly choosing each value from [-1, 1] delta = np.random.choice([-1, 1], steps) walk = np.cumsum(delta) return walk my_walk = walk_vector(100)
We could further plot this walk easily with Matplotlib as a line plot with
relevant labels. Note we can use
range to generate the “steps” based on the
length of the walk. Let’s try to write a
plot_walk function that plots its
walk collection parameter such that we could do the following. Peek at an implementation.
my_walk = walk_vector(100) plot_walk(my_walk)
def plot_walk(walk): """ Plot random walk as a line plot """ plt.plot(range(0, len(walk)), walk) plt.xlabel("Step") plt.ylabel("Position") plt.title("Random walk with +1/-1 steps") plt.show()
How could we translate this code into a function that just uses Python built-in functions, i.e. doesn’t use NumPy? As starting point we should think about what variables are scalars (i.e. a single value) and which are vectors. The latter would be translated into lists. Operations that produce vectors, translate into loops that build-up lists.
import random def walk_builtin(steps): """ Return list of +1/-1 random walk of length steps """ pos = 0 walk =  for i in range(steps): pos += random.choice([-1, 1]) walk.append(pos) return walk
Here in each step we generate a random step and decrement or increment the position respectively. The new position is appended to the list containing our walk.
We might want to calculate some other statistics, e.g. min and max, and also when (i.e the index) the walk reaches its maximum value.
Min and max are quite simple. Note that we use the NumPy
functions so that we use the optimized implementations (instead of the built-in
For determining the index of the maximum, we can use the
np.argmax(my_walk). As we discussed earlier, these are large libraries
with 100s of functions. Often there is a function “for that”. But no one can or
should just know all of these functions. Instead we want to cultivate an idea
of what kind of functions should exist and the background/experience to effectively find the
relevant function in the documentation.
How would we implement
argmax with “built-in” Python?
def argmax(a_list): index = 0 for i in range(len(a_list)): if a_list[i] > a_list[index]: index = i return index
Recall that we rarely do a simulation once! Last time for instance we performed 5000 simulations to determine the empirical null distribution for birthweight vs. mother’s smoking status. What if we wanted to compute the maximum values across 5000 random walks? Here is where the vectorized approach can be particularly helpful (for both conciseness and performance). We can think of 5000 random walks of length 100 as working with a 2-D array with 5000 rows and 100 columns. All the NumPy functions we used can also be applied to 2-D (or more than 2-D) arrays!
delta = np.random.choice([-1, 1], (5000, 100)) walks = np.cumsum(delta, axis=1)
Notice we only need to make two changes, one to specify the
size as a tuple
of length two (rows, columns) and the specify that the cumulative sum should be
performed along axis 1, that is along the rows. Here are the resulting
(note the 2-D array), with an overall maximum of 38.
>>> walks array([[ -1, 0, -1, ..., -2, -1, 0], [ 1, 0, 1, ..., 14, 15, 16], [ 1, 2, 1, ..., 12, 11, 10], ..., [ -1, 0, -1, ..., -2, -1, 0], [ 1, 0, -1, ..., -16, -15, -16], [ 1, 2, 3, ..., 2, 3, 2]]) >>> np.max(walks) 38