- Use the numpy, datascience and matplotlib libraries
- Make plots with matplotlib
- Translate iterative code to a vectorized implementation and vice-versa

Matplotlib is a library for production-ready 2D-plotting (and more experimental 3D-plotting).

Before launching into this library, lets think about the features we would need/want to generate a plot:

- Plot data (as points, line, bars, etc.)
- Label axes
- Set plot title
- Add a legend
- Annotate the graph
- …

Matplotlib supports all of these features.

Recall that we need to import the scientific computing modules before use (note that the order
matters, `datascience`

needs to be imported before `matplotlib`

):

```
import numpy as np
import datascience as ds
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
```

Note the two step process for importing `matplotlib`

. This is needed to work
around some incompatibilities that have arisen in Thonny and Matplotlib.

A simple line plot:

```
import matplotlib.pyplot as plt
def simple_plot():
""" Plotting a simple line """
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
plt.plot(x, y)
plt.show()
```

Note that after the plotting functions, we call the `show()`

function. Why does
the library work this way? So that we can make changes to the plot before
rendering. Note that you will need to close the plotting window before you can execute further commands in the shell.

The result is a nice figure with attractive defaults, etc. and importantly the ability to save the figure as an image.

A more complex example here below. Here two different lines are plotted. Note
`plot`

can take any number of “x, y vector” pairs, or you can invoke `plot`

multiple times.

```
def multiple_plot():
""" Plotting multiple lines """
x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3, 4, 5]
y2 = [2, 4, 6, 8, 10]
plt.plot(x, y1, x, y2)
# could also be two different plot calls
plt.show()
```

Now with additional formatting. Here we add an additional formatting string that controls the color, line and point style. Check out the documentation of this format string.

```
def fancy_multiple_plot():
""" Plotting of multiple lines with different attributes """
x = [1, 2, 3, 4, 5]
y1 = [1, 2, 3, 4, 5]
y2 = [2, 4, 6, 8, 10]
plt.plot(x, y1, "ro")
plt.plot(x, y2, "b+--")
# Could also have been implemented as:
# plt.plot(x, y1, "ro", x, y2, "b+--")
plt.show()
```

In this case:

- “r” : red
- “o” : circle marker
- “b” : blue
- ”+” : plus marker
- ”–” : dashed line style

We previously ran an experiment comparing the query times for lists vs sets. We plotted the results in Excel. With Matplotlib, we can now generate that plot directly within Python. Check out lists_vs_sets_improved.py.

What would our plot look like:

- x-axis is the size of the collection
- y-axis is the execution time
- Plot two lines (one each for list times and set times)

How could we add this to original program? Previously we printed the results for each collection size as we ran the test. Doing so now makes it difficult to plot. Lets decouple those two operations, that is collect the data in one function and then either print or plot those results in a separate function.

In the former we return a tuple of lists. This is an approximation of a Table (or DataFrame) using built-in types. In the plotting code, we see:

`plot`

: Plot the two lines in different colors`xlabel`

and`ylabel`

: Add axis labels`title`

: Add plot title`legend`

: Add a legend

Not surprisingly, datascience and Matplotlib, work well together. The key additions are

```
perf = ds.Table().with_columns(
"sizes", sizes,
"list", list_times,
"set", set_times
)
perf.plot("sizes")
plt.show()
```

which creates a Table and then plots the different series.

There are many more powerful integrations between these two libraries that I encourage you to investigate.

A random walk is a path composed of a succession of random steps, e.g. that path of molecule as it travels in a liquid of gas. Random walks have many application in a wide variety of fields.

Let’s consider a very simple 1-D random walk that starts at 0 and takes steps of +1 or -1 with equal probability (source).

How could we implement this walk in a “vectorized” way? What are some capabilities we would need?

- Randomly select from 1 and -1 (e.g., similar to
`random.choice`

) - Vectorized accumulation

Some relevant NumPy functions:
`random.choice`

and
`cumsum`

.

Let’s try to write `walk_vector`

function that returns a 1-D NumPy array of the
position at each step. Peek at an
implementation.

```
def walk_vector(steps):
""" Return array of +1/-1 random walk of length steps """
# Return array of size steps by randomly choosing each value from [-1, 1]
delta = np.random.choice([-1, 1], steps)
walk = np.cumsum(delta)
return walk
my_walk = walk_vector(100)
```

We could further plot this walk easily with Matplotlib as a line plot with
relevant labels. Note we can use `range`

to generate the “steps” based on the
length of the walk. Let’s try to write a `plot_walk`

function that plots its
`walk`

collection parameter such that we could do the following. Peek at an implementation.

```
my_walk = walk_vector(100)
plot_walk(my_walk)
```

```
def plot_walk(walk):
""" Plot random walk as a line plot """
plt.plot(range(0, len(walk)), walk)
plt.xlabel("Step")
plt.ylabel("Position")
plt.title("Random walk with +1/-1 steps")
plt.show()
```

How could we translate this code into a function that just uses Python built-in functions, i.e. doesn’t use NumPy? As starting point we should think about what variables are scalars (i.e. a single value) and which are vectors. The latter would be translated into lists. Operations that produce vectors, translate into loops that build-up lists.

*PI Questions*

Peek at an implementation only using Python built-ins:

```
import random
def walk_builtin(steps):
""" Return list of +1/-1 random walk of length steps """
pos = 0
walk = []
for i in range(steps):
pos += random.choice([-1, 1])
walk.append(pos)
return walk
```

Here in each step we generate a random step and decrement or increment the position respectively. The new position is appended to the list containing our walk.

We might want to calculate some other statistics, e.g. min and max, and also when (i.e the index) the walk reaches its maximum value.

Min and max are quite simple. Note that we use the NumPy `min`

and `max`

functions so that we use the optimized implementations (instead of the built-in
`min`

and `max`

functions).

```
np.min(my_walk)
np.max(my_walk)
```

For determining the index of the maximum, we can use the `argmax`

function,
e.g. `np.argmax(my_walk)`

. As we discussed earlier, these are large libraries
with 100s of functions. Often there is a function “for that”. But no one can or
should just know all of these functions. Instead we want to cultivate an idea
of what kind of functions should exist and the background/experience to effectively find the
relevant function in the documentation.

How would we implement `argmax`

with “built-in” Python?

```
def argmax(a_list):
index = 0
for i in range(len(a_list)):
if a_list[i] > a_list[index]:
index = i
return index
```

Recall that we rarely do a simulation once! Last time for instance we performed 5000 simulations to determine the empirical null distribution for birthweight vs. mother’s smoking status. What if we wanted to compute the maximum values across 5000 random walks? Here is where the vectorized approach can be particularly helpful (for both conciseness and performance). We can think of 5000 random walks of length 100 as working with a 2-D array with 5000 rows and 100 columns. All the NumPy functions we used can also be applied to 2-D (or more than 2-D) arrays!

```
delta = np.random.choice([-1, 1], (5000, 100))
walks = np.cumsum(delta, axis=1)
```

Notice we only need to make two changes, one to specify the `size`

as a tuple
of length two (rows, columns) and the specify that the cumulative sum should be
performed along axis 1, that is along the rows. Here are the resulting `walks`

(note the 2-D array), with an overall maximum of 38.

```
>>> walks
array([[ -1, 0, -1, ..., -2, -1, 0],
[ 1, 0, 1, ..., 14, 15, 16],
[ 1, 2, 1, ..., 12, 11, 10],
...,
[ -1, 0, -1, ..., -2, -1, 0],
[ 1, 0, -1, ..., -16, -15, -16],
[ 1, 2, 3, ..., 2, 3, 2]])
>>> np.max(walks)
38
```