We have spent the semester developing our understanding of Python (and CS) fundamentals. However it may seem like there is still long way to go between what we have learned so far and the kind of data analysis tasks that may have motivated you take this class. For example, can we use what we have learned so far to replace Excel (and its paltry limit on rows)? Actually yes!
Doing so, or at at least introducing how to do so, is our next topic. Specifically we are going to introduce several tools that make up
the ScipPy stack and the
Unfortunately in the time available we can only scratch the surface of these tools. However, I also want to assure you that you have developed the necessary skills to start using these modules on your own!
There is a rich scientific computing ecosystem in Python. In this course we will be using three core packages in that ecosystem:
And the datascience package, which provides an easier-to-use-interface around Pandas’ data tables.
There are many other packages that make up “scientific python” that are beyond the scope of this course. I encourage you to check them out if you are interested.
Let’s get our imports of these modules out the way (note that the order
datascience needs to be imported before
import numpy as np import datascience as ds import matplotlib matplotlib.use('TkAgg') import matplotlib.pyplot as plt
What if I get a “ModuleNotFoundError”? You will need to install these modules.
You can do so with the Thonny “Tools -> Manage Packages” command. Enter
matplotlib into the search (one at a
time), click “Find package from PyPI” then “Install” (note that you may not
need to install all the packages manually, as some are automatically installed
as dependencies). Alternately open the shell as we did last week, i.e.
“Tools -> Open System Shell”, then execute the following commands on OSX:
python3 -m pip install numpy pandas matplotlib datascience
python -m pip install numpy pandas matplotlib datascience
to perform the same process from the command line.
Presumably you have used Microsoft Excel (or other similar spreadsheet). If so,
you have effectively used a datascience
Table (which is itself a wrapper
around a Pandas DataFrame).
What are the key properties of a spreadsheet table:
There are lot of different ways to think, conceptually, about
including (but not limited to) as a spreadsheet, or a database table, or as a
list/dictionary of 1-D arrays (the columns) with the column label as the key.
The last is how we will often create
Let’s create an example
Table starting from lists (adapted from this
>>> df = ds.Table().with_columns( 'Artist', ['Billie Holiday','Jimi Hendrix', 'Miles Davis', 'SIA'], 'Genre', ['Jazz', 'Rock', 'Jazz', 'Pop'], 'Listeners', [1300000, 2700000, 1500000, 2000000], 'Plays', [27000000, 70000000, 48000000, 74000000] ) >>> df Artist | Genre | Listeners | Plays Billie Holiday | Jazz | 1300000 | 27000000 Jimi Hendrix | Rock | 2700000 | 70000000 Miles Davis | Jazz | 1500000 | 48000000 SIA | Pop | 2000000 | 74000000
Access columns with index operator, e.g.
frame["colname"], and rows via the
.take attribute, e.g.
>>> df['Artist'] array(['Billie Holiday', 'Jimi Hendrix', 'Miles Davis', 'SIA'], dtype='<U14') >>> df.select(["Artist","Genre"]) Artist | Genre Billie Holiday | Jazz Jimi Hendrix | Rock Miles Davis | Jazz SIA | Pop >>> df.take[1:3] Artist | Genre | Listeners | Plays Jimi Hendrix | Rock | 2700000 | 70000000 Miles Davis | Jazz | 1500000 | 48000000
We can use the
with_column method to create a new column as shown below. Note this returns a new Table with new column, so we need to assign the return value to a variable to use the new Table in the future.
>>> df = df.with_column("Albums", [12, 8, 122, 8]) >>> df Artist | Genre | Listeners | Plays | Albums Billie Holiday | Jazz | 1300000 | 27000000 | 12 Jimi Hendrix | Rock | 2700000 | 70000000 | 8 Miles Davis | Jazz | 1500000 | 48000000 | 122 SIA | Pop | 2000000 | 74000000 | 8
We could use the above indexing tools (and the
attributes) to iterate through rows and columns. However, whenever possible we
will try to avoid explicitly iterating, that it is we will try to perform
operations on entire columns or blocks of rows at once. For example let’s
compute the “plays per listener”, creating a new column by assigning to a
>>> df["Average_Plays"] = df["Plays"] / df["Listeners"] >>> df Artist | Genre | Listeners | Plays | Albums | Average_Plays Billie Holiday | Jazz | 1300000 | 27000000 | 12 | 20.7692 Jimi Hendrix | Rock | 2700000 | 70000000 | 8 | 25.9259 Miles Davis | Jazz | 1500000 | 48000000 | 122 | 32 SIA | Pop | 2000000 | 74000000 | 8 | 37
Here we use the
/ operator to perform element-wise division, that is we are
new_column =  for i in range(df.num_rows): new_column.append(df["Plays"][i] / df["Listeners"][i]) df = df.with_column("Average_Plays", new_column)
This is a vector-style of computation and can be much faster than directly iterating (as we have done before) because we use highly-optimized implementations for performing arithmetic and other operations on the columns. In this context, we use “vector” in linear algebra sense of the term, i.e. a 1-D matrix of values. More generally we are aiming for largely “loop-free” implementations, that is all loops are implicit in the vector operations.
This is not new terrain for us. How many of you implemented
def mean(data): return sum(data) / len(data)
def mean(data): result = 0 for val in data: result += val return result / len(data)
In a more complex example, let’s consider the standard deviation computation from our statistics lab. Here is a typical implementation with an explicit loop. How can we vectorize? To do we will use the NumPy module which provides helpful functions for operating on vectors (and is used internally by datascience):
import math def stddev(data): mean = sum(data) / len(data) result = 0.0 for d in data: result += (d - mean) ** 2 return math.sqrt(result / (len(data) - 1))
To get rid of the explicit
>>> data = np.array([1.0, 2.0, 3.0, 4.0]) >>> math.sqrt(np.sum(np.power(data - np.mean(data), 2))/(len(data) - 1))
For our example input, this code performs the following computations:
Element-wise subtraction to compute the difference from the mean. Note that the scalar argument, the mean, is “broadcast” to be the same size as the vector.
Element-wise “squaring” via the power function (which raises its first argument to its second argument)
Sum the vector
Perform the division and square root
A key takeaway is that the NumPy and datascience functions can be applied to single value, vectors and matrices similarly. This is not just a feature of the SciPy stack, many programming languages, such Matlab and R, are designed around this vectorized approach.
We can apply the same “vector” approaches to filtering our data. Let’s subset our rows based on specific criteria.
>>> df.where(df["Genre"] == "Jazz") Artist | Genre | Listeners | Plays | Albums | Average_Plays Billie Holiday | Jazz | 1300000 | 27000000 | 12 | 20.7692 Miles Davis | Jazz | 1500000 | 48000000 | 122 | 32 >>> df.where(df["Listeners"] > 1800000) Artist | Genre | Listeners | Plays | Albums | Average_Plays Jimi Hendrix | Rock | 2700000 | 70000000 | 8 | 25.9259 SIA | Pop | 2000000 | 74000000 | 8 | 37
Conceptually when we filter we are computing a vector of booleans and using those booleans to select rows from the Table.
>>> df["Listeners"] > 1800000 array([False, True, False, True])
One of the most powerful (implicitly looped) iteration approaches on Tables is
This method “groups” identical values in the specified columns, and then
performs computations on the rows belonging to each of those groups as a block (but each group separately from all other groups). For
example let’s first count the observances of different genres and then sum all
the data by genre:
>>> df.group("Genre") Genre | count Jazz | 2 Pop | 1 Rock | 1 >>> df.group("Genre", sum) Genre | Artist sum | Listeners sum | Plays sum | Albums sum | Average_Plays sum Jazz | | 2800000 | 75000000 | 134 | 52.7692 Pop | | 2000000 | 74000000 | 8 | 37 Rock | | 2700000 | 70000000 | 8 | 25.9259
Notice that in that first example we implemented a histogram in a single function!
As you might imagine, using these libraries we can very succinctly re-implement our data analysis lab.
Here it is in its entirety using NumPy alone
data = np.loadtxt("lab5-demo-values.txt") print("File contained", len(data), "entries") if len(data) > 0: print("Max:", np.max(data)) print("Min:", np.min(data)) print("Average:", np.mean(data)) print("Median:", np.median(data)) print("Std. dev:", np.std(data, ddof=1))
or using the datascience module (note that we define our own
std function so that we can set the
ddof optional argument explicitly).
def std(data): """Compute np.std with ddof=1""" return np.std(data, ddof=1) # Data file does not have column names (header=None) so provide explicitly data = ds.Table.read_table("lab5-demo-values.txt", header=None, names=["Data"]) data.stats(ops=(min,max,np.median,np.mean,std))
If those approaches are so much shorter, why didn’t we start with the above? In the absence of an understanding of what is happening “behind the scenes” of those function calls, the code becomes “magic”. And we can’t modify or extend “magic” we don’t understand. These libraries in effect become “walled gardens” no different than a graphical application that provides a (limited) set of possible analyses. Our goal is to have the tools to solve any computational problem, not just those problems someone else might have anticipated we might want to solve.
As you apply the skills you have learned in this class “in the wild”, you will not necessarily “start from scratch” as we have often done so far. Instead you will leverage the many sophisticated libraries that already exist (and that you are ready to do so!). But as you do so, nothing will appear to be magic!
The following data is birthweight data (in ounces) for a cohort of babies along with other data about the baby and mother, including whether the mother smoked. Our hypothesis is that birthweight is negatively associated with whether the mother smoked during pregnancy (i.e., that babies born to mothers who smoked will be smaller on average).
Let’s first load the data, taking advantage of the capability to directly load data from the Internet
>>> baby = ds.Table.read_table('https://www.inferentialthinking.com/data/baby.csv') >>> baby Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker 120 | 284 | 27 | 62 | 100 | False 113 | 282 | 33 | 64 | 135 | False 128 | 279 | 28 | 64 | 115 | True 108 | 282 | 23 | 67 | 125 | True 136 | 286 | 25 | 62 | 93 | False 138 | 244 | 33 | 62 | 178 | False 132 | 245 | 23 | 65 | 140 | False 120 | 289 | 25 | 62 | 125 | False 143 | 299 | 30 | 66 | 136 | True 140 | 351 | 27 | 68 | 120 | False ... (1164 rows omitted)
Here we see a sample of the data (1174 observations total). We are most
interested in the “Birth Weight” and “Maternal Smoker” columns. We can use the
group method to quickly figure out how many smokers and non-smokers are in
the data (that is do we have enough data to do a meaningful analysis of that
>>> baby.select(["Maternal Smoker"]).group("Maternal Smoker") Maternal Smoker | count False | 715 True | 459
Let’s check out the means for smokers and non-smokers. Again we can use
>>> means = baby.group("Maternal Smoker", np.mean) >>> means Maternal Smoker | Birth Weight mean | Gestational Days mean | Maternal Age mean | Maternal Height mean | Maternal Pregnancy Weight mean False | 123.085 | 279.874 | 27.5441 | 64.014 | 129.48 True | 113.819 | 277.898 | 26.7364 | 64.1046 | 126.919
Notice the over 9 ounce difference in average birthweight! Specifically:
>>> actual_diff = means.column("Birth Weight mean").item(0) - means.column("Birth Weight mean").item(1) >>> actual_diff 9.266142572024918
In the above code we extract the
Birth Weight mean column, which is an array and use the
item method to extract scalars at index 0 (mean birthweight when
Maternal Smoker is
True) and index 1 (mean birthweight when
Maternal Smoker is
In contrast the other attributes seem similar between the two groups. However, is that difference in average birthweight actually meaningful or did it just arise by chance? That is in reality what if there is no difference between the two groups and what we are observing is just an artifact of how these mothers/babies were selected? We will describe that latter expectation, that there is no difference, as the null hypothesis.
We would like to determine how likely it is that the null hypothessis is true. To do so we can simulate the null hypothesis by randomly shuffling the “Maternal Smoker”, i.e. randomly assigning row to be a smoker or non-smoker, and the recomputing the difference in average birthweight. If the null hypothesis is true, the difference in means will be similar to what we observe in actual data.
We can do so with the following function. The first part of the loop generates a new Table with the original birthweight data but randomly shuffled values for
Maternal Smoker. The second part repeats the operations we just performed to compute the difference in average birthweight between the two groups.
def permutation(data, repetitions): """Compute difference in mean birthweight repetitions times using shuffled labels""" diffs =  for i in range(repetitions): # Create new table with birthweights and shuffled smoking labels shuffled_labels = data.sample(with_replacement=False).column("Maternal Smoker") shuffled_table = data.select("Birth Weight").with_column("Shuffled Label",shuffled_labels) # Compute difference in mean birthweight between smoking and not smoking groups means = shuffled_table.group("Shuffled Label", np.mean) diff = means.column("Birth Weight mean").item(0) - means.column("Birth Weight mean").item(1) diffs.append(diff) return np.array(diffs)
Testing it out with a single repetition we see that the re-sampled mean is much lower than the actual mean. However we won’t know if that is meaningful until we simulate many times.
>>> permutation(baby, 1) array([-0.80381797])
Let’s simulate 5000 times, creating a vector of 5000 differences. If we ask how many of the differences are greater than the observed difference?
>>> sim_diffs = permutation(baby, 5000) >>> num_diffs_greater = (abs(sim_diffs) > abs(actual_diff)).sum() >>> num_diffs_greater 0
None! That would suggest we can reject the null hypothesis (for those who have taken statistics, our empirical p-value is less than 1/5000).
This is not a statistics class, my goal is to show you how could use some of the tools that we are learning about (i.e., the tools in your toolbox) to implement statistical analyses (and not just perform a pre-built analysis). And in particular to demonstrate the use of “vectorized” computations.
This “vectorized” approach is very powerful. And not as unfamiliar as it might seem. If you have ever created a formula in Excel in a single cell and then copied it to an entire column, then you have performed a “vectorized” computation. As we described earlier, the benefits of this approach are performance and programmer efficiency. We just implemented some very sophisticated computations very easily. And we did so with function calls, indexing and slicing just as we have done with strings, lists, etc. The challenge is that our tool box just got much bigger! We can’t possible know all the functions available. Instead what we will seek to cultivate is the knowledge that such a function should exist and where to start looking for it. As a practical matter that means there will be many links in our labs and notes to online documentation that you will need to check out.