Lecture 16: Data analysis with NumPy/pandas

Objectives for today

Python tools for data analysis

Our topic for the next two weeks is an introduction to programming tools we can use for data analysis.

These tools will enable us to work with much larger data sets than we can with say, Excel, which allows only a limited number of rows.

Specifically we will introduce several tools that make up the SciPy system and the R statistics environment.

SciPy

SciPy is an umbrella effort to develop scientific computing tools within Python. From the SciPy website: “SciPy (pronounced “Sigh Pie”) is a Python-based ecosystem of open-source software for mathematics, science, and engineering.” Check out the website for documentation, tutorials and other information.

In this course we will be using three packages from SciPy:

Here is how we will import these modules:

import numpy as np
import pandas as pd
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt

What if I get a “ModuleNotFoundError”? You will need to install these modules using the same approach as when you installed Thonny the first time. Open the shell as we did last week, i.e., “Tools -> Open System Shell”, then execute the following commands on OSX:

python3 -m pip install numpy pandas matplotlib

and Windows:

python -m pip install numpy pandas matplotlib

Series and DataFrames

Pandas is a Python library providing data structures and data analysis tools.

The two basic data structures used in pandas are Series and DataFrame.

A Series is a 1-D array (a list of values of a single type) with indices that can be labels. See the Series documentation.

A DataFrame is a 2-D table with labeled rows and columns, similar to a spreadsheet. See the DataFrame documentation.

Some of the key properties of a spreadsheet table:

The pandas DataFrame type is designed for this kind of data.

There are lot of different ways to think, conceptually, about DataFrames including (but not limited to) as a spreadsheet, or a database table, or a dictionary of 1-D arrays (the columns) with the column label as the key. The last is how we will often create DataFrames.

In addition to these notes, check out another introduction for pandas basics or the following notes from Software Carpentry.

Indexing

Let’s create an example DataFrame starting from a dictionary (adapted from this example):

>>> d = {
        'Artist': ['Billie Holiday','Jimi Hendrix', 'Miles Davis', 'SIA'],
        'Genre': ['Jazz', 'Rock', 'Jazz', 'Pop'],
        'Listeners': [1300000, 2700000, 1500000, 2000000],
        'Plays': [27000000, 70000000, 48000000, 74000000]
    }
>>> df = pd.DataFrame(d)
>>> df
           Artist Genre  Listeners     Plays
0  Billie Holiday  Jazz    1300000  27000000
1    Jimi Hendrix  Rock    2700000  70000000
2     Miles Davis  Jazz    1500000  48000000
3             SIA   Pop    2000000  74000000

Access columns with index operator frame['colname'], e.g., df["Artist"], or a list of columns names. And rows via the .loc attributes. Unfortunately unlike most of Python, slicing using .loc uses an inclusive end. The .loc attribute can also be used to select rows and columns simultaneously.

>>> df['Artist']
0    Billie Holiday
1      Jimi Hendrix
2       Miles Davis
3               SIA
Name: Artist, dtype: object
>>> df[['Artist','Listeners']]
           Artist  Listeners
0  Billie Holiday    1300000
1    Jimi Hendrix    2700000
2     Miles Davis    1500000
3             SIA    2000000
>>> df.loc[1:3]
         Artist Genre  Listeners     Plays
1  Jimi Hendrix  Rock    2700000  70000000
2   Miles Davis  Jazz    1500000  48000000
3           SIA   Pop    2000000  74000000
>>> df.loc[1:3,['Artist']]
         Artist
1  Jimi Hendrix
2   Miles Davis
3           SIA

A labeled column is analogous to a dictionary entry, where the column label would be the key, and the column contents would be in a list comprising the value associated with that key. Much like dictionaries, we can create new columns by assigning to an index. For example:

>>> df["Albums"] = [12, 8, 122, 8]
>>> df
           Artist Genre  Listeners     Plays  Albums
0  Billie Holiday  Jazz    1300000  27000000      12
1    Jimi Hendrix  Rock    2700000  70000000       8
2     Miles Davis  Jazz    1500000  48000000     122
3             SIA   Pop    2000000  74000000       8

Vector execution on columns, etc.

We could use the above indexing (and the shape attribute) to iterate through rows and columns. However, whenever possible we will try to avoid explicitly iterating, that is, we will try to perform operations on entire columns or blocks of rows at once. For example let’s compute the “plays per listener”:

>>> df["Average_Plays"] = df["Plays"] / df["Listeners"]
>>> df
           Artist Genre  Listeners     Plays  Albums  Average_Plays
0  Billie Holiday  Jazz    1300000  27000000      12      20.769231
1    Jimi Hendrix  Rock    2700000  70000000       8      25.925926
2     Miles Davis  Jazz    1500000  48000000     122      32.000000
3             SIA   Pop    2000000  74000000       8      37.000000

Here the / operator to perform element-wise division, that is, we are effectively computing:

[df.loc[0,"Plays"] / df.loc[0, "Listeners"], df.loc[1,"Plays"] / df.loc[1, "Listeners"], ...]

This a vector-style of computation and can be much faster than directly iterating (as we have done before) because we use highly-optimized implementations for performing arithmetic and other operations on the columns. In this context, we use “vector” in the linear algebra sense of the term, i.e., a 1-D matrix of values. More generally we are aiming for largely “loop-free” implementations, that is, all loops are implicit in the vector operations.

The idea of a “loop-free” implementation is illustrated with the definition of mean as:

def mean(data):
    return sum(data) / len(data)

instead of:

def mean(data):
    result = 0
    for val in data:
        result += val
    return result / len(data)

In a more complex example, let’s consider the standard deviation computation from our statistics lab. Here is a typical implementation with an explicit loop. How can we vectorize?

import math

def stddev(data):
    mean = sum(data) / len(data)
    result = 0.0
    for d in data:
        result += (d - mean) ** 2
    return math.sqrt(result / (len(data) - 1))

To get rid of the explicit for loop:

>>> data = np.array([1.0, 2.0, 3.0, 4.0])
>>> math.sqrt(np.sum(np.power(data - np.mean(data), 2))/(len(data) - 1))

For our example input, this code performs the following computations:

  1. Element-wise subtraction to compute the difference from the mean

    $$ \begin{bmatrix} 1.0 \\ 2.0 \\ 3.0 \\ 4.0 \end{bmatrix} - \begin{bmatrix} 2.5 \\ 2.5 \\ 2.5 \\ 2.5 \end{bmatrix} $$
  2. Element-wise “squaring” via the power function (which raises its first argument to its second argument)

    $$ \begin{bmatrix} -1.5^2 \\ -0.5^2 \\ 0.5^2 \\ 1.5^2 \end{bmatrix} $$
  3. Sum the vector

    $$ 2.25 + 0.25 + 0.25 + 2.25 $$
  4. Perform the division and square root

    $$ \sqrt{\frac{5}{4-1}} $$

A key takeaway is that the NumPy and pandas functions can be applied to single values, vectors and matrices similarly. Note this is not just a feature of the SciPy system; many programming languages, such as Matlab and R (which we will discuss next week), are designed around this vectorized approach.

Filtering

We can apply the same “vector” approaches to filtering our data. Let’s subset our rows based on specific criteria.

>>> df[df['Genre'] == "Jazz" ]
           Artist Genre  Listeners     Plays  Albums  Average_Plays
0  Billie Holiday  Jazz    1300000  27000000      12      20.769231
2     Miles Davis  Jazz    1500000  48000000     122      32.000000
>>> df[df['Listeners'] > 1800000 ]
         Artist Genre  Listeners     Plays  Albums  Average_Plays
1  Jimi Hendrix  Rock    2700000  70000000       8      25.925926
3           SIA   Pop    2000000  74000000       8      37.000000

Conceptually we are computing a vector of booleans and using those booleans to select rows from the DataFrame.

>>> df['Listeners'] > 1800000
0    False
1     True
2    False
3     True

Grouping and groupby

One of the most powerful (implicitly looped) iteration approaches in pandas is the groupby method. groupby “groups” identical values in the specified columns, and facilitates analyses on each of those groups. For example let’s sum all the data by genre:

>>> df.groupby('Genre').sum()
       Listeners     Plays  Albums  Average_Plays
Genre                                            
Jazz     2800000  75000000     134      52.769231
Pop      2000000  74000000       8      37.000000
Rock     2700000  70000000       8      25.925926

A histogram can be implemented in a single expression with groupby(...).size(), which counts the number of rows in each group, or the value_counts method:

>>> df.groupby('Genre').size()
Genre
Jazz    2
Pop     1
Rock    1
dtype: int64
>>> df['Genre'].value_counts()
Jazz    2
Rock    1
Pop     1
Name: Genre, dtype: int64

Reimplementing our data analysis lab

As you might imagine, using NumPy and pandas we can very easily, and succinctly, reimplement our data analysis lab by using the many standard statistics functions provided in those libraries.

Here it is in its entirety…

# Load single column:
# header=None specifies that there isn't a header
# squeeze=True specifies that since there is just one column, read_table should
# produce a series (a vector) instead of a table
data = pd.read_table(filename, header=None, squeeze=True)

print("File contained", len(data), "entries")
if len(data) > 0:
    print("Max:", np.max(data))
    print("Min:", np.min(data))
    print("Average:", np.mean(data))
    print("Median:", np.median(data))
    print("Std. dev:", np.std(data, ddof=1))

We can implement this even more concisely using the pandas describe method.

More generally, the message is that when you start applying the skills you have learned in this class “in the wild”, you will not be starting from scratch. But instead you will leverage the many sophisticated libraries that already exist.

Analyzing noise complaints

Let’s look at NYC 311 complaints from 10/1/2016 to 10/7/2016 (based on this example), downloaded from NYC OpenData portal. We downloaded a small set of data (one week in October of 2016) for us to use; download it here.

First load the data and check out the available columns. We will want to perform some operations on the dates, so we will convert “Created Date” to pandas DateTime type (note that doing so can take a moment) and add a “Weekday” column.

complaints = pd.read_csv("311_Service_Requests.csv.gz", low_memory=False)
complaints.info()
complaints["Created Date"] = pd.to_datetime(complaints["Created Date"])
complaints["Weekday"] = complaints["Created Date"].dt.weekday_name

We can use the head method to check out a few examples of the data, e.g., complaints.head().

What are the most common complaints?

complaints['Complaint Type'].value_counts()

No surprise that it is noise complaints. Let’s select just the most common noise complaints using the sophisticated slicing capabilities we saw earlier. Here we are selecting rows using a boolean that is the OR of all those different types of noise complaints.

noise_complaints = complaints[
    (complaints["Complaint Type"] == "Noise - Residential") |  
    (complaints["Complaint Type"] == "Noise") |
    (complaints["Complaint Type"] == "Noise - Street/Sidewalk") |
    (complaints["Complaint Type"] == "Noise - Commercial") |
    (complaints["Complaint Type"] == "Noise - Vehicle")
]

Which borough has the highest fraction of noise complaints (of total complaints)? We can normalize by the total number of complaints (even better would be to normalize by the borough population, but we don’t immediately have that information).

noise_complaint_counts = noise_complaints['Borough'].value_counts()
complaint_counts = complaints['Borough'].value_counts()
noise_complaint_counts / complaint_counts

Which day of the week and borough has the highest fraction of noise complaints? We can use the groupby functionality to compute the counts for different combination of day and borough. Note that result is just a Series (1-D), but with a very powerful MultiIndex and that division takes those indices into account.

noise_complaint_counts = noise_complaints.groupby(['Weekday','Borough']).size()
complaint_counts = complaints.groupby(['Weekday','Borough']).size()
noise_complaint_counts / complaint_counts

A note about vectorization

This “vectorized” approach is very powerful. And not as unfamiliar as it might seem. If you have ever created a formula in Excel in a single cell and then copied it to an entire column, then you have performed a “vectorized” computation. As we described earlier, the benefits of this approach are performance and programmer efficiency. We just implemented some very sophisticated computations very easily. And we did so with function calls, indexing and slicing just as we have done with strings, lists, etc. The challenge is that our tool box just got much bigger! We can’t possible know all the functions available. Instead what we will seek to cultivate is the knowledge that such a function should exist and where to start looking for it. As a practical matter that means there will be many links in our labs and notes to online documentation that you will need to check out.

Summary

  1. Numpy and pandas
  2. Vectorized execution
  3. groupby

Be sure to check out the reading and the practice exercises for this week.