CS 465 - Tutorial Two
2025-10-06T23:59
Objectives
- Walk through the process of doing exploratory data analysis with visualizations
- Get some more practice building visualizations
Prerequisites
Create the git repository for your tutorial by accepting the assignment from GitHub Classroom. This will create a new repository for you with a bare bones npm package already set up for you.
Clone the repository to you computer with
git clone
(get the name of the repository from GitHub).Open the directory with VSCode. You should see all of the files down the panel on the left in the Explorer.
In the VSCode terminal, type
pnpm install
. This will install all of the necessary packages.In the terminal, type
pnpm dev
to start the development server.
Iterative Exploratory Data Analysis
Exploratory data analysis is one of the primary data science activities. Our primary goal is to feel around in a data set and try to understand what is going on. The visualizations we create are in service to our goal of understanding the data. As a result, we aren’t spending a lot of time trying to make the visualizations pretty – once they have communicated what they have to say to us, we move on. If we are going to communicate our findings to someone else, that is when we might spend some time on the design (which isn’t to say that we should be satisfied with bad visualizations – bad visualizations hide data or give us misleading impressions of what is going on).
Our basic process is pretty simple
- Formulate a question (or questions)
- Construct a visualization to address the question (it won’t necessarily answer it!)
- Inspect the visualization and use it to form new questions
- Repeat
our ability to intermix Markdown and JS with Observable Framework makes it a useful tool for this kind of exploration. We can write up our questions, make a visualization, write some notes, and then move on to the next visualization, leaving a record behind us of the exploration. It is okay to work for a little while at a visualization, adding and removing encodings until it answers our question, but once we have an answer, we move on. The record we leave should not consist of a collection of graphs that show the same thing – each one should have some purpose in the exploration.
The kinds of questions we can ask vary wildly. It is possible that we may have a very targeted question going in, but a lot of times our question is just “what’s in this data set?” As such, our first visualizations may just be looking at the shape of the data, or looking for simple relationships. After that, we are frequently driven by the power of surprise. “huh, that’s funny” and “I wonder why…” will be our driving mantras.
In this tutorial, I’ll guide you through a simple exploration. My curiosity is our primary guide. I make no claim to have plumbed this dataset for its deepest secrets or most interesting tidbits. I want you to get a feel for the process, dead ends and all.
The data set
For this exercise, we are going to use the ‘Employee’s Evaluation for Promotion’ dataset from Kaggle.
Ostensibly, this dataset was created to promote the development of an ML tool that can predict employees that are more likely to be eligible for promotion. As such, the presence of age and gender in this dataset are a bit suspect given Amazon’s experience with a similar problem (not to mention the fact that gender is strictly binary in this set)1.
Regardless, it is a reasonable data set for us to poke around in, even if our ultimate goal is not one of prediction.
Part 1: Guided exploration
I have written up a guided exploration for you. If you haven’t already, start up the dev server. Follow through the guided exploration in src/index.md.
Part 2: Your turn
Now it is your turn. Switch to independent.md. I’ve imported the data for you, but the rest is up to you. I left a number of threads open that you could follow up on, or you could pick your own direction through the data.
Follow the pattern that I established (though you don’t need to explain how to make the visuals) - Start with a question written in markdown - Create a graph to address it - Follow up with your observations in another markdown block
Create at least two more charts. They can be connected, with one addressing a question raised by the first, or separate. It should be clear, however, that there you started with a question, there is a visualization, and then your observations about what the visualization has told you.
Reflection
Before you submit, make sure to answer the questions in reflection.md.
Submitting your work
- Commit your work to the repository (see the git basics guide if you are unfamiliar with this process)
- push the work to github
- submit the repository on Gradescope (see the Gradescope submission guide)
Footnotes
Amazon briefly was using a similar model to evaluate if résumés should be read. It tended to discard women’s résumés because the model was built on the profiles of successful employees, which skewed heavily male because it is a “dotcom-era technology company” and that is who they had been hiring…↩︎