Lecture 12: Sets and Objects

Objectives for today

Classes and objects

In Object-Oriented Programming, an object is an instance of a class. A class is like a blueprint – it describes what data and methods an object of that class will have. Consider a “people” class. People will have attributes and methods. An instance of the people class will be a specific “person”.

In Python, classes are synonymous with types, that is, each type is a Class. When we invoke help with a type we are obtaining information about the class, e.g., help(str) or help(int).

When we write d = dict() or s = set() we are creating new objects, namely new instances of the dictionary class and set class.

Sets

Recall when we introduced Lists, we described a List as a “data structure”. And that data structures were a particular way of organizing data, with different data structures being designed for different kinds of computations. Since then, we have also introduced the data structures dictionary and tuple.

Today we introduce a new data structure, sets and the set class. In a mathematical context, a set is a collection of distinct values. So too in Python (and CS generally). Specifically a set data structure is a:

A set differs from a list in these two key properties: (1) a list is ordered, and (2) a list can have duplicates.

The values in a set must be comparable (or “hashable”, so uniqueness can be enforced) and immutable (so prior comparisons are not invalidated by changing an object within the set). A set is a mutable structure, but its values must be immutable.

Some operations we would want to perform with sets:

For a list of the operations provided by Python for the set data structure:

>>> help(set)

Let’s discuss these in turn…

Set Construction

Recall that we could create a list directly using square brackets

>>> l = [1, 2, 3, 4]
>>> type(l)
<class 'list'>

we can do something similar to create a set:

>>> s = {1, 2, 3, 4}
>>> type(s)
<class 'set'>

But we can also create a set using the constructors. Constructors are the name of the functions that create new instances (or objects) of a Class - they have the same name as the type. They define how to “construct” objects from different inputs.

The first part of the help describes the constructors:

>>> help(set)
...
class set(object)
 |  set() -> new empty set object
 |  set(iterable) -> new set object
...

The first creates an empty set (but is not equivalent to {}, which is an empty dictionary). The second creates a new set from an iterable object.
Lists and strings are “iterable” and can be used to construct a set. For example:

>>> set()
set()
>>> set(l)
{1, 2, 3, 4}
>>> set("abcd")
{'c', 'a', 'd', 'b'}

What do we notice about the last set we created? The letters are no longer “in order”. Recall that sets are “unordered”. What about their second property, uniqueness?

>>> set("abcda")
{'c', 'a', 'd', 'b'}

Constructors: An Aside

We have used other constructors. Here are some examples:

>>> str(10)
'10'
>>> int("10")
10

Set Methods

There are two broad categories of set methods, those that mutate the object and those that don’t.

Using the help output, we can learn which methods mutate the set object on which they were invoked:

Set methods in action:

>>> s = {1, 2, 3, 4}
>>> s.add(5)
>>> s
{1, 2, 3, 4, 5}
>>> s2 = {4, 5, 6, 7}
>>> s.difference(s2)
{1, 2, 3}
>>> s
{1, 2, 3, 4, 5}
>>> s2
{4, 5, 6, 7}
>>> s.union(s2)
{1, 2, 3, 4, 5, 6, 7}
>>> s.intersection(s2)
{4, 5}
>>> s.intersection_update(s2)
>>> s
{4, 5}
>>> s.update({1,2})
>>> s
{1, 2, 4, 5}

Note that some operators are “overloaded” to implement set operations:

Are these operators mutating or not?

What about membership? We apply the in operator:

>>> s2
{4, 5, 6, 7}
>>> 1 in s2
False
>>> 5 in s2
True
>>> "abc" in s2
False

What about comparisons? What would make sense in the context of sets? For lists (and strings and other sequences) Python performs ordered comparisons, i.e., compares the first elements then the second elements and so on. But sets are not ordered. Instead, set comparisons are defined in terms of subsets and supersets. For example, a <= b if set a is a subset of set b, while a >= b if set a is a superset of set b.

What about the indexing operator?

>>> s2[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing

Recall sets are unordered, so there is no notion of indexing. But we can still iterate, that is, sets are iterable:

>>> for val in s2:
...     print(val)
... 
4
5
6
7

Answer questions about sets

Why Sets

Couldn’t we accomplish the same operations with lists? We could. And lists actually do support some of the same operations as sets, e.g. in

>>> l = [1, 2, 3, 4]
>>> 4 in l
True
>>> "abc" in l
False

So why sets? The uniqueness invariant and performance. The set class is implemented in way to speed up set operations relative to lists. Consider implementing a function contains that implements the same functionality as in.

def contains(list, item):
    for val in list:
        if val == item:
            return True
    return False

If we double the length of the list, how much longer will it take to execute contains? The time will also double. We would say contains executes in linear time with respect to the length of the input list.

In contrast, we can perform the same operation with set in constant time (or at worst, logarithmic time) with respect to the size of the set.

Let’s convince ourselves with data. Checkout lists_vs_sets.py.

This script includes functions for creating lists and sets of a given size (with random data). And then querying those data structures a given number times. speed_test times how long it takes to create and query the different data structures.

For smaller input size, the performance is similar:

>>> speed_test(1000, 100)
List creation took 0.001810 seconds
Set creation took 0.002273 seconds
--
List querying took 0.001780 seconds
Set querying took 0.000172 seconds

As we increase the input sizes or the number of queries, we start to see real differences in performance:

>>> speed_test(1000, 100)
List creation took 0.001810 seconds
Set creation took 0.002273 seconds
--
List querying took 0.001780 seconds
Set querying took 0.000172 seconds
>>> speed_test(10000, 100)
List creation took 0.018895 seconds
Set creation took 0.024366 seconds
--
List querying took 0.016907 seconds
Set querying took 0.000159 seconds
>>> speed_test(10000, 1000)
List creation took 0.018688 seconds
Set creation took 0.018990 seconds
--
List querying took 0.156710 seconds
Set querying took 0.002958 seconds
>>> speed_test(100000, 1000)
List creation took 0.161801 seconds
Set creation took 0.187271 seconds
--
List querying took 1.591348 seconds
Set querying took 0.001975 seconds

Lets investigate in a more systematic way:

>>> speed_data(1000, 10000, 100000, 5000)
size	list	set
10000	0.157616	0.001780
15000	0.251193	0.001595
20000	0.312362	0.001505
25000	0.392334	0.001448
30000	0.480571	0.001570
35000	0.566010	0.002419
40000	0.634969	0.001507
45000	0.732042	0.001493
50000	0.815638	0.001570
55000	0.942068	0.001659
60000	1.031601	0.001606
65000	1.164648	0.001574
70000	1.288893	0.001734
75000	1.400574	0.001537
80000	1.443860	0.001565
85000	1.633924	0.001546
90000	1.784104	0.001501
95000	1.799544	0.001495

We can plot this in Excel. We will learn how to plot from within Python later in semester.

Putting it together: When to use set vs list

If order matters, use a list.

If order is unimportant but uniqueness or query performance matters, use a set.

What are some situations when we might want to use a set? Membership testing, removing duplicates from a sequence, …

Some thoughts about efficiency and performance

So far this semester we have largely ignored questions about performance or efficiency. Our programs have been so small that efficiency has been a non-issue. But that will not always be the case.

What about in the standard deviation calculation? If you calculate the average in each loop iteration, your program could be very slow on the larger census datasets. We will talk more formally about computational complexity later in the semester, but here is a preview.

We said finding an element in a list takes an amount of time that grows linearly with the size of the list. Whereas testing whether an element is in a set takes constant time (or in the worst case, proportional to log(size of set)). What about computing the average in each loop iteration of standard deviation? The computation of the standard deviation would then increase as the square of the size of the dataset; if the size of the dataset is n, computing the average takes “order” n time (summing all n elements) and we are doing that computation for each data point, i.e. n times. n^2 can grow very quickly, even for “smallish” n.

We will continually build up our “efficiency” toolbox throughout the semester. Our first two tools are:

Summary

  1. Sets
  2. Introduction to objects
  3. Introduction to computational complexity
  1. lists_vs_sets.py.
  2. Python documentation for set
  3. Python documentation for ordered comparisons
  4. Python documentation for set comparisons

Supplemental Reading (optional)