Class 10

Objects, Sets, Dictionaries

Objectives for today

  • Explain when sets are used
  • Describe sets as unordered unique collections of objects of arbitrary type
  • Create a set
  • Explain and use functions, methods and operators on sets including adding, deleting, membership, etc.
  • Explain the properties of different data structures
  • Explain when dictionaries are used
  • Describe dictionaries as a key-value store
  • Create a dictionary
  • Explain and use functions, methods and operators on dictionaries including adding, indexing, deleting, etc.

A Brief Word about Classes

We previously introduced the notion of objects and Object-Oriented Programming, but didn’t answer the question of how objects are defined. Objects are defined by Classes, specifically objects are instances of classes.

We can think of the class as the blueprint describing what data and methods an object will have. Consider a “people” class. People will have attributes, e.g. name, age, and methods that perform computations on those attributes. An instance of the People class will be a specific “person”.

In Python classes are synonymous with types, that is each type is a Class. When we invoke help with a type we are obtaining information about the class, e.g. help(str) or help(int). As we continue throughout the semester we will learn more about Classes. For the purpose of today we will just focus on how to create instances of a class.

Set: An Introduction

Recall when we introduced Lists, we described a List as a “data structure”. And that data structures were a particular way of organizing data, with different data structures being designed for different kinds of computations.

Today we are going to introduce a new data structure, sets and the set type (class). In a mathematical context, a set is a collection of distinct values. So too in Python (and CS generally). Specifically a set data structure is a:

  • Unordered collection, in which
  • All values are unique, i.e., there are no duplicates

A set differs from a List in these two key properties: recall that a List is ordered and can have duplicates.

These properties facilitate very fast operations like de-duplication and membership, but also introduce some requirements on the kind of values we can store in a set. Specifically all values must be “hashable”. For our purposes, “hashable” implies “comparable” (so uniqueness can be enforced) and immutable so that prior comparisons aren’t invalidated by changing an object within the set.

What are the kinds of operations we would want to perform with sets?

  • Create a new set
  • Add values to a set
  • Remove a value from a set
  • Query if a value is in a set
  • Set intersection (collection of values in two/both sets)
  • Set union (collection of values in one or both sets)
  • Set difference (“subtract” all values in another set)
  • Set symmetric difference (collection of values only in one of two sets)

What operations are provided by Python sets?

help(set)
Help on class set in module builtins:

class set(object)
 |  set() -> new empty set object
 |  set(iterable) -> new set object
 |  
 |  Build an unordered collection of unique elements.
 |  
 |  Methods defined here:
 |  
 |  __and__(self, value, /)
 |      Return self&value.
 |  
 |  __contains__(...)
 |      x.__contains__(y) <==> y in x.
 |  
 |  __eq__(self, value, /)
 |      Return self==value.
 |  
 |  __ge__(self, value, /)
 |      Return self>=value.
 |  
 |  __getattribute__(self, name, /)
 |      Return getattr(self, name).
 |  
 |  __gt__(self, value, /)
 |      Return self>value.
 |  
 |  __iand__(self, value, /)
 |      Return self&=value.
 |  
 |  __init__(self, /, *args, **kwargs)
 |      Initialize self.  See help(type(self)) for accurate signature.
 |  
 |  __ior__(self, value, /)
 |      Return self|=value.
 |  
 |  __isub__(self, value, /)
 |      Return self-=value.
 |  
 |  __iter__(self, /)
 |      Implement iter(self).
 |  
 |  __ixor__(self, value, /)
 |      Return self^=value.
 |  
 |  __le__(self, value, /)
 |      Return self<=value.
 |  
 |  __len__(self, /)
 |      Return len(self).
 |  
 |  __lt__(self, value, /)
 |      Return self<value.
 |  
 |  __ne__(self, value, /)
 |      Return self!=value.
 |  
 |  __or__(self, value, /)
 |      Return self|value.
 |  
 |  __rand__(self, value, /)
 |      Return value&self.
 |  
 |  __reduce__(...)
 |      Return state information for pickling.
 |  
 |  __repr__(self, /)
 |      Return repr(self).
 |  
 |  __ror__(self, value, /)
 |      Return value|self.
 |  
 |  __rsub__(self, value, /)
 |      Return value-self.
 |  
 |  __rxor__(self, value, /)
 |      Return value^self.
 |  
 |  __sizeof__(...)
 |      S.__sizeof__() -> size of S in memory, in bytes
 |  
 |  __sub__(self, value, /)
 |      Return self-value.
 |  
 |  __xor__(self, value, /)
 |      Return self^value.
 |  
 |  add(...)
 |      Add an element to a set.
 |      
 |      This has no effect if the element is already present.
 |  
 |  clear(...)
 |      Remove all elements from this set.
 |  
 |  copy(...)
 |      Return a shallow copy of a set.
 |  
 |  difference(...)
 |      Return the difference of two or more sets as a new set.
 |      
 |      (i.e. all elements that are in this set but not the others.)
 |  
 |  difference_update(...)
 |      Remove all elements of another set from this set.
 |  
 |  discard(...)
 |      Remove an element from a set if it is a member.
 |      
 |      If the element is not a member, do nothing.
 |  
 |  intersection(...)
 |      Return the intersection of two sets as a new set.
 |      
 |      (i.e. all elements that are in both sets.)
 |  
 |  intersection_update(...)
 |      Update a set with the intersection of itself and another.
 |  
 |  isdisjoint(...)
 |      Return True if two sets have a null intersection.
 |  
 |  issubset(...)
 |      Report whether another set contains this set.
 |  
 |  issuperset(...)
 |      Report whether this set contains another set.
 |  
 |  pop(...)
 |      Remove and return an arbitrary set element.
 |      Raises KeyError if the set is empty.
 |  
 |  remove(...)
 |      Remove an element from a set; it must be a member.
 |      
 |      If the element is not a member, raise a KeyError.
 |  
 |  symmetric_difference(...)
 |      Return the symmetric difference of two sets as a new set.
 |      
 |      (i.e. all elements that are in exactly one of the sets.)
 |  
 |  symmetric_difference_update(...)
 |      Update a set with the symmetric difference of itself and another.
 |  
 |  union(...)
 |      Return the union of sets as a new set.
 |      
 |      (i.e. all elements that are in either set.)
 |  
 |  update(...)
 |      Update a set with the union of itself and others.
 |  
 |  ----------------------------------------------------------------------
 |  Class methods defined here:
 |  
 |  __class_getitem__(...) from builtins.type
 |      See PEP 585
 |  
 |  ----------------------------------------------------------------------
 |  Static methods defined here:
 |  
 |  __new__(*args, **kwargs) from builtins.type
 |      Create and return a new object.  See help(type) for accurate signature.
 |  
 |  ----------------------------------------------------------------------
 |  Data and other attributes defined here:
 |  
 |  __hash__ = None

Let’s discuss some these in turn…

Creating sets

Recall that we could create lists directly using square brackets

l = [1, 2, 3, 4]
type(l)
list

we can do something similar for sets:

s = {1, 2, 3, 4}
type(s)
set

But we can also create sets using the initializers. Initializer functions (often called constructors in other languages) are the name for the functions that create (and initialize, hence the name) new instances (or objects) of a type (or Class). In Python initializers have the same name as the type. The define how to “construct” objects from different inputs.

We have used other initializers. Can you think of an example? Recall str (to create strings) and int to create integers.

The first part of the help describes the initializers:

>>> help(set)
...
class set(object)
 |  set() -> new empty set object
 |  set(iterable) -> new set object
...

The first creates an empty set (but is not equivalent to {}, for reasons we will learn about soon). The second creates a new set from an iterable object. We will continue to learn more about iterables, but for purposes of today recall that lists and strings are “iterable” and can be used to construct a set (just in the way we could use a string to construct a list). For example:

set()
set(l)
set("abcd")
set()
{1, 2, 3, 4}
{'a', 'b', 'c', 'd'}

What do we notice about the last set we created? The letters are no longer “in order”. Recall that sets are “unordered”. What about their second property, uniqueness?

set("abcda")
{'a', 'b', 'c', 'd'}

Set Methods

We have two broad categories of set methods, those that mutate the object and those that don’t. Using the help output, which of the following methods mutate the set object on which they were invoked?

  • add : Mutate
  • clear : Mutate
  • union: Non-mutating
  • update: Mutating (recall update is effectively a union operation)
  • intersection : Non-mutating
  • intersection_update : Mutate
  • difference : Non-mutating
  • difference_update : Mutate

Set methods in action:

>>> s = {1, 2, 3, 4}
>>> s.add(5)
>>> s
{1, 2, 3, 4, 5}
>>> s2 = {4, 5, 6, 7}
>>> s.difference(s2)
{1, 2, 3}
>>> s
{1, 2, 3, 4, 5}
>>> s2
{4, 5, 6, 7}
>>> s.union(s2)
{1, 2, 3, 4, 5, 6, 7}
>>> s.intersection(s2)
{4, 5}
>>> s.intersection_update(s2)
>>> s
{4, 5}

Some operators are “overloaded” to implement set operations:

  • | : Set union
  • - : Set difference
  • & : Set intersection
  • ^ : Set symmetric difference

Are these operators mutating or not?

What about membership? We can apply the in operator:

>>> s2
{4, 5, 6, 7}
>>> 1 in s2
False
>>> 5 in s2
True
>>> "abc" in s2
False

What about comparisons? What would make sense in the context of sets? For lists (and strings and other sequences) Python performs lexicographic comparisons, i.e. compares the first elements then the second elements and so on. But sets are not ordered. Instead, set comparisons are defined in terms of subsets and supersets. For example, a <= b if set a is a subset of set b, while a >= b if set a is a superset of set b (< and > are proper subset and superset).

What about the indexing operator?

>>> s2[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'set' object does not support indexing

Recall sets are unordered, so there is no notion of indexing. But we can still iterate, that is sets are iterable

>>> for val in s2:
...     print(val)
... 
4
5
6
7

Does that mean we could implement indexing if we really want to? Yes. But should we? Is there an semantic meaning of indices for a set?

Why Sets

Couldn’t we accomplish the same operations with lists? We could. And lists actually do support some of the same operations as sets, e.g., in

>>> l = [1, 2, 3, 4]
>>> 4 in l
True
>>> "abc" in l
False

So why sets? The uniqueness invariant and performance. The set class is implemented in way to speed up those operations relative to lists. Consider implementing a function contains that implements the same functionality as in.

def contains(list, item):
    for val in list:
        if val == item:
            return True
    return False

If we double the length of the list, how much longer will it take to execute contains? The time will also double. We would say contains executes in “linear” time with respect to the length of the input list.

In contrast, we can perform the same operation with set in constant time (or at worst, logarithmic time) with respect to the size of the set.

Let’s convince ourself with data. Checkout lists_vs_sets.py. This program includes functions for creating lists and sets of a given size (with random data). And then querying those data structures a given number times. speed_test times how long it takes to create and query the different data structures.

For smaller input size, the performance is similar:

speed_test(1000, 100)
List creation took 0.0006480216979980469 seconds
Set creation took 0.0007088184356689453 seconds
--
List querying took 0.0009682178497314453 seconds
Set querying took 6.29425048828125e-05 seconds

As we increase the input sizes or the number of queries, we start to see real differences in performance:

speed_test(10000, 1000)
List creation took 0.006696939468383789 seconds
Set creation took 0.007429838180541992 seconds
--
List querying took 0.09478592872619629 seconds
Set querying took 0.0007238388061523438 seconds

Lets investigate in a more systematic way:

speed_data(1000, 10000, 100000, 5000)
size    list    set
10000   0.09415793418884277 0.0006778240203857422
15000   0.139693021774292   0.0007507801055908203
20000   0.1837480068206787  0.0007188320159912109
25000   0.23158884048461914 0.000823974609375
30000   0.2761962413787842  0.0010371208190917969
35000   0.32499217987060547 0.00074005126953125
40000   0.3725881576538086  0.0007519721984863281
45000   0.41575098037719727 0.0007748603820800781
50000   0.5937619209289551  0.0007617473602294922
55000   0.5039329528808594  0.000782012939453125
60000   0.5555520057678223  0.001074075698852539
65000   0.596034049987793   0.0007562637329101562
70000   0.6440389156341553  0.0007801055908203125
75000   0.6970598697662354  0.001148223876953125
80000   0.7484962940216064  0.0017039775848388672
85000   0.7898149490356445  0.0008890628814697266
90000   0.8315770626068115  0.0009768009185791016
95000   0.888375997543335   0.0007839202880859375

We can plot this in Excel. We will learn how to plot from within Python later in semester.

When to use set vs list

Does order matter? Use a list. Is order unimportant but uniqueness or query performance matters? Use a set.

What are some situations when we might want to use a set? Membership testing, removing duplicates from a sequence, …

Some thoughts about efficiency and performance

So far we have largely ignored questions about performance or efficiency. Our programs have been so small that efficiency has been a non-issue. But that is/will not always be the case. Has anyone encountered performance issues already?

One possible situation is in the standard deviation calculation in the data assignment. If you calculate the average in each loop iteration, your program could be very slow on the larger census datasets. We will talk more formally about computational complexity later in the semester, but we can start informally today.

We said finding an element in a list takes an amount of time that grows linearly with the size of the list. Whereas finding an element in a set takes constant time (or in the worst case, proportional to log(size of set)). What about the time for computing the average in each loop iteration of standard deviation? That would increase as the square of the size of the dataset; if the size of the dataset is n, computing the average takes “order” n time (summing all n elements) and we are doing that computation for each data point, i.e. n times. n^2 can grow very quickly, even for “smallish” n.

We will continually be building up our “efficiency” toolbox throughout the semester. Our first two tools are:

  • Choose the “right” data structure for the task at hand
  • “Hoist” unchanging computations out of loops, that is compute the result once and assign to a variable to be used within the loop.

Data structure categorization

An opinionated summary of “major” Python types:

Type Ordered Mutable Mutable Values Typical (but not only) Usage
List Yes Yes Yes Ordered collection of variable length (often homogenous)
Set No Yes No Membership/Set operations
Tuple Yes No Yes Heterogeneous (ordered) collection of fixed length
Dictionary Yes-ish Yes Yes (but not keys) Key -> Value lookup

What is the deal with Dictionary ordering? As of Python 3.7, dictionaries are specified to maintain their elements in insertion order. That this unlike Sets, when you iterate through the elements of a Dictionary, the elements will be in a known order.

Histograms: A motivating question

A really common tool in data analysis are histograms, typically implemented as a plot where the x-axis is bins and the y-axis is the count of items in that bin. We have already implemented a histogram analysis in the frequencies function in our programming assignment (or will shortly). But today lets look at an easier, more generalizable and often faster approach to creating histograms.

On paper determine the histogram for the following data:

[1, 2, 3, 2, 3, 2, 1, 1, 5, 4, 4, 5]

How did you do it? Probably you kept a tally for each number. Each time you encountered a new number, you initialized its count at 1, and every time you encountered a previously observed number you incremented its count. That is you were keeping track of two connected pieces of the information, the number and its associated count. We could describe these as the “key” and “value”.

Dictionaries

Dictionaries, aka maps, aka associative arrays are data structures that store keys and associated values, optimized for efficiently looking up the value by key. In other languages this data structure is called a (hash)map or an associative array. The Python type is the dict.

A dictionary literal is created with { ... }, e.g.

>>> {}
{}
>>> d = { 5: 1, 6: 2 }
>>> d
{5: 1, 6: 2}

Note that { ... } is also used for sets. You indicate a dictionary with the <key> : <value> syntax (the colons are what Python uses to distinguish between dictionaries and sets). Note that the {} is an empty dictionary, not an empty set (empty sets can be created with the set initializer, e.g. set()). In the above, the integers 5 and 6 are the keys and the integers 1 and 2 are their respective values.

We can efficiently access values with the indexing operator, e.g.

>>> d[5]
1
>>> d[6]
2
>>> d[1]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: 1

but we will get an error if the key is not in the dictionary. What if we aren’t sure the key is in the dictionary and we have a default value we would want to use instead? We can use get:

>>> d.get(1, 5)
5
>>> d.get(5, 6)
1

We can also use the index operator to add key-value pairs to the dictionary. Assigning a value to a key that does exist will overwrite the previous value; assigning a value to a key that does not exist will create that key (with that value) in the dictionary. Recall that the same is not true for a list. Assigning to an index outside the current “range” of the list is an error.

>>> d[3] = 7
>>> d
{3: 7, 5: 1, 6: 2}

Dictionary keys can be any “hashable” type (same as sets), and even mixed types. Keys are unique, that is you can’t have duplicate keys with different values. Recall that for our purposes, “hashable” implies “comparable” (so uniqueness of the keys can be enforced) and immutable so that prior comparisons aren’t invalidated by changing a key.

The values can be of any type, including mutable types like lists, etc (that means we can modify values in place). And a dictionary can have duplicate values.

>>> d["string"] = "test"
>>> d
{'string': 'test', 3: 7, 5: 1, 6: 2}
>>> d["a_list"] = [1, 2, 3]
>>> d
{'string': 'test', 'a_list': [1, 2, 3], 3: 7, 5: 1, 6: 2}
>>> d[3] += 5
>>> d
{'string': 'test', 'a_list': [1, 2, 3], 3: 12, 5: 1, 6: 2}

Much like lists and other data structures, dictionaries can be the argument to built-in functions like len, support operators like in and are also objects with various methods.

>>> len(d)
5
>>> 'a_list' in d
True
>>> dir(dict)
['__class__', '__contains__', '__delattr__', '__delitem__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', 'clear', 'copy', 'fromkeys', 'get', 'items', 'keys', 'pop', 'popitem', 'setdefault', 'update', 'values']

Some commonly used methods:

>>> d.keys()
dict_keys(['string', 'a_list', 3, 5, 6])
>>> d.values()
dict_values(['test', [1, 2, 3], 12, 1, 2])
>>> d.pop("a_list")
[1, 2, 3]
>>> d
{'string': 'test', 3: 12, 5: 1, 6: 2}

What about iteration? How can we use for loops with dictionaries? There are actually several ways. Using a dictionary as sequence in a for loop iterates over the keys, e.g. the following loops are identical

>>> for k in d:
...     print(k)
... 
3
5
6
>>> for k in d.keys():
...     print(k)
... 
3
5
6

We can then use the keys to access the associated values. We can also iterate over the key-value tuples using the items methods and tuple unpacking. items returns a set like object of dictionary’s items, with are (key,value) tuples. We can iterate over those tuples directly as in the first loop, or unpack the tuples into specific key a and value variables.

>>> help(dict.items)
Help on method_descriptor:

items(...)
    D.items() -> a set-like object providing a view on D's items

>>> for i in d.items(): 
...     print(i)
... 
(3, 12)
(5, 1)
(6, 2)
>>> for k,v in d.items(): 
...     print(k, "=>", v)
... 
3 => 12
5 => 1
6 => 2

What is going on in that last example? items provides an iterable of tuples. What is a tuple? That is our next topic…