Class 17

Searching and sorting

Objectives for today

  • Describe algorithms for linear and binary search
  • Describe algorithms for selection sort, insertion sort and merge sort
  • Recall the asymptotic runtime complexity of different search and sort algorithms

Searching in a list

Imagine you had a list of n numbers. What would be the worst-case case asymptotic time complexity (Big-O) of finding a number in that list, e.g. 5 in a_list? \(O(n)\). What about the average case? Also \(O(n)\). In the worst case you will need to check all numbers. In the best case, just the first. On average, n/2 (or \(O(n)\)). This algorithm is named “linear search”, because it will sequentially test each element of the list. A possible implementation could be:

def linear_search(a_list, item):
    """Return index of item in a_list or None if not found

    Args:
        a_list: A sequence of comparable values
        item: A value to search for in a_list

    Returns:
        Index of item or None is not found
    """
    for i in range(len(a_list)):
        if item == a_list[i]:
            return i
    return None

That assumes you don’t know anything about the list. What if you new the list is in sorted order? Think about finding a name in the phone book, or finding a specific page in a book. You don’t sequentially search all the names, you use the ordering to your advantage.

A “binary search” algorithm on a sorted list first compares the middle value of the list, depending on the result of that comparison the search continues in the upper or lower half of the list (discarding the other half). This process repeats until the value is found (or not), shrinking the list to be searched by half in each iteration. How could you implement this algorithm?

If we think about the description above, we realize we are applying the same search algorithm to successively smaller portions of the list, i.e., it is a recursive algorithm. We could phrase our recurrence relation as “the index of an item in a sorted list is the middle, if the middle value is item, or index of the item in relevant half of the list”. What about the base case? The smallest input would be the empty list, which by definition can’t include the item and so we should return None. Putting it together:

def binary_search(a_list, item, lo, hi):
     """Return index of item in a sorted a_list or None if not found

    Args:
        a_list: A sequence of comparable values sorted in ascending order
        item: A value to search for in a_list
        lo: Inclusive start index to search in a_list
        hi: Inclusive end index to search in a_list

    Returns:
        Index of item or None is not found
    """
    if lo > hi:
        return None
    else:
        middle = (lo + hi) // 2
        middle_elem = a_list[middle]
        if item == middle_elem:
            return middle
        elif item < middle_elem:
            return binary_search(a_list, item, lo, middle-1)
        else:
            return binary_search(a_list, item, middle + 1, hi)

What is the time complexity? Since we reduce the search space by half in each iteration, the worst case number of recursive calls is \(\left \lfloor log_2 n+ 1 \right \rfloor\) (floor of \(log_2 n + 1\)). Each recursive call does a constant amount of work (e.g., \(c\) operations where \(c\) is a constant) to split the input and check the middle for equality. Thus, the total time complexity is \(c\left \lfloor log_2 n+ 1 \right \rfloor\) or \(O(log_2 n)\) average-case and worst-case time complexity.

Sorting: One of the most important algorithms

Sorting is one of our key algorithmic building blocks. For example, we have already used sort in our implementation of median and our initial implementation of histogram (via the frequencies function). Also knowing about sorting will help you run for president.

An objective for today is to investigate several different approaches to sorting, connecting our discussion of time complexity and recursion. All our examples today are in sorting.py.

Define Sorting

Input
A list of comparable values, e.g. numbers
Output
The list values ordered such as that value[i] <= value[j] for all i < j

Thus we need both the data to sort and a comparison function with which to order the values (this is termed “comparison-based sorting”).

Selection Sort: Our First Sorting Algorithm

What must be true of the 1st value in our output, i.e. value[0]? It is the smallest. And the 2nd value? It must be the 2nd smallest. And so on. Thus we could sort a list by finding the smallest value and moving it to the “front” and the repeating the process for the remaining elements in the array.

Think back to our discussion of time complexity, what is the asymptotic complexity of this algorithm? How many iterations of the “outer” loop (with index i) are there? n. And the “inner” loop (with index j)? n/2 (on average). So there \(O(n^2)\) loop iterations. What about operations? What constitutes an “operation” in this context? Here we would count the comparison as an operation. So the overall time complexity for selection sort is \(O(n^2)\).

def selection_sort(a_list):
    """ Sort the list in place using the selection sort algorithm """
    
    # In each iteration, find the next smallest element in the list
    # and swap it into place
    for i in range(len(a_list)):
        # Find the index of the smallest value from i onwards
        min_index = i
        min_value = a_list[i]        
        
        for j in range(i+1, len(a_list)):
            if a_list[j] < min_value:
                min_index = j
                min_value = a_list[j]
                
        # Swap i and min_index
        a_list[i], a_list[min_index] = a_list[min_index], a_list[i]

Here are we returning anything? No, we are modifying the a_list parameter in place.

Insertion Sort

Think about you might sort playing cards in your hand. I suspect you don’t use Selection Sort. Instead of I suspect you treat the first card as sorted, and “slide” the next cards to their respective spot in the sorted portion of the list, i.e., insert them in the correct position in the sorted portion of your hand. That algorithm is effectively insertion sort.

def insertion_sort(a_list):
    """
    Sort list in place using the insertion sort algorithm

    Args:
        a_list : List to sort in place
    """
    for i in range(1,len(a_list)):
        # Values at [0,i-1] are sorted already
        # Shift up all values in [0,i-1] greater than a_list[i]
        value = a_list[i]
        index = i
        
        while index > 0 and a_list[index-1] > value:
            a_list[index] = a_list[index-1]
            index -= 1
        # Now insert value (old a_list[i]) in its proper place
        a_list[index] = value
        # Now everything from 0...i is sorted

What is the worst-case asymptotic complexity of this algorithm? How many iterations of the “outer” loop (with index i) are there? n. And the “inner” loop (with index)? n/2 (on average). So there \(O(n^2)\) loop iterations. Like Selection Sort, Insertion Sort has a worst-case time complexity of \(O(n^2)\) in the worst case.

Note that I purposely specified worst-case. We previously just talked about time complexity, but more precisely we want to think about average-case time complexity, best-case time complexity and worst-case time complexity. For many algorithms, including Selection Sort, all three are the same, but not always.

Merge Sort: Its Linearithmic!

I think we can do better than Selection Sort or Insertion Sort. A key insight: We can efficiently merge two sorted lists. How so? We just need to compare the two front elements, move the smaller to a new list and repeat.

With that insight we can develop a recursive approach to sorting that is more time efficient than \(O(n^2)\). That is to sort a list, merge the result of sorting the first and second halves of the list. Using our recursive design approach:

  1. Define the function header, including the parameters

    def merge_sort(a_list):
        """ Return a new sorted list """
  2. Define the recursive case

    Split the list in half, sort each half independently, then merge the two halves.

  3. Define the base case

    A single element list is by definition sorted.

  4. Put it all together. Check out merge_sort in sorting.py.

What is the time complexity of merge sort? How many times will be split the list, i.e. how many split levels will we have? \(O(log_2 n)\). And for each “level” of split, how many comparisons will we do? \(O(n)\). So the total complexity is \(O(n log_2 n)\). A big improvement!

Merge sort is a canonical example of a “Divide and Conquer” algorithm.

Real World Performance

Much like we did with list vs. set query performance, we can test the performance of our sorting implementations. In sorting.py, there is a test setup for selection sort, merge sort and the Python built-in sorted function.

What do you observe? We see a clear difference between selection sort and merge sort, as we would expect based on their different asymptotic complexities. But we also see a substantial difference between the built-in sorted and merge sort.

It would appear that sorted runs in constant time. But that is just an artifact of the small input sizes. The average-case and worse-case time complexity of sorted is actually the same, \(O(n log_2 n)\), but the constants are very different. The sorted function (an algorithm called Timsort) is a hybrid of insertion sort and merge sort that is both an efficient algorithm (like insertion sort it has a best-case time complexity of \(O(n)\)) and an efficient implementation. It is implemented in a lower-level language and so it doesn’t incur the overheads of our pure Python implementation of merge sort.

More generally, there is more to thinking about performance than just time complexity. We may also want to consider space complexity, that is how much memory an algorithm might use. Memory usage can often be a/the limiting factor; while we can wait longer for a result, our computers have finite memory that can set a limit on the size of problem we can solve.

Knowing the time complexity of sorting

Sorting is both a valuable tool for learning about recursion, complexity analysis, etc. and a key building block for many programs. Better understanding the time complexity of sorting helps us make better implementation decisions. Consider the approaches to histogram we saw in the frequencies function in PA5 vs. the dictionary-based histogram.py. Knowing what we know now, which approach would you choose?

Recall that frequencies sorts the list before calculating the length of runs. As a result it is an \(f(n log_2 n + n)\) or \(O(n log n)\) algorithm. The first term is the time complexity for the sorting operation, the second is the time complexity for scanning the sorted list for the ends of “runs”.

What about the dictionary-based approach? At a minimum we need to traverse the entire list so it will at least be \(O(n)\). What other information do we need? The time to get and set a key value pair in the dictionary. In Python, getting and setting is amortized constant time, i.e., \(O(1)\). So the total complexity is \(O(n)\). Given that, we would want to use the dictionary-based implementation, especially when the input is large!

What is “amortized” constant time? “Amortized” constant time means that on average, over a large number of operations, the time complexity is constant. Any one call, e.g., any one set operation, may take longer. Here is since we are setting lots of values in the dictionary, the average time is what matters.

Other Sorting Algorithms

You will learn a lot more about the many different sorting algorithms in CS201 and other classes.