Class 33: Searching and Sorting

Objectives for today

Searching in a list

Imagine you had a list of n numbers. What would be the worst-case case asymptotic time complexity (Big-O) of finding a number in that list, e.g. 5 in a_list? \(O(n)\). What about the average case? Also \(O(n)\). In the worst case you will need to check all numbers. In the best case, just the first. On average, n/2 (or \(O(n)\)). This algorithm is named “linear search”, because it will sequentially test each element of the list.

That assumes you don’t know anything about the list. What if the list is in sorted order?

Think about finding a name in the phone book, or finding a specific page in a book. You don’t sequentially search all the names, you use the ordering to your advantage.

A “binary search” algorithm on a sorted list first compares the middle value of the list, depending on the result of that comparison the search continues in the upper or lower half of the list (discarding the other half). This process repeats until the value is found (or not), shrinking the list to be searched by half in each iteration. Since we reduce the search space by half in each iteration, the worst case number of iteration is \(\left \lfloor log_2 n+ 1 \right \rfloor\) (floor of \(log_2 n + 1\)), for \(O(log_2 n)\) average and worst case time complexity.

Sorting: One of the most important algorithms

Sorting is one of our key algorithmic building blocks. For example, we have already used sort in our implementation of median and our initial implementation of histogram. Also knowing about sorting will help you run for president.

An objective for today is to investigate several different approaches to sorting, connecting our discussion of time complexity and recursion. All our examples today are in sorting.py.

Define Sorting

Input : A list of comparable values, e.g. numbers

Output : The list values ordered such as that value[i] <= value[j] for all i < j

Thus we need both the data to sort and a comparison function with which to order the values (this is termed comparison-based sorting).

Selection Sort: Our First Sorting Algorithm

What must be true of the 1st value in our output, i.e. value[0]? It is the smallest. And the 2nd value? It must be the 2nd smallest. And so on. Thus we could sort a list by finding the smallest value and moving it to the “front” and the repeating the process for the remaining elements in the array.

Think back to our discussion of time complexity, what is the asymptotic complexity of this algorithm? How many iterations of the “i loop” are there? n. And the “j loop”? n/2 (on average). So there \(O(n^2)\) loop iterations. What about operations? What constitutes an “operation” in this context? Here we would count the comparison as an operation. So the overall time complexity for selection sort is \(O(n^2)\).

def selection_sort(a_list):
    """ Sort the list in place using the selection sort algorithm """
    
    # In each iteration, find the next smallest element in the list
    # and swap it into place
    for i in range(len(a_list)):
        # Find the index of the smallest value from i onwards
        min_index = i
        min_value = a_list[i]        
        
        for j in range(i+1, len(a_list)):
            if a_list[j] < min_value:
                min_index = j
                min_value = a_list[j]
                
        # Swap i and min_index
        a_list[i], a_list[min_index] = a_list[min_index], a_list[i]

Here are we returning anything? No, we are modifying the a_list parameter in place.

PI Questions1

Merge Sort: Its Linearithmic!

I think we can do better than selection sort. A key insight: We can efficiently merge two sorted lists. How so? We just need to compare the two front elements, move the smaller to a new list and repeat.

With that insight we can develop a recursive approach to sorting that is more time efficient than O(n^2). That is to sort a list, merge the result of sorting the first and second halves of the list. Using our recursive design approach:

  1. Define the function header, including the parameters

     def merge_sort(a_list):
         """ Return a new sorted list """
    
  2. Define the recursive case

    Split the list in half, sort each half independently, then merge the two halves.

  3. Define the base case

    A single element list is by definition sorted.

  4. Put it all together. Check out merge_sort in sorting.py.

What is the time complexity of merge sort? How many times will be split the list, i.e. how many split levels will we have? \(O(log_2 n)\). And for each “level” of split, how many comparisons will we do? O(n). So the total complexity is \(O(n log_2 n)\). A big improvement! As an aside, this is an example of a “Divide and Conquer” algorithm.

PI Questions1

Real World Performance

Much like we did with list vs. set query performance, we can test the performance of our sorting implementations. In sorting.py, there is a test setup for selection sort, merge sort and the Python built-in sorted function.

What do you observe? We see clear difference between selection sort and merge sort, as we would expect based on their different asymptotic complexities. But we also see a substantial difference between the built-in sorted and merge sort.

It would appear that sorted runs in constant time. But that is just an artifact of the small input sizes. The average time complexity of sorted is actually the same, \(O(n log_2 n)\), but the constants are very different. The sorted implementation (an algorithm called Timsort) is a hybrid of insertion sort and merge sort that is very efficient (it doesn’t incur the overheads of our pure Python implementation of merge sort).

More generally, there is more to thinking about performance than just average asymptotic complexity. We also may need to consider:

Both selection and merge sort have identical best case, average and worse case complexity, i.e., selection sort’s average and worst case complexity is the same, but that is not true of all sorting algorithms. In fact, Python’s built-in sorting algorithm has \(O(n log_2 n)\) average and worst case time complexity but \(O(n)\) best case time complexity. The insertion sort algorithm included in sorting.py (but which we didn’t talk about) has \(O(n)\) best case complexity, but \(O(n^2)\) average and worst case time complexity.

Why should we know the time complexity of sorting

Sorting is both a valuable tool for learning about recursion, complexity analysis, etc. and a key building block for many programs. Better understanding the time complexity of sorting helps us make better implementation decisions. Consider the approaches to histogram we saw in the frequencies function lab5 vs. the dictionary-based histogram.py. Knowing what we know now, which approach would you choose?

Recall that frequencies sorts the list before calculating the length of runs. As a result it is an \(f(n log_2 n + n)\) or \(O(n log n)\) algorithm. What about the dictionary-based approach? At a minimum we need to traverse the entire list so it will at least be \(O(n)\). In Python, querying an updating a dictionary is generally \(O(1)\) so the total complexity is \(O(n)\). Given that generally want to use the dictionary-based implementation, especially when the input is large.

There Are Many Sorting Algorithms

You will learn a lot more about the many different sorting algorithms in CS201 and other classes.