Imagine you had a list of n numbers. What would be the worst-case case
asymptotic time complexity (Big-O) of finding a number in that list, e.g. 5 in
a_list
? \(O(n)\). What about the average case? Also
\(O(n)\). In the worst case you will need to check all numbers. In
the best case, just the first. On average, n/2 (or \(O(n)\)).
This algorithm is named “linear search”, because it will sequentially test each
element of the list.
That assumes you don’t know anything about the list. What if the list is in sorted order?
Think about finding a name in the phone book, or finding a specific page in a book. You don’t sequentially search all the names, you use the ordering to your advantage.
A “binary search” algorithm on a sorted list first compares the middle value of the list, depending on the result of that comparison the search continues in the upper or lower half of the list (discarding the other half). This process repeats until the value is found (or not), shrinking the list to be searched by half in each iteration. Since we reduce the search space by half in each iteration, the worst case number of iteration is \(\left \lfloor log_2 n+ 1 \right \rfloor\) (floor of \(log_2 n + 1\)), for \(O(log_2 n)\) average and worst case time complexity.
Sorting is one of our key algorithmic building blocks. For example, we have
already used sort
in our implementation of median and our initial
implementation of histogram. Also knowing about sorting will help you run for
president.
An objective for today is to investigate several different approaches to sorting, connecting our discussion of time complexity and recursion. All our examples today are in sorting.py.
Input : A list of comparable values, e.g. numbers
Output :
The list values ordered such as that value[i] <= value[j]
for all i < j
Thus we need both the data to sort and a comparison function with which to order the values (this is termed comparison-based sorting).
What must be true of the 1st value in our output, i.e. value[0]
? It is the
smallest. And the 2nd value? It must be the 2nd smallest. And so on. Thus we
could sort a list by finding the smallest value and moving it to the “front”
and the repeating the process for the remaining elements in the array.
Think back to our discussion of time complexity, what is the asymptotic complexity of this algorithm? How many iterations of the “i loop” are there? n. And the “j loop”? n/2 (on average). So there \(O(n^2)\) loop iterations. What about operations? What constitutes an “operation” in this context? Here we would count the comparison as an operation. So the overall time complexity for selection sort is \(O(n^2)\).
def selection_sort(a_list):
""" Sort the list in place using the selection sort algorithm """
# In each iteration, find the next smallest element in the list
# and swap it into place
for i in range(len(a_list)):
# Find the index of the smallest value from i onwards
min_index = i
min_value = a_list[i]
for j in range(i+1, len(a_list)):
if a_list[j] < min_value:
min_index = j
min_value = a_list[j]
# Swap i and min_index
a_list[i], a_list[min_index] = a_list[min_index], a_list[i]
Here are we returning anything? No, we are modifying the a_list
parameter in
place.
PI Questions1
I think we can do better than selection sort. A key insight: We can efficiently merge two sorted lists. How so? We just need to compare the two front elements, move the smaller to a new list and repeat.
With that insight we can develop a recursive approach to sorting that is more time efficient than O(n^2). That is to sort a list, merge the result of sorting the first and second halves of the list. Using our recursive design approach:
Define the function header, including the parameters
def merge_sort(a_list):
""" Return a new sorted list """
Define the recursive case
Split the list in half, sort each half independently, then merge the two halves.
Define the base case
A single element list is by definition sorted.
Put it all together. Check out merge_sort
in sorting.py.
What is the time complexity of merge sort? How many times will be split the list, i.e. how many split levels will we have? \(O(log_2 n)\). And for each “level” of split, how many comparisons will we do? O(n). So the total complexity is \(O(n log_2 n)\). A big improvement! As an aside, this is an example of a “Divide and Conquer” algorithm.
PI Questions1
Much like we did with list vs. set query performance, we can test the
performance of our sorting implementations. In sorting.py, there
is a test setup for selection sort, merge sort and the Python built-in sorted
function.
What do you observe? We see clear difference between selection sort and merge sort, as we would expect based on their different asymptotic complexities. But we also see a substantial difference between the built-in sorted and merge sort.
It would appear that sorted
runs in constant time. But that is just an
artifact of the small input sizes. The average time complexity of sorted
is
actually the same, \(O(n log_2 n)\), but the constants are very
different. The sorted
implementation (an algorithm called Timsort) is a
hybrid of insertion sort and merge sort that is very efficient (it doesn’t
incur the overheads of our pure Python implementation of merge sort).
More generally, there is more to thinking about performance than just average asymptotic complexity. We also may need to consider:
Both selection and merge sort have identical best case, average and worse case complexity, i.e., selection sort’s average and worst case complexity is the same, but that is not true of all sorting algorithms. In fact, Python’s built-in sorting algorithm has \(O(n log_2 n)\) average and worst case time complexity but \(O(n)\) best case time complexity. The insertion sort algorithm included in sorting.py (but which we didn’t talk about) has \(O(n)\) best case complexity, but \(O(n^2)\) average and worst case time complexity.
Sorting is both a valuable tool for learning about recursion, complexity analysis, etc. and a key building block for many programs. Better understanding the time complexity of sorting helps us make better implementation
decisions. Consider the approaches to histogram we saw in the frequencies
function lab5 vs. the dictionary-based histogram.py. Knowing what we know now, which approach would you choose?
Recall that frequencies
sorts the list before calculating the length of runs. As a result it is an \(f(n log_2 n + n)\) or \(O(n log n)\) algorithm. What about the dictionary-based approach? At a minimum we need to traverse the entire list so it will at least be \(O(n)\). In Python, querying an updating a dictionary is generally \(O(1)\) so the total complexity is \(O(n)\). Given that generally want to use the dictionary-based implementation, especially when the input is large.
You will learn a lot more about the many different sorting algorithms in CS201 and other classes.