Imagine you had a list of n numbers. What would be the worst-case case
asymptotic time complexity (Big-O) of finding a number in that list, e.g.
a_list? \(O(n)\). What about the average case? Also
\(O(n)\). In the worst case you will need to check all numbers. In
the best case, just the first. On average, n/2 (or \(O(n)\)).
This algorithm is named “linear search”, because it will sequentially test each
element of the list.
That assumes you don’t know anything about the list. What if the list is in sorted order?
Think about finding a name in the phone book, or finding a specific page in a book. You don’t sequentially search all the names, you use the ordering to your advantage.
A “binary search” algorithm on a sorted list first compares the middle value of the list, depending on the result of that comparison the search continues in the upper or lower half of the list (discarding the other half). This process repeats until the value is found (or not), shrinking the list to be searched by half in each iteration. Since we reduce the search space by half in each iteration, the worst case number of iteration is \(\left \lfloor log_2 n+ 1 \right \rfloor\) (floor of \(log_2 n + 1\)), for \(O(log_2 n)\) average and worst case time complexity.
Sorting is one of our key algorithmic building blocks. For example, we have
sort in our implementation of median and our initial
implementation of histogram. Also knowing about sorting will help you run for
An objective for today is to investigate several different approaches to sorting, connecting our discussion of time complexity and recursion. All our examples today are in sorting.py.
Input : A list of comparable values, e.g. numbers
The list values ordered such as that
value[i] <= value[j] for all i < j
Thus we need both the data to sort and a comparison function with which to order the values (this is termed comparison-based sorting).
What must be true of the 1st value in our output, i.e.
value? It is the
smallest. And the 2nd value? It must be the 2nd smallest. And so on. Thus we
could sort a list by finding the smallest value and moving it to the “front”
and the repeating the process for the remaining elements in the array.
Think back to our discussion of time complexity, what is the asymptotic complexity of this algorithm? How many iterations of the “i loop” are there? n. And the “j loop”? n/2 (on average). So there \(O(n^2)\) loop iterations. What about operations? What constitutes an “operation” in this context? Here we would count the comparison as an operation. So the overall time complexity for selection sort is \(O(n^2)\).
def selection_sort(a_list): """ Sort the list in place using the selection sort algorithm """ # In each iteration, find the next smallest element in the list # and swap it into place for i in range(len(a_list)): # Find the index of the smallest value from i onwards min_index = i min_value = a_list[i] for j in range(i+1, len(a_list)): if a_list[j] < min_value: min_index = j min_value = a_list[j] # Swap i and min_index a_list[i], a_list[min_index] = a_list[min_index], a_list[i]
Here are we returning anything? No, we are modifying the
a_list parameter in
I think we can do better than selection sort. A key insight: We can efficiently merge two sorted lists. How so? We just need to compare the two front elements, move the smaller to a new list and repeat.
With that insight we can develop a recursive approach to sorting that is more time efficient than O(n^2). That is to sort a list, merge the result of sorting the first and second halves of the list. Using our recursive design approach:
Define the function header, including the parameters
def merge_sort(a_list): """ Return a new sorted list """
Define the recursive case
Split the list in half, sort each half independently, then merge the two halves.
Define the base case
A single element list is by definition sorted.
Put it all together. Check out
merge_sort in sorting.py.
What is the time complexity of merge sort? How many times will be split the list, i.e. how many split levels will we have? \(O(log_2 n)\). And for each “level” of split, how many comparisons will we do? O(n). So the total complexity is \(O(n log_2 n)\). A big improvement! As an aside, this is an example of a “Divide and Conquer” algorithm.
Much like we did with list vs. set query performance, we can test the
performance of our sorting implementations. In sorting.py, there
is a test setup for selection sort, merge sort and the Python built-in
What do you observe? We see clear difference between selection sort and merge sort, as we would expect based on their different asymptotic complexities. But we also see a substantial difference between the built-in sorted and merge sort.
It would appear that
sorted runs in constant time. But that is just an
artifact of the small input sizes. The average time complexity of
actually the same, \(O(n log_2 n)\), but the constants are very
sorted implementation (an algorithm called Timsort) is a
hybrid of insertion sort and merge sort that is very efficient (it doesn’t
incur the overheads of our pure Python implementation of merge sort).
More generally, there is more to thinking about performance than just average asymptotic complexity. We also may need to consider:
Both selection and merge sort have identical best case, average and worse case complexity, i.e., selection sort’s average and worst case complexity is the same, but that is not true of all sorting algorithms. In fact, Python’s built-in sorting algorithm has \(O(n log_2 n)\) average and worst case time complexity but \(O(n)\) best case time complexity. The insertion sort algorithm included in sorting.py (but which we didn’t talk about) has \(O(n)\) best case complexity, but \(O(n^2)\) average and worst case time complexity.
Sorting is both a valuable tool for learning about recursion, complexity
analysis, etc. and a key building block for many programs. Better
understanding the time complexity of sorting helps us make better implementation
decisions. Consider the approaches to histogram we saw in the
function lab5 vs. the
dictionary-based histogram.py. Knowing what we know now, which
approach would you choose? Recall that
frequencies sorts the list before
calculating the length of runs. As a result it is an \(f(n log_2 n +
n)\) or \(O(n log n)\) algorithm vs.
\(O(n)\) for the dictionary-based approach (as querying and
updating a dictionary by key is generally \(O(1)\)).
You will learn a lot more about the many different sorting algorithms in CS201 and other classes.