Several lectures ago we plotted the execution time of membership queries of list and sets of different sizes. Check out lists_vs_sets_improved.py for a refresher. Recall that we observed that the query time grew linearly with the size of list, but did not grow at all as we increased the size of the set. Today we are going to talk more formally about the efficiency of these and other algorithms.
We can always talk about absolute running time, but that requires lots of experimentation and is easily confounded by different inputs, the computer, etc. Instead we want a tool that allows us to talk about and compare different algorithms, data structures, etc. before we do any implementation, while eliding unnecessary details.
That tool is asymptotic analysis, that is an estimate of the execution time (termed time complexity), memory usage or other properties of an algorithm as the input grows arbitrarily large.
The key idea: How does the run-time grow as we increase the input size?
For example, in the case of querying lists, if we double the size of the list, how will the execution time (time complexity) grow? Will it be unchanged, double, triple, quadruple? Double! Similarly, how would the execution time of querying sets grow if we double the size of the set? Unchanged!
Big-O notation is way of describing the upper-bound of the growth rate as a function of the size of the input, typically abbreviated n. That is, it is really about growth rates. Big-O is a simplification of \(f(n)\), the actual functional relationship, which follows these two rules:
Thus \(f(n)=3n^3 + 6n^2 + n + 5\) would be \(O(n^3)\).
Using big-O, we can describe groups of algorithms that have similar asymptotic behavior:
Returning to our “list vs. sets” example, querying a list is \(O(n)\) or linear time and querying a Python set is \(O(1)\), or constant time.
Suppose that we had an algorithm with \(O(n^2)\) complexity that takes 5 seconds to run on our computer when n is 1000. If the input increased to n of 3000, about how long would it take to run? Show the answer:
Since we are tripling the input, and the algorithm has quadratic complexity, the time will grow as \(3^2\), or by a factor of 9. Thus we would expect it to take 45 seconds with the larger input.
What about the standard deviation computation in our statistics lab? Here is a Python module with two possible implementations. What is the complexity of these two implementations? Show the answer:
In the first, we compute the mean before the loop. Thus inside the loop we are only performing a constant number of operations (a subtraction, multiplication and addition). Thus it is a linear time implementation. In the latter, in each loop iteration, i.e. n times, we are computing the average, an \(O(n)\) operation. Thus the overall complexity is \(O(n^2)\), or quadratic! We should choose the first approach!
So thinking back to the choice between a list and a set for querying, should we always choose the set because its time complexity is better? It depends. Keep in mind that big-O approximates the growth rate in the limit where the input size is very large. So if the inputs are large then we would almost certainly want to choose te set. If that assumption does not hold for our input, e.g., we are querying just a few values, then the set may not be any faster.
And also keep in mind that big-O drops the constants, however in reality the constant factors may be quite large. Thus think of big-O as a useful tool for thinking about efficiency (albeit with caveats) that complements experiments and other approaches.
In addition to the classes we describe above, we can talk about some broader classes of time complexity. For example polynomial time. Polynomial time problems are those that can be solved in \(O(n^k)\), where k is some constant. Most of the algorithms we have encountered fall into this category.
We often focus on a specific subset of polynomial time problems, decision problems (those problems that have a boolean answer). The class of decision problems that can be solved in polynomial time using a deterministic Turing machine are in the class P. Another important class is NP, or “non-deterministic polynomial time”. At present there are no known solutions for these problems that run in polynomial time. An example of an NP problem is the decision version of the traveling salesman problem (TSP):
“Given a matrix of distances between n cities, determine if there is a route visiting all cities exactly once with total distance less than k.”
We can verify if a solution is valid in polynomial time, but finding such a route generally, at present, takes exponential time in the worst case. Proving whether P=NP, or not, is one of the key open problems in Computer Science (with a hefty cash prize, if I recall). Finding a polynomial time solutions to NP problems would make many otherwise intractable, or at least very difficult, problems (like TSP) tractable, even in the worst case.
TSP is one of a sub-class of NP problems termed NP-complete. NP-complete problems can be translated to other NP-complete problems in polynomial time. Thus a polynomial time solution to any NP-complete problem would a polynomial time solution to all other NP-complete problems!
Yes! One of the most famous undecidable problems is the halting program: Given an arbitrary program and its input, as an input, determine whether the program will finish or run forever.
Alan Turing proved that a general algorithm to solve the halting problem for any program does not exist. To do so created a Turing machine, a theoretical model of a computer that is used to model and test theoretical aspects of computing.