CS 202 - Notes 2016-05-16

Performance

Now have some tools to understand how to affect performance beyond picking the right data structure and algorithm. In real programs, we sometimes have to worry about those constant factors. Which is better 500n or n2? It depends on the n.

Big caveat: It is easier to make working code fast than it is to make fast code work. Don’t optimize until you know you have a problem. You are likely to make your code harder to reason about and debug in the process.

Tools

This is not a comprehensive introduction to these tools by any stretch…

time

This is good for just testing tradeoffs to see how long your program runs. On the command line, just type time before invoking your program, and it will report back how long your program was running for.

gprof

Compile the program with -pg to add profiling code to the executable. When you run the program, it will interrupt your code every 1/100th a second a look at which function is running. At the end it will write out a report of how much time each function was active into a file called gmon.out

Call gprof -b executable (where executable is your program), and it will report the amount of time it spent in each function.

This will give you an idea about where to start looking for problems.

It is only at the level of functions, however, and it has a hard time with short functions. It also struggles with I/O bound programs since it will assure you that the program is running very quickly, because by and large you program isn’t actually executing any instructions.

valgrind (callgrind)

Valgrind is a sampling profiler. Instead of instrumenting the code like gprof, it takes snapshots of the stack which the program is running to figure out what is happening. Rather than timing execution, it gives us instruction counts.

Profile code with valgrind --tool=callgrind executable. This will produce a file callgrind.out.pid, where pid is the process id of the program. You can look at this file with callgrind_annotate callgrind.out.pid.

Examples

strlen

I have two variations of a program optimize_1A.c and optimize_1B.c. Both of them create a large array of capital ‘A’s (length determined by command line input). They then call a function called lower(), which iterates over the array, converting each ‘A’ to lowercase. The only different between the two is that strlen is called in the loop condition in the first and it is stored in a variable in the second.

We can run time and see that the first is much slower than the second. Even if we tell the compiler to optimize, it doesn’t improve the performance of the first.

If we look at the first with valgrind, we can see it is spending all of its time in strlen.

The compiler can’t optimize this out because we are changing the values in the array as we work. We could end up shortening the array, so the compiler can’t assume that the value of strlen() won’t change.

Looping over a 2D array

In optimize_mem.c, we have three functions that all perform the same task. They are all summing the columns of a 2D array into a 1D array. The 2D array is stored as one large contiguous chunk of memory, so we are using the i*ROWS +j formula to compute the location of element a[i][j].

If we run the function through gprof, we find that sum_rows1 is the slowest, followed by sum_rows2 and then sum_rows3.

In sum_rows2, rather than adding directly into the array, we using a variable instead, and only writing into the array at the end when we have the final sum. This change allows the compiler to do the repetetive work in a register, which is much faster.

The third version of the function, factors out the computation i*SIZE, which always has the same result in the inner loop anyway.

If we add one level of optimization to the compiler, it performs the factoring out optimization automatically, and version two speeds up to the same speed as version three. Adding -O2 level optimization makes all three equivalent.

Cache access patterns

In optimize_cache.c, we are performing a very similar operation. This time, we are simplifying the functions down to just add up all of the values in a 2D array.

The only difference between the two is that in the second function, I switched the inner and outer for loops. Running this through gprof, we can see a big difference in the amount of time taken by the two functions.

This comes down to the cache access pattern. In the first, all of the values are accessed in order. In the second, we visit the first element of every row and then the second element of every row, etc. Because the rows are contiguous in memory, this means we are jumping over an entire row with each access. So, we are not practicing spatial locality, and the result is more cache misses.

Amdahl’s law

If we can speed some percentage p of a program up by some factor f, the overall speedup will be 1/((1-p) + p/f).

Example: We identify a portion of the code that currently takes up 10% of the run time. We come up with a way to make it 5 times faster. Total speed up is 1/((1-.1) + .1/5) = 1/(.9 + .02) = 1.09.

If I can make this section of code 10 times faster, the total speed up is 1.1.

If I managed to completely factor out this section of code, the best I can do is 1/.9 = 1.11 times faster.