CS 202 - Notes 2016-05-13

Caches

Based on the locality of reference – the idea that memory references cluster together. We call these clusters the “working sets”

temporal locality: if we reference something we will probably reference it again soon. We see this in loop, but also we tend to access the same few variables in bursts.
spatial locality: if an address is accessed, we will probably access things near it soon as well. Arrays, local variables, instructions…

Caches are smaller faster memory between the processor and main memory. Faster access, but it won’t always have what we need (a miss). When we have a miss, then we go up the hierarchy until we find the value, propagating it back down as we return to the processor.

Data is stored in blocks bigger than a single word to take advantage of spatial locality.

Cache Organization

Direct Mapped

Given a memory address, there is one unique location in the cache where that address could be found.

We think of the bits of the address of being broken into three pieces: tag, block address, offset

block address: this tells us which line in the cache the address will be found. The number of bits required is determined by the number of lines in the cache
offset: this tells us where in the block the value is found. This is determined by the size of a block and the level to which the data is addressed (a 16 byte cache line with byte level addressing needs four bits).
tag: this is stored as a label on the cache line so we can tell which of the many addresses that map to this line is current in residence. The number of bits is basically what ever is left of the address after the other two have been subtracted out.

Example: 6-bit address space, byte level addressing, four line cache, four bytes per block

The address then has two bit for each of the tag, address, and offset.

The address 101101 would be found in the second byte of the third line of the cache, provided the cache line is tagged 10.

The downside of the direct mapped cache is that it is prone to thrashing. If we are using two values that map to the same location in the cache we will be constantly switching between them even if there is otherwise plenty of room in the cache.

Associative mapping

This is basically the opposite of direct mapped. A block can be mapped to any line in the cache. Addresses are split into just tag and offset, with the offset determined by the size of the block.

This solves the thrashing problem, but now we have to look at the tags on every single line every time we want to find an address.

We also need to figure out which line to replace when inserting a new line if they are all full already.

Set-associative mapping

The compromise between the other two solutions. The cache is divided into sets, each of which uses associative mapping.

The address is broken up into three components again, but the middle one is the set address, uniquely identifying which set it belongs to.

This is the strategy used in most caches.

Write policy

We have to consider what happens when we write a value into memory. We can write it back into the closest cache, but at some point that value needs to make its way into main memory.

write through

When we write a value, it propagates back through all of the caches and gets stored immediately in memory.

The has fairly poor performance. It can take 100 cycles to write a value back into memory. If our CPI is 1 without cache misses, and 10% of our instructions are stores, our actual CPI would be .9 + .1 * 100 = 10.9, reducing the speed by more than a factor of 10.

This doesn’t take into account that we may not need to push every store back into main memory. We make make a number of changes to a value before it stabilizes.

write buffer

We follow the same strategy as with write through, except we don’t wait for the store to complete. We add a buffer to memory that can store the pending stores. We can even replace values if we have come up with a new value before the value has made it to memory.

But, the buffer will eventually fill, and then we are back where we started.

write back

This is the opposite of write through. We just write into the nearest cache. We only worry about pushing it upstream in the memory hierarchy when the cache line gets replaced.

This makes cache misses a bit more complicated since we now need to check if the value has changed and then wait to write it back.

Replacement algorithms

Once the cache is full, we need to have a strategy for picking which line to replace when we want to load a new line into the cache.

least recently used

We are basically looking for the “stalest” value. We need to add some extra logic to the cache to indicate how long it has been since the value was read or written to. With a 2-way set associative cache, we could use add a single bit to every cache line. When we access a line, we set its “use” bit to 1 and set the other line’s bit to 0. This get more complicated with more than two lines…

This is probably the best solution.

first in, first out

This is a little easier to implement. We just count up through the lines in the set. Every time we make a replacement, we increment the counter. When we hit the end, we start over at the beginning.

least frequently used

We keep track of how often a value is used and get rid of the one that we haven’t been touching a lot. Unfortunately, this is biased against new lines.

random

We could just pick one at random. In truth, this works about as well as the others on average. But random is not actually that easy to accomplish with logic circuits.

Design decisions

cache size: Large caches increase the hit rate. Large caches are slower.

block size: more data in the block, the better the spatial locality. More data per block, means fewer cache lines, so poorer temporal locality. larger blocks also increase the possibility that we move in data that we will never need or want.

associativity: more lines per set reduces chance of thrashing. More lines == harder to implement and slower access time.

Real world example

Our lab machines have three levels of cache, each with 64 byte lines.

L1 cache - closest to the processor. Split between data and instructions, 32 KB a piece. Each one is 8-way associative, with 64 sets. There is an L1 cache per core.

L2 cache - 256KB, 8-way associative with 512 sets. This is also per core.

L3 cache - 8MB, 16-way associative with 8192. There is only one of these, and it is shared by all of the cores on the chip.