Problem

Out of speed, density (proxy for cost), and capacity, we can only get two. With such limitation, Memory Wall and Bandwidth Wall problem persist.
Memory Wall: speed of memory (latency) is being the bottleneck, thus masking the fast speed of the processor
Bandwidth Wall: bandwidth of memory is being the bottleneck, thus masking the fast speed of the processor
Insight
Principle of locality lets us predict which data will be accessed which can be cached to reduce access time. This gives the illusion of SRAM access time with larger capacity.
Temporal Locality: likely to access the same location in the near future again
Spatial Locality: likely to access neighboring locations soon
* Value Locality: nearby locations are likely to have the same values (used heavily in approximate computing)
Cache
Organization (size, associativity, block size)

Associativity means freedom to place data. There will be a single set in a cache with many ways (associativity) to place data. Least associative cache is named Direct Mapped and most associative is named Fully associative.

Replacement policy
Ones that are most unlikely to be used in the future should be pushed back to lower level memory to make room for one that is needed now.
Longest until next use: ideal but impossible
Least Recently Used (LRU): best practical approximation
Pseudo-LRU
Random
Write policy
Write through: the information is written to both the block in the cache and to the block in the lower-level memory (read miss does not result in write back to DRAM, because all data are being synchronized)
* Write through is always combined with write buffers to reduce delay
Write back: the information is written only to the block in the cache, and is written to the DRAM only when it is replaced (good because no write to DRAM for repeated writes)
Write-allocate: make room for the cache block in the cache and fetch rest of the block from DRAM, then write to cache
No-write-allicate (Write-around): write to lower levels of memory hierarchy, ignoring the cache.
Improving cache performance
Reduce miss rate
Reduce miss penalty
Reduce hit time
Reducing miss rate
Types of miss
Compulsory: the first access to a block is not in the cache, so the block must be brought into the cache. (cold start miss, first reference miss) = miss in infinite cache
Capacity: if C is the size of the cache (in blocks) and there have been more than C unique cache blocks accessed since this cache was last accessed. = Non-compulsory miss in size X fully associative cache
Conflict: occurs because too many active blocks are mapped to the same cache set. = Non-compulsory, non-capacity miss
Coherence: ***
Compulsory miss: increase block size or prefetch
* too small a block size will lead to under-exploiting of spatial locality, too large a block size will lead to under-exploiting of temporal locality
Capacity miss: increase size of the cache
Conflict miss: increase associativity (or place victim cache)
Prefetching
Prefetching relies on extra memory bandwidth that can be used without penalty
(HW) Instruction prefetching can be done by placing stream buffer to prefetch the extra block.
(SW) Data prefetching can be done by special prefetching instructions, or by placing “helper thread” to prefetch the data for the main thread.
Compiler optimization
Reordering procedures in memory to reduce misses (partly link time optimization)
Profiling to look at conflicts
Merging arrays: improve spatial locality by single array of compound elements vs. two arrays
Loop interchange: change nesting of loops to access data in order stored in memory
Loop fusion: combine two independent loops where their accessed data overlap
Blocking: improve temporal locality by accessing blocks of data repeatedly vs. going down whole column of rows
Victim cache (Norman Jouppi, 1990)
Add buffer to hold data recently discarded from cache.
Reducing miss penalty
Write back memory
Read misses may require write of dirty block back to DRAM, instead copy the dirty block to a write buffer, then read, then write
Early restart and Critical word first
Early restart: as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution.
Critical word first: request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block.
* Not always helpful because of spatial locality
Build a multi-level cache
Higher level can have less associative for faster speed, lower level can have higher associativity to reduce global miss rate.
Local miss rate: misses in this cache divided by the total number of memory accesses to this cache
Global miss rate: product of each local miss rate
Property of inclusion means that if it is in the higher level cache, it will have the lower level cache. (exclusion could be better than inclusion when L2 is only slightly larger than L1)
Reduce hit time
Way prediction
Add bits to each cache line to predict which way is going to hit
Gives direct mapped hit time + associative hit rate