Computer Science 237
Computer Organization

Williams College
Fall 2006

Lecture 33: Register Windows, Cache Memory
Date: December 1, 2006

Agenda

Announcements
Microcode
- Phase 2 due today - I will run some stats this weekend
- Jump right in on phase 3!
- Questions and answers?
Register Windows (notes from last time)
Cache Memory
- CPU performance doubles every 18-24 months
- Address space increases by 1.5 bits per year--a factor of four every 18-24 months
- Improvements in memory technology are mostly focused on increased density, not increases in memory speeds
- Caches help keep the needed data close to the CPU
  - A small, fast memory is placed between the CPU and main memory
  - Whenever an address/value pair is read from main memory, it is written to the cache
  - Whenever a value is needed for an address, we first look for it in the cache, and only if it is not found there do we look to main memory
  - When we write a value, we write it to cache and continue - we update it in memory later, concurrently with other computations
- The cache is an associative memory, where we can look up a memory value by its address in a table
  
  key=address, data=value at that address
  - Ideally, we can look up the key instantly - search all at once - think parallel circuitry
  - Realistically, there is a hash function that maps addresses to possible cache entries
  - Depending on the hash function, we may or may not need to store the key itself in the cache
  - The hash function depends on how we organize the cache
- Generally there is an organization into lines of cache - groups of addresses that are stored (and read/written) in the cache together (Typically 16 or 32 bytes)
- Direct caches:
  - Here, we allow for direct access to a cache line from memory by breaking the address down into three parts:
```
  101 10111 1011
```
  - Use the middle chunk as the cache line or cache address or index.
  - In this case, we would have 16 addresses per line in the cache, and 32 lines of cache
  - 1/8 of our memory actually fits in cache, but each address, if it is in the cache, is in a specific location
  - We need only check if the entry at the desired line is the actual address we want: If so, use it, if not, go get it
- Fully Associative caches
  - Here, a cache entry is not restricted to a single location - in fact, it can be in any line
  - The key is all bits above the number of bits to specify an entry within a line
  - When looking up, we have to check every entry in the cache to see if it's in there (sounds like a nasty search, but maybe we can build a circuit for this)
  - If we find it, use it, if we don't, we need to go get it
  - But then which line do we replace? We could imagine lots of approaches to "select a victim" -
    1. LRU - least-recently used
    2. LRU approximations
    3. random
    4. saturating counters - keep track of frequency of recent access
    5. replace non-dirty (unmodified) lines
  - The decision must be fast but a poor decision will lead to another cache miss in the near future - also expensive
- Set Associative Mapping
  - In between the other approaches - each address maps to a small set of possible lines
  - For 2-way, each address could be in one of two places
  - Easier to check than in the fully associative case, but if we are using 3 or 4 lines repeatedly, we will end up with lots of cache misses
  - But still better off than the direct mapped
  - Almost definitely do lookups in parallel:
- There are tradeoffs and all three approaches (and various levels of associativity within set associative) are really used
- There are three ways that a memory value will fail to be found in cache:
  1. a compulsory miss--the first time a value is needed, it must be retrieved in from memory
  2. a conflict miss--since our last access another value that maps to the same cache location has evicted the value we want
  3. a capacity miss--the value we desire has been evicted not because of a direct conflict, but because it was the least recently used value
- Things to consider:
  - Data and instructions are often separated (a Harvard cache); combining them is called a unified cache
  - Most primary (Level 1) caches hold a small amount of data--8 to 32K bytes (and this value has been fairly constant for 40 years!)
  - A good primary cache delivers better than 95% of requests - Why? Locality, locality, locality
  - Spatial locality and temporal locality - it's why caches work
  - In practice, 2-way yields around 90%, 4-way 95%, 8-way, close to 100%
  - Missing 2% of requests on to a memory that is 50 times slower means the average access time is 0.98+0.02*50=1.98 cycles, or half of the desired speed -and memory is often much slower than that!
  - Secondary (and tertiary) caches can be effective at reducing miss rate, but they must be much larger
    - Secondary caches might be 256K bytes or larger
    - Tertiary caches are measured in megabytes
- Read policies:
  - Reading instructions from memory - I-cache
  - Possibility: store partially-decoded instructions in a trace cache - cache the equivalent of micro-ops (This is what Intel Pentiums do to cache the RISC-equivalent of the actual CISC instructions)
  - Reading data - D-cache
- Policies for writing to cache
  - instructions are read-only - motivation for Harvard cache
  - data is mutable (usually) - keep track of dirty lines
  - if a value has changed, it must be written back to memory some time (at least before it gets evicted)
  - the line never changes, no need to write it back - avoid unnecessary writebacks?
  - Write policies:
    - Write-through - memory is always written back but usually only when the MMU is not otherwise busy
    - Write-back - memory is updated only when the line is evicted
  - Probably need some sort of "write queue" in either case
- Another thing to worry about - what if you write before you read? (initialization) Read first? Perhaps "write-around" for writes to lines not in cache? But will we just be reading them soon anyway?
- Another technique: victim cache
  - place to keep recently evicted lines
  - likely candidates for reinsertion
  - at least keep them here until written back
  - Especially useful for direct-mapped caches, can benefit set associative caches
- Multiprocessor issues - cache coherency
  - Example: the Power 4 architecture has 4 cores, each with its own L1 cache, but the 4 share an L2 cache
  - If a line is in 2 L1 caches, how do we mantain consistency?
  - Do lines in an L1 also exist in L2?

Computer Science 237 Computer Organization

Williams College Fall 2006

Agenda

Computer Science 237
Computer Organization

Williams College
Fall 2006