Phase 2 due today - I will run some stats this weekend
Jump right in on phase 3!
Questions and answers?
Register Windows (notes from last time)
Cache Memory
CPU performance doubles every 18-24 months
Address space increases by 1.5 bits per year--a factor of four
every 18-24 months
Improvements in memory technology are mostly focused on
increased density, not increases in memory speeds
Caches help keep the needed data close to the CPU
A small, fast memory is placed between the CPU and main
memory
Whenever an address/value pair is read from main memory, it
is written to the cache
Whenever a value is needed for an address, we first look
for it in the cache, and only if it is not found there do we look
to main memory
When we write a value, we write it to cache and continue -
we update it in memory later, concurrently with other
computations
The cache is an associative memory, where we can look up
a memory value by its address in a table
key=address, data=value at that address
Ideally, we can look up the key instantly - search all at
once - think parallel circuitry
Realistically, there is a hash function that maps addresses to
possible cache entries
Depending on the hash function, we may or may not need to
store the key itself in the cache
The hash function depends on how we organize the cache
Generally there is an organization into lines of cache
- groups of addresses that are stored (and read/written) in the
cache together (Typically 16 or 32 bytes)
Direct caches:
Here, we allow for direct access to a cache line from memory
by breaking the address down into three parts:
101 10111 1011
Use the middle chunk as the cache line or cache address or
index.
In this case, we would have 16 addresses per line in the
cache, and 32 lines of cache
1/8 of our memory actually fits in cache, but each address,
if it is in the cache, is in a specific location
We need only check if the entry at the desired line is the
actual address we want: If so, use it, if not, go get it
Fully Associative caches
Here, a cache entry is not restricted to a single location
- in fact, it can be in any line
The key is all bits above the number of bits to specify an
entry within a line
When looking up, we have to check every entry in the
cache to see if it's in there (sounds like a nasty search, but
maybe we can build a circuit for this)
If we find it, use it, if we don't, we need to go get it
But then which line do we replace? We could imagine lots of
approaches to "select a victim" -
LRU - least-recently used
LRU approximations
random
saturating counters - keep track of frequency of recent access
replace non-dirty (unmodified) lines
The decision must be fast but a poor decision will lead to
another cache miss in the near future - also expensive
Set Associative Mapping
In between the other approaches - each address maps to a
small set of possible lines
For 2-way, each address could be in one of two places
Easier to check than in the fully associative case, but if
we are using 3 or 4 lines repeatedly, we will end up with lots
of cache misses
But still better off than the direct mapped
Almost definitely do lookups in parallel:
There are tradeoffs and all three approaches (and various
levels of associativity within set associative) are really used
There are three ways that a memory value will fail to be found
in cache:
a compulsory miss--the first time a value is
needed, it must be retrieved in from memory
a conflict miss--since our last access another
value that maps to the same cache location has evicted the
value we want
a capacity miss--the value we desire has been
evicted not because of a direct conflict, but because it was the
least recently used value
Things to consider:
Data and instructions are often separated (a Harvard
cache); combining them is called a unified cache
Most primary (Level 1) caches hold a small amount of
data--8 to 32K bytes (and this value has been fairly constant
for 40 years!)
A good primary cache delivers better than 95% of requests
- Why? Locality, locality, locality
Spatial locality and temporal locality - it's why caches
work
In practice, 2-way yields around 90%, 4-way 95%, 8-way, close
to 100%
Missing 2% of requests on to a memory that is 50 times slower
means the average access time is 0.98+0.02*50=1.98 cycles, or
half of the desired speed -and memory is often much slower than
that!
Secondary (and tertiary) caches can be effective at reducing
miss rate, but they must be much larger
Secondary caches might be 256K bytes or larger
Tertiary caches are measured in megabytes
Read policies:
Reading instructions from memory - I-cache
Possibility: store partially-decoded instructions in a
trace cache - cache the equivalent of micro-ops (This is
what Intel Pentiums do to cache the RISC-equivalent of the
actual CISC instructions)
Reading data - D-cache
Policies for writing to cache
instructions are read-only - motivation for Harvard
cache
data is mutable (usually) - keep track of dirty lines
if a value has changed, it must be written back to memory
some time (at least before it gets evicted)
the line never changes, no need to write it back - avoid
unnecessary writebacks?
Write policies:
Write-through - memory is always written back but
usually only when the MMU is not otherwise busy
Write-back - memory is updated only when the line is
evicted
Probably need some sort of "write queue" in either case
Another thing to worry about - what if you write before you
read? (initialization) Read first? Perhaps "write-around" for
writes to lines not in cache? But will we just be reading them soon
anyway?
Another technique: victim cache
place to keep recently evicted lines
likely candidates for reinsertion
at least keep them here until written back
Especially useful for direct-mapped caches, can benefit set
associative caches
Multiprocessor issues - cache coherency
Example: the Power 4 architecture has 4 cores, each with
its own L1 cache, but the 4 share an L2 cache
If a line is in 2 L1 caches, how do we mantain consistency?