Memory
Systems
CH1 Cache Principles
Prof. Ren-Shuo Liu
Outline
• Cache objectives and the classes of caches
• Locality principles and the classes of locality
• Design choices of caches
2
Cache Objectives
• Basic objectives
• Speed up accesses to storage devices, including
• Tape drives
• Disk drives
• Main memory
• Networked servers
• Other caches
• …
• Previously called as
• Look-aside [1962]
• Slave memory [1965]
3
Kitchen Analogy
[Link] 4
Implementation Technologies
• Caches can be built of anything that holds state
• A cache fronting a given storage device would be
built from a technology faster than the storage
device
as a cache for
• SRAM DRAM
• DRAM Disk
• Disk Tape
• Fast SRAM Slow SRAM
• Local disk Remote disk array
• Local server Remote server
• …
5
Implementation Technologies
• The cache technology typically costs more on a per-
bit basis
• Cache only needs to be large enough to hold the
application’s working set to be effective
• Working set: instructions and data the application is
currently using
• Most of the applications accesses will be satisfied out of
the working set (due to locality)
• Most of the time, the access characteristics will be that
of the cache
6
Cache Objectives
• In addition to performance, can a cache improve
the following metrics?
• Power
• Energy
• Energy delay product
• Cost
• Predictable behaviors
• Reliability
7
Software vs. Hardware Caches
DSP CPU
Router
Storage IP Translation Web Page
Transparent vs. Explicitly Managed
• Two types of addressing styles
• Transparent cache
• Uses the same namespace (address space) overlapped as the
backing store
• Cached data is a related, dependent copy of the original data
• Invisible to clients
• Works independently of the client making the request
• Explicitly managed / non-transparent cache
• Uses a separate namespace (address space)
• Cached data are independent from the original one
• Requires direct control of the client to move data in and out
the cache
Namespace / Address Space
• Two commonly seen examples
CPU-style DSP-style
transparent caches non-transparent caches
(aka. scratch pad memories)
DRAM
Outline
• Cache objectives and the classes of caches
• Locality principles and the classes of locality
• Design choices of caches
11
Cache Principle
• Caches work because of locality
• Locality
• The tendency of applications to access a predictably
small amount of data within a given window of time
• Caches work poorly for applications with little
locality
• E.g., synthetic applications that intentionally generate
random accesses
12
Observations about Locality
• Memory access patterns of programs are not
random
• Accesses tend to repeat themselves in time
• Accesses tend to be near each other in the memory
address space
• Notes
• These are merely observations
• Not guarantees nor provably inherent behaviors
• Computer architects continue observing programs to
discover new behaviors
13
Classes of Locality
• Temporal locality
• If the program references a datum once, it is likely in the
near future to reference that datum again
• Spatial locality
• If the program references a datum once, it is likely in the
near future to reference nearby data as well
• Algorithmic locality
• Program repeatedly accesses data items or executes
instructions that distributed widely across the memory
space
• Value locality
• Likelihood of the recurrence of a previously-seen value
within a storage location
14
Temporal Locality
• Reasons
• Loop
• Shared data
• Among expressions
• Among threads
• Among processes
• Basic policy to exploit it: demand-fetch
• When the program demands (references) an instruction
or data item, the cache hardware or software fetches
the item from the memory and retains it in the cache
• Before looking to memory for a particular data item, the
cache is searched for a local copy
15
Temporal Locality
• The only real limitation in exploiting temporal
locality is cache capacity
• If the cache is large enough, no data items ever need be
fetched from the backing store or written to the backing
store more than once
• If the cache is not large enough, a policy is required to
deal with eviction/replacement
16
Spatial Locality
• Reasons
• Programmers and compilers tend to cluster related
objects (data and instructions) together in the memory
space
• Human-like way that programs tend to work on related
objects one right after the other
• The classic example: array and array processing
• Basic policy to exploit it: lookahead
• When object n is brought into the processor, object n+1 should be
brought in as well
17
Spatial Locality
• Two types of mechanisms for lookahead
• Passive mechanism: Build caches in units called cache
blocks larger than a single object
• Active mechanism: prefetching
• Prefetching
• Speculate on which cache block(s) might be used next
• Exampling prefetching
• One-block lookahead
• Simple extension of the cache block granularity of access
• Prefetching instructions
18
Spatial Locality
• Limitation of exploiting spatial locality using
prefetching
• Memory access behaviors differ among programs
• Memory access behaviors change over time
• Accuracy of prefetching mechanisms’ prediction
19
Algorithmic Locality
• Algorithm-level behaviors that are predictable but
not capturable by cache blocks or simple lookahead
• Examples
• Graphics processing which walks through a list of
polygons in the 3D scene over and over and paints 2D
image of the scene
• The polygon list is a large linked data structure, so temporal
and spatial locality are nonobvious
• Electronic design automation
• Computer simulation, circuit simulation
• Synthesis
• Validation, verification, rule checking
20
History depth of one
History depth of 16 (for
Value Locality analyzing the potential
of value locality)
• Significant percentage of
times each static load
instruction retrieves a
value from memory that
matches a previously-
seen value for that load
21
Exploiting Value Locality
• Load value prediction unit
• Load value prediction table
(LVPT)
• Input: lower order bits of the
PC (program counter) of a load
• output: previous loaded value
for the same PC
• Load classification table (LCT)
• Input: same as LVPT
• Output: whether the load
tends to be unpredictable,
predictable, or constant
• Constant verification unit (CVU)
• Input: load/store addresses
and LVPT index
• Output: whether the load
needs verification or not
22
Load Value Prediction Table (LVPT)
(value loaded) Word
for loop
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
0
N=10 entries 0
203 Load R4, 8(R1) (1)
direct-mapped 0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
23
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 103 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
01 0
0
203 Load R4, 8(R1) (1)
0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
24
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 104 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
1
07 0
203 Load R4, 8(R1) (1)
0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
25
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 103 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
1 1
7
203 Load R4, 8(R1) (1)
0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
26
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 104 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
1
7 7
203 Load R4, 8(R1) (1)
0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
27
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 203 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
1 1
7
203 Load R4, 8(R1) (1)
0
constructive interference
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
28
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 302 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
01 0
1
7
203 Load R4, 8(R1) (1)
0
duplicated entries
may exist 0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
29
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 303 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
1
18 1
7
203 Load R4, 8(R1) (1)
0
destructive interference
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
30
Load Classification Table (LCT)
1-bit or 2-bit
saturating counter
for loop
103 Load R2, 0(R1) 11
104 Load R3, 4(R1)
… Bnez R1, 103 10
01
203 Load R4, 8(R1) N=10 entries
00
direct-mapped
302 Load R4, 0(R1)
303 Load R5, 0(R6)
N is chosen to be ~1024 in real design
31
Load Classification Table (LCT)
• 2-bit LCT
Classification of the load
11 Constant load (highly predictable)
misprediction 10 Predictable
successful prediction
01
Unpredictable
00
32
Constant Verification Unit (CVU)
• Basically, we still need to verify the actual value for
predictable and even constant loads
• Memory accesses are not eliminated
• There do exist constant loads that rarely or never
change their loaded values
• Verification becomes a waste of time, energy, and
memory bandwidth
• CVU targets eliminating this kind of memory
accesses
33
Constant Verification Unit (CVU)
• CVU is a fully associative
cache Data address LVPT index
• i.e., content-addressable match?
memory (CAM)
match?
• Indexed by the combination
of data address and LVPT match?
index (lower bits of load PC) match?
• Verification is safely omitted match?
for loads that hit the CVU match?
• CVU caches highly- match?
predictable loads
match?
• CVU monitors all stores
match?
• CVU invalidates entries
whose address matches the match?
stores
34
Exploiting Value Locality
35
Outline
• Cache objectives and the classes of caches
• Locality principles and the classes of locality
• Design choices of caches
36
Cache Design and Optimization
• Identify classes of application /
algorithm behaviors
• Identify data types that exhibit
recognizable behaviors
• New applications /
algorithms emerge
• Exploit the predictable behaviors
that are exhibited
37
Key Design Choices of Caches
• Where to put it: logical organization (CH2)
• Arrangement of data stored within its context
• What to cache: content management (CH3)
• The decision to cache or not to cache a particular item
at a particular time during execution
• How to maintain it: consistency management (CH4)
• Ensure that application receive the correct cached data
38
Logical Organization: Tags, Sets, Ways
• Cache uses tags because
• Cache size is typically smaller than the backing store
• There is a good possibility that any particular requested
datum is not in the cache
• Some mechanism must indicate whether any particular
datum is present in the cache
• Cache is divided into sets and ways because
• A cache can comprise multiple blocks
• Some mechanism must indicate where should an
incoming block be placed
39
Logical Organization: Tags, Sets, Ways
Memory
• Two-way set associative ways implementing
byte in block the cache
Address
sets
Tag set
index cache
Tag Tag
Flags Flags
Cache data (aka.
cache block,
= = cache line)
valid = valid =
40
Logical Organization: Tags, Sets, Ways
• Set associative caches strike a balance between
direct mapped and fully associative caches
Direct-mapped Two-way set associative
Four-way set associative
Fully associative
41
Content Management
• A question of "to cache or not to cache"
• Determine which memory references are the best
candidates for caching
• Keep data in the cache at the time the data are
requested
• Decide which previously cached data should be evicted
to make room for more important data not yet cached
42
Content Management
• Oracle (先知) vs. heuristics (經驗法則)
• Oracle content management
• See into the future
• For cached data and data to be cached
• See which one whose next reuse is furthest
• Evict the furthest one to make room (if necessary)
• Oracle is optimal but usually unavailable
43
Content Management
• Heuristics approximate the oracle
• Based on imperfect and/or incomplete information that
describes the data item
• Who is using it
• How it is being used
• How important it is
• At what point in the application's execution is the item being
considered
• Based on the information, the heuristic makes "yes/no"
decision to cache the item
44
Content Management
• Heuristics' decisions
• Can be made at any of several different times
• When the programmer writes code
• When the code is compiled
• When the application code is executed
• By any of several different entities
• Programmers
• Compilers
• Operating systems
• Application code itself
• Cache itself
45
Content Management
• Heuristics' decisions
• Can be predictive and proactive (主動)
• E.g., the compiler determines that a particular memory access
is better to hit the cache, so some steps such as prefetching
instructions are added into code to bring the corresponding
data
• Cab be reactive (被動)
• E.g., the cache determines that a requested item is valuable
and retain it in the cache)
46
Consistency Management
• Three main charges
• Keep the cache consistent with it self
• Keep the cache consistent with the backing store
• Keep the cache consistent with other caches
47
Consistent with Itself
• It would be inconsistent if there are two copies of a
single item in different places of the cache
• Different sets
• Different ways in a set
• Causes of such situations: synonym
• Virtual-address-indexed cache
• OS allows two unrelated virtual addresses to map to the
same physical address
• Virtual-address-tag cache and a set-associative cache
48
Consistent with Backing Store
• The value of backing store should be up-to-date
with any changes made to the version stored in the
cache
• Two typical mechanisms
• Write through
• Write back
49
Consistent with Backing Store
• Write through
• Any change is immediately
updated to the backing store
• Pros: smallest possible
window in which the backing
store holds an outdated copy
of a datum
• Cons: the backing store
typically cannot provide the backing store
same write throughput as of L1 caches
the cache
• Improvement: additional
write buffer that is as fast as
the cache but logically part
of the backing store
50
Consistent with Backing Store
• Write back
• Changes are made to the
cache and later updated to
the backing store
• Pros: locality of writes
results in reduced writes to processes, e.g.,
the backing store (write an editor and
an browser
coalescing)
• Cons
• Other processors may see
outdated data
• Other processes and
processors may see outdated
data
51
Consistent with Other Caches
• Other caches in the system should all be treated
much like the backing store in terms of keeping the
various copies of a datum up-to-date
y=0 y=1
y is shared
52
Inclusion and Exclusion
• Cache system is vertically and horizontally partitioned
• Vertical partitioning: move more frequently accessed data into
cache levels that are closer to the processors
• Horizontal partitioning: each partition consumes a fraction of the
whole cache's energy and is only driven when a reference targets it
Horizontally partitioned
Vertically partitioned
53
Inclusion and Exclusion
• Principles of inclusion and exclusion define a
particular class of relationship that exists between
any two partitions in a cache system
• Inclusive relationship
• Every cached item found in one of the units has a copy
found in the other
• Exclusive relationship
• Any given item is either in A or B or in neither, and it
absolutely should not be in both
54
Inclusive vs. Exclusive
• Considering inclusive L1, L2, and LLC (last level
cache) caches
• Every item in L1 is also held in L2 and LLC
• Pros
• Clean contents of L1 can be discarded at any time
without affecting the correctness of the system
• Detecting whether a datum is present can be
accomplished by checking LLC only (without L1)
• Cons
• Keeping all copies consistent spread around the system
is relatively complex
• Existence of multiple copies lowers the effective cache
capacity
55
56