0% found this document useful (0 votes)
64 views56 pages

Understanding Cache Principles and Locality

The document discusses cache principles, objectives, and design choices, emphasizing the importance of locality in cache performance. It outlines different types of locality, including temporal, spatial, algorithmic, and value locality, and how they can be exploited to improve cache efficiency. Additionally, it covers various cache implementation technologies and the distinction between transparent and explicitly managed caches.

Uploaded by

洪啟恩
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views56 pages

Understanding Cache Principles and Locality

The document discusses cache principles, objectives, and design choices, emphasizing the importance of locality in cache performance. It outlines different types of locality, including temporal, spatial, algorithmic, and value locality, and how they can be exploited to improve cache efficiency. Additionally, it covers various cache implementation technologies and the distinction between transparent and explicitly managed caches.

Uploaded by

洪啟恩
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Memory

Systems
CH1 Cache Principles
Prof. Ren-Shuo Liu
Outline
• Cache objectives and the classes of caches
• Locality principles and the classes of locality
• Design choices of caches

2
Cache Objectives
• Basic objectives
• Speed up accesses to storage devices, including
• Tape drives
• Disk drives
• Main memory
• Networked servers
• Other caches
• …
• Previously called as
• Look-aside [1962]
• Slave memory [1965]

3
Kitchen Analogy

[Link] 4
Implementation Technologies
• Caches can be built of anything that holds state
• A cache fronting a given storage device would be
built from a technology faster than the storage
device
as a cache for
• SRAM DRAM
• DRAM Disk
• Disk Tape
• Fast SRAM Slow SRAM
• Local disk Remote disk array
• Local server Remote server
• …

5
Implementation Technologies
• The cache technology typically costs more on a per-
bit basis
• Cache only needs to be large enough to hold the
application’s working set to be effective
• Working set: instructions and data the application is
currently using
• Most of the applications accesses will be satisfied out of
the working set (due to locality)
• Most of the time, the access characteristics will be that
of the cache

6
Cache Objectives
• In addition to performance, can a cache improve
the following metrics?
• Power
• Energy
• Energy delay product
• Cost
• Predictable behaviors
• Reliability

7
Software vs. Hardware Caches

DSP CPU

Router

Storage IP Translation Web Page


Transparent vs. Explicitly Managed
• Two types of addressing styles
• Transparent cache
• Uses the same namespace (address space) overlapped as the
backing store
• Cached data is a related, dependent copy of the original data
• Invisible to clients
• Works independently of the client making the request

• Explicitly managed / non-transparent cache


• Uses a separate namespace (address space)
• Cached data are independent from the original one
• Requires direct control of the client to move data in and out
the cache
Namespace / Address Space
• Two commonly seen examples
CPU-style DSP-style
transparent caches non-transparent caches
(aka. scratch pad memories)

DRAM
Outline
• Cache objectives and the classes of caches
• Locality principles and the classes of locality
• Design choices of caches

11
Cache Principle
• Caches work because of locality

• Locality
• The tendency of applications to access a predictably
small amount of data within a given window of time

• Caches work poorly for applications with little


locality
• E.g., synthetic applications that intentionally generate
random accesses

12
Observations about Locality
• Memory access patterns of programs are not
random
• Accesses tend to repeat themselves in time
• Accesses tend to be near each other in the memory
address space
• Notes
• These are merely observations
• Not guarantees nor provably inherent behaviors
• Computer architects continue observing programs to
discover new behaviors

13
Classes of Locality
• Temporal locality
• If the program references a datum once, it is likely in the
near future to reference that datum again
• Spatial locality
• If the program references a datum once, it is likely in the
near future to reference nearby data as well
• Algorithmic locality
• Program repeatedly accesses data items or executes
instructions that distributed widely across the memory
space
• Value locality
• Likelihood of the recurrence of a previously-seen value
within a storage location
14
Temporal Locality
• Reasons
• Loop
• Shared data
• Among expressions
• Among threads
• Among processes
• Basic policy to exploit it: demand-fetch
• When the program demands (references) an instruction
or data item, the cache hardware or software fetches
the item from the memory and retains it in the cache
• Before looking to memory for a particular data item, the
cache is searched for a local copy

15
Temporal Locality
• The only real limitation in exploiting temporal
locality is cache capacity
• If the cache is large enough, no data items ever need be
fetched from the backing store or written to the backing
store more than once
• If the cache is not large enough, a policy is required to
deal with eviction/replacement

16
Spatial Locality
• Reasons
• Programmers and compilers tend to cluster related
objects (data and instructions) together in the memory
space
• Human-like way that programs tend to work on related
objects one right after the other
• The classic example: array and array processing
• Basic policy to exploit it: lookahead
• When object n is brought into the processor, object n+1 should be
brought in as well

17
Spatial Locality
• Two types of mechanisms for lookahead
• Passive mechanism: Build caches in units called cache
blocks larger than a single object
• Active mechanism: prefetching
• Prefetching
• Speculate on which cache block(s) might be used next
• Exampling prefetching
• One-block lookahead
• Simple extension of the cache block granularity of access
• Prefetching instructions

18
Spatial Locality
• Limitation of exploiting spatial locality using
prefetching
• Memory access behaviors differ among programs
• Memory access behaviors change over time
• Accuracy of prefetching mechanisms’ prediction

19
Algorithmic Locality
• Algorithm-level behaviors that are predictable but
not capturable by cache blocks or simple lookahead
• Examples
• Graphics processing which walks through a list of
polygons in the 3D scene over and over and paints 2D
image of the scene
• The polygon list is a large linked data structure, so temporal
and spatial locality are nonobvious
• Electronic design automation
• Computer simulation, circuit simulation
• Synthesis
• Validation, verification, rule checking

20
History depth of one
History depth of 16 (for
Value Locality analyzing the potential
of value locality)

• Significant percentage of
times each static load
instruction retrieves a
value from memory that
matches a previously-
seen value for that load

21
Exploiting Value Locality
• Load value prediction unit
• Load value prediction table
(LVPT)
• Input: lower order bits of the
PC (program counter) of a load
• output: previous loaded value
for the same PC
• Load classification table (LCT)
• Input: same as LVPT
• Output: whether the load
tends to be unpredictable,
predictable, or constant
• Constant verification unit (CVU)
• Input: load/store addresses
and LVPT index
• Output: whether the load
needs verification or not
22
Load Value Prediction Table (LVPT)
(value loaded) Word

for loop
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
0
N=10 entries 0
203 Load R4, 8(R1) (1)
direct-mapped 0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
23
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 103 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
01 0
0
203 Load R4, 8(R1) (1)
0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
24
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 104 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
1
07 0
203 Load R4, 8(R1) (1)
0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
25
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 103 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
1 1
7
203 Load R4, 8(R1) (1)
0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
26
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 104 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
1
7 7
203 Load R4, 8(R1) (1)
0
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
27
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 203 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
0
1 1
7
203 Load R4, 8(R1) (1)
0
constructive interference
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
28
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 302 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
01 0
1
7
203 Load R4, 8(R1) (1)
0
duplicated entries
may exist 0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
29
Load Value Prediction Table (LVPT)
(actual value loaded) Word predicted
value
for loop 303 % 10
103 Load R2, 0(R1) (1) 0
104 Load R3, 4(R1) (7 or 9)
0
… Bnez R1, 103
1
18 1
7
203 Load R4, 8(R1) (1)
0
destructive interference
0
0
302 Load R4, 0(R1) (1)
303 Load R5, 0(R6) (8) 0
0
N is chosen to be ~1024 in real design
30
Load Classification Table (LCT)
1-bit or 2-bit
saturating counter

for loop
103 Load R2, 0(R1) 11
104 Load R3, 4(R1)
… Bnez R1, 103 10

01

203 Load R4, 8(R1) N=10 entries


00
direct-mapped

302 Load R4, 0(R1)


303 Load R5, 0(R6)

N is chosen to be ~1024 in real design


31
Load Classification Table (LCT)
• 2-bit LCT
Classification of the load

11 Constant load (highly predictable)

misprediction 10 Predictable

successful prediction
01
Unpredictable

00

32
Constant Verification Unit (CVU)
• Basically, we still need to verify the actual value for
predictable and even constant loads
• Memory accesses are not eliminated
• There do exist constant loads that rarely or never
change their loaded values
• Verification becomes a waste of time, energy, and
memory bandwidth

• CVU targets eliminating this kind of memory


accesses

33
Constant Verification Unit (CVU)
• CVU is a fully associative
cache Data address LVPT index
• i.e., content-addressable match?
memory (CAM)
match?
• Indexed by the combination
of data address and LVPT match?
index (lower bits of load PC) match?
• Verification is safely omitted match?
for loads that hit the CVU match?
• CVU caches highly- match?
predictable loads
match?
• CVU monitors all stores
match?
• CVU invalidates entries
whose address matches the match?
stores
34
Exploiting Value Locality

35
Outline
• Cache objectives and the classes of caches
• Locality principles and the classes of locality
• Design choices of caches

36
Cache Design and Optimization
• Identify classes of application /
algorithm behaviors
• Identify data types that exhibit
recognizable behaviors

• New applications /
algorithms emerge

• Exploit the predictable behaviors


that are exhibited
37
Key Design Choices of Caches
• Where to put it: logical organization (CH2)
• Arrangement of data stored within its context

• What to cache: content management (CH3)


• The decision to cache or not to cache a particular item
at a particular time during execution

• How to maintain it: consistency management (CH4)


• Ensure that application receive the correct cached data

38
Logical Organization: Tags, Sets, Ways
• Cache uses tags because
• Cache size is typically smaller than the backing store
• There is a good possibility that any particular requested
datum is not in the cache
• Some mechanism must indicate whether any particular
datum is present in the cache
• Cache is divided into sets and ways because
• A cache can comprise multiple blocks
• Some mechanism must indicate where should an
incoming block be placed

39
Logical Organization: Tags, Sets, Ways
Memory
• Two-way set associative ways implementing
byte in block the cache
Address

sets
Tag set
index cache

Tag Tag

Flags Flags
Cache data (aka.
cache block,
= = cache line)
valid = valid =
40
Logical Organization: Tags, Sets, Ways
• Set associative caches strike a balance between
direct mapped and fully associative caches
Direct-mapped Two-way set associative

Four-way set associative

Fully associative

41
Content Management
• A question of "to cache or not to cache"
• Determine which memory references are the best
candidates for caching
• Keep data in the cache at the time the data are
requested
• Decide which previously cached data should be evicted
to make room for more important data not yet cached

42
Content Management
• Oracle (先知) vs. heuristics (經驗法則)
• Oracle content management
• See into the future
• For cached data and data to be cached
• See which one whose next reuse is furthest
• Evict the furthest one to make room (if necessary)
• Oracle is optimal but usually unavailable

43
Content Management
• Heuristics approximate the oracle
• Based on imperfect and/or incomplete information that
describes the data item
• Who is using it
• How it is being used
• How important it is
• At what point in the application's execution is the item being
considered
• Based on the information, the heuristic makes "yes/no"
decision to cache the item

44
Content Management
• Heuristics' decisions
• Can be made at any of several different times
• When the programmer writes code
• When the code is compiled
• When the application code is executed
• By any of several different entities
• Programmers
• Compilers
• Operating systems
• Application code itself
• Cache itself

45
Content Management
• Heuristics' decisions
• Can be predictive and proactive (主動)
• E.g., the compiler determines that a particular memory access
is better to hit the cache, so some steps such as prefetching
instructions are added into code to bring the corresponding
data
• Cab be reactive (被動)
• E.g., the cache determines that a requested item is valuable
and retain it in the cache)

46
Consistency Management
• Three main charges
• Keep the cache consistent with it self
• Keep the cache consistent with the backing store
• Keep the cache consistent with other caches

47
Consistent with Itself
• It would be inconsistent if there are two copies of a
single item in different places of the cache
• Different sets
• Different ways in a set
• Causes of such situations: synonym
• Virtual-address-indexed cache
• OS allows two unrelated virtual addresses to map to the
same physical address
• Virtual-address-tag cache and a set-associative cache

48
Consistent with Backing Store
• The value of backing store should be up-to-date
with any changes made to the version stored in the
cache
• Two typical mechanisms
• Write through
• Write back

49
Consistent with Backing Store
• Write through
• Any change is immediately
updated to the backing store
• Pros: smallest possible
window in which the backing
store holds an outdated copy
of a datum
• Cons: the backing store
typically cannot provide the backing store
same write throughput as of L1 caches
the cache
• Improvement: additional
write buffer that is as fast as
the cache but logically part
of the backing store
50
Consistent with Backing Store
• Write back
• Changes are made to the
cache and later updated to
the backing store
• Pros: locality of writes
results in reduced writes to processes, e.g.,
the backing store (write an editor and
an browser
coalescing)
• Cons
• Other processors may see
outdated data
• Other processes and
processors may see outdated
data
51
Consistent with Other Caches
• Other caches in the system should all be treated
much like the backing store in terms of keeping the
various copies of a datum up-to-date

y=0 y=1

y is shared

52
Inclusion and Exclusion
• Cache system is vertically and horizontally partitioned
• Vertical partitioning: move more frequently accessed data into
cache levels that are closer to the processors
• Horizontal partitioning: each partition consumes a fraction of the
whole cache's energy and is only driven when a reference targets it

Horizontally partitioned
Vertically partitioned

53
Inclusion and Exclusion
• Principles of inclusion and exclusion define a
particular class of relationship that exists between
any two partitions in a cache system
• Inclusive relationship
• Every cached item found in one of the units has a copy
found in the other
• Exclusive relationship
• Any given item is either in A or B or in neither, and it
absolutely should not be in both

54
Inclusive vs. Exclusive
• Considering inclusive L1, L2, and LLC (last level
cache) caches
• Every item in L1 is also held in L2 and LLC
• Pros
• Clean contents of L1 can be discarded at any time
without affecting the correctness of the system
• Detecting whether a datum is present can be
accomplished by checking LLC only (without L1)
• Cons
• Keeping all copies consistent spread around the system
is relatively complex
• Existence of multiple copies lowers the effective cache
capacity
55
56

You might also like