Memory Hierarchy and Locality Concepts
Memory Hierarchy and Locality Concepts
201)
MEMORY HIERARCHY (SECTION 6.1 -6.3.1)
Deepak Gangadharan
Computer Systems Group (CSG), IIIT Hyderabad
Temporal locality:
◦ Recently referenced items are likely
to be referenced again in the near future
Spatial locality:
◦ Items with nearby addresses tend
to be referenced close together in time
Data references
◦ Reference array elements in succession (stride-1
Spatial locality
reference pattern).
◦ Reference variable sum each iteration. Temporal locality
Instruction references
◦ Reference instructions in sequence. Spatial locality
◦ Cycle through loop repeatedly. Temporal locality
Question: Does this function have good locality with respect to array a?
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0;
They suggest an approach for organizing memory and storage systems known as a memory
hierarchy.
L1: L1 cache
Smaller, (SRAM) L1 cache holds cache lines retrieved
from L2 cache
faster,
costlier L2:
L2 cache
per byte L2 cache holds cache lines
(SRAM)
retrieved from main memory
L3:
Larger, Main memory
(DRAM) Main memory holds disk blocks
slower, retrieved from local disks
cheaper
per byte L4: Local secondary storage Local disks hold files
(local disks) retrieved from disks on
remote network servers
Big Idea: The memory hierarchy creates a large pool of storage that costs as much as the cheap
storage near the bottom, but that serves data to programs at the rate of the fast storage near the
top.
4 5 6 7
8 9 10 11
12 13 14 15
Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
Conflict miss
◦ Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the block positions at
level k.
◦ E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
◦ Conflict misses occur when the level k cache is large enough, but multiple data objects all map to the
same level k block.
◦ E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
Capacity miss
◦ Occurs when the set of active cache blocks (working set) is larger than the cache.
CPU looks first for data in caches (e.g., L1, L2, and L3), then in main memory.
Typical system structure:
CPU chip
Register file
Cache
ALU
memories
System bus Memory bus
I/O Main
Bus interface
bridge memory
S = 2s sets
Cache size:
C = S x E x B data bytes
v tag 0 1 2 B-1
valid bit
B = 2b bytes per cache block (the data)
v tag 0 1 2 B-1
valid bit
B = 2b bytes per cache block (the data)
Address of int:
v tag 0 1 2 3 4 5 6 7
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
find set
S = 2s sets
v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7
Address of int:
valid? + match: assume yes = hit
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
block offset
Address of int:
valid? + match: assume yes = hit
t bits 0…01 100
v tag 0 1 2 3 4 5 6 7
block offset
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
block offset
v tag 0 1 2 3 4 5 6 7 v tag 0 1 2 3 4 5 6 7
block offset
short int (2 Bytes) is here No match:
• One line in set is selected for eviction and replacement
• Replacement policies: random, least recently used (LRU), …
What to do on a write-miss?
◦ Write-allocate (load into cache, update line in cache)
◦ Good if more writes to the location follow
◦ No-write-allocate (writes immediately to memory)
Typical
◦ Write-through + No-write-allocate
◦ Write-back + Write-allocate
L1 L1 L1 L1 L2 unified cache:
d-cache i-cache d-cache i-cache 256 KB, 8-way,
… Access: 10 cycles
L2 unified cache L2 unified cache
L3 unified cache:
8 MB, 16-way,
L3 unified cache Access: 40-75 cycles
(shared by all cores)
Block size: 64 bytes for
all caches.
Main memory
Hit Time
◦ Time to deliver a line in the cache to the processor
◦ includes time to determine whether the line is in the cache
◦ Typical numbers:
◦ 1-2 clock cycle for L1
◦ 5-20 clock cycles for L2
Miss Penalty
◦ Additional time required because of a miss
◦ typically 50-200 cycles for main memory (Trend: increasing!)
Analysis Method:
◦ Look at access pattern of inner loop
k j j
i k i
A B C
40
jki
30
kji
ijk
ijk / jik jik
20
10
kij / ikj
0
50 100 150 200 250 300 350 400 450 500 550 600 650 700 750
Array size (n)
COMPUTER SYSTEMS ORGANIZATION (SPRING 2024) 47
Topics
Cache organization and operation
Performance impact of caches
◦ Rearranging loops to improve spatial locality
◦ Using blocking to improve temporal locality
j
c a b
=i *
First iteration:
◦ n/8 + n = 9n/8 misses
= *
◦ Afterwards in cache:
(schematic)
= *
8 wide
n
Second iteration:
◦ Again:
n/8 + n = 9n/8 misses
Total misses:
= *
◦ 9n/8 * n2 = (9/8) * n3
8 wide
Block size B x B
COMPUTER SYSTEMS ORGANIZATION (SPRING 2024) 52
Cache Miss Analysis
Assume:
◦ Cache block = 8 doubles
◦ Cache size C << n (much smaller than n) n/B blocks
◦ Three blocks fit into cache: 3B2 < C
= *
Block size B x B
Total misses:
◦ nB/4 * (n/B)2 = n3/(4B)