CS 153
Design of Operating Systems
Spring 25
Lecture 11: Locality, Cache, and Memory
Hierarchy
Instructor: Chengyu Song
Some slides modified from originals by Dave O’hallaron
Efficient Translations
Recall that our original page table scheme doubled the latency of doing
memory lookups
◆ One lookup into the page table, another to fetch the data
Now two-level page tables triple the latency!
◆ Two lookups into the page tables, a third to fetch the data
◆ And this assumes the page table is in memory
How can we use paging but also have lookups cost about the same as
fetching from memory?
◆ Cache (remember) translations in hardware Why memory access is
◆ Translation Lookaside Buffer (TLB) slow and what to do
◆ TLB managed by Memory Management Unit (MMU) about it
The CPU-Memory Gap
The gap widens between DRAM, disk, and CPU speeds.
100,000,000.0 Disk
10,000,000.0
1,000,000.0
SSD
100,000.0
Disk seek time
10,000.0
Flash SSD access time
DRAM access time
1,000.0
ns
SRAM access time
DRAM CPU cycle time
100.0 Effective CPU cycle time
10.0
1.0
0.1 CPU
0.0
1980 1985 1990 1995 2000 2003 2005 2010
Year
The Price-Speed Gap
Question: why don’t we just use fast memory to do everything?
SRAM
◆ Latency: 0.5-2.5 ns, cost: ~$5000 per GB
DRAM
◆ Latency: 50-70 ns, cost: ~$20 - $50 per GB
SSD/NVM
◆ Latency: 70-150 ns, cost: ~$4 - $12 per GB
Magnetic disk
◆ Latency: 5-20 ms, cost: ~$0.02 - $2 per GB
Locality to the Rescue!
The key to bridging this CPU-Memory gap is a fundamental property of
computer programs known as locality
Locality
Principle of Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently
Temporal locality:
◆ Recently referenced items are likely
to be referenced again in the near future
Spatial locality:
◆ Items with nearby addresses tend
to be referenced close together in time
Q: What does locality enable us to do?
Locality Example
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;
Data references
◆ Reference array elements in
Spatial locality
succession (stride-1 reference
pattern).
◆ Reference variable sum each iteration.
Temporal locality
Instruction references
◆ Reference instructions in sequence.
◆ Cycle through loop repeatedly. Spatial locality
Temporal locality
Qualitative Estimates of Locality
Claim: Being able to look at code and get a qualitative sense of its locality
is a key skill for a professional programmer.
Question: Does this function have good locality with respect to array a?
int sum_array_rows(int a[M][N])
{
int i, j, sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
Locality Example
Question: Does this function have good locality with respect to array a?
int sum_array_cols(int a[M][N])
{
int i, j, sum = 0;
for (j = 0; j < N; j++)
for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
Locality Example
Question: Can you permute the loops so that the function scans the 3-d
array a with a stride-1 reference pattern (and thus has good spatial
locality)?
int sum_array_3d(int a[M][N][N])
{
int i, j, k, sum = 0;
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
for (k = 0; k < M; k++)
sum += a[k][i][j];
return sum;
}
Cache
l Cache: A smaller, faster storage that acts as a staging area for a subset
of the data in a larger, slower storage.
◆ The storage could be a software data structure or a hardware device → memory
hierarchy
l Why does cache work?
◆ Because of locality!
» Hit fast storage much more frequently even though its smaller
General Cache Concepts
Smaller, faster, more
Cache 8
4 9 14
10 3 expensive storage caches a
subset of the blocks
Data is copied in block-
10
4 sized transfer units
Memory Larger, slower, cheaper
0 1 2 3 storage viewed as
4 5 6 7 partitioned into “blocks”
8 9 10 11
12 13 14 15
General Cache Concepts: Hit
Request: 14 Data in block b is needed
Cache Block b is in cache:
8 9 14 3 Hit!
Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
General Cache Concepts: Miss
Request: 12 Data in block b is needed
Cache Block b is not in cache:
8 9
12 14 3 Miss!
Block b is fetched from
12 Request: 12 memory
Memory Block b is stored in cache
0 1 2 3
• Placement policy:
4 5 6 7 determines where b goes
8 9 10 11 • Replacement policy:
determines which block
12 13 14 15 gets evicted (victim)
Types of Cache Misses
Cold (compulsory) miss
◆ Cold misses occur because the cache is empty.
Conflict miss
◆ Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the
block positions at level k.
» E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
◆ Conflict misses occur when the level k cache is large enough, but multiple data
objects all map to the same level k block.
» E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
Capacity miss
◆ Occurs when the set of active cache blocks (working set) is larger than the cache.
Cache Replacement Policy
Cache replacement policy: determine which data to remove when we
need a victim
Does it matter?
◆ Yes! Cache filling is expensive
◆ Getting the number down, can improve the performance of the system significantly
Considerations
Cache replacement support has to be simple
◆ They happen all the time, we cannot make that part slow
But it can be complicated/expensive when a miss occurs – why?
◆ Reason 1: if we are successful, this will be rare
◆ Reason 2: when it happens, we are paying the cost of loading
» Loading from lower layer is relatively slower: can afford to do some extra computation
» Worth it if we can save some future miss
What makes a good cache replacement policy?
Evicting the Best Data
Goal is to reduce the cache miss rate
The best data to evict is the one never touched again
◆ Will never have a cache miss on it
Never is a long time, so picking the data closest to “never” is the next
best thing
◆ Evicting the data that won’t be used for the longest period of time minimizes the
number of cache misses
◆ Proved by Belady
We’ll survey various replacement algorithms: Belady’s, FIFO, LRU (least
recently used)
Belady’s Algorithm
Belady’s algorithm
◆ Idea: Replace the data that will not be used for the longest time in the future
◆ Optimal? How would you show?
◆ Problem: Have to predict the future
Why is Belady’s useful then?
◆ Use it as a yardstick/upper bound
◆ Compare implementations of page replacement algorithms with the optimal to gauge
room for improvement
» If optimal is not much better, then algorithm is pretty good
◆ What’s a good lower bound?
» Random replacement is often the lower bound
First-In First-Out (FIFO)
FIFO is an obvious algorithm and simple to implement
◆ Maintain a list of pages in order in which they were paged in
◆ On replacement, evict the one brought in longest time ago
Why might this be good?
◆ Maybe the one brought in the longest ago is not being used
Why might this be bad?
◆ Then again, maybe it’s not
◆ We don’t have any info to say one way or the other
FIFO suffers from “Belady’s Anomaly”
◆ The miss rate might actually increase when the cache size grows (very bad)
Least Recently Used (LRU)
LRU uses reference information to make a more informed replacement
decision
◆ Idea: We can’t predict the future, but we can make a guess based upon past
experience
◆ On replacement, evict the page that has not been used for the longest time in the past
(Belady’s: future)
◆ When does LRU do well? When does LRU do poorly?
Implementation
◆ To be perfect, need to time stamp every reference (or maintain a stack) – much too
costly
◆ So we need to approximate it
Memory Hierarchies
Some fundamental and enduring properties of hardware and software:
◆ Fast storage technologies cost more per byte, have less capacity, and require more
power (heat!).
◆ The gap between CPU and main memory speed is widening.
◆ Well-written programs tend to exhibit good locality.
These fundamental properties complement each other beautifully.
They suggest an approach for organizing memory and storage systems
known as a memory hierarchy.
An Example of Memory Hierarchy
L0:
CPU registers hold words retrieved from L1
Registers cache
L1: L1 cache
Smaller, (SRAM) L1 cache holds cache lines retrieved from
L2 cache
faster,
costlier L2:
per byte L2 cache
(SRAM) L2 cache holds cache lines retrieved
from main memory
L3:
Main memory
Larger,
(DRAM) Main memory holds disk blocks
slower, retrieved from local disks
cheaper
per byte
L4: Local secondary storage Local disks hold files
(local disks) retrieved from disks on
remote network servers
Remote secondary storage
L5: (tapes, distributed file systems, Web servers)
Another Example
Memory hierarchy
l Fundamental idea of a memory hierarchy:
For each layer, faster, smaller device caches larger, slower device
l Why do memory hierarchies work?
Because of locality!
» Hit fast memory much more frequently even though its smaller
Thus, the storage at level k+1 can be slower (but larger and cheaper!)
l Big Idea: The memory hierarchy creates a large pool of storage that costs
as much as the cheap storage near the bottom, but that serves data to
programs at the rate of the fast storage near the top.
Examples of Caching in the Hierarchy
Cache Type What is Cached? Where is it Cached? Latency (cycles) Managed By
Registers 4-8 bytes words CPU core 0 Compiler
TLB Address translations On-Chip TLB 0 Hardware
L1 cache 64-bytes block On-Chip L1 1 Hardware
L2 cache 64-bytes block On/Off-Chip L2 10 Hardware
Virtual Memory 4-KB page Main memory 100 Hardware + OS
Buffer cache Parts of files Main memory 100 OS
Disk cache Disk sectors Disk controller 100,000 Disk firmware
Network buffer Parts of files Local disk 10,000,000 AFS/NFS client
cache
Browser cache Web pages Local disk 10,000,000 Web browser
Web cache Web pages Remote server disks 1,000,000,000 Web proxy
server
Intel Core i7 Memory System
Processor package
Core x4
Instruction MMU
Registers
fetch (addr translation)
L1 d-cache L1 i-cache L1 d-TLB L1 i-TLB
32 KB, 8-way 32 KB, 8-way 64 entries, 4-way 128 entries, 4-way
L2 unified cache L2 unified TLB
256 KB, 8-way 512 entries, 4-way
To other
cores
QuickPath interconnect
4 links @ 25.6 GB/s each
To I/O
bridge
L3 unified cache DDR3 Memory controller
8 MB, 16-way 3 x 64 bit @ 10.66 GB/s
(shared by all cores) 32 GB/s total (shared by all cores)
Main memory
End-to-end Core i7 Address Translation
32/64
CPU
Result L2, L3, and
main memory
Virtual address (VA)
36 12
VPN VPO L1 L1
hit miss
32 4
TLBT TLBI
L1 d-cache
(64 sets, 8 lines/set)
TLB
hit
TLB
miss
...
...
L1 TLB (16 sets, 4 entries/set)
9 9 9 9
40 12 40 6 6
VPN1 VPN2 VPN3 VPN4
PPN PPO CT CI CO
Physical
CR3 address
(PA)
PTE PTE PTE PTE
Page tables
Simple Memory System Example
Addressing
◆ 14-bit virtual addresses
◆ 12-bit physical address
◆ Page size = 64 bytes
13 12 11 10 9 8 7 6 5 4 3 2 1 0
VPN VPO
Virtual Page Number Virtual Page Offset
11 10 9 8 7 6 5 4 3 2 1 0
PPN PPO
Physical Page Number Physical Page Offset
Simple Memory System Page Table
Only show first 16 entries (out of 256)
VPN PPN Valid VPN PPN Valid
00 28 1 08 13 1
01 – 0 09 17 1
02 33 1 0A 09 1
03 02 1 0B – 0
04 – 0 0C – 0
05 16 1 0D 2D 1
06 – 0 0E 11 1
07 – 0 0F 0D 1
Simple Memory System TLB
16 entries
4-way associative* (what is this?!)
TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0
VPN VPO
Set Tag PPN Valid Tag PPN Valid Tag PPN Valid Tag PPN Valid
0 03 – 0 09 0D 1 00 – 0 07 02 1
1 03 2D 1 02 – 0 04 – 0 0A – 0
2 02 – 0 08 – 0 06 – 0 03 – 0
3 07 – 0 03 0D 1 0A 34 1 02 – 0
Simple Memory System Cache
16 lines, 4-byte block size
Physically addressed
Direct mapped CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0
PPN PPO
Idx Tag Valid B0 B1 B2 B3 Idx Tag Valid B0 B1 B2 B3
0 19 1 99 11 23 11 8 24 1 3A 00 51 89
1 15 0 – – – – 9 2D 0 – – – –
2 1B 1 00 02 04 08 A 2D 1 93 15 DA 3B
3 36 0 – – – – B 0B 0 – – – –
4 32 1 43 6D 8F 09 C 12 0 – – – –
5 0D 1 36 72 F0 1D D 16 1 04 96 34 15
6 31 0 – – – – E 13 1 83 77 1B D3
7 16 1 11 C2 DF 03 F 14 0 – – – –
Address Translation Example #1
Virtual Address: 0x03D4
TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 1 1 1 1 0 1 0 1 0 0
VPN VPO
VPN 0x0F
___ TLBI 0x3
___ TLBT 0x03
____ TLB Hit? Y
__ Page Fault? __
N PPN:0x0D
____
Physical Address
CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0
0 0 1 1 0 1 0 1 0 1 0 0
PPN PPO
CO ___
0 CI___
0x5 CT 0x0D
____ Hit? Y__ Byte: ____
0x36
Address Translation Example #2
Virtual Address: 0x0B8F
TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 1 0 1 1 1 0 0 0 1 1 1 1
VPN VPO
VPN 0x2E
___ TLBI ___
2 TLBT 0x0B
____ TLB Hit? N
__ Page Fault? __
Y PPN: TBD
____
Physical Address
CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0
PPN PPO
CO ___ CI___ CT ____ Hit? __ Byte: ____
Address Translation Example #3
Virtual Address: 0x0020
TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0
VPN VPO
VPN 0x00
___ TLBI ___
0 TLBT 0x00
____ TLB Hit? N
__ Page Fault? __
N PPN:0x28
____
Physical Address
CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0
1 0 1 0 0 0 1 0 0 0 0 0
PPN PPO
CO___
0 CI___
0x8 CT 0x28
____ Hit? N__ Byte: ____
Mem