0% found this document useful (0 votes)
12 views35 pages

Bridging the CPU-Memory Gap

The lecture discusses the design of operating systems with a focus on memory hierarchy, locality, and cache management. It highlights the challenges of memory access speed and the importance of locality in improving performance through caching techniques. Various cache replacement policies, such as FIFO and LRU, are examined to optimize memory usage and reduce cache misses.

Uploaded by

gptcharacholena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views35 pages

Bridging the CPU-Memory Gap

The lecture discusses the design of operating systems with a focus on memory hierarchy, locality, and cache management. It highlights the challenges of memory access speed and the importance of locality in improving performance through caching techniques. Various cache replacement policies, such as FIFO and LRU, are examined to optimize memory usage and reduce cache misses.

Uploaded by

gptcharacholena
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CS 153

Design of Operating Systems

Spring 25

Lecture 11: Locality, Cache, and Memory


Hierarchy
Instructor: Chengyu Song

Some slides modified from originals by Dave O’hallaron


Efficient Translations
Recall that our original page table scheme doubled the latency of doing
memory lookups
◆ One lookup into the page table, another to fetch the data
Now two-level page tables triple the latency!
◆ Two lookups into the page tables, a third to fetch the data
◆ And this assumes the page table is in memory
How can we use paging but also have lookups cost about the same as
fetching from memory?
◆ Cache (remember) translations in hardware Why memory access is
◆ Translation Lookaside Buffer (TLB) slow and what to do
◆ TLB managed by Memory Management Unit (MMU) about it
The CPU-Memory Gap
The gap widens between DRAM, disk, and CPU speeds.
100,000,000.0 Disk
10,000,000.0

1,000,000.0
SSD
100,000.0
Disk seek time
10,000.0
Flash SSD access time
DRAM access time
1,000.0
ns

SRAM access time


DRAM CPU cycle time
100.0 Effective CPU cycle time

10.0

1.0

0.1 CPU

0.0
1980 1985 1990 1995 2000 2003 2005 2010
Year
The Price-Speed Gap
Question: why don’t we just use fast memory to do everything?
SRAM
◆ Latency: 0.5-2.5 ns, cost: ~$5000 per GB
DRAM
◆ Latency: 50-70 ns, cost: ~$20 - $50 per GB
SSD/NVM
◆ Latency: 70-150 ns, cost: ~$4 - $12 per GB
Magnetic disk
◆ Latency: 5-20 ms, cost: ~$0.02 - $2 per GB
Locality to the Rescue!
The key to bridging this CPU-Memory gap is a fundamental property of
computer programs known as locality
Locality
Principle of Locality: Programs tend to use data and instructions with
addresses near or equal to those they have used recently

Temporal locality:
◆ Recently referenced items are likely
to be referenced again in the near future

Spatial locality:
◆ Items with nearby addresses tend
to be referenced close together in time

Q: What does locality enable us to do?


Locality Example
sum = 0;
for (i = 0; i < n; i++)
sum += a[i];
return sum;

Data references
◆ Reference array elements in
Spatial locality
succession (stride-1 reference
pattern).
◆ Reference variable sum each iteration.
Temporal locality
Instruction references
◆ Reference instructions in sequence.
◆ Cycle through loop repeatedly. Spatial locality

Temporal locality
Qualitative Estimates of Locality
Claim: Being able to look at code and get a qualitative sense of its locality
is a key skill for a professional programmer.
Question: Does this function have good locality with respect to array a?

int sum_array_rows(int a[M][N])


{
int i, j, sum = 0;

for (i = 0; i < M; i++)


for (j = 0; j < N; j++)
sum += a[i][j];
return sum;
}
Locality Example
Question: Does this function have good locality with respect to array a?

int sum_array_cols(int a[M][N])


{
int i, j, sum = 0;

for (j = 0; j < N; j++)


for (i = 0; i < M; i++)
sum += a[i][j];
return sum;
}
Locality Example
Question: Can you permute the loops so that the function scans the 3-d
array a with a stride-1 reference pattern (and thus has good spatial
locality)?

int sum_array_3d(int a[M][N][N])


{
int i, j, k, sum = 0;

for (i = 0; i < N; i++)


for (j = 0; j < N; j++)
for (k = 0; k < M; k++)
sum += a[k][i][j];
return sum;
}
Cache
l Cache: A smaller, faster storage that acts as a staging area for a subset
of the data in a larger, slower storage.
◆ The storage could be a software data structure or a hardware device → memory
hierarchy
l Why does cache work?
◆ Because of locality!
» Hit fast storage much more frequently even though its smaller
General Cache Concepts

Smaller, faster, more


Cache 8
4 9 14
10 3 expensive storage caches a
subset of the blocks

Data is copied in block-


10
4 sized transfer units

Memory Larger, slower, cheaper


0 1 2 3 storage viewed as
4 5 6 7 partitioned into “blocks”

8 9 10 11
12 13 14 15
General Cache Concepts: Hit
Request: 14 Data in block b is needed

Cache Block b is in cache:


8 9 14 3 Hit!

Memory 0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
General Cache Concepts: Miss
Request: 12 Data in block b is needed

Cache Block b is not in cache:


8 9
12 14 3 Miss!

Block b is fetched from


12 Request: 12 memory

Memory Block b is stored in cache


0 1 2 3
• Placement policy:
4 5 6 7 determines where b goes
8 9 10 11 • Replacement policy:
determines which block
12 13 14 15 gets evicted (victim)
Types of Cache Misses
Cold (compulsory) miss
◆ Cold misses occur because the cache is empty.
Conflict miss
◆ Most caches limit blocks at level k+1 to a small subset (sometimes a singleton) of the
block positions at level k.
» E.g. Block i at level k+1 must be placed in block (i mod 4) at level k.
◆ Conflict misses occur when the level k cache is large enough, but multiple data
objects all map to the same level k block.
» E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss every time.
Capacity miss
◆ Occurs when the set of active cache blocks (working set) is larger than the cache.
Cache Replacement Policy
Cache replacement policy: determine which data to remove when we
need a victim

Does it matter?
◆ Yes! Cache filling is expensive
◆ Getting the number down, can improve the performance of the system significantly
Considerations
Cache replacement support has to be simple
◆ They happen all the time, we cannot make that part slow
But it can be complicated/expensive when a miss occurs – why?
◆ Reason 1: if we are successful, this will be rare
◆ Reason 2: when it happens, we are paying the cost of loading
» Loading from lower layer is relatively slower: can afford to do some extra computation
» Worth it if we can save some future miss
What makes a good cache replacement policy?
Evicting the Best Data
Goal is to reduce the cache miss rate
The best data to evict is the one never touched again
◆ Will never have a cache miss on it
Never is a long time, so picking the data closest to “never” is the next
best thing
◆ Evicting the data that won’t be used for the longest period of time minimizes the
number of cache misses
◆ Proved by Belady
We’ll survey various replacement algorithms: Belady’s, FIFO, LRU (least
recently used)
Belady’s Algorithm
Belady’s algorithm
◆ Idea: Replace the data that will not be used for the longest time in the future
◆ Optimal? How would you show?
◆ Problem: Have to predict the future
Why is Belady’s useful then?
◆ Use it as a yardstick/upper bound
◆ Compare implementations of page replacement algorithms with the optimal to gauge
room for improvement
» If optimal is not much better, then algorithm is pretty good
◆ What’s a good lower bound?
» Random replacement is often the lower bound
First-In First-Out (FIFO)
FIFO is an obvious algorithm and simple to implement
◆ Maintain a list of pages in order in which they were paged in
◆ On replacement, evict the one brought in longest time ago
Why might this be good?
◆ Maybe the one brought in the longest ago is not being used
Why might this be bad?
◆ Then again, maybe it’s not
◆ We don’t have any info to say one way or the other
FIFO suffers from “Belady’s Anomaly”
◆ The miss rate might actually increase when the cache size grows (very bad)
Least Recently Used (LRU)
LRU uses reference information to make a more informed replacement
decision
◆ Idea: We can’t predict the future, but we can make a guess based upon past
experience
◆ On replacement, evict the page that has not been used for the longest time in the past
(Belady’s: future)
◆ When does LRU do well? When does LRU do poorly?
Implementation
◆ To be perfect, need to time stamp every reference (or maintain a stack) – much too
costly
◆ So we need to approximate it
Memory Hierarchies
Some fundamental and enduring properties of hardware and software:
◆ Fast storage technologies cost more per byte, have less capacity, and require more
power (heat!).
◆ The gap between CPU and main memory speed is widening.
◆ Well-written programs tend to exhibit good locality.
These fundamental properties complement each other beautifully.
They suggest an approach for organizing memory and storage systems
known as a memory hierarchy.
An Example of Memory Hierarchy
L0:
CPU registers hold words retrieved from L1
Registers cache

L1: L1 cache
Smaller, (SRAM) L1 cache holds cache lines retrieved from
L2 cache
faster,
costlier L2:
per byte L2 cache
(SRAM) L2 cache holds cache lines retrieved
from main memory
L3:
Main memory
Larger,
(DRAM) Main memory holds disk blocks
slower, retrieved from local disks
cheaper
per byte
L4: Local secondary storage Local disks hold files
(local disks) retrieved from disks on
remote network servers

Remote secondary storage


L5: (tapes, distributed file systems, Web servers)
Another Example
Memory hierarchy
l Fundamental idea of a memory hierarchy:
For each layer, faster, smaller device caches larger, slower device
l Why do memory hierarchies work?
Because of locality!
» Hit fast memory much more frequently even though its smaller
Thus, the storage at level k+1 can be slower (but larger and cheaper!)
l Big Idea: The memory hierarchy creates a large pool of storage that costs
as much as the cheap storage near the bottom, but that serves data to
programs at the rate of the fast storage near the top.
Examples of Caching in the Hierarchy
Cache Type What is Cached? Where is it Cached? Latency (cycles) Managed By

Registers 4-8 bytes words CPU core 0 Compiler


TLB Address translations On-Chip TLB 0 Hardware

L1 cache 64-bytes block On-Chip L1 1 Hardware


L2 cache 64-bytes block On/Off-Chip L2 10 Hardware
Virtual Memory 4-KB page Main memory 100 Hardware + OS
Buffer cache Parts of files Main memory 100 OS
Disk cache Disk sectors Disk controller 100,000 Disk firmware
Network buffer Parts of files Local disk 10,000,000 AFS/NFS client
cache
Browser cache Web pages Local disk 10,000,000 Web browser

Web cache Web pages Remote server disks 1,000,000,000 Web proxy
server
Intel Core i7 Memory System
Processor package
Core x4

Instruction MMU
Registers
fetch (addr translation)

L1 d-cache L1 i-cache L1 d-TLB L1 i-TLB


32 KB, 8-way 32 KB, 8-way 64 entries, 4-way 128 entries, 4-way

L2 unified cache L2 unified TLB


256 KB, 8-way 512 entries, 4-way

To other
cores
QuickPath interconnect
4 links @ 25.6 GB/s each
To I/O
bridge

L3 unified cache DDR3 Memory controller


8 MB, 16-way 3 x 64 bit @ 10.66 GB/s
(shared by all cores) 32 GB/s total (shared by all cores)

Main memory
End-to-end Core i7 Address Translation
32/64
CPU
Result L2, L3, and
main memory
Virtual address (VA)
36 12
VPN VPO L1 L1
hit miss
32 4
TLBT TLBI
L1 d-cache
(64 sets, 8 lines/set)
TLB
hit
TLB
miss

...

...
L1 TLB (16 sets, 4 entries/set)

9 9 9 9
40 12 40 6 6
VPN1 VPN2 VPN3 VPN4
PPN PPO CT CI CO
Physical
CR3 address
(PA)
PTE PTE PTE PTE

Page tables
Simple Memory System Example
Addressing
◆ 14-bit virtual addresses
◆ 12-bit physical address
◆ Page size = 64 bytes
13 12 11 10 9 8 7 6 5 4 3 2 1 0

VPN VPO

Virtual Page Number Virtual Page Offset

11 10 9 8 7 6 5 4 3 2 1 0

PPN PPO
Physical Page Number Physical Page Offset
Simple Memory System Page Table
Only show first 16 entries (out of 256)

VPN PPN Valid VPN PPN Valid


00 28 1 08 13 1
01 – 0 09 17 1
02 33 1 0A 09 1
03 02 1 0B – 0
04 – 0 0C – 0
05 16 1 0D 2D 1
06 – 0 0E 11 1
07 – 0 0F 0D 1
Simple Memory System TLB
16 entries
4-way associative* (what is this?!)

TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0

VPN VPO

Set Tag PPN Valid Tag PPN Valid Tag PPN Valid Tag PPN Valid
0 03 – 0 09 0D 1 00 – 0 07 02 1
1 03 2D 1 02 – 0 04 – 0 0A – 0
2 02 – 0 08 – 0 06 – 0 03 – 0
3 07 – 0 03 0D 1 0A 34 1 02 – 0
Simple Memory System Cache
16 lines, 4-byte block size
Physically addressed
Direct mapped CT CI CO

11 10 9 8 7 6 5 4 3 2 1 0

PPN PPO
Idx Tag Valid B0 B1 B2 B3 Idx Tag Valid B0 B1 B2 B3
0 19 1 99 11 23 11 8 24 1 3A 00 51 89
1 15 0 – – – – 9 2D 0 – – – –
2 1B 1 00 02 04 08 A 2D 1 93 15 DA 3B
3 36 0 – – – – B 0B 0 – – – –
4 32 1 43 6D 8F 09 C 12 0 – – – –

5 0D 1 36 72 F0 1D D 16 1 04 96 34 15
6 31 0 – – – – E 13 1 83 77 1B D3
7 16 1 11 C2 DF 03 F 14 0 – – – –
Address Translation Example #1
Virtual Address: 0x03D4
TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 1 1 1 1 0 1 0 1 0 0

VPN VPO

VPN 0x0F
___ TLBI 0x3
___ TLBT 0x03
____ TLB Hit? Y
__ Page Fault? __
N PPN:0x0D
____

Physical Address
CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0
0 0 1 1 0 1 0 1 0 1 0 0
PPN PPO
CO ___
0 CI___
0x5 CT 0x0D
____ Hit? Y__ Byte: ____
0x36
Address Translation Example #2
Virtual Address: 0x0B8F
TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 1 0 1 1 1 0 0 0 1 1 1 1

VPN VPO

VPN 0x2E
___ TLBI ___
2 TLBT 0x0B
____ TLB Hit? N
__ Page Fault? __
Y PPN: TBD
____

Physical Address
CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0

PPN PPO
CO ___ CI___ CT ____ Hit? __ Byte: ____
Address Translation Example #3
Virtual Address: 0x0020
TLBT TLBI
13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 0 0 0 0 0 1 0 0 0 0 0

VPN VPO

VPN 0x00
___ TLBI ___
0 TLBT 0x00
____ TLB Hit? N
__ Page Fault? __
N PPN:0x28
____

Physical Address
CT CI CO
11 10 9 8 7 6 5 4 3 2 1 0
1 0 1 0 0 0 1 0 0 0 0 0
PPN PPO
CO___
0 CI___
0x8 CT 0x28
____ Hit? N__ Byte: ____
Mem

You might also like