Parallel Processing and Multi-Core Architecture
Parallel Processing and Multi-Core Architecture
Data Output
Data Input stream A
stream A
Data Output
Data Input stream B
stream B
Data Output
Data Input stream C
stream C
Vector Unit (SIMD) Working
• Vector arithmetic instructions usually only
allow element N of one vector register to take
part in operations with element N from other
vector registers.
• This dramatically simplifies the construction of
a highly parallel vector unit, which can be
structured as multiple parallel vector lanes.
a single vector add instruction,
C=A+B
Structure of a vector unit containing
four lanes
vector unit containing four lanes
• each lane holding every fourth element of each
vector register.
• The figure shows three vector functional units: an
FP add, an FP multiply, and a load-store unit.
• Each of the vector arithmetic units contains four
execution pipelines, one per lane, which acts in
concert to complete a single vector instruction.
• Note how each section of the vector-register fi le
only needs to provide enough read and write
ports
NOT more popular outside high-
performance computing?
• concerns about the larger state for vector
registers increasing context switch time and
the difficulty of handling page faults in vector
loads and stores,
• advantage of vector and multimedia extensions
is that it is relatively easy to extend a scalar
instruction set architecture with these
instructions to improve performance of data
parallel operations.
InstructÎOIl Instruction InStructÎon
Stream A Stream B Stream C
Data Output
Data Input stream A
stream A
Data Output
Data Input stream B
stream B
Data Output
Data Input stream C
stream C
No Multithreading
Hardware Multithreading/MIMD
Integer
0 ..
.C...
0
u
"t:l
C
Ill
Q)
.c(.)
uI l l
"' BTB uCode ROM Branch Target
Buffer,
Translation Look
aside Buffer
3
Without SMT, only a single thread can
run at any given time
Floating Point
..
0....
C
0
()
"O
ai
Q)
(.)
(ll
()
8TB uCode RO 1
Jer
Thread 2:
integer operation
SMT processor: both threads can run
concurrently
0
-
...
0
C:
(.)
'U
C:
('Q Uoo qu [Link]
(I)
.i::.
(.)
('Q
(.)
B B uCode ROM
5
Instruction Level Parallelism (ILP):
Thread Level Parallelism (TLP):
8
Multicore Philosophy
- Two or more cores with in a single Die
- each core has its own set of instructions and
architectural resources
A multi-core processor (or chip-level
multiprocessor, CMP) combines two or more
independent cores (normally a CPU) into a single
package composed of a single integrated circuit
(IC), called a die, or more dies packaged together. A
dual-core processor contains two cores, and a
quad-core processor contains four cores.
9
Multi-core architectures
bus interface
10
Multi-Core Architecture
shared memory multiprocessor (SMP)
Lock
Nonuniform Memory Access (NUMA)
Only one processor at a
Main memory is divided and attached to different
time can acquire the
microprocessors or to different memory controllers
lock, and other
on the same chip
processors interested in
shared data must wait
until the original
processor unlocks the
synchronization variable
As processors operating in parallel will
normally share data, they also need to
coordinate when operating on shared data;
otherwise, one processor could start working
on data before another is finished with it
Multi-core CPU chip
C C C C
0 0 0 0
r r r r
e e e e
1 2 3 4
13
The cores run in parallel
thread 1 thread 2 thread 3 thread 4
C C C C
0 0 0 0
r r r r
e e e e
1 2 3 4
'' ''
14
Within each core, threads are time-sliced Just
like on a uniprocessor)
several several several several
threads threads threads threads
C C C C
0 0 0 0
r r r r
e e e e
1 2 3 4
15
SMT not a "true" parallel processor
16
Multi-core:
threads can run on separate cores
"O "O
C: C:
C1l (1)
Cl) Cl)
.s::::.
0 .s::::.
C1l
(.) (.)
. . .,
( /'
Thread 1 Thread 2
17
MuIti-core:
threads can run on separate cores
l'lteaer 'lteger
-
0
C:
(..)
-c -c
C: C:
!1l !1l Uopque,ues
Q.l Q.l
.c .c
0 0
!1l
()
!1l Rename loc
(..)
Thread 3 Thread 4
18
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
- Single-core, non-SMT: standard uniprocessor
- Single-core, with SMT
- Multi-core, non-SMT
- Multi-core, with SMT
- The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them "hyper-threads"
19
SMT Dual-core: all four threads can
run concurrently
-0 -0
C C
ro ro
(I) (I)
.I:; .I:;
0
(.)
ro
(.)
BTB uCode
OM
20
Comparison: multi-core vs SMT
• Multi-core:
- Since there are several cores,
each is smaller and not as powerful
(but also easier to design and manufacture)
- However, great with thread-level parallelism
• SMT
- Can have one large and fast superscalar core
- Great performance on a single thread
- Mostly still only exploits instruction-level
parallelism
21
Pentium D
23
Core Duo
24
Core 2 Duo & Core 2 Quad
24
Some Intel Processors
26
Hyper Threading:
- Parts of a single processor are
shared between threads
- Execution Engine is shared
- OS task switching does not happen
in Hyper threading.
-Processor is kept as busy as
possible
27
A Simple Parallel Processing Program for a
Shared Address Space
Suppose we want to sum 64,000 numbers on a shared memory multiprocessor
computer with uniform memory access time. Let s assume we have 64 processors. Th e
first step is to ensure a balanced load per processor, so we split the set of numbers into
subsets of the same size. We do not allocate the subsets to a different memory space,
since there is a single memory space for this machine; we just give different starting
addresses to each processor. Pn is the number that identifies the processor, between 0
and 63. All processors start the program by running a loop that sums their subset of
numbers
sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i
+= 1) sum[Pn] += A[i]; /*sum the
assigned areas*/
The last four levels of a reduction
multi-core chip
Main memory
The cache coherence problem
Suppose variable x initially contains 15213
multi-core chip
Main memory
x=15213
The cache coherence problem
Core 1 reads x
multi-core chip
Main memory
x=15213
The cache coherence problem
Core 2 reads x
multi-core chip
Main memory
x=15213
The cache coherence problem
Core 1 writes to x, setting it to 21660
multi-core chip
Main memory assuming
x=21660 } write-through
caches
The cache coherence problem
Core 2 attempts to read x... gets a stale copy
multi-core chip
Main memory
x=21660
Solutions for cache coherence
• This is a general problem with
multiprocessors, not limited just to multi-core
• There exist many solution algorithms,
coherence protocols, etc.
• A simple solution:
invalidation-based protocol with snooping
Inter-core bus
multi-core chip
Main memory
inter-core
bus
Invalidation protocol with snooping
• Invalidation:
If a core writes to a data item, all other
copies of this data item in other caches
are invalidated
• Snooping:
All cores continuously "snoop" (monitor)
the bus connecting the cores.
The cache coherence problem
Revisited: Cores 1 and 2 have both read x
multi-core chip
Main memory
x=15213
The cache coherence problem
Core 1 writes to x, setting it to 21660
sends
multi-core chip
request
Main memory assuming
inter-core
x=21660 } write-through
_ . caches bus
The cache coherence problem
After invalidation:
multi-core chip
Main memory
x=21660
The cache coherence problem
Core 2 reads x. Cache misses, and loads the new copy.
multi-core chip
Main memory
x=21660
Alternative to invalidate protocol:
update protocol
Core 1 writes x=21660:
multi-core chip
updated
value Main memory assuming
x=21660 inter-core
} write-through bus
caches
15
Invalidation vs update
• Multiple writes to the same location
- invalidation: only the first time
- update: must broadcast each write
(which includes new variable value)
HOST
• GPU:
SM
➢ SMs S S S S
P P P P
o30xSM on GT200, S S S S
P P P P
o14xSM on Fermi S S S S
P P P P
➢ For example, GTX 480: S
P
S
P
S
P
S
P
➢ 14 SMs x 32 cores SHARED
= 448 cores on a GPU MEMORY
HOST
Datapath of a multithreaded SIMD
Processor
More Detailed GPU Architecture View
Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14 streaming
multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800.
The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP
cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared
memory. Copyright © 2009 Elsevier, Inc. All rights reserved.
NVIDIA s CUDA (Compute Unified
Device Architecture),
HOST
Application Thread blocks
◼ Threads grouped in BLOCK 1
thread blocks THREAD THREAD THREAD
◼ 128, 192 or 256 (0,0) (0,1) (0,2)
• We call the off chip DRAM shared by the whole GPU and all thread
blocks GPU Memory.
• Rather than rely on large caches to contain the whole working sets of
an application, GPUs traditionally use smaller streaming caches and
rely on extensive multithreading of threads of SIMD instructions to
hide the long latency to DRAM.
Similarities and differences between multicore with Multimedia SIMD
extensions and recent GPUs.
NVIDIA GeForce 8800 GPU
Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14 streaming
multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800.
The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP
cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared
memory. Copyright © 2009 Elsevier, Inc. All rights reserved.
Texture/Processor Cluster (TPC)
• Each TPC contains a geometry controller, an SMC,
two SMs, and a texture unit as shown in
• The geometry controller maps the logical
graphics vertex pipeline into recirculation on the
physical SMs by directing all primitive and vertex
attribute and topology flow in the TPC.
• The SMC controls multiple SMs, arbitrating the
shared texture unit, load/store path, and I/O
path.
• The SMC serves three graphics workloads
simultaneously: vertex, geometry, and pixel.
Streaming Multiprocessor (SM)
• The SM is a unified graphics and computing
multiprocessor that executes vertex, geometry,
and pixel-fragment shader programs and parallel
computing programs.
• The SM consists of eight SP thread processor
cores, two SFUs, a multithreaded instruction
fetch and issue unit (MT issue), an instruction
cache, a read only constant cache, and a 16 KB
read/write shared memory. It executes scalar
instructions for individual threads.
SP core
• The multithreaded SP core is the primary
thread processor,
• Its register fi le provides 1024 scalar 32-bit
registers for up to 96 threads (more threads
than in the example SP of Section C.4).
• Its floating-point add and multiply operations
are compatible with the IEEE 754 standard for
single precision FP numbers, including not-a-
number (NaN) and infinity.
Special Function Unit (SFU)
• The SFU supports computation of both transcendental
functions and planar attribute interpolation.
crossbar
network
A network that
allows any node to
network bandwidth Informally, the peak communicate with
transfer rate of a network; can refer to the any other node in
speed of a single link or the collective transfer one pass through
rate of all links in the network. the network.
Popular multistage network topologies
for eight nodes
Intel® Xeon® processor Scalable family
Microarchitecture Overview
Intel® Xeon® processor Scalable family
Microarchitecture Overview
NVIDIA GPU AI Workload
Processing Flow Architecture
Diagram
Data Loaded into GPU Memory: Input data (images, videos,
sensor data) is loaded into high-speed GPU memory (HBM or
GDDR6X) for fast access during computations.
Parallel Processing Across CUDA Cores: The data is split into
smaller tasks, processed simultaneously by CUDA cores,
performing operations like matrix multiplications and
convolutions for training and inference.
• NVIDIA GPUs use HBM and GDDR6X memory to enable fast data transfer, ensuring
large datasets are handled efficiently. With high data throughput, these memory
types prevent bottlenecks and accelerate training and inference by quickly fetching
data for processing.
• NVIDIA GPUs feature compute pipelines optimized for AI tasks, including matrix
multiplications, activation functions, and gradient computations. These
optimizations reduce latency and boost throughput, enabling efficient real-time AI
inference and supporting both training and inference phases.
Total Cores
6
Total Threads
12 Supplemental Information
Max Turbo Frequency Marketing Status -
4.3 GHz Launched
Launch Date
4.3 GHz
Q3'25
Processor Base Frequency Embedded Options Available
2.9 GHz No
Cache Use Conditions
12 MB Intel® Smart Cache PC/Client/Tablet
Bus Speed
8 GT/s
TDP
65 W
Intel® Core 7 Processor 251E
36M Cache, up to 5.60
Supplemental Information
GHz
GPU Specifications
GPU Name
Marketing Status Intel® UHD Graphics 770
CPU Specifications Launched Graphics Base Frequency
Total Cores Launch Date 300 MHz
24 Q1'25 Graphics Max Dynamic Frequency
# of Performance-cores Embedded Options Available 1.66 GHz
8 Yes Graphics Output
# of Efficient-cores Use Conditions eDP 1.4b, DP 1.4a, HDMI 2.1
16 PC/Client/Tablet Execution Units
Total Threads Product Tuning (Embedded Uses) 4096 x 2160 @ 60Hz
32 Yes Max Resolution (DP)‡
Max Turbo Frequency 7680 x 4320 @ 60Hz
5.6 GHz Max Resolution (eDP - Integrated Flat
Performance-core Max Turbo Frequency Memory Panel)‡
5.6 GHz 5120 x 3200 @ 120Hz
Specifications
All Core Turbo Frequency DirectX* Support
Max Memory Size
4.9 GHz 12
(dependent on memory
Efficient-core Max Turbo Frequency OpenGL* Support
type)
4.4 GHz 4.5
192 GB
Performance-core Base Frequency OpenCL* Support
Memory Types
2.1 GHz 3
Up to DDR5 5600 MT/s
Efficient-core Base Frequency Multi-Format Codec Engines
Up to DDR4 3200 MT/s
1.6 GHz 2
Max # of Memory
Cache Intel® Quick Sync Video
Channels
36 MB Yes
2
Bus Speed Intel® Clear Video HD Technology
ECC Memory
0.000 GT/s Yes
Supported
TDP # of Displays Supported
Yes
65 W 4
Device ID
0xA780
Intel® Data Center GPU Flex 170V
GPU Specifications
X-cores
32 Memory
Render Slices Specificati
8
Essentials Ray Tracing Units Features ons
Download Specifications 32 Matrix Extensions (Intel® XMX)
Product Collection Engines
H.264 Hardware Memory
Encode/Decode
Intel® Data Center GPU Flex 512 Yes Size
Series Execution Units
Microarchitecture 512
H.265 (HEVC) Hardware 16 GB
Encode/Decode
Xe-HPG Graphics Max Dynamic Clock Yes Memory
Code Name 2050 MHz
Products formerly Arctic Sound Intel® X Matrix Extensions (Intel®
AV1 Encode/Decode Type
Yes
Marketing Status XMX) Max Dynamic Clock VP9 Bitstream & Decoding GDDR6
Launched 1500 MHz
Launch Date Up to PCI Express 4.0 x16
Yes Graphics
Q3'24 Device ID Memory
Warranty Period 0x56C2
3 yrs Interface
Embedded Options Available 256 bit
No
Use Conditions Graphics
Server/Enterprise Memory
Usecase
Cloud Computing Bandwidth
576 GB/s