0% found this document useful (0 votes)
54 views114 pages

Parallel Processing and Multi-Core Architecture

The document discusses parallel processing, GPU, and multi-core architecture, highlighting the increasing computational demands and the limitations of sequential architectures. It covers the challenges of writing software for multi-processor systems, the categorization of parallel hardware, and the benefits of multi-core processors in handling multithreaded applications. Additionally, it addresses cache coherence issues in multi-core systems and potential solutions for maintaining data consistency across processors.

Uploaded by

harshalchothe05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views114 pages

Parallel Processing and Multi-Core Architecture

The document discusses parallel processing, GPU, and multi-core architecture, highlighting the increasing computational demands and the limitations of sequential architectures. It covers the challenges of writing software for multi-processor systems, the categorization of parallel hardware, and the benefits of multi-core processors in handling multithreaded applications. Additionally, it addresses cache coherence issues in multi-core systems and potential solutions for maintaining data consistency across processors.

Uploaded by

harshalchothe05
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

CH3

Parallel Processing, GPU, and Multi-


Core Architecture
Introduction to Parallel Processing, Types of
Parallelism, Multi-Core Architecture, GPU
Architecture, Case study on: Parallel
Processing, GPUs, and Multi-Core processor
Parallel Processing
• Computation requirements are ever
increasing
- visualization, distributed databases,
simulations, scientific prediction
(earthquake),etc.
•Sequential architectures reaching
physical limits in processing limits
Parallel Processing
•Significant development in networking
technology is paving a way for network-
based cost-effective parallel computing.

•The parallel processing technology is


mature and is being exploited
commercially.
Goals and Challenges : PP

• It is difficult to write soft ware


that uses multiple processors to
complete one task faster, and
the problem gets worse as the
number of processors increases.

(analogy of eight reporters trying to write a single


story in hopes of doing the work eight times
faster)
Why has this been so?

• you must get better performance or better energy efficiency


from a parallel processing program on a multiprocessor;
otherwise, you would just use a sequential program on a
uniprocessor, as sequential programming is simpler.

• In fact, uniprocessor design techniques such as superscalar


and out-of order execution take advantage of instruction-
level parallelism

• Such innovations reduced the demand for rewriting programs


for multiprocessors, since programmers could do nothing and
yet their sequential programs would run faster on new
computers.
Categorization of parallel hardware
Vector

Data Output
Data Input stream A
stream A
Data Output
Data Input stream B
stream B
Data Output
Data Input stream C
stream C
Vector Unit (SIMD) Working
• Vector arithmetic instructions usually only
allow element N of one vector register to take
part in operations with element N from other
vector registers.
• This dramatically simplifies the construction of
a highly parallel vector unit, which can be
structured as multiple parallel vector lanes.
a single vector add instruction,
C=A+B
Structure of a vector unit containing
four lanes
vector unit containing four lanes
• each lane holding every fourth element of each
vector register.
• The figure shows three vector functional units: an
FP add, an FP multiply, and a load-store unit.
• Each of the vector arithmetic units contains four
execution pipelines, one per lane, which acts in
concert to complete a single vector instruction.
• Note how each section of the vector-register fi le
only needs to provide enough read and write
ports
NOT more popular outside high-
performance computing?
• concerns about the larger state for vector
registers increasing context switch time and
the difficulty of handling page faults in vector
loads and stores,
• advantage of vector and multimedia extensions
is that it is relatively easy to extend a scalar
instruction set architecture with these
instructions to improve performance of data
parallel operations.
InstructÎOIl Instruction InStructÎon
Stream A Stream B Stream C

Data Output
Data Input stream A
stream A
Data Output
Data Input stream B
stream B
Data Output
Data Input stream C
stream C
No Multithreading
Hardware Multithreading/MIMD

hardware multithreading allows multiple threads to


share the functional units of a single processor in an
overlapping fashion to try to utilize the hardware
resources efficiently
Fine-grained multithreading Coarse-grained multithreading
switches between threads on switches threads only on costly
each instruction, resulting in stalls, such as last-level cache
interleaved execution of misses. This change relieves
multiple threads. This the need to have thread
interleaving is oft en done in switching be extremely fast
a round-robin fashion, and is much less likely to slow
skipping any threads that are down the execution of an
stalled at that clock cycle individual thread
Simultaneous multithreading (SMT) is a
variation on hardware multithreading that uses
the resources of a multiple-issue, dynamically
scheduled pipelined processor to exploit thread-
level parallelism at the same time it exploits
instruction level parallelism
With Multithreading
CH3
Parallel Processing, GPU, and Multi-
Core Architecture
Introduction to Parallel Processing, Types of
Parallelism, Multi-Core Architecture, GPU
Architecture, Case study on: Parallel
Processing, GPUs, and Multi-Core processor
Multithreading
Without SMT, only a single thread can
run at any given time

Integer
0 ..
.C...
0
u
"t:l
C
Ill
Q)
.c(.)
uI l l
"' BTB uCode ROM Branch Target
Buffer,
Translation Look
aside Buffer

Thread 1: floating point

3
Without SMT, only a single thread can
run at any given time

Floating Point
..
0....
C
0
()
"O
ai
Q)

(.)
(ll
()

8TB uCode RO 1

Jer

Thread 2:
integer operation
SMT processor: both threads can run
concurrently

0
-
...
0
C:
(.)
'U
C:
('Q Uoo qu [Link]
(I)
.i::.
(.)
('Q
(.)

B B uCode ROM

Thread 2: Thread 1: floating point


integer operation

5
Instruction Level Parallelism (ILP):
Thread Level Parallelism (TLP):

ILP focuses on executing multiple instructions from


the same program concurrently within a single
processor,

while TLP involves running multiple threads of


execution, potentially on different processors,
simultaneously.
Thread Level Parallelism (TLP) Examples:

• Editing a photo while recording a TV show


through a digital video recorder

• Downloading software while running an


anti-virus program

• "Anything that can be threaded today will


map efficiently to multi-core"

• BUT: some applications difficult to


paraIlelize
Why multi-core ?

• Difficult to make single-core


clock frequencies even higher
• Deeply pipelined circuits:
- heat problems
- speed of light problems
- difficult design and verification
- large design teams necessary
- server farms need expensive
air-conditioning
• Many new applications are multithreaded
• General trend in computer architecture (shift
towards more parallelism)

8
Multicore Philosophy
- Two or more cores with in a single Die
- each core has its own set of instructions and
architectural resources
A multi-core processor (or chip-level
multiprocessor, CMP) combines two or more
independent cores (normally a CPU) into a single
package composed of a single integrated circuit
(IC), called a die, or more dies packaged together. A
dual-core processor contains two cores, and a
quad-core processor contains four cores.

9
Multi-core architectures

Replicate multiple processor cores on a


single die.
Core 1 Core 2 Core 3 Core4

bus interface

Multi-core CPU chip

10
Multi-Core Architecture
shared memory multiprocessor (SMP)

A single physical address space across all processors


Processors communicate through shared variables in memory, with all processors
capable of accessing any memory location via loads and stores.
Uniform Memory Access (UMA)
Latency to a word in memory does not depend
on which processor asks for it

Lock
Nonuniform Memory Access (NUMA)
Only one processor at a
Main memory is divided and attached to different
time can acquire the
microprocessors or to different memory controllers
lock, and other
on the same chip
processors interested in
shared data must wait
until the original
processor unlocks the
synchronization variable
As processors operating in parallel will
normally share data, they also need to
coordinate when operating on shared data;
otherwise, one processor could start working
on data before another is finished with it
Multi-core CPU chip

• The cores fit on a single processor socket


• Also called CMP (Chip Multi-Processor)

C C C C
0 0 0 0
r r r r
e e e e

1 2 3 4

13
The cores run in parallel
thread 1 thread 2 thread 3 thread 4

C C C C
0 0 0 0
r r r r
e e e e

1 2 3 4

'' ''

14
Within each core, threads are time-sliced Just
like on a uniprocessor)
several several several several
threads threads threads threads

C C C C
0 0 0 0
r r r r
e e e e

1 2 3 4

15
SMT not a "true" parallel processor

• Enables better threading (e.g. up to 30%)


• OS and applications perceive each
simultaneous thread as a separate
"virtual processor"
• The chip has only a single copy
of each resource
• Compare to multi-core:
each core has its own copy of resources

16
Multi-core:
threads can run on separate cores

"O "O
C: C:
C1l (1)
Cl) Cl)
.s::::.
0 .s::::.
C1l
(.) (.)

. . .,
( /'

,.,.. t - - - - - - t 8TB and 1-TLB

Thread 1 Thread 2

17
MuIti-core:
threads can run on separate cores

l'lteaer 'lteger
-
0
C:
(..)
-c -c
C: C:
!1l !1l Uopque,ues
Q.l Q.l
.c .c
0 0
!1l
()
!1l Rename loc
(..)

BTB TracEc Cac c.1C. 1€


1 M
Deco ler
1/)
::,
BTB and 1-TLB cc BTB and TLB

Thread 3 Thread 4

18
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:
- Single-core, non-SMT: standard uniprocessor
- Single-core, with SMT
- Multi-core, non-SMT
- Multi-core, with SMT
- The number of SMT threads:
2, 4, or sometimes 8 simultaneous threads
• Intel calls them "hyper-threads"

19
SMT Dual-core: all four threads can
run concurrently

-0 -0
C C
ro ro
(I) (I)
.I:; .I:;
0
(.)
ro
(.)

BTB uCode
OM

Thread 1 Thread 3 Thread 2 Thread 4

20
Comparison: multi-core vs SMT
• Multi-core:
- Since there are several cores,
each is smaller and not as powerful
(but also easier to design and manufacture)
- However, great with thread-level parallelism
• SMT
- Can have one large and fast superscalar core
- Great performance on a single thread
- Mostly still only exploits instruction-level
parallelism

21
Pentium D

23
Core Duo

24
Core 2 Duo & Core 2 Quad

24
Some Intel Processors

26
Hyper Threading:
- Parts of a single processor are
shared between threads
- Execution Engine is shared
- OS task switching does not happen
in Hyper threading.
-Processor is kept as busy as
possible

27
A Simple Parallel Processing Program for a
Shared Address Space
Suppose we want to sum 64,000 numbers on a shared memory multiprocessor
computer with uniform memory access time. Let s assume we have 64 processors. Th e
first step is to ensure a balanced load per processor, so we split the set of numbers into
subsets of the same size. We do not allocate the subsets to a different memory space,
since there is a single memory space for this machine; we just give different starting
addresses to each processor. Pn is the number that identifies the processor, between 0
and 63. All processors start the program by running a loop that sums their subset of
numbers

sum[Pn] = 0;
for (i = 1000*Pn; i < 1000*(Pn+1); i
+= 1) sum[Pn] += A[i]; /*sum the
assigned areas*/
The last four levels of a reduction

T h e next step is to add these 64 partial sums.


This step is called a reduction, where we divide
to conquer
Multicore SMP
The cache coherence problem
• Since we have private caches:
How to keep the data consistent across caches?
• Each core should perceive the memory as a
monolithic array, shared by all the cores

Core 1 Core 2 Core 3 Core4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
The cache coherence problem
Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
x=15213
The cache coherence problem
Core 1 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213

multi-core chip
Main memory
x=15213
The cache coherence problem
Core 2 reads x

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213
The cache coherence problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory assuming
x=21660 } write-through
caches
The cache coherence problem
Core 2 attempts to read x... gets a stale copy

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=15213

multi-core chip
Main memory
x=21660
Solutions for cache coherence
• This is a general problem with
multiprocessors, not limited just to multi-core
• There exist many solution algorithms,
coherence protocols, etc.

• A simple solution:
invalidation-based protocol with snooping
Inter-core bus

Core 1 Core2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache

multi-core chip
Main memory
inter-core
bus
Invalidation protocol with snooping

• Invalidation:
If a core writes to a data item, all other
copies of this data item in other caches
are invalidated
• Snooping:
All cores continuously "snoop" (monitor)
the bus connecting the cores.
The cache coherence problem
Revisited: Cores 1 and 2 have both read x

Core 1 Core 2 Core 3 Core4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=15213 x=15213

multi-core chip
Main memory
x=15213
The cache coherence problem
Core 1 writes to x, setting it to 21660

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache

sends
multi-core chip
request
Main memory assuming
inter-core
x=21660 } write-through
_ . caches bus
The cache coherence problem
After invalidation:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660

multi-core chip
Main memory
x=21660
The cache coherence problem
Core 2 reads x. Cache misses, and loads the new copy.

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660

multi-core chip
Main memory
x=21660
Alternative to invalidate protocol:
update protocol
Core 1 writes x=21660:

Core 1 Core 2 Core 3 Core 4

One or more One or more One or more One or more


levels of levels of levels of levels of
cache cache cache cache
x=21660 x=21660

multi-core chip
updated
value Main memory assuming
x=21660 inter-core
} write-through bus
caches

15
Invalidation vs update
• Multiple writes to the same location
- invalidation: only the first time
- update: must broadcast each write
(which includes new variable value)

• Invalidation generally performs better:


it generates less bus traffic
GPU Evolution
Highly parallel, highly multithreaded multiprocessor optimized for
graphic computing and other applications
• New GPU are being developed every 12 to 18 months
• Number crunching: 1 card ~= 1 teraflop ~= small cluster.

• 1980’s – No GPU. PC used VGA controller


• 1990’s – Add more function into VGA controller
• 1997 – 3D acceleration functions:
Hardware for triangle setup and rasterization
Texture mapping
Shading
• 2000 – A single chip graphics processor ( beginning of GPU
term)
• 2005 – Massively parallel programmable processors
key characteristics as to how GPUs vary from CPUs:
■ GPUs are accelerators that supplement a CPU, so they do
not need be able to perform all the tasks of a CPU. This role
allows them to dedicate all their resources to graphics. It s
fi ne for GPUs to perform some tasks poorly or not at all,
given that in a system with both a CPU and a GPU, the CPU
can do them if needed.
■ The GPU problems sizes are typically hundreds of
megabytes to gigabytes, but not hundreds of gigabytes to
terabytes. These differences led to different styles of
architecture
■ Perhaps the biggest difference is that GPUs do not rely on
multilevel caches to overcome the long latency to memory,
as do CPUs. Instead, GPUs rely on hardware multithreading
to hide the latency to memory.
key characteristics as to how GPUs vary from
CPUs:
■ The GPU memory is thus oriented toward bandwidth rather than
latency. T h ere are even special graphics DRAM chips for GPUs
that are wider and have higher bandwidth than DRAM chips for
CPUs. In addition, GPU memories have traditionally had smaller
main memories than conventional microprocessors. In 2013,
GPUs typically have 4 to 6 GiB or less, while CPUs have 32 to
256 GiB.
• GPU is a coprocessor.
■ Given the reliance on many threads to deliver good memory
bandwidth, GPUs can accommodate many parallel processors
(MIMD) as well as many threads. Hence, each GPU processor is
more highly multithreaded than a typical CPU, plus they have
more processors
Thus, GPU hardware has two levels of hardware
schedulers:

1. The Thread Block Scheduler that assigns blocks of


threads to multithreaded SIMD processors, and
2. The SIMD Thread Scheduler within a SIMD
processor, which schedules when SIMD threads
should run.
The SIMD instructions of these threads are 32 bit
wide, so each thread of SIMD instructions would
compute 32 of the elements of the computation.
Data path of a multithreaded SIMD Processor

These SIMD threads have their own program


counters and they run on a multithreaded SIMD
processor.

The SIMD Thread Scheduler includes a controller that


lets it know which threads of SIMD instructions are
ready to run, and then it sends them off to a dispatch
unit to be run on the multithreaded SIMD processor.

It is identical to a hardware thread scheduler in a


traditional multithreaded processor
except that it is scheduling threads of SIMD instructions.
Intel/AMD CPU with GPU
GPU Architecture
SM
S S S S
SP: scalar processor P P P P
‘CUDA core’ S S S S
P P P P
Executes one thread S S S S
P P P P
S S S S
P P P P
SM SHARED
streaming MEMORY
multiprocessor
32xSP (or 16, 48 or more)
Fast local ‘shared memory’ GLOBAL MEMORY
(shared between SPs) (ON DEVICE)
16 KiB (or 64 KiB)

HOST
• GPU:
SM
➢ SMs S S S S
P P P P
o30xSM on GT200, S S S S
P P P P
o14xSM on Fermi S S S S
P P P P
➢ For example, GTX 480: S
P
S
P
S
P
S
P
➢ 14 SMs x 32 cores SHARED
= 448 cores on a GPU MEMORY

GDDR memory GLOBAL MEMORY


(ON DEVICE)
512 MiB - 6 GiB

HOST
Datapath of a multithreaded SIMD
Processor
More Detailed GPU Architecture View

Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14 streaming
multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800.
The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP
cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared
memory. Copyright © 2009 Elsevier, Inc. All rights reserved.
NVIDIA s CUDA (Compute Unified
Device Architecture),

• C-inspired programming languages to allow them


to write programs directly for the GPU
• which enables the programmer to write C
programs to execute on GPUs
GPU Programming API
• CUDA (Compute Unified Device Architecture) : parallel
GPU programming API created by NVIDA
– Hardware and software architecture for issuing and
managing computations on GPU
• Massively parallel architecture. over 8000 threads is
common
• API libaries with C/C++/Fortran language
• Numerical libraries: cuBLAS, cuFFT,
• OpenGL – an open standard for GPU programming
• DirectX – a series of Microsoft multimedia programming
interfaces
How To Program For GPUs
SM
◼ Parallelization S
P
S
P
S
P
S
P
◼ Decomposition to threads S S S S
P P P P
◼ Memory S S S S
◼ shared memory, global P P P P
S S S S
memory P P P P
SHARED
◼ Enormous processing MEMORY
power
◼ Thread communication GLOBAL MEMORY
◼ Synchronization, no (ON DEVICE)
interdependencies

HOST
Application Thread blocks
◼ Threads grouped in BLOCK 1
thread blocks THREAD THREAD THREAD
◼ 128, 192 or 256 (0,0) (0,1) (0,2)

threads in a block THREAD THREAD THREAD


(1,0) (1,1) (1,2)

• One thread block executes on one


SM
– All threads sharing the ‘shared memory’
– 32 threads are executed simultaneously
(‘warp’)
Application Thread blocks
◼ Blocks execute on SMs BLOCK 1
◼ - execute in parallel THREAD THREAD THREAD
(0,0) (0,1) (0,2)
◼ - execute independently!
THREAD THREAD THREAD
(1,0) (1,1) (1,2)

• Blocks form a GRID


• Thread ID
BLOCK 0 BLOCK 1 BLOCK 2
unique within block
BLOCK 3 BLOCK 4 BLOCK 5
• Block ID
BLOCK 6 BLOCK 7 BLOCK 8
unique within grid Grid
Thread Batching: Grids and Blocks
• A kernel is executed as a grid
of thread blocks Host Device

▪ All threads share data Grid 1

memory space Kernel Block Block Block



1 (0, 0) (1, 0) (2, 0)
A thread block is a batch of
threads that can cooperate Block Block Block
(0, 1) (1, 1) (2, 1)
with each other by:
▪ Synchronizing their execution
Grid 2
– For hazard-free shared
Kernel
memory accesses 2
▪ Efficiently sharing data through a
low latency shared memory
Block (1, 1)
• Two threads from two
different blocks cannot Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
cooperate Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Courtesy: NDVIA Thread Thread Thread Thread Thread


(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
GPU Memory structures
GPU Memory structures

• We call the on chip memory that is local to each multithreaded SIMD


processor Local Memory. It is shared by the SIMD Lanes within a
multithreaded SIMD processor, but this memory is not shared
between multithreaded SIMD processors.

• We call the off chip DRAM shared by the whole GPU and all thread
blocks GPU Memory.

• Rather than rely on large caches to contain the whole working sets of
an application, GPUs traditionally use smaller streaming caches and
rely on extensive multithreading of threads of SIMD instructions to
hide the long latency to DRAM.
Similarities and differences between multicore with Multimedia SIMD
extensions and recent GPUs.
NVIDIA GeForce 8800 GPU

More Detailed GPU Architecture View

Basic unified GPU architecture. Example GPU with 112 streaming processor (SP) cores organized in 14 streaming
multiprocessors (SMs); the cores are highly multithreaded. It has the basic Tesla architecture of an NVIDIA GeForce 8800.
The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP
cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared
memory. Copyright © 2009 Elsevier, Inc. All rights reserved.
Texture/Processor Cluster (TPC)
• Each TPC contains a geometry controller, an SMC,
two SMs, and a texture unit as shown in
• The geometry controller maps the logical
graphics vertex pipeline into recirculation on the
physical SMs by directing all primitive and vertex
attribute and topology flow in the TPC.
• The SMC controls multiple SMs, arbitrating the
shared texture unit, load/store path, and I/O
path.
• The SMC serves three graphics workloads
simultaneously: vertex, geometry, and pixel.
Streaming Multiprocessor (SM)
• The SM is a unified graphics and computing
multiprocessor that executes vertex, geometry,
and pixel-fragment shader programs and parallel
computing programs.
• The SM consists of eight SP thread processor
cores, two SFUs, a multithreaded instruction
fetch and issue unit (MT issue), an instruction
cache, a read only constant cache, and a 16 KB
read/write shared memory. It executes scalar
instructions for individual threads.
SP core
• The multithreaded SP core is the primary
thread processor,
• Its register fi le provides 1024 scalar 32-bit
registers for up to 96 threads (more threads
than in the example SP of Section C.4).
• Its floating-point add and multiply operations
are compatible with the IEEE 754 standard for
single precision FP numbers, including not-a-
number (NaN) and infinity.
Special Function Unit (SFU)
• The SFU supports computation of both transcendental
functions and planar attribute interpolation.

• it uses quadratic interpolation based on enhanced minimax


approximations to approximate the reciprocal, reciprocal
square root, log2 x, 2x , and sin/cos functions at one result
per cycle.

• The SFU also supports pixel attribute interpolation such as


color, depth, and texture coordinates at four samples per
cycle.
Texture Memory
• Texture memory holds large read-only arrays of
data.

• Textures for computing have the same attributes


and capabilities as textures used with 3D graphics.

• Although textures are commonly two-dimensional


images (2D arrays of pixel values), 1D (linear) and
3D (volume) textures are also available.
Raster Operation Processors (ROPs)
• A scalable streaming processor array (SPA), which performs all of
the GPU’s programmable calculations, and a scalable memory
system, which comprises external DRAM control and fixed function
Raster Operation Processors (ROPs)
• Perform color and depth frame buffer operations directly on
memory.
• Each ROP unit is paired with a specific memory partition. ROP
partitions are fed from the SMs via an interconnection network.
• Each ROP is responsible for depth and stencil tests and updates, as
well as color blending. The ROP and memory controllers cooperate
to implement lossless color and depth compression (up to 8:1) to
reduce external bandwidth demand.
• ROP units also perform atomic operations on memory
Amdahl's Law
• Amdahl's Law: The speedup that can be achieved by parallelizing a
program is limited by the sequential fraction of the program.
▪ Example: If 33% of the program must be performed sequentially, no
matter how many processors you use, you can only get a 3x speedup.

▪ Therefore, part of the trick becomes


learning how to minimize the portion
of the program that must be
performed sequentially.
• Making better parallel algorithms.
core
A processor-within-a-processor.
• A "multi-core" processor is one with several cores
inside.
Map/Reduce
• map/reduce: A strategy for implementing parallel
algorithms.
– map: A master worker takes the problem input, divides it
into smaller sub-problems, and distributes the sub-
problems to workers (threads).
– reduce: The master worker collects sub-solutions from
the workers and combines them in some way to produce
the overall answer.
• Our multi-threaded merge sort is an example of such an
algorithm.
Frameworks and tools have been
written to perform map/reduce.
– MapReduce framework by Google
– Hadoop framework by Yahoo!
– related to the ideas of
Big Data and Cloud Computing
– also related to functional programming
Message-Passing Multiprocessors
• message passing Communicating between
multiple processors by explicitly sending and
receiving information.

• send message routine A routine used by a


processor in machines with private memories to pass
a message to another processor.

• receive message routine A routine used by a


processor in machines with private memories to
accept a message from another processor.
Clusters
Collections of computers connected via I/O over standard
network switches to form a message-passing multiprocessor
• most widespread example today of the message-
passing parallel computer.
• Applications- with little communication like Web
search, mail servers, and fi le servers
• cluster consists of independent computers connected
through a local area network, it is much easier to
replace a computer without bringing down the system
in a cluster than in an shared memory multiprocessor
Clusters Applications
• Their lower cost, higher availability, and rapid,
incremental expandability make clusters attractive to
service Internet providers, despite their poorer
communication performance when compared to large-
scale shared memory multiprocessors.
• The search engines that hundreds of millions of us use
every day depend upon this technology.
• Amazon, Facebook, Google, Microsoft , and others all
have multiple datacenters each with clusters of tens of
thousands of servers.
• Clearly, the use of multiple processors in Internet
service companies has been hugely successful.
Warehouse-Scale Computers(WSC)
• Internet services, such as those described above,
necessitated the construction of new buildings to house,
power, and cool 100,000 servers.
• Although they may be classified as just large clusters, their
architecture and operation are more sophisticated.
• They act as one giant computer and cost on the order of
$150M for the building, the electrical and cooling
infrastructure, the servers, and the networking equipment
that connects and houses 50,000 to 100,000 servers.
• We consider them a new class of computer, called
Warehouse-Scale Computers (WSC).
WSC
• Th e most popular framework for batch
processing in a WSC is MapReduce [Dean,
2008]
• and its open-source twin Hadoop.
parallelism from MIMD, SIMD, and both MIMD and
SIMD over time for x86 computers.
Multiprocessor Network Topologies
Ring Topology bisection bandwidth The bandwidth
between two equal parts of a multiprocessor.

crossbar
network
A network that
allows any node to
network bandwidth Informally, the peak communicate with
transfer rate of a network; can refer to the any other node in
speed of a single link or the collective transfer one pass through
rate of all links in the network. the network.
Popular multistage network topologies
for eight nodes
Intel® Xeon® processor Scalable family
Microarchitecture Overview
Intel® Xeon® processor Scalable family
Microarchitecture Overview
NVIDIA GPU AI Workload
Processing Flow Architecture
Diagram
Data Loaded into GPU Memory: Input data (images, videos,
sensor data) is loaded into high-speed GPU memory (HBM or
GDDR6X) for fast access during computations.
Parallel Processing Across CUDA Cores: The data is split into
smaller tasks, processed simultaneously by CUDA cores,
performing operations like matrix multiplications and
convolutions for training and inference.

Tensor Cores for Matrix Operations: Tensor Cores accelerate


deep learning tasks by performing high-speed matrix
multiplications using mixed-precision arithmetic, boosting
performance and reducing latency.

Output Transferred to CPU for Deployment: Once computations


are complete, the results are sent back to the CPU for further
processing or deployment, such as in real-time applications like
autonomous systems or medical diagnostics.
How Parallel Processing is Optimized
in NVIDIA Architecture
• High-Speed Memory Transfer (HBM, GDDR6X)

• NVIDIA GPUs use HBM and GDDR6X memory to enable fast data transfer, ensuring
large datasets are handled efficiently. With high data throughput, these memory
types prevent bottlenecks and accelerate training and inference by quickly fetching
data for processing.

• Optimized Compute Pipelines for AI Inference

• NVIDIA GPUs feature compute pipelines optimized for AI tasks, including matrix
multiplications, activation functions, and gradient computations. These
optimizations reduce latency and boost throughput, enabling efficient real-time AI
inference and supporting both training and inference phases.

• NVIDIA GPUs, with their parallel computing power, accelerate AI workloads


through specialized hardware like Tensor Cores and high-bandwidth memory.
These GPUs help speed up complex tasks in healthcare, autonomous driving, and
industrial automation, pushing AI capabilities to new limits.
Real-World Applications Using NVIDIA
GPUs /
Case Studies
• Tesla’s Autopilot: NVIDIA GPUs process data from
Tesla’s sensors in real-time, enabling fast object
detection and decision-making for autonomous
driving.
• Healthcare AI (NVIDIA Clara): Clara uses GPUs for fast
medical image processing, helping detect conditions
like cancer and heart disease more efficiently.
• Industrial Automation: GPUs power AI systems for
defect detection and quality control in manufacturing,
improving efficiency and reducing manual inspection.
Healthcare AI (NVIDIA Clara): lifNVIDIA Clara is a suite of
computing platforms, software, and services that powers AI
solutions for healthcare and life sciences, from imaging and
instruments to genomics and drug discovery.e sciences, from
imaging and instruments to genomics and
Intel® Core i5-110 Processor
12M Cache, up to 4.30 GHz

Total Cores
6
Total Threads
12 Supplemental Information
Max Turbo Frequency Marketing Status -
4.3 GHz Launched
Launch Date
4.3 GHz
Q3'25
Processor Base Frequency Embedded Options Available
2.9 GHz No
Cache Use Conditions
12 MB Intel® Smart Cache PC/Client/Tablet
Bus Speed
8 GT/s
TDP
65 W
Intel® Core 7 Processor 251E
36M Cache, up to 5.60
Supplemental Information
GHz
GPU Specifications
GPU Name
Marketing Status Intel® UHD Graphics 770
CPU Specifications Launched Graphics Base Frequency
Total Cores Launch Date 300 MHz
24 Q1'25 Graphics Max Dynamic Frequency
# of Performance-cores Embedded Options Available 1.66 GHz
8 Yes Graphics Output
# of Efficient-cores Use Conditions eDP 1.4b, DP 1.4a, HDMI 2.1
16 PC/Client/Tablet Execution Units
Total Threads Product Tuning (Embedded Uses) 4096 x 2160 @ 60Hz
32 Yes Max Resolution (DP)‡
Max Turbo Frequency 7680 x 4320 @ 60Hz
5.6 GHz Max Resolution (eDP - Integrated Flat
Performance-core Max Turbo Frequency Memory Panel)‡
5.6 GHz 5120 x 3200 @ 120Hz
Specifications
All Core Turbo Frequency DirectX* Support
Max Memory Size
4.9 GHz 12
(dependent on memory
Efficient-core Max Turbo Frequency OpenGL* Support
type)
4.4 GHz 4.5
192 GB
Performance-core Base Frequency OpenCL* Support
Memory Types
2.1 GHz 3
Up to DDR5 5600 MT/s
Efficient-core Base Frequency Multi-Format Codec Engines
Up to DDR4 3200 MT/s
1.6 GHz 2
Max # of Memory
Cache Intel® Quick Sync Video
Channels
36 MB Yes
2
Bus Speed Intel® Clear Video HD Technology
ECC Memory
0.000 GT/s Yes
Supported
TDP # of Displays Supported
Yes
65 W 4
Device ID
0xA780
Intel® Data Center GPU Flex 170V
GPU Specifications
X-cores
32 Memory
Render Slices Specificati
8
Essentials Ray Tracing Units Features ons
Download Specifications 32 Matrix Extensions (Intel® XMX)
Product Collection Engines
H.264 Hardware Memory
Encode/Decode
Intel® Data Center GPU Flex 512 Yes Size
Series Execution Units
Microarchitecture 512
H.265 (HEVC) Hardware 16 GB
Encode/Decode
Xe-HPG Graphics Max Dynamic Clock Yes Memory
Code Name 2050 MHz
Products formerly Arctic Sound Intel® X Matrix Extensions (Intel®
AV1 Encode/Decode Type
Yes
Marketing Status XMX) Max Dynamic Clock VP9 Bitstream & Decoding GDDR6
Launched 1500 MHz
Launch Date Up to PCI Express 4.0 x16
Yes Graphics
Q3'24 Device ID Memory
Warranty Period 0x56C2
3 yrs Interface
Embedded Options Available 256 bit
No
Use Conditions Graphics
Server/Enterprise Memory
Usecase
Cloud Computing Bandwidth
576 GB/s

You might also like