0% found this document useful (0 votes)

56 views39 pages

Parallel Computing Communication & Synchronization

The document discusses key concepts in parallel computing, including communication methods (shared memory and message passing), synchronization techniques, and granularity types (coarse and fine). It highlights the importance of observed speedup, efficiency, and scalability in evaluating parallel program performance. Additionally, it covers manual and automatic parallelization methods, emphasizing the trade-offs between control and ease of use.

Uploaded by

ayemenbaig26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views39 pages

Parallel Computing Communication & Synchronization

Uploaded by

ayemenbaig26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

🔁 1.

Communications in Parallel Computing

📌 What is it?
In parallel programs, tasks (or threads/processes) often need to share or exchange data
with each other. This exchange is called communication, no matter how it's done.

📦 Two Main Methods:

1. Shared Memory – All tasks access a common memory space (like multiple people
using the same whiteboard).

2. Message Passing (Network) – Tasks send messages to each other over a network
(like texting between phones).

✅ Key Point: No matter if it’s shared memory or network — data exchange is

called communication.

🕒 2. Synchronization
📌 What is it?
Synchronization means making sure all tasks are in step with each other, especially
when they need to wait for others to reach a certain point before moving on.

🧠 Think of it like:
A group of runners who are only allowed to move to the next lap when all
members finish the current lap — they must sync up.

🛠️ How it’s implemented:

● Barriers: A point where all tasks must wait until every other task has reached it.

● Locks / Semaphores: Used to control access to shared resources.

⚠️ Why They Matter:

Term Purpose Problem If Ignored
Communicatio Share data between tasks Incomplete or incorrect results
n

Synchronizatio Ensure tasks proceed together Race conditions, deadlocks,

n logically inconsistent data

✍️ Scenario-Based Question (Exam Style):

Q: In a parallel matrix multiplication program, each thread calculates a part of the matrix.
Once all threads finish, they combine the results.
Which concepts are being used here?

✅ Communication – Threads need to send their computed parts to a shared space.

✅ Synchronization – Threads must wait until all are done before combining results.

⚙️ Granularity in Parallel Computing

📌 What is it?
Granularity is about how much computation happens before a task needs to communicate
with others.

It’s the ratio of:

Computation Time÷Communication Time\text{Computation Time} \div \text{Communication

Time}Computation Time÷Communication Time

🧱 Two Types of Granularity:

Type Meaning Example

Coarse-Gra Lots of computation happens before A thread works on a full image

ined communication (less communication) block before sending data

Fine-Grain Very frequent communication (small tasks Threads share results after
ed keep sharing data often) every few calculations

✅ Key Points for Exam:

● Coarse granularity → better performance (less communication overhead)

● Fine granularity → more communication, more overhead → can slow down

performance

● Ideal granularity depends on your task and system architecture

⚡ Observed Speedup
📌 What is it?
Observed speedup tells us how much faster a parallel program runs compared to the serial
(single processor) version.

Observed Speedup (S)=Wall-clock time (Serial)Wall-clock time (Parallel)\text{Observed

Speedup (S)} = \frac{\text{Wall-clock time (Serial)}}{\text{Wall-clock time (Parallel)}}Observed
Speedup (S)=Wall-clock time (Parallel)Wall-clock time (Serial)

🧠 Simple Example:
● Serial time = 20 seconds

● Parallel time (on 4 processors) = 6 seconds

S=206≈3.33S = \frac{20}{6} \approx 3.33S=620≈3.33

✅ So, the parallel program is about 3.33 times faster than the serial one.

💡 Key Insights:
● Ideal speedup = number of processors

○ If using 4 processors → ideal speedup = 4

● But in real life, speedup is usually less due to communication, synchronization, and
sequential parts (Amdahl’s Law!)

● Still, observed speedup helps measure how good your parallelization is
📝 Scenario-Based Question:
Q: You run a parallel sorting algorithm. The serial version takes 40 seconds, and the parallel
version (with 8 processors) takes 10 seconds. What is the observed speedup? What type of
granularity would be ideal here?

● Observed Speedup = 40 / 10 = 4

● For sorting (a compute-heavy task), coarse granularity is ideal to reduce

communication overhead.

🧵 Fine-Grain Parallelism – Simple Explanation

📌 What it means:
The program breaks tasks into very tiny parts, so processors need to
communicate very often.

🧠 In Simple Words:
Imagine doing group homework, but after every single sentence, you stop and show your
friend to approve before writing the next one.

● That’s fine-grained parallelism — lots of checking (communication), very little

writing (computation).

📊 Characteristics of Fine-Grain Parallelism:

Feature Explanation

Small computation Each task does a tiny bit of work before syncing

High communication Tasks need to talk frequently

Low compute-to-comm Communication happens almost as much (or more) than
ratio computation

High overhead Communication and synchronization slow things down

Low performance gain Speedup is limited due to frequent "talking"

⚠️ Drawback:
If the tasks are too small, processors spend more time talking and waiting
than actually working → leading to slower performance.

📝 Quick Example:
You divide a big matrix into very tiny 2x2 blocks for multiple processors to work on.

● After every small operation, processors sync up and exchange data.

● This constant stopping kills performance.

✅ Instead, it would be better to use coarser granularity — bigger blocks, less

communication.

🧱 Coarse-Grain Parallelism – Simple Explanation

📌 What it means:
Each processor does a big chunk of work before needing to communicate with
others.

🧠 In Simple Words:
Imagine you're cleaning your room with your siblings.

● You clean your entire area first, and only talk when you're done.

● That’s coarse-grain parallelism — more working, less talking ✅

📊 Characteristics of Coarse-Grain Parallelism:
Feature Explanation

Large computation blocks Each processor works longer before needing to

sync

Less frequent Communication happens only occasionally

communication

High compute-to-comm ratio More time spent working, less time talking

Low overhead Not much time wasted on syncing

Better performance Because less time is wasted on communication

❌ Harder load balancing Some processors might get more work than others

⚠️ Drawback:
If one processor finishes early, it might wait for others — this makes load
balancing a bit tricky.

✅ When to Use:
● Tasks that can be split into independent large pieces

● Sorting, image processing, simulations

📝 Quick Example:
● Dividing a 1000x1000 image into 4 large chunks for 4 processors.

● Each works on its own part without talking.

● Only once at the end, they combine the results.

This is coarse-grain and generally faster than fine-grain because less time is spent on
communication.
\
Problems that increase the percentage of parallel time with their size are more scalable,
than problems with a fixed percentage of parallel time
✅ Why Use Parallel Processing? (Easy Version)
1. 🕒 Save Time (Reduce Wall-Clock Time)
● Parallel programs run tasks at the same time.

● This finishes the job faster than doing everything one by one (serially).

● Example: Sorting a big dataset using 8 CPUs instead of 1.

2. 📈 Solve Bigger Problems

● Some problems (like weather modeling or simulations) are too large for one
computer.

● Parallel processing splits the big problem across many processors to make it
manageable.

3. 💾 Overcome Memory Limits

● One machine might not have enough RAM to load huge data.

● Using multiple nodes shares memory resources → larger memory pool!

4. 💰 Cost Savings
● Parallel systems (like GPU clusters or cloud compute) can be more cost-effective
than supercomputers.

● Also allows flexibility — pay for only what you use (cloud model).

5. 🧯 Better Fault Tolerance

● If one processor fails, others can continue the task or take over.

● Makes systems more robust — good for critical applications (like servers or medical
simulations).
6. 🧪 Scientific Curiosity / Research
● Many scientific fields (AI, physics, genomics) require massive computations.

● Parallel processing makes these computations possible and faster!

✍️ One-Line Exam Answer:

"Parallel processing is used to save time, solve larger problems, utilize
memory better, save cost, handle failures, and enable advanced scientific
computing."

🌟 Title: Other Metrics for Performance Evaluation

This means:
When you run a program (especially on multiple processors), how do you measure how
good or fast it is?

🔹 (Run-time is the dominant metric)

👉
That means the most important thing is:
How much time does it take to finish?
This is called Execution Time or Run-Time.

🚀 Metrics Explained:
1. Run-Time (Execution Time)

👉 How long the program takes to finish.

Shorter time = Better performance.

2. Speed (mflops, mips)

● mflops = Million floating point operations per second.

🚗.
● mips = Million instructions per second.
These show how fast your system is working, like a speedometer

3. Speedup

👉 How much faster your program runs when you use multiple processors.
It’s calculated like this:

sql
CopyEdit
Speedup = Time using 1 processor / Time using multiple processors

Example:
If 1 processor takes 10 seconds, and 4 processors take 2.5 seconds,
Speedup = 10 / 2.5 = 4×

4. Efficiency

This tells you how well you're using your processors.

🧠 Formula:
E=SpeedupNumber of ProcessorsE = \frac{\text{Speedup}}{\text{Number of
Processors}}E=Number of ProcessorsSpeedup

If you have 4 processors and get a speedup of 4:

Efficiency = 4 / 4 = 1 or 100% (perfect)
If speedup is 2 with 4 processors:
Efficiency = 2 / 4 = 0.5 or 50%

So, higher efficiency = better use of processors!

5. Scalability

👉 How well your program performs as you add more processors.

Good scalability = Performance keeps improving with more processors.
✅ Summary (Quick Version)
● Run-Time: Total time to finish.

● Speed: Operations per second.

● Speedup: How much faster with multiple processors.

● Efficiency: How well processors are used.

● Scalability: How well it handles more processors.

🧠 General Parallel Terminologies (Simplified)

1. 🔁 Parallel Overhead
This is the extra time spent on managing parallel tasks instead of doing the actual work.

⚙️ It includes:
● Task startup time

● Data communication between tasks

● Synchronization (waiting for other tasks)

● Software delays (caused by compilers, OS, etc.)

● Task termination time

🧠 Think of it like a group project: If you spend more time planning, calling, and emailing
each other than doing the actual work — that's overhead.

2. 💥 Massively Parallel
Refers to a computer system that has hundreds or thousands of processors working
together.

✅ Example: Supercomputers like those used for climate simulation or deep learning.
3. 📈 Scalability
How well a parallel system can handle more work when more processors are added.

🧩 Depends on:
● CPU-memory connection speed (bandwidth)

● Network communication speed

● Your algorithm (is it parallel-friendly?)

● The overhead of managing multiple tasks

📌 If performance keeps improving as you add more CPUs = system is scalable.

✍️ Exam Tip:
"Parallel overhead is the time spent coordinating parallel tasks. Massively
parallel systems use 100s or 1000s of processors. Scalability is how well a
system improves as we add more processors."

✅ Factors That Affect Scalability (Easy Explanation)

Scalability = How well a system performs when you add more processors. These are the
key things that affect it:

🔧 1. Hardware
● Especially the bandwidth between CPU and Memory.

● Also includes network speed between processors (if they're on different machines).

🧠 Think of it like highways: More cars (processors) only help if roads (bandwidth) are
wide enough.

📊 2. Application Algorithm
● Some algorithms are easy to split into tasks (parallel-friendly) → great scalability.
● Others are mostly serial → poor scalability.

💡 Example:
● Sorting large lists = good

● Recursive dependency problems = not so good

⏱️ 3. Parallel Overhead
● Extra time spent on managing tasks, synchronizing, communicating instead of
doing the real work.

● More overhead = less scalability

🧑‍💻 4. Your Specific Code & Application

● How you write and structure your parallel code.

● Smart coding = better use of hardware = better scalability.

💬 Summary:
Scalability depends on how fast the hardware is, how parallel your algorithm is,
how much time is wasted in communication, and how well you write your code.
Manual Parallelization (Detailed)

● You decide what part of the code should run in parallel.

● You use tools/libraries like:

○ Threads (e.g., POSIX, Java threads)

○ OpenMP, MPI, etc.

● 🔧 Example: You manually create 4 threads to sort different parts of an array.

Pros: Full control
Cons: Hard to manage, easy to make mistakes

🤖 Automatic Parallelization
● A parallelizing compiler or pre-processor analyzes your code.

● It detects loops or parts of the code that can safely run in parallel.

● Tools:

○ Intel Compilers (ICC)

○ LLVM/Clang with auto-parallelization flags

○ Some Python libraries or MATLAB backends

Pros: Fast and easy
Cons: Not always smart enough to parallelize complex logic

📝 Summary:
Manual = More control, more work.
Automatic = Easier, but not always efficient.

🧠 How a Parallelizing Compiler Works

A parallelizing compiler helps convert your normal (serial) code into parallel code — either
automatically or with your help. There are two main modes:

1. 🔄 Fully Automatic Parallelization

● What happens?
The compiler automatically scans your code and finds parts (usually loops) that
can run in parallel.

● Target areas:

○ Loops (for, do, etc.)

○ Independent computations (no data dependency)

● Examples of tools:

○ Paralax compiler

○ Insieme compiler

● Pros:

○ No need to modify your code

○ Great for beginners or simple programs

● Cons:
○ May miss some opportunities

○ Only works well when the code is structured clearly

2. 🧾 Programmer-Directed Parallelization
● What happens?
The programmer gives hints or instructions to the compiler using directives
(special comments or flags).

● How?

○ You add compiler directives like #pragma to tell the compiler how to
parallelize.

○ Examples:

■ #pragma omp parallel for → tells compiler to run loop in

parallel (OpenMP)

■ #pragma acc parallel loop → for OpenACC (used with GPUs)

● Examples of tools:

○ OpenMP

○ OpenACC

● Pros:

○ More control over what gets parallelized

○ Works even with more complex code

● Cons:

○ Requires programmer knowledge

○ May lead to bugs if used incorrectly

📝 Summary Table
Feature Fully Automatic Programmer-Directed

Who identifies parallelism? Compiler You (with hints)

Code changes needed? No Yes (add directives)

Flexibility Low High

Examples Paralax, Insieme OpenMP, OpenACC

🔧 Why Shift to Manual Parallelization?

Since automatic tools have these drawbacks, for real-world, high-performance
applications, developers usually prefer the manual approach.

✅ Manual Parallelization:
● Gives full control over how the work is split and synchronized.

● Lets you optimize communication and memory use.

● Works better for complex or irregular code structures.

● Can result in higher speedup and efficiency when done properly.

📝 Final Transition Note:

From this point on, the lecture focuses on manual parallelization, where
you—the programmer—take control over designing and implementing parallel
tasks.
🔑 Step 1: Understanding the Problem (Before
Parallelizing)
Before jumping into writing parallel code, the first and most critical step is to fully
understand the problem and the existing serial program. Here's why and how:

✅ Why It’s Important:

1. Not all problems benefit from parallelization.

○ Some problems have too much sequential dependency.

○ Others might not be worth the overhead of parallel processing.

2. You can’t optimize what you don’t understand.

○ Without knowing the program flow, data dependencies, and bottlenecks, any
attempt at parallelization could lead to:

■ Incorrect results ❌
■ Wasted time ⌛
■ Poor performance 🐌

📌 What You Should Understand:

● What the program does (its logic, input, output).

● Which parts take the most time (profiling helps here).

● Where data dependencies exist.

● Which sections can run independently (ideal candidates for parallelism).

● Memory usage and potential for data sharing or race conditions.

🤔 Is Your Problem Suitable for Parallelization?

Ask these questions:

● Is the problem compute-intensive?

● Can parts of the computation be done independently?

● Is the problem data-parallel? (same operation on different data chunks)

● Is the amount of work large enough to justify parallel overhead?

Example:

Let’s say you're working on image processing (e.g., applying a filter to every pixel).

✅ Each pixel can be processed independently → good for parallelization.

❌ But if you’re working on a recursive depth-first search, parallelization is more
challenging due to sequential dependencies.

🎯 Identify the Program's Hot-Spots

Once you understand the problem and the serial code, the next key step is to identify the
hot-spots—the parts of the code that do the most computational work or take the most
time.

🔍 What Are Hot-Spots?

Hot-spots are the sections of code where:

● The CPU spends most of its time.

● Most computations or data operations occur.

● Optimizing them would have the biggest impact on performance.

🛠️ How to Find Them?

You can’t just guess—use profiling tools:
✅ Tools for Profiling:
● gprof (Linux)

● Valgrind with callgrind

● perf

● Intel VTune

● Visual Studio Profiler (Windows)

● Python: cProfile or line_profiler

● MATLAB: built-in profiler

● Jupyter Notebooks: %timeit, %prun

📈 What to Look For:

● Functions with highest CPU usage

● Loops with long execution time

● I/O bottlenecks (if any)

● Memory access patterns (e.g., cache misses)

🎯 Focus Efforts Wisely

“Don’t waste time parallelizing code that’s only used 1% of the time.”

✅ Instead:
● Prioritize the parts of the code where optimization can produce the most speedup.

● Apply Amdahl’s Law: If 90% of your program is parallelizable, you can get a 10x
potential speedup in theory.
🧠 Example:
If a program has:

● 3 nested loops for matrix multiplication that take 80% of the total runtime

● And a few print statements and setup code taking the rest

👉 Focus only on parallelizing the matrix multiplication section.

🔧 Identify Bottlenecks in the Program

After spotting the hot-spots, the next step is to identify bottlenecks—the parts of your
program that slow everything else down or block parallel progress.

🔍 What Are Bottlenecks?

A bottleneck is:

A section of code that limits the performance of the entire program because it
takes too long or blocks other tasks from running.

❗ Examples of Bottlenecks:
Bottleneck Type Description

🧾 I/O Operations Reading/writing to disk is very slow compared to

computation.

🕒 Serial Sections Parts that can’t be parallelized (e.g., setting up data

structures).

🔁 Poorly optimized Nested loops or unoptimized algorithms that slow down

loops computation.

🌐 Communication In parallel programs, time spent waiting for data from other
delays processes.

🔄 Synchronization wait One thread waits for another at a barrier or lock.

🛠️ How to Identify Bottlenecks?
Use:

● 🔬 Profilers (like gprof, perf, VTune) to find slow code.

● 📊 Timing functions to measure each part (e.g., [Link]() in Python).
● 📦 Memory + CPU usage tools to monitor resources.
● 🧠 Logic review: Are you doing extra unnecessary work?

💡 How to Reduce or Eliminate Bottlenecks?

Problem Solution

🐢 Slow I/O Buffer I/O, use parallel file reading, or reduce file operations

🧱 Sequential code Refactor or break into smaller parallel pieces if possible

🔁 Slow algorithms Use better data structures or faster algorithms

🌐 Comm overhead Try to overlap communication with computation

🔄 Sync delays Reduce the need for frequent synchronization/barriers

🔄 Overlapping Communication with Computation

This is key in parallel computing:

Start computing with the available data, while other data is still being
received/transferred in the background.

Example: While one thread is sending data to another, let it process already received data
at the same time.

🧠 Summary:
● 🕵️ Identify slowdowns → are they unavoidable or fixable?
● 🎯 Focus on fixing the most impactful ones.
● 🔄 Try to hide latency using smart scheduling or overlap techniques.
🔍 Other Key Considerations in Parallel Programming
🧱 1. Identify Blockages to Parallelism
Even if a task seems parallelizable, some hidden issues can block it. One of the biggest
culprits is:

🔁 Data Dependence
This happens when:

One part of your code depends on the result of another part before it can
continue.

⚠️ Types of Data Dependencies

Type Meaning

📥 Read-after-Write A task needs data produced by another task.

(RAW)

📝 Write-after-Read A task tries to overwrite data before another task has finished
(WAR) reading it.

✍️ Write-after-Write Two tasks try to write to the same variable at the same time.
(WAW)

🧠 These prevent tasks from running in parallel because order matters.

🛠️ How to Handle It?

● Restructure the code to remove or reduce dependencies.

● Use local copies of variables (avoid shared states).

● Apply synchronization carefully (mutexes, barriers).

● Explore task scheduling strategies to reorder execution safely.

🔄 2. Investigate Alternative Algorithms
Sometimes the current algorithm is the bottleneck. Instead of trying to force it into a
parallel model:

💡 Try a different algorithm that's naturally parallel.

🧠 Why Is This Important?

Because:

● Some algorithms are easier to split across processors.

● They may offer better scalability and less communication overhead.

● A simple algorithm switch can give huge performance improvements.

✅ Examples:
Task Traditional Algorithm Better Parallel Alternative

Matrix multiplication Naive nested loops Blocked/Strassen’s multiplication

Sorting QuickSort (recursive) MergeSort / Parallel QuickSort

Searching large Linear search Divide-and-conquer, MapReduce style

datasets

Graph traversal DFS/BFS Parallel BFS with level synchronization

🔚 Summary:
● 🔎 Look for data dependencies that block parallelism.
● 🧠 Consider algorithm changes — don’t stick to serial logic!
● 🎯 A well-chosen algorithm can unlock massive parallel performance.
🧩 Partitioning in Parallel Programming
Before you can run a program in parallel, you need to divide the work so multiple
processors can work on different parts simultaneously. This division is called
partitioning or decomposition.

Why Partition?

Because...

🚀 You can’t parallelize what you haven’t split up!

🛠️ Two Main Types of Partitioning:

1. Domain Decomposition (Data-Based)

Also called Data Parallelism

You divide the data into chunks, and each processor works on a subset of the data.

📌 Example:
● Suppose you're processing a big image (e.g., applying a filter).

● You can divide the image into sections and assign each section to a different
processor.

✅ Best for:
● Tasks where the same operation is performed on different data (like matrix
operations, simulations, etc.)

2. Functional Decomposition (Task-Based)

Also called Task Parallelism

You divide the work based on functions or tasks, not data. Each processor performs a
different function or step in the process.
📌 Example:
● Suppose you're simulating weather:

○ One task handles temperature.

○ Another handles pressure.

○ Another handles wind speed.

○ All tasks work in parallel on the same data.

✅ Best for:
● Workflows with distinct stages or responsibilities.

● Pipelines and heterogeneous tasks.

🔁 Summary:
Type Based Example Use When...
On

Domain Data Split matrix/image/data Same operation on lots of

Decomposition array data

Functional Function One task computes, one Different tasks on shared

Decomposition s writes or same data

🗂️ Domain Decomposition (Data-Based Partitioning)

💡 Concept:
You split the data into parts, and each parallel task (processor/thread) works on its own
part of the data.

✅ Key Points:
● All tasks perform the same operation.
● Each task only works on a specific chunk of the overall dataset.

● Ideal when your data can be evenly split and worked on independently.

📊 Example: Matrix Addition

Imagine you have two 4×4 matrices A and B, and you want to compute C = A + B.

🧠 Serial:
One processor adds each element one by one.

⚡ Parallel (Domain Decomposition):

Split the matrix into 4 rows. Assign 1 row to each of 4 processors:

Processo Data Worked On (Rows)

P0 Row 0

P1 Row 1

P2 Row 2

P3 Row 3

Each processor does:

cpp
CopyEdit
C[i][j] = A[i][j] + B[i][j] // for its assigned rows

🧱 Real-Life Analogy:
Imagine painting a wall — you split the wall into 4 sections and 4 painters work in parallel,
each on their own section.

💻 Use Cases:
● Image processing
● Scientific simulations

● Weather modeling

● Matrix operations

🧩 Partitioning Data in Parallel Computing

When we talk about partitioning data, we're basically figuring out how to split the dataset
so multiple processors can work on different parts at the same time.

📐 Common Ways to Partition Data:

1. 1D Partitioning (One-Dimensional)

You divide the data along one dimension only.

📊 Example: Vectors or Rows of a Matrix

Suppose you have a 4×4 matrix:

lua
CopyEdit
A = [[1, 2, 3, 4],
[5, 6, 7, 8],
[9,10,11,12],
[13,14,15,16]]

Using 1D partitioning, you could assign:

Processo Rows
r Assigned

P0 Row 0

P1 Row 1

P2 Row 2

P3 Row 3
Each processor works on a full row.

2. 2D Partitioning (Two-Dimensional)

You divide the data in both rows and columns — a grid-style split.

📊 Same 4×4 Matrix — but split into blocks:

Processo Block Assigned
r

P0 Top-left (2×2)

P1 Top-right (2×2)

P2 Bottom-left (2×2)

P3 Bottom-right (2×2)

Each processor works on a sub-matrix.

📌 Summary:
Type How It Works Good For

1D Split rows or columns Simple, easier to manage

2D Split into blocks (grid) Better load balance for big 2D data like matrices or
images

⚙️ Functional Decomposition in Parallel Computing

Instead of breaking up the data, functional decomposition breaks up the tasks (functions or
operations) that need to be performed.

💡 Simple Definition:
"Split the work based on what needs to be done, not on what data needs
to be handled."
🔍 How it works:
Each processor (or thread) is given a different function or part of the algorithm to
execute.

🛠️ Example:
Let's say you're modeling an ecosystem simulation:

● 🐦 Processor 1 handles bird population growth

● 🌳 Processor 2 handles plant growth
● 🐺 Processor 3 handles predator-prey dynamics
Each one does a different task, but together they simulate the entire ecosystem.

✅ When to Use Functional Decomposition:

● When a problem involves distinct operations or stages.

● When those operations can run independently or in parallel.

🧠 Common Use Cases:

Application Tasks (Functions)

🌱 Ecosystem Model Animal growth, plant cycles, weather

🎧 Signal Processing Noise filtering, amplification, encoding

🌍 Climate Modeling Wind simulation, ocean currents, radiation

🔹 Who Needs Communications?
You don’t need communication when:

● The tasks in a parallel program are independent of each other.

● There's no data dependency between them.

● Each task can complete its part without needing to know what the others are doing.

Example – Image Processing:

● Imagine a black and white image where you want to invert each pixel.

● Each pixel operation is independent (e.g., new_value = 255 - old_value).

● You can assign chunks of pixels to different processors—no need to talk to each
other.

Such problems are called: 👉 Embarrassingly Parallel Problems – because they’re so

easy to parallelize!

🔹 Factors to Consider: Cost of Communication

Whenever parallel tasks communicate, there's a performance price to pay:

💸 1. Overhead
● Communication uses CPU cycles, memory, and network bandwidth.

● These resources could otherwise be used for actual computation.

⏱ 2. Waiting Time (Synchronization Delays)

● Tasks often need to wait for others before moving forward.

● This happens during data exchange, creating idle time (which is wasteful).

🌐 3. Network Bandwidth Saturation

● When too many tasks communicate at once, it can clog the network.

● Like rush-hour traffic — everyone slows down.

● Result? Even well-optimized programs lose performance due to congestion.

🔁 Real-World Analogy
Think of a group of workers trying to build a wall:

● If they talk constantly about each brick placement, they'll build slowly.

● If they only coordinate occasionally (less communication), they work faster.

● But if they never communicate, mistakes happen.

📝 Scenario-Based Question
Q:
You're running a parallel program with 16 processors. Each processor frequently needs to
exchange boundary values with its neighbors. However, as the number of processors
increases, your program's speedup starts decreasing. Why is this happening?

● The increase in inter-processor communication introduces more overhead.

● Synchronization delays cause processors to wait for others to exchange data.

● Network bandwidth may be saturated, reducing the overall efficiency.

● The program is spending more time communicating than computing.

Parallel Computing: Communication & Sync
No ratings yet
Parallel Computing: Communication & Sync
63 pages
Parallel Algorithms: Performance Insights
No ratings yet
Parallel Algorithms: Performance Insights
12 pages
Understanding Scalability and Parallelism
No ratings yet
Understanding Scalability and Parallelism
19 pages
Principles of Parallel Algorithm Synchronization
No ratings yet
Principles of Parallel Algorithm Synchronization
48 pages
HPC Lecture 2: Parallel Architectures & Design
No ratings yet
HPC Lecture 2: Parallel Architectures & Design
7 pages
Serial vs Parallel Computing Explained
No ratings yet
Serial vs Parallel Computing Explained
39 pages
Understanding Parallel Computing Concepts
No ratings yet
Understanding Parallel Computing Concepts
14 pages
Designing Efficient Parallel Algorithms
No ratings yet
Designing Efficient Parallel Algorithms
27 pages
Need for Parallel Computing Explained
No ratings yet
Need for Parallel Computing Explained
76 pages
Understanding Parallel Computing Basics
No ratings yet
Understanding Parallel Computing Basics
49 pages
Parallel Computing: Design and Performance
No ratings yet
Parallel Computing: Design and Performance
27 pages
Understanding Parallel Programming Concepts
No ratings yet
Understanding Parallel Programming Concepts
10 pages
Types of Processor Interconnects Explained
No ratings yet
Types of Processor Interconnects Explained
15 pages
Big Data Systems: Parallel & Distributed Computing
No ratings yet
Big Data Systems: Parallel & Distributed Computing
58 pages
Modern Parallelism in Computing Systems
No ratings yet
Modern Parallelism in Computing Systems
35 pages
Parallel Computing: Design & Performance
No ratings yet
Parallel Computing: Design & Performance
27 pages
Measuring Parallel Computation Performance
No ratings yet
Measuring Parallel Computation Performance
41 pages
Understanding Parallelism in Computing
No ratings yet
Understanding Parallelism in Computing
67 pages
Introduction to Parallel Processing
No ratings yet
Introduction to Parallel Processing
40 pages
Advanced Parallel Programming Techniques
No ratings yet
Advanced Parallel Programming Techniques
23 pages
Parallel and Distributed Computing Course
No ratings yet
Parallel and Distributed Computing Course
26 pages
Cloud Computing: Concurrency & Parallelism
No ratings yet
Cloud Computing: Concurrency & Parallelism
37 pages
Introduction to Parallel Computing
No ratings yet
Introduction to Parallel Computing
10 pages
Amdahl's Law and Parallel Computing
No ratings yet
Amdahl's Law and Parallel Computing
39 pages
Introduction to Parallel Computing Concepts
No ratings yet
Introduction to Parallel Computing Concepts
122 pages
High Performance Computing Basics
No ratings yet
High Performance Computing Basics
26 pages
High Performance Computing Overview
No ratings yet
High Performance Computing Overview
68 pages
Parallel Computing: Pros and Cons
No ratings yet
Parallel Computing: Pros and Cons
45 pages
Parallel Algorithms: Concepts & Uses
No ratings yet
Parallel Algorithms: Concepts & Uses
32 pages
Introduction to Parallel Processing
No ratings yet
Introduction to Parallel Processing
45 pages
Parallel Computing Concepts Overview
No ratings yet
Parallel Computing Concepts Overview
2 pages
HPC Unit 2: Decomposition and Models
No ratings yet
HPC Unit 2: Decomposition and Models
4 pages
Overview of Massively Parallel Processing
No ratings yet
Overview of Massively Parallel Processing
25 pages
Understanding Parallel Computing Concepts
No ratings yet
Understanding Parallel Computing Concepts
81 pages
Collective Communication in Parallel Computing
No ratings yet
Collective Communication in Parallel Computing
50 pages
Module 3
No ratings yet
Module 3
104 pages
Understanding Parallel Computing Concepts
No ratings yet
Understanding Parallel Computing Concepts
44 pages
Overview of Multithreading Algorithms
No ratings yet
Overview of Multithreading Algorithms
36 pages
Parallel Computing Fundamentals and Applications
No ratings yet
Parallel Computing Fundamentals and Applications
27 pages
Parallel Computing: Scope and Applications
No ratings yet
Parallel Computing: Scope and Applications
2 pages
Parallel Processing Architectures Overview
No ratings yet
Parallel Processing Architectures Overview
36 pages
Overhead and Performance in Parallel Computing
No ratings yet
Overhead and Performance in Parallel Computing
25 pages
Overhead Sources in Parallel Systems
No ratings yet
Overhead Sources in Parallel Systems
4 pages
Understanding Parallelism in Computing
No ratings yet
Understanding Parallelism in Computing
13 pages
Introduction to Parallel Programming
No ratings yet
Introduction to Parallel Programming
18 pages
Parallel Algorithm for Array Sum Calculation
No ratings yet
Parallel Algorithm for Array Sum Calculation
6 pages
Overview of Parallel and Distributed Computing
No ratings yet
Overview of Parallel and Distributed Computing
66 pages
Overview of Distributed Systems Concepts
No ratings yet
Overview of Distributed Systems Concepts
8 pages
Overview of Distributed Systems Concepts
No ratings yet
Overview of Distributed Systems Concepts
182 pages
Optimizing Parallel Communication in HPC
No ratings yet
Optimizing Parallel Communication in HPC
3 pages
Overview of Parallel Computing Concepts
No ratings yet
Overview of Parallel Computing Concepts
95 pages
Overview of Multiprocessor Systems
No ratings yet
Overview of Multiprocessor Systems
78 pages
Understanding Parallel Computing Basics
No ratings yet
Understanding Parallel Computing Basics
53 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
Overview of Parallel Computing Techniques
No ratings yet
Overview of Parallel Computing Techniques
40 pages
Principles of Parallel Programming
No ratings yet
Principles of Parallel Programming
40 pages
Parallel Computing
100% (1)
Parallel Computing
12 pages
Program and Network Properties in Parallelism
No ratings yet
Program and Network Properties in Parallelism
71 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
8 pages
ILP, TLP, and DLP in TRIPS Architecture
No ratings yet
ILP, TLP, and DLP in TRIPS Architecture
21 pages
Task Graph Pattern for Parallel Execution
No ratings yet
Task Graph Pattern for Parallel Execution
11 pages
Understanding Parallel Computing Concepts
No ratings yet
Understanding Parallel Computing Concepts
22 pages
Introduction to Parallel Computing Concepts
No ratings yet
Introduction to Parallel Computing Concepts
90 pages
Overview of Distributed Computing Concepts
100% (1)
Overview of Distributed Computing Concepts
54 pages
Latency Hiding Techniques in Computing
No ratings yet
Latency Hiding Techniques in Computing
28 pages
Overview of Distributed Systems
No ratings yet
Overview of Distributed Systems
14 pages
Principles of Parallel and Distributed Computing
No ratings yet
Principles of Parallel and Distributed Computing
35 pages
Parallel Matrix Multiplication in Python
No ratings yet
Parallel Matrix Multiplication in Python
15 pages
High Performance Computing Overview
No ratings yet
High Performance Computing Overview
18 pages
Distributed UNIT 3
No ratings yet
Distributed UNIT 3
17 pages
Distributed Computing Course Overview
No ratings yet
Distributed Computing Course Overview
32 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
78 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
72 pages
Fundamentals of Computer Architecture
No ratings yet
Fundamentals of Computer Architecture
294 pages
Parallel Algorithm Design Principles
No ratings yet
Parallel Algorithm Design Principles
147 pages
Behavioral Observation
No ratings yet
Behavioral Observation
19 pages
Parallel Computing Overview by Dr. Trivedi
No ratings yet
Parallel Computing Overview by Dr. Trivedi
170 pages
Flynn's Taxonomy in Parallel Computing
No ratings yet
Flynn's Taxonomy in Parallel Computing
24 pages
Modeling Near-Synonymy in Language
No ratings yet
Modeling Near-Synonymy in Language
40 pages
Understanding Granularity in Parallel Computing
No ratings yet
Understanding Granularity in Parallel Computing
16 pages
ZeRO: Optimizing Memory for Trillion-Parameter Models
No ratings yet
ZeRO: Optimizing Memory for Trillion-Parameter Models
24 pages
Hardware Multithreading Overview
No ratings yet
Hardware Multithreading Overview
10 pages
Parallel Computing Design: Algorithms & Decomposition
No ratings yet
Parallel Computing Design: Algorithms & Decomposition
47 pages

Parallel Computing Communication & Synchronization

Uploaded by

Parallel Computing Communication & Synchronization

Uploaded by

🔁 1.

Communications in Parallel Computing

📦 Two Main Methods:

✅ Key Point: No matter if it’s shared memory or network — data exchange is

🛠️ How it’s implemented:

●​ Locks / Semaphores: Used to control access to shared resources.​

⚠️ Why They Matter:

Synchronizatio Ensure tasks proceed together Race conditions, deadlocks,

✍️ Scenario-Based Question (Exam Style):

✅ Communication – Threads need to send their computed parts to a shared space.​

⚙️ Granularity in Parallel Computing

It’s the ratio of:

Computation Time÷Communication Time\text{Computation Time} \div \text{Communication

🧱 Two Types of Granularity:

Coarse-Gra Lots of computation happens before A thread works on a full image

✅ Key Points for Exam:

●​ Fine granularity → more communication, more overhead → can slow down

●​ Ideal granularity depends on your task and system architecture​

Observed Speedup (S)=Wall-clock time (Serial)Wall-clock time (Parallel)\text{Observed

●​ Parallel time (on 4 processors) = 6 seconds​

S=206≈3.33S = \frac{20}{6} \approx 3.33S=620​≈3.33

○​ If using 4 processors → ideal speedup = 4​

●​ For sorting (a compute-heavy task), coarse granularity is ideal to reduce

🧵 Fine-Grain Parallelism – Simple Explanation

●​ That’s fine-grained parallelism — lots of checking (communication), very little

📊 Characteristics of Fine-Grain Parallelism:

High communication Tasks need to talk frequently

High overhead Communication and synchronization slow things down

Low performance gain Speedup is limited due to frequent "talking"

●​ After every small operation, processors sync up and exchange data.​

●​ This constant stopping kills performance.​

✅ Instead, it would be better to use coarser granularity — bigger blocks, less

🧱 Coarse-Grain Parallelism – Simple Explanation

●​ That’s coarse-grain parallelism — more working, less talking ✅​

Large computation blocks Each processor works longer before needing to

Less frequent Communication happens only occasionally

Low overhead Not much time wasted on syncing

Better performance Because less time is wasted on communication

●​ Sorting, image processing, simulations​

●​ Each works on its own part without talking.​

●​ Only once at the end, they combine the results.​

●​ Example: Sorting a big dataset using 8 CPUs instead of 1.​

2. 📈 Solve Bigger Problems

3. 💾 Overcome Memory Limits

●​ Using multiple nodes shares memory resources → larger memory pool!​

5. 🧯 Better Fault Tolerance

●​ Parallel processing makes these computations possible and faster!​

✍️ One-Line Exam Answer:

🌟 Title: Other Metrics for Performance Evaluation

🔹 (Run-time is the dominant metric)

👉 How long the program takes to finish.​

2. Speed (mflops, mips)

This tells you how well you're using your processors.

If you have 4 processors and get a speedup of 4:​

So, higher efficiency = better use of processors!

👉 How well your program performs as you add more processors.​

●​ Speed: Operations per second.​

●​ Speedup: How much faster with multiple processors.​

●​ Efficiency: How well processors are used.​

●​ Scalability: How well it handles more processors.

🧠 General Parallel Terminologies (Simplified)

●​ Data communication between tasks​

●​ Synchronization (waiting for other tasks)​

●​ Software delays (caused by compilers, OS, etc.)​

●​ Task termination time​

●​ Network communication speed​

●​ Your algorithm (is it parallel-friendly?)​

●​ The overhead of managing multiple tasks​

📌 If performance keeps improving as you add more CPUs = system is scalable.

✅ Factors That Affect Scalability (Easy Explanation)

●​ Recursive dependency problems = not so good​

●​ More overhead = less scalability​

🧑‍💻 4. Your Specific Code & Application

●​ Smart coding = better use of hardware = better scalability.​

●​ You decide what part of the code should run in parallel.​

●​ You use tools/libraries like:​

● Locks / Semaphores: Used to control access to shared resources.

✅ Communication – Threads need to send their computed parts to a shared space.

● Fine granularity → more communication, more overhead → can slow down

● Ideal granularity depends on your task and system architecture

● Parallel time (on 4 processors) = 6 seconds

S=206≈3.33S = \frac{20}{6} \approx 3.33S=620≈3.33

○ If using 4 processors → ideal speedup = 4

● For sorting (a compute-heavy task), coarse granularity is ideal to reduce

● That’s fine-grained parallelism — lots of checking (communication), very little

● After every small operation, processors sync up and exchange data.

● This constant stopping kills performance.

● That’s coarse-grain parallelism — more working, less talking ✅

● Sorting, image processing, simulations

● Each works on its own part without talking.

● Only once at the end, they combine the results.

● Example: Sorting a big dataset using 8 CPUs instead of 1.

● Using multiple nodes shares memory resources → larger memory pool!

● Parallel processing makes these computations possible and faster!

👉 How long the program takes to finish.

If you have 4 processors and get a speedup of 4:

👉 How well your program performs as you add more processors.

● Speed: Operations per second.

● Speedup: How much faster with multiple processors.

● Efficiency: How well processors are used.

● Scalability: How well it handles more processors.

● Data communication between tasks

● Synchronization (waiting for other tasks)

● Software delays (caused by compilers, OS, etc.)

● Task termination time

● Network communication speed

● Your algorithm (is it parallel-friendly?)

● The overhead of managing multiple tasks

● Recursive dependency problems = not so good

● More overhead = less scalability

● Smart coding = better use of hardware = better scalability.

● You decide what part of the code should run in parallel.

● You use tools/libraries like:

○ Threads (e.g., POSIX, Java threads)

○ OpenMP, MPI, etc.

● 🔧 Example: You manually create 4 threads to sort different parts of an array.

○ Intel Compilers (ICC)

○ LLVM/Clang with auto-parallelization flags

○ Some Python libraries or MATLAB backends

○ Loops (for, do, etc.)

○ Independent computations (no data dependency)

○ No need to modify your code

○ Great for beginners or simple programs

○ Only works well when the code is structured clearly

■ #pragma omp parallel for → tells compiler to run loop in

■ #pragma acc parallel loop → for OpenACC (used with GPUs)

○ More control over what gets parallelized

○ Works even with more complex code

○ Requires programmer knowledge

○ May lead to bugs if used incorrectly

● Lets you optimize communication and memory use.

● Works better for complex or irregular code structures.

○ Some problems have too much sequential dependency.

○ Others might not be worth the overhead of parallel processing.

2. You can’t optimize what you don’t understand.

● Which parts take the most time (profiling helps here).

● Where data dependencies exist.

● Which sections can run independently (ideal candidates for parallelism).

● Memory usage and potential for data sharing or race conditions.

● Is the problem compute-intensive?

● Can parts of the computation be done independently?

● Is the problem data-parallel? (same operation on different data chunks)

● Is the amount of work large enough to justify parallel overhead?

✅ Each pixel can be processed independently → good for parallelization.

● The CPU spends most of its time.

● Most computations or data operations occur.

● Optimizing them would have the biggest impact on performance.

● Valgrind with callgrind

● Visual Studio Profiler (Windows)

● Python: cProfile or line_profiler

● MATLAB: built-in profiler

● Jupyter Notebooks: %timeit, %prun

● Loops with long execution time

● I/O bottlenecks (if any)

● Memory access patterns (e.g., cache misses)

● 🔬 Profilers (like gprof, perf, VTune) to find slow code.