0% found this document useful (0 votes)
29 views134 pages

Shared Address Space Programming Concepts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views134 pages

Shared Address Space Programming Concepts

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Programming Shared

Address Space Platforms


UNIT - 2
Introduction

 Explicit parallel programming requires specification of parallel tasks along


with their interaction.
 In shared address space architectures, communication is implicitly specified
since some of the memory is accessible to all the processors.
 Process based models assume that all data associated with a process is
private, by default, unless otherwise specified.
 Lightweight processes and threads assume that all memory is global.
 Directive based programming models extend the threaded model by
facilitating creation and synchronization of threads.
Thread Basics

 A thread is a single stream of control in the flow of a program.


 Example: product of two dense matrices of size n X n
for (row = 0; row < n; row++)
for (column = 0; column < n; column++)
c[row][column] = dot_product(get_row(a, row),
get_col(b, col));
 Previous code can be transformed to:
for (row = 0; row < n; row++)
for (column = 0; column < n; column++)
c[row][column] =
create_thread( dot_product(get_row(a, row), get_col(b, col)));
 Logical memory model of a thread:
 All memory in the logical machine model of a thread is globally
accessible to every thread.
 The stack corresponding to the function call is generally treated as
being local to the thread for liveness reasons.
 This implies a logical machine model M holds thread local data.
 Logical machine model gives the view of an equally accessible address
space, physical realizations of the model deviate from this assumption.
 In distributed shared memory space, the cost to access a physically
local memory may be an order of magnitude less than that of accessing
remote memory.
Why Threads?

 Threaded programming models offer significant advantages over


message passing programming models.
i. Software Portability:
 Threaded Application can be developed on serial machines and run on
parallel machines without any changes.
ii. Latency Hiding:
 Major overhead in program is the access latency for memory access, I/O and
communication.
iii. Scheduling and Load Balancing:
 Minimizing overheads of remote interaction and idling.
iv. Ease of programming, widespread use
The POSIX Thread API

 A number of vendors provide vendor specific thread APIs.


 The IEEE specifies a standard 1003.1c-1995, POSIX API. Also referred to
as Pthreads.
 POSIX has emerged as the standard threads API, supported by most
vendors.
 Pthread API are used for introducing multithreading concepts.
 The concepts themselves are largely independent of the API and can be
used for programming with other thread APIs as well.
Thread Basics: Creation and
Termination

 Example : threaded program for computing .


 The method we use here is based on generating random numbers in a
unit length square counting the number of points that fall within the
largest circle inscribed in the square.
 Threads can be created using Pthread API using the function
pthread_create.
 Program for computing the value of π.
1. Read the desired number of threads  num_threads.
2. Read the desired number of sample points  sample_points.
3. These points are divided equally among the threads.
4. Program uses array hits, for assigning an integer id to each
thread.
5. The same array is used to keep track of the number of
hits(points inside the circle) encountered by each thread upon
return.
6. The program creates num_threads threads each invoking compute_pi,
using the pthread_create function.
7. Once the respective compute_pi threads have generated assigned
number of random points and computed their hit ratios, the results must
be combined to determine π.
 The main program must wait for the threads to run to completion.
int {
pthread_join (
pthread_t thread, void **ptr);
 Once all threads have joined, the value of π is computed by multiplying
the combined hit ratio by 4.0.

Programming Notes
 The use of function rand_r(instead of superior random number
generators such as drand48)
 The reason for this is reentant(functions that can be safely called when
another instance has been suspended in the middle of its invocation).
 Performance Notes:
 We execute this program on a four-processor SGI origin 2000.
 We can also modify the program slightly to observe the effect of
falsesharing.
 The program can also be used to assess the secondary cache line size
Synchronization Primitives in
Pthreads

 While communication is implicit in shared- address-space programming,


much of the effort associated with writing correct threaded program is
spent on synchronizing concurrent threads with respect to their data
accesses or scheduling.
Mutual Exclusion for Shared Variables:
 Tasks work together to manipulate data and accomplish a given task.
 When multiple threads attempt to manipulate the same data item the
results can often be incoherent if proper care is not taken to
synchronize them.
 Much of the effort associated with writing correct threaded programs is
spent on synchronizing concurrent threads with respect to their data
accesses or scheduling.
 Consider the following code fragment being executed by multiple
threads.
 The variable my_cost is thread-local and best_cost is a global variable
shared by all threads.
/* each thread tries to update variable best_cost as follows */
if(my_cost < best_cost)
best_cost = my_cost;
 Assume that there are two threads,
 The initial value of best_cost is 100.
 The values of my_cost are 50 and 75 at threads t1 and t2, respectively.
 If both threads execute the condition inside the if statement
concurrently, then both threads enter the then part of the statement.
 Depending on which thread executes first, the value of best_cost at the
end could be either 50 or 75.
 There are two problems here:
1. non-deterministic nature of the result;
2. more importantly, the value 75 of best_cost is inconsistent in the sense
that no serialization of the two threads can possibly yield this result.

Reason:
 Test and update operation – the operation should not be broken into
sub-operation.
 The thread must be executed by only one thread at any time.
 Threaded APIs provide support for implementing critical sections and
atomic operations using mutex-locks.
 Critical segments in Pthreads are implemented using mutex locks.
 Mutex-locks have two states:
 locked and unlocked
 At any point of time, only one thread can lock a mutex lock. A lock is an
atomic operation.
 A thread entering a critical segment first tries to get a lock. It goes
ahead when the lock is granted.
 The Pthread API provides a number of functions for handling mutex
locks.
 The function pthread_mutex_lock can be used to attempt a lock on a
mutex-lock.
 Prototype:
int pthread_mutex_lock ( pthread_mutex_t *mutex_lock);
 The Pthread function pthread_mutex_unlock is used to unlock a
mutex_lock
int pthread_mutex_unlock ( pthread_mutex_t *mutex_lock);
 Function to initialize a mutex-lock to its unlocked state.
int pthread_mutex_init (
pthread_mutex_t *mutex_lock,
const pthread_mutexattr_t *lock_attr);
 Example: Computing the minimum entry in a list of integers.
 Write a simple threaded program to compute the minimum of a list of
integers.

#include<pthread.h>
void *find_min(void *list_ptr);
pthread_mutex_t minimum_value_lock;
int minimum_value, partial_list _size;
main() {
/* declare and initialize data structures and list */
minimum_value = MIN_INT;
pthread_init();
pthread_mutex_init(&minimum_value_lock, NULL);

/* initialize lists, list_ptr and partial_list_size */


/* create and join threads here */
void *find_min(void +list_ptr) {
int *partial_list_pointer, my_min, i;
my_min = MIN_INT;
partial_list_pointer = (int *) list_ptr;
for(i=0; i<partial_list_size; i++)
if(partial_list_pointer[i] < my_min)
my_min = partial_list_pointer[i];
/* lock the mutex associated with minimum_value and update the variable
as required */
pthread_mutex_lock(&minimum_value_lock);
if(my_min < minimum_value)
minimum_value = my_min;

/* and unlock the mutex */


pthread_mutex_unlock(&minimum_value_lock);
pthread_exit(0);
}
Producer-consumer work queues
 The producer-consumer scenario imposes the following constraints:
 The producer thread must not overwrite the shared buffer when the previous
task has not been picked up by a consumer thread.
 The consumer threads must not pick up tasks until there is something
present in the shared data structure.
 Individual consumer threads should pick up tasks one at a time.
 The threaded version of this program is as follows:

pthread_mutex_t task_queue_lock;
int task_available;
/* other shared data structures here */
main()
{ /* declarations and initializations */
task_available = 0;
pthread_init();
pthread_mutex_init(&task_queue_lock, NULL)
/* create and join producer and consumer threads */
}
void *producer(void *produver_thread_data)
{ int inserted;
struct task my_task;
while(!done())
{ inserted = 0;
create_task(&my_task);
while(inserted == 0)
{ pthread_mutex_lock(&task_queue_lock);
if(task_available == 0)
{ insert_into_queue(my_task);
task_available = 1;
inserted = 1;
}
pthread_mutex_unlock(&task_queue_lock);
}
}
}
void *consumer(void *consumer_thread_data)
{ int extracted;
struct task my_task;
/* local data structure declarations */
while(!done())
{ pthreaded_mutex_lock(&task_queue_lock);
if(task_available == 1)
{ extract_from_queue(&my_task);
task_available = 0;
extracted = 1;
}
pthread_mutex_unlock(&task_queue_lock);
}
process_task(my_task);
}
}
Overheads of Locking
 Locks represent serialization points since critical sections must be
executed by threads one after the other.
 Encapsulating large segments of the program within locks can,
therefore, lead to significant performance degradation.
 It is important to minimize the critical sections.
Alleviating Locking Overheads:
 It is often possible to reduce the idling overhead associated with locks
using an alternate function, pthread_mutex_trylock.
int pthread_mutex_trylock (pthread_mutex_t *mutex_lock);

 pthread_mutex_trylock is typically much faster than


pthread_mutex_lock on typical systems s.
 Since it does not have to deal with queues associated with locks for multiple
threads waiting on the lock.
 Enables a thread to do something else if a lock is unavailable
 Finding k matches in a list
void *find_entries(void *start_pointer)
{ /*this is the thread function */
struct database_record *next_record;
int count;
current_pointer = start_pointer;
do
{ next_record = find_next_entry(current_pointer);
count = output_record(next_record);
}
while (count < requested_number_of_records);
}
int output_record(struct database_record *record_ptr)
{ int count;
pthread_mutex_lock(&output_count_lock);
output_count ++;
count = output_count;
pthread_mutex_unlock(&output_count_lock);
if (count <= requested_number_of_records)
print_record(record_ptr);
return (count);
}
 The locking overhead can be alleviated by using the function
pthread_mutex_trylock.
 Each thread now finds the next entries and tries to acquire the lock and
update count.
 If another thread already has the lock, the record is inserted into a local
list and the thread proceeds to find other matches.
/* rewritten output_record function */
int output_record(struct database_record *record_ptr)
{ int count;
int lock_status;
lock_status=pthread_mutex_trylock(&output_count_lock);
if (lock_status == EBUSY)
{ insert_into_local_list(record_ptr);
return(0);
}
else
{ count = output_count;
output_count += number_on_local_list + 1;
pthread_mutex_unlock(&output_count_lock);
print_records(record_ptr, local_list,
requested_number_of_records - count);
return(count + number_on_local_list + 1);
}
}
Performance Notes:
 The time for execution of this version is less than the time for the first
one on two counts:
 First, the time for executing a pthread_mutex_trylock is typically much
smaller than that for a pthread_mutex_lock.
 Second, since the multiple records may be inserted on each lock, the number
of locking operations is also reduced.
 The number of records actually searched may be slightly larger than
the number of records actually desired.
Condition Variables for Synchronization:
 Indiscriminate use of locks can result in idling overhead from blocked
threads.
 The function pthread_mutex_trylock introduces pooling for availability
of locks.
 Solution: suspend the execution of the producer until space becomes
available.
 The availability of space is signaled by the consumer thread that
consumes the task. The function to accomplish this is provided by a
condition variable.
 A condition variable is a data object used for synchronizing threads.
This variable allows a thread to block itself until specified data reaches
a predefined state.
 Always use condition variables together with a mutex lock.
 The shared variable task_available must become 1 before the consumer
threads can be signaled.
 The boolean condition task_available == 1 is referred to as a predicate.
 A condition variable is associated with predicates. When the predicates
becomes true, the condition variable is used to signal one or more
threads waiting on the condition.
 If the predicate is not true, the thread waits on the condition variable
associated with the predicate using the function pthread_cond_wait.

int pthread_cond_wait(pthread_cond_t *cond, pthread_mutex_t


*mutex);

 call to this function blocks the execution of the thread until it receives a
signal from another thread.
 In addition to blocking the thread, the pthread_cond_wait function
releases the lock on mutex.
 When the condition is signaled (using pthread_cond_signal), one of
these threads in the queue is unblocked.
int pthread_cond_signal(pthread_cond_t *cond);
 The producer then relinquishes its lock on mutex by explicitly calling
pthread_mutex_unlock, allowing one of the blocked consumer thread to
consume the task.
 Initializes a condition variable (pointed to by cond) whose Attributes are
defined in the attribute object attr
int pthread_cond_init (pthread_cond_t *cond, const pthread_condattr_t *attr);

 If at some point in a program a condition variable is no longer required,


int pthread_cond_destroy(pthread_cond_t *cond);
 Producer-Consumer Using Condition Variables
pthread_cond_t cond_queue_empty, cond_queue_full;
pthread_mutex_t task_queue_cond_lock;
int task_available;

/* other data structures here */

main() {
/* declarations and initializations */
task_available = 0;
pthread_init();
pthread_cond_init(&cond_queue_empty, NULL);
pthread_cond_init(&cond_queue_full, NULL);
pthread_mutex_init(&task_queue_cond_lock, NULL);
/* create and join producer and consumer threads */
}

void *producer(void *producer_thread_data) {


int inserted;
while (!done()) {
create_task();
pthread_mutex_lock(&task_queue_cond_lock);
while (task_available == 1)
pthread_cond_wait(&cond_queue_empty, &task_queue_cond_lock);
insert_into_queue();
task_available = 1;
pthread_cond_signal(&cond_queue_full);
pthread_mutex_unlock(&task_queue_cond_lock);
}
}
void *consumer(void *consumer_thread_data) {
while (!done()) {
pthread_mutex_lock(&task_queue_cond_lock);
while (task_available == 0)
pthread_cond_wait(&cond_queue_full, &task_queue_cond_lock);
my_task = extract_from_queue();
task_available = 0;
pthread_cond_signal(&cond_queue_empty);
pthread_mutex_unlock(&task_queue_cond_lock);
process_task(my_task);
}
}
Controlling thread and Synchronization
Attributes.

 Entities such as threads and synchronization variables can have


several attributes associated with them.
 The pthreads API allows a programmer to change the default
attributes of entities using attributes objects.
 An attributes object is a data-structure that describes entity
(thread, mutex, condition variable) properties.
 There are several advantages of using attributes objects:
1. It separates the issues of program semantics and
implementation.
2. Using attribute objects improves modularity and readability
of the programs.
3. It allows the user to modify the program easily.
Attributes objects for threads
 The function pthread_attr_init to create an attributes objects for
threads.
int pthread_attr_init(pthread_attr_t *attr);

 The attribute object may be destroyed using the function


pthread_attr_destroy.
int pthread_atta_destroy(pthread_attr_t *attr);
 Individual properties associated with the attributes object can be
changed using the following functions:
 pthread_attr_setdetachstate
 pthread_attr_setguardsize_np
 pthread_attr_setstacksize
 pthread_attr_setinheritsched
 pthread_attr_setschedpolicy
 pthread_attr_setschedparam
Attributes objects for Mutexes
 The Pthread API supports three different kinds of locks. All of these
locks use the same functions for locking and unlocking.
1. Normal Mutex: it is a default lock, only a single thread is allowed to lock a
normal mutex once at any point in time.
2. Recursive Mutex: it allows a single thread to lock a mutex multiple times.
Each time a thread locks the mutex, a lock counter is incremented. Unlock
decrements the counter.
3. Errorcheck Mutex: when a thread attempts a lock on a mutex it has
already locked, instead of deadlocking it returns an error. It is useful for
debugging purposes.
 Example of a thread searching for an element in a binary tree.
search_tree(void *tree_ptr)
{ search node *node_pointer;
node_pointer = (struct node *) tree_ptr;
pthread_mutex_lock(&tree_lock);
if(is_search_node(node_pointer) == 1)
{ /* solution is found here */
print_node(node_pointer);
pthread_mutex_unlock(&tree_lock);
return(1);
}
else
{ if(tree_ptr  left != NULL)
search_tree((void *) tree_ptr  left);
if(tree_ptr  right != NULL)
search_tree((void *) tree_ptr  right);
}
printf(“Search unsuccessful \n”);
pthread_mutex_unlock(&tree_lock);
}
 To create and initialize a mutex attribute object to default values,
Pthreads provides the function pthread_mutexattr_init.
int pthread_mutexattr_init(pthread_mutexattr_t *attr);

 Pthreads provides the function pthread_mutexattr_settype_np for


setting the type of mutex specified by the mutex attributes object.
int pthread_mutexattr_settype_np(pthread_mutexattr_t *atttr, int type);
 Type specifies the type of the mutex and can take one of the following
values corresponding to the three mutex types –normal, recursive or
errorcheck.
 PTHREAD_MUTEX_NORMAL_NP
 PTHREAD_MUTEX_RECURSIVE_NP
 PTHREAD_MUTEX_ERRORCHECK_NP
Thread Cancellation

 Posix threads provide cancellation feature in the function


Pthread_cancel.
 int pthread_cancel(pthread_t thread);

 Thread may cancel itself or cancel other threads.

 Threads can protect themselves against cancelation.


Composite synchronization
constructs

 While the pthread API provides a basic set of synchronization


constructs, there is a need for higher level constructs.
 It uses basic synchronization constructs.
 two such constructs –
 read-write locks and
 barriers.
Read-Write Locks:
 In many applications, a data structure is read frequently but written infrequently. For
such applications, we should use read-write locks.
 Multiple reads can proceed without any coherence problems. However, writes must be
serialized.
 A read lock is granted when there are other threads that may already have read
locks.
 If there is a write lock on the data (or if there are queued write locks), the thread
performs a condition wait.
 If there are multiple threads requesting a write lock, they must perform a condition
wait.
 With this description, we can design functions for read locks mylib_rwlock_rlock, write
locks mylib_rwlock_wlock, and unlocking mylib_rwlock_unlock.
 The lock data type mylib_rwlock_t holds the following:
 A count of the number of readers,
 The writer (a 0/1 integer specifying whether a writer is present),
 A condition variable readers_proceed that is signaled when readers can
proceed,
 A condition variable writer_proceed that is signaled when one of the writers
can proceed,
 A count pending_writers of pending writers, and
 A mutex read_write_lock associated with the shared data structure
 The code for initializing and locking/unlocking :
typedef struct
{ int readers;
int writer;
pthread_cond_t readers_proceed;
pthread_cond_t writer_proceed;
int pending_writers;
pthread_mutex_t read_write_lock;
} mylib_rwlock_t;
void mylib_rwlock_init (mylib_rwlock_t *l)
{ l -> readers = l -> writer = l -> pending_writers = 0;
pthread_mutex_init(&(l -> read_write_lock), NULL);
pthread_cond_init(&(l -> readers_proceed), NULL);
pthread_cond_init(&(l -> writer_proceed), NULL);
}
void mylib_rwlock_rlock(mylib_rwlock_t *l)
{ /* if there is a write lock or pending writers, perform condition wait..
Else increment count of readers and grant read lock */
pthread_mutex_lock(&(l -> read_write_lock));
while ((l -> pending_writers > 0) || (l -> writer > 0))
pthread_cond_wait(&(l -> readers_proceed), &(l -> read_write_lock));
l -> readers ++;
pthread_mutex_unlock(&(l -> read_write_lock));
}
void mylib_rwlock_wlock(mylib_rwlock_t *l)
{ /* if there are readers or writers, increment pending writers count
and wait. On being woken, decrement pending writers count and
increment writer count */
pthread_mutex_lock(&(l -> read_write_lock));
while ((l -> writer > 0) || (l -> readers > 0)) {
l -> pending_writers ++;
pthread_cond_wait(&(l -> writer_proceed), &(l ->
read_write_lock));
}
l -> pending_writers --;
l -> writer ++;
pthread_mutex_unlock(&(l -> read_write_lock));
}
void mylib_rwlock_unlock(mylib_rwlock_t *l)
{ /* if there is a write lock then unlock, else if there are read locks,
decrement count of read locks. If the count is 0 and there is a pending writer, let
it through, else if there are pending readers, let them all go through */
pthread_mutex_lock(&(l -> read_write_lock));
if (l -> writer > 0)
l -> writer = 0;
else if (l -> readers > 0)
l -> readers --;
pthread_mutex_unlock(&(l -> read_write_lock));
if ((l -> readers == 0) && (l -> pending_writers > 0))
pthread_cond_signal(&(l -> writer_proceed));
else if (l -> readers > 0)
pthread_cond_broadcast(&(l -> readers_proceed));
}
Using read-write locks for implementing hash tables:
 A commonly used operation in applications ranging from database
query to state space search is the search of a key in a database.
 The database is organized as a hash table.
 We consider two versions of this program:
 one using mutex locks and
 one using read-write locks
The mutex lock version of the program hashes the key into the table, locks the mutex
associated with the table index, and proceeds to search/update within the linked list.
The thread function for doing this is as follows:
manipulate_hash_table(int entry)
{ int table_index, found;
struct list_entry *node, *new_node;
table_index = hash(entry);
pthread_mutex_lock(&hash_table[table_index].list_lock);
found = 0;
node = hash_table[table_index].next;
while ((node != NULL) && (!found))
{ if (node -> value == entry)
found = 1;
else
node = node -> next;
}
pthread_mutex_unlock(&hash_table[table_index].list_lock);
if (found)
return(1);
else
insert_into_hash_table(entry);
}
 Here, the function insert_into_hash_table must lock
hash_table[table_index].list_lock before performing the actual insertion.
 When a large fraction of the queries are found in the hash table, these
searches are serialized.
 It is easy to see that multiple threads can be safely allowed to search
the hash table and only updates to the table must be serialized. This
can be accomplished using read-write locks.
 We can rewrite the manipulate_hash table function as follows:
manipulate_hash_table(int entry)
{ int table_index, found;
struct list_entry *node, *new_node;
table_index = hash(entry);
mylib_rwlock_rlock(&hash_table[table_index].list_lock);
found = 0;
node = hash_table[table_index].next;
while ((node != NULL) && (!found))
{ if (node -> value == entry)
found = 1;
else
node = node -> next;
}
mylib_rwlock_rlock(&hash_table[table_index].list_lock);
if (found)
return(1);
else
insert_into_hash_table(entry);
}
Barriers:
 As in MPI, a barrier holds a thread until all threads participating in the
barrier have reached it.
 Barriers can be implemented using a counter, a mutex and a condition
variable.
 A single integer is used to keep track of the number of threads that
have reached the barrier.
 If the count is less than the total number of threads, the threads
execute a condition wait.
 The last thread entering (and setting the count to the number of
threads) wakes up all the threads using a condition broadcast.
 Code:
typedef struct {
pthread_mutex_t count_lock;
pthread_cond_t ok_to_proceed;
int count;
} mylib_barrier_t;
void mylib_init_barrier(mylib_barrier_t *b) {
b -> count = 0;
pthread_mutex_init(&(b -> count_lock), NULL);
pthread_cond_init(&(b -> ok_to_proceed), NULL);
}
void mylib_barrier (mylib_barrier_t *b, int num_threads)
{ pthread_mutex_lock(&(b -> count_lock));
b -> count ++;
if (b -> count == num_threads)
{ b -> count = 0;
pthread_cond_broadcast(&(b -> ok_to_proceed));
}
else
while (pthread_cond_wait(&(b -> ok_to_proceed),&(b -> count_lock)) != 0);
pthread_mutex_unlock(&(b -> count_lock));
}
Tips for designing Asynchronous
programs
 It is important to remember that one cannot assume any order of
execution with respect to other threads.
 In many thread libraries, threads are switched at semi-deterministic
intervals. Such libraries are more forgiving of synchronization errors in
programs. These libraries are called slightly asynchronous libraries.
 The programmer must not make any assumptions regarding the level of
asynchrony in the threads library.
 Some common errors that arise from incorrect assumptions on relative
execution times of threads:
Assumption 1:
 Thread T1 creates another thread T2.
 T2 requires some data from thread T1. this data is transferred using a
global memory location.
 However, thread T1 places data in the location after creating thread T2.
 The implicit assumption here is that T1 will not be switched until it
blocks or that T2 will get to the point at which it uses the data only
after T1 has stored it there.
Assumption 2
 Thread T1 creates T2 and that it needs to pass data to thread T2 which
resides on its stack.
 It passes this data by passing a pointer to the stack location to thread
T2.
 Consider, T1 runs to completion before T2 gets scheduled.
 In this case, the stack frame Is released and some other thread may
overwrite the space pointed to formerly by the stack frame.
 In this case, T2 reads from the location may be invalid data.
Assumption 3:
 We strongly discourage the use of scheduling techniques as means of
synchronization.
Rules of thumb helps minimize the errors in threaded programs.
 Set up all the requirements for a thread before actually creating the
thread.
 When there is a producer-consumer relation between two threads for
certain data items, make sure the producer thread places the data before
it is consumed and that intermediate buffers are guaranteed to not
overflow.
 At the consumer end, make sure that the data lasts at least until all
potential consumers have consumed the data. This is particularly
relevant for stack variables.
 Where possible, define and use group synchronizations and data
replication. This can improve program performance significantly.
OpenMP: a standard for directive
based parallel programming

 Threaded API has come a long way, their use is still predominantly
restricted to system programmers as opposed to application
programmers.
 Pthreads are considered to be low-level primitives.
 OpenMP is an API that can be used with FORTRAN, C, and C++ for
programming shared address space machines.
 OpenMP directives provide support for concurrency, synchronization
and data handling.
The OpenMP Programming Model
 OpenMP directives in C and C++ are based on the #pragma compiler
directives.
#pragma omp directive [clause list]
 OpenMP programs execute serially until they encounter the parallel
directive.
 The main thread that encounters the parallel directive becomes the
master of this group of threads and is assigned the thread id 0 within
the group.
#pragma omp parallel[clause list]
/* structured block */
 Each thread created by this directive executes the structured block
specified by the parallel directive.
 The clause list is used to specify conditional parallelization, number of
threads and data handling.
 Conditional Parallelization: the clause if(scalar expression) determines
whether the parallel construct results in creation of threads. Only one if
clause can be used with the parallel directive.
 Degree of Concurrency: The clause num_threads (integer expression)
specifies the number of threads that are created by the parallel directive.
 Data handling:
The clause private (variable list) indicates that the set of variables
specified is local to each thread. Each thread has its own copy of each variable
in the list.
The clause firstprivate(variable list) is similar to the private
clause, except the values of variables on entering the threads are initialized to
corresponding values before the parallel directive.
The shared(variable list) indicates that all variables in the list are
shared across all the threads.
 A sample OpenMP program along with its Pthreads translation that
might be performed by an OpenMP compiler
Example:
#pragma omp parallel if(is_parallel == 1)
num_threads(8) private(a) shared (b) firstprivate(c)
{ /* structured block */
}
The default state of variable is specified by clause default(shared) or
default(none).
default(shared)  implies that, by default, variable is shared by all the threads.
default(none)  state of variable used by the thread must be explicitly
specified.
 The reduction clause specifies how multiple local copies of a variable at
different threads are combined into a single copy at a master when
thread exit.
 The usage of reduction clause is reduction(operator: variable list).
 This clause performs a reduction on the scalar variables specified in the
list using the operator.
 The variable in the list are implicitly specified as being private to
threads.
Example: using the reduction clause
#pragma omp parallel reduction(+: sum) num_threads(8)
{ /* compute local sum here */
}
/* sum here contains sum of all local instances of sums */
 First OpenMP program:
omp_get_num_threads()  returns the number of threads in the
parallel region.
omp_get_thread_num()  returns the integer ID of each thread
 Example: Computing PI using OpenMP directives
#pragma omp parallel default (private) shared(npoints) reduction(+: sum)
num_threads(8)
{ num_threads = omp_get_num_threads();
sample_points_per_thread = npoints / num_threads;
sum = 0;
for(i=0; I < sample_points_per_threads; i++)
{ rand_no_x = (double) (rand_r(&seed)) / (double)((2<<14)-1);
rand_no_y = (double) (rand_r(&seed)) / (double)((2<<14)-1);
if(((rand_no_x – 0.5) * (rand_no_x – 0.5) + (rand_no_y – 0.5) * (rand_no_y
– 0.5)) < 0.25)
sum++;
}
}
Special Concurrent Tasks in OpenMP
 The parallel directive can be used in conjunction with other directives to
specify concurrency across iterations and tasks.
 OpenMp provides two directives – for and sections to specify concurrent
iterations and tasks.
The for Directive
 The for directive is used to split parallel iteration spaces across threads.
 The general form of a for directive is as follows:
#pragma omp for [clause list]
/* for loop */

The clauses that can be used in context are: private, firstprivate, lastprivate,
reduction, schedule, nowait and ordered.
 Using the for directive for computing π.
#pragma omp parallel default (private) shared(npoints) reduction(+:sum) num_threads(8)
{ sum =0;
#pragma omp for
for(i = 0; i<npoints; i++)
{ rand_no_x = (double) (rand_r(&seed)) / (double) ((2<<14) – 1);
rand_no_y = (double) (rand_r(&seed)) / (double) ((2<<14) – 1);
if(((rand_no_x – 0.5 * (rand_no_x – 0.5) + (rand_no_y) – 0.5) * (rand_no_y – 0.5))
< 0.25);
sum++;
}
}
Assigning iterations to threads
 The schedule clause of the for directive deals with the assignment of iterations to
threads.
 The general form of the schedule directive is schedule( scheduling_class [, parameter]).
 OpenMP supports four scheduling classes: static, dynamic, guided and runtime.
 Scheduling classes in OpenMP – matrix multiplication.
for(i=0; i < dim; i++)
{ for(j=0; j < dim; j++)
{ c(i, j) = 0;
for(k=0; k<dim; k++)
{ c(i, j) += a(i, k) * b(k, j);
}
}
}
 Static:
 The general form of the static scheduling class is scheduling(static[ , chunk-
size]).
 This technique splits the iteration space into equal chunks of chunk-size and
assigns them to threads in a round-robin fashion.
 When no chunk-size is specified, the iteration space is split into as many
chunks as there are threads and one chunk is assigned to each thread.
 Static scheduling of loops in matrix multiplication
#pragma omp parallel default(private) shared (a, b, c, dim) num_threads(4)
#pragma omp for schedule(static)
for(i=0; i < dim; i++)
{ for(j=0; j < dim; j++)
{ c(i, j) = 0;
for(k=0; k<dim; k++)
{ c(i, j) += a(i, k) * b(k, j);
}
}
}
 Dynamic
 Often, because of a number of reasons, ranging from heterogeneous computing resources
to non- uniform processor loads, equally partitioned workloads take widely varying
execution times.
 For this reason, OpenMP has a dynamic scheduling class.
 The general form of this class is schedule(dynamic[, chunk-size]).
 Guided:
 If there are as many processors as threads, this assignment results in considerable idling .
 The solution to this problem (edge effect) is to reduce the chunk size as we proceed through
the computation. This is the principle of guided scheduling.
 The general form of this class is schedule(guided[, chunk-size]).
 In this class, the chunk size is reduced exponentially as each chunk is dispatched to a thread.
 Runtime:
 If one would like to see the impact of various scheduling strategies to select the best one,
the scheduling can be set to runtime.
 In this case the environment variable OMP_SCHEDULE determines the scheduling class
and the chunk size.
 Synchorization across multiple for directives:
 OpenMP provides a clause – nowait, which can be used with a for directive to indicate
that the threads can proceed to the next statement without waiting for all the other
threads to complete the for loop execution.
 Using the nowait clause
#pragma omp parallel
{ #pragma omp for nowait
for(i = 0; i < nmax; i++)
if(isEqual(name, current_list[i]))
processCurrentName(name);
#pragma omp for
for(i = 0; i < nmax; i++)
if(isEqual(name, past_list[i]))
processPastName(name);
}
 The section directive
 The for directive is suited to partitioning iteration spaces across threads.
 Consider there are three tasks(taskA, taskB and taskC) that need to be executed. Assume
that these tasks are independent of each other and therefore can be assigned to different
threads.
 OpenMP supports such non iterative parallel task assignment using the sections directive.
Example:
#pragma omp sections [clause list]
{ [#pragma omp section
/* structured block */
]
[#pragma omp section
/* structured block */
]

}
 For executing the three concurrent tasks taskA, taskB and taskC, the corresponding sections directive is
as follows:
#pragma omp parallel
{ #pragma omp sections
{ #pragma omp sections
{ taskA;
}
#pragma omp sections
{ taskB;
}
#pragma omp sections
{ taskC;
}
}
 Merging Directives
 OpenMP allows the programmer to merge the parallel directives to parallel for and parallel
sections, respectively.
 The clause list for the merged directive can be from the clause lists of either the parallel
or for/sections directives.
Example:
#pragma omp parallel default (private)
shared (n)
 Is identical to:
{ #pragma omp for
#pragma omp parallel for default
for(i=0 < i < n; i++)
(private) shared (n)
{ /* body of parallel for loop
{ for(i=0 < i < n; i++)
*/
{ /* body of parallel for
}
loop */
}
}
}
#pragma omp parallel  Is identical to:
{ #pragma omp sections #pragma omp parallel sections
{ #pragma omp sections { #pragma omp sections
{ taskA(); { taskA();
} }
#pragma omp sections #pragma omp sections
{ taskB(); { taskB();
} }
/* other section here */ /* other section here */
} }
}
 Nesting parallel directives
#pragma omp parallel for default(private) shared(a, b, c, dim) num_threasds(2)
for(i=0; i < dim; i++)
{ #pragma omp parallel for default(private) shared(a, b, c, dim) num_threasds(2)
for(j=0; j < dim; j++)
{ c(i, j) = 0;
#pragma omp parallel for default(private) shared(a, b, c, dim) num_threasds(2)
for(k=0; k<dim; k++)
{ c(i, j) += a(i, k) * b(k, j);
}
}
}
Synchhronization constructs inOpenMP
 The OpenMP standard provides high level functionality in an easy-to-
use API.
 Directives:
 Synchronization Point: the barrier directive
 Single thread Executions: the single and master directives
 Critical section: the critical and atomic directives
 In-order execution: the ordered directive
 Memory consistency: the flush directive
 Synchronization Point: the barrier directive
 Barrier is one of the most frequently used synchronization primitives.
 Syntax:
#pragma omp barrier
 On encountering this directive, all threads in a team wait until others have cought up,
and then release.
 For executing barrier directive, it is important to note that a barrier directive must be
enclosed in a compound statement that is conditionally executed.
 Single thread Executions: the single and master directives
 The single directive specifies a structured block that isexecuted by a single(arbitrary)
thread.
 Syntax:
#pragma omp single[clause list]
structured block
 The master directive is a specialization of the single directive in which only the
master thread executes the structured block
#pragma omp master
structured block
 Critical section: the critical and atomic directives
 In addition to explicit lock management, OpenMP provides a critical directive for
implementing critical region.
 Syntax of critical directive is
#pragma omp critical[(name)]
structured block
 Example: using the critical directive for producer – consumer threads.
#pragma omp parallel sections
{ #pragma parallel section
{ /* producer thread*/
task = produce_task();
#pragma critical (task_queue)
{ insert_into_queue(task);
}
}
#pragma parallel section
{ /* consumer thread*/
#pragma critical (task_queue)
{ task = extract_from_queue(task);
}
consumer_task(task);
}
}
 In-order execution: the ordered directive:
 In many cases, it is necessary to execute a segment of a parallel loop in the order in
which the serial version would execute it.
 The Syntax of the ordered directive is as follows:
#pragma omp ordered
structured block
 Example: computing the cumulative sum of a list using the ordered
directive.
cumul_sum[0] = list[0];
#pragma omp parallel for private(i) shared(cumul_sum, list, n) ordered
for(i=1; i<n; i++)
{ /* other processing on list[i] if needed */
#pragma omp ordered
{ cumul_sum[i] = cumul_sum[i-1] + list[i];
}
}
 Memory consistency: the flush directive
 The flush directive provides a memory fence by forcing a variable to be
written to or read from the memory system.
 The syntax of the flush directive is as follows:
#pragma omp flush[(list)]
 Data Handling in OpenMP
 If a thread initializes and uses a variable and no other thread accesses the data, then a
local copy of the variable should be made for the thread. Such data should be specified as
private.
 If a thread repeatedly reads a variable that has been initialized earlier in the program, it is
beneficial to make a copy of the variable and inherit the value at the time of thread
creation. Such data should be specified as firstprivate.
 If multiple threads manipulate a single piece of data, one must explore ways of breaking
these manipulations into local operations followed by a single global operation.
 If multiple threads manipulate different parts of a large data structure, the programmer
should explore ways of breaking it into smaller data structures and making them private to
the thread manipulating them.
 After all the above techniques have been explored and exhausted, remaining data items
may be shared among various threads using the clause shared.
 The threadprivate and copyin directives:
 It is useful to make a set of objects locally available to a thread in such a way
that these objects persists through parallel and serial block provided the
number of threads remains the same.
 The class of variables is supported in OpenMP using the threadprivate
directive.
 The syntax of the directive is as follows:
#pragma omp threadprivate(variable_list)
OpenMP Library functions:
 Contolling number of threads and processors:
omp_set_num_threads(int num_threads);  this function sets the default
number of threads that will be created on encountering the next parallel directive
provided the num_threads clause is not used in the parallel directive.
omp_get_num_threads()  returns the number of threads participating in a
team.
omp_get_max_threads() returns the maximum number of threads.
omp_get_thread_num()  returns a unique thread id.
omp_get_num_procs()  returns the number of processors available.
omp_in_parallel() returns a non-zero value if called from within the scope of a
parallel region and zero otherwise.
 Controlling and monitoring thread creation:
omp_set_dynamic(int dynamic_threads)  this function allows the programmer to
dynamically alter the number of threads created on encountering a parallel region.
if the value dynamic_threads evaluates to zero, dynamic adjustment is disabled,
otherwise it is enabled.
the function must be called outside the scope of a parallel region.
omp_get_dynamic()  by using this function we can query whether dynamic adjustment
is enabled or disabled.
omp_set_nested()  this function enables nested parallelism if the value of its
argument, nested, is non-zero, and disabled it otherwise.
omp_get_nested()  returns non zero value if nested parallelism is enabled and zero
otherwise.
 Mutual exclusion
 There are situation, where it is convenient to use an explicit lock.
 For this OpenMP provides functions for initializing, locking and unlocking and disgarding
locks.
 The lock data structure in OpenMP is of type omp_lock_t.
 Functions:
omp_init_lock  initializing lock
omp_destroy_lock  destroying lock
omp_set_lock  once the lock has been initialized, it can be locked and unlocked using this
function.
omp_test_lock  used to attempt to set a lock. If it returns a non zero value, lock has been
successfully locked otherwise the lock is currently owned by another thread.
 OpenMP also supports nestable locks that can be locked multiple times
by the same thread.
 void omp_init_nest_lock(omp_nest_lock_t *lock);
 void omp_destroy_nest_lock(omp_nest_lock_t *lock);
 void omp_set_nest_lock(omp_nest_lock_t *lock);
 void omp_unset_nest_lock(omp_nest_lock_t *lock);
 void omp_test_nest_lock(omp_nest_lock_t *lock);
 Environment variables in OpenMP
 OMP_NUM_THREADS  specifies default number of threads.
 OMP_DYNAMIC  this variable, when set toTRUE, allows the number of
threads to be controlled at runtime.
 OMP_NESTED  enables nested parallelism
 OMP_schedule  this environment variable controls the assignment of
iteration spaces associated with for directives that use the runtime
scheduling class.
END – Unit 2

You might also like