Shared Address Space Programming Concepts
Shared Address Space Programming Concepts
Programming Notes
The use of function rand_r(instead of superior random number
generators such as drand48)
The reason for this is reentant(functions that can be safely called when
another instance has been suspended in the middle of its invocation).
Performance Notes:
We execute this program on a four-processor SGI origin 2000.
We can also modify the program slightly to observe the effect of
falsesharing.
The program can also be used to assess the secondary cache line size
Synchronization Primitives in
Pthreads
Reason:
Test and update operation – the operation should not be broken into
sub-operation.
The thread must be executed by only one thread at any time.
Threaded APIs provide support for implementing critical sections and
atomic operations using mutex-locks.
Critical segments in Pthreads are implemented using mutex locks.
Mutex-locks have two states:
locked and unlocked
At any point of time, only one thread can lock a mutex lock. A lock is an
atomic operation.
A thread entering a critical segment first tries to get a lock. It goes
ahead when the lock is granted.
The Pthread API provides a number of functions for handling mutex
locks.
The function pthread_mutex_lock can be used to attempt a lock on a
mutex-lock.
Prototype:
int pthread_mutex_lock ( pthread_mutex_t *mutex_lock);
The Pthread function pthread_mutex_unlock is used to unlock a
mutex_lock
int pthread_mutex_unlock ( pthread_mutex_t *mutex_lock);
Function to initialize a mutex-lock to its unlocked state.
int pthread_mutex_init (
pthread_mutex_t *mutex_lock,
const pthread_mutexattr_t *lock_attr);
Example: Computing the minimum entry in a list of integers.
Write a simple threaded program to compute the minimum of a list of
integers.
#include<pthread.h>
void *find_min(void *list_ptr);
pthread_mutex_t minimum_value_lock;
int minimum_value, partial_list _size;
main() {
/* declare and initialize data structures and list */
minimum_value = MIN_INT;
pthread_init();
pthread_mutex_init(&minimum_value_lock, NULL);
pthread_mutex_t task_queue_lock;
int task_available;
/* other shared data structures here */
main()
{ /* declarations and initializations */
task_available = 0;
pthread_init();
pthread_mutex_init(&task_queue_lock, NULL)
/* create and join producer and consumer threads */
}
void *producer(void *produver_thread_data)
{ int inserted;
struct task my_task;
while(!done())
{ inserted = 0;
create_task(&my_task);
while(inserted == 0)
{ pthread_mutex_lock(&task_queue_lock);
if(task_available == 0)
{ insert_into_queue(my_task);
task_available = 1;
inserted = 1;
}
pthread_mutex_unlock(&task_queue_lock);
}
}
}
void *consumer(void *consumer_thread_data)
{ int extracted;
struct task my_task;
/* local data structure declarations */
while(!done())
{ pthreaded_mutex_lock(&task_queue_lock);
if(task_available == 1)
{ extract_from_queue(&my_task);
task_available = 0;
extracted = 1;
}
pthread_mutex_unlock(&task_queue_lock);
}
process_task(my_task);
}
}
Overheads of Locking
Locks represent serialization points since critical sections must be
executed by threads one after the other.
Encapsulating large segments of the program within locks can,
therefore, lead to significant performance degradation.
It is important to minimize the critical sections.
Alleviating Locking Overheads:
It is often possible to reduce the idling overhead associated with locks
using an alternate function, pthread_mutex_trylock.
int pthread_mutex_trylock (pthread_mutex_t *mutex_lock);
call to this function blocks the execution of the thread until it receives a
signal from another thread.
In addition to blocking the thread, the pthread_cond_wait function
releases the lock on mutex.
When the condition is signaled (using pthread_cond_signal), one of
these threads in the queue is unblocked.
int pthread_cond_signal(pthread_cond_t *cond);
The producer then relinquishes its lock on mutex by explicitly calling
pthread_mutex_unlock, allowing one of the blocked consumer thread to
consume the task.
Initializes a condition variable (pointed to by cond) whose Attributes are
defined in the attribute object attr
int pthread_cond_init (pthread_cond_t *cond, const pthread_condattr_t *attr);
main() {
/* declarations and initializations */
task_available = 0;
pthread_init();
pthread_cond_init(&cond_queue_empty, NULL);
pthread_cond_init(&cond_queue_full, NULL);
pthread_mutex_init(&task_queue_cond_lock, NULL);
/* create and join producer and consumer threads */
}
Threaded API has come a long way, their use is still predominantly
restricted to system programmers as opposed to application
programmers.
Pthreads are considered to be low-level primitives.
OpenMP is an API that can be used with FORTRAN, C, and C++ for
programming shared address space machines.
OpenMP directives provide support for concurrency, synchronization
and data handling.
The OpenMP Programming Model
OpenMP directives in C and C++ are based on the #pragma compiler
directives.
#pragma omp directive [clause list]
OpenMP programs execute serially until they encounter the parallel
directive.
The main thread that encounters the parallel directive becomes the
master of this group of threads and is assigned the thread id 0 within
the group.
#pragma omp parallel[clause list]
/* structured block */
Each thread created by this directive executes the structured block
specified by the parallel directive.
The clause list is used to specify conditional parallelization, number of
threads and data handling.
Conditional Parallelization: the clause if(scalar expression) determines
whether the parallel construct results in creation of threads. Only one if
clause can be used with the parallel directive.
Degree of Concurrency: The clause num_threads (integer expression)
specifies the number of threads that are created by the parallel directive.
Data handling:
The clause private (variable list) indicates that the set of variables
specified is local to each thread. Each thread has its own copy of each variable
in the list.
The clause firstprivate(variable list) is similar to the private
clause, except the values of variables on entering the threads are initialized to
corresponding values before the parallel directive.
The shared(variable list) indicates that all variables in the list are
shared across all the threads.
A sample OpenMP program along with its Pthreads translation that
might be performed by an OpenMP compiler
Example:
#pragma omp parallel if(is_parallel == 1)
num_threads(8) private(a) shared (b) firstprivate(c)
{ /* structured block */
}
The default state of variable is specified by clause default(shared) or
default(none).
default(shared) implies that, by default, variable is shared by all the threads.
default(none) state of variable used by the thread must be explicitly
specified.
The reduction clause specifies how multiple local copies of a variable at
different threads are combined into a single copy at a master when
thread exit.
The usage of reduction clause is reduction(operator: variable list).
This clause performs a reduction on the scalar variables specified in the
list using the operator.
The variable in the list are implicitly specified as being private to
threads.
Example: using the reduction clause
#pragma omp parallel reduction(+: sum) num_threads(8)
{ /* compute local sum here */
}
/* sum here contains sum of all local instances of sums */
First OpenMP program:
omp_get_num_threads() returns the number of threads in the
parallel region.
omp_get_thread_num() returns the integer ID of each thread
Example: Computing PI using OpenMP directives
#pragma omp parallel default (private) shared(npoints) reduction(+: sum)
num_threads(8)
{ num_threads = omp_get_num_threads();
sample_points_per_thread = npoints / num_threads;
sum = 0;
for(i=0; I < sample_points_per_threads; i++)
{ rand_no_x = (double) (rand_r(&seed)) / (double)((2<<14)-1);
rand_no_y = (double) (rand_r(&seed)) / (double)((2<<14)-1);
if(((rand_no_x – 0.5) * (rand_no_x – 0.5) + (rand_no_y – 0.5) * (rand_no_y
– 0.5)) < 0.25)
sum++;
}
}
Special Concurrent Tasks in OpenMP
The parallel directive can be used in conjunction with other directives to
specify concurrency across iterations and tasks.
OpenMp provides two directives – for and sections to specify concurrent
iterations and tasks.
The for Directive
The for directive is used to split parallel iteration spaces across threads.
The general form of a for directive is as follows:
#pragma omp for [clause list]
/* for loop */
The clauses that can be used in context are: private, firstprivate, lastprivate,
reduction, schedule, nowait and ordered.
Using the for directive for computing π.
#pragma omp parallel default (private) shared(npoints) reduction(+:sum) num_threads(8)
{ sum =0;
#pragma omp for
for(i = 0; i<npoints; i++)
{ rand_no_x = (double) (rand_r(&seed)) / (double) ((2<<14) – 1);
rand_no_y = (double) (rand_r(&seed)) / (double) ((2<<14) – 1);
if(((rand_no_x – 0.5 * (rand_no_x – 0.5) + (rand_no_y) – 0.5) * (rand_no_y – 0.5))
< 0.25);
sum++;
}
}
Assigning iterations to threads
The schedule clause of the for directive deals with the assignment of iterations to
threads.
The general form of the schedule directive is schedule( scheduling_class [, parameter]).
OpenMP supports four scheduling classes: static, dynamic, guided and runtime.
Scheduling classes in OpenMP – matrix multiplication.
for(i=0; i < dim; i++)
{ for(j=0; j < dim; j++)
{ c(i, j) = 0;
for(k=0; k<dim; k++)
{ c(i, j) += a(i, k) * b(k, j);
}
}
}
Static:
The general form of the static scheduling class is scheduling(static[ , chunk-
size]).
This technique splits the iteration space into equal chunks of chunk-size and
assigns them to threads in a round-robin fashion.
When no chunk-size is specified, the iteration space is split into as many
chunks as there are threads and one chunk is assigned to each thread.
Static scheduling of loops in matrix multiplication
#pragma omp parallel default(private) shared (a, b, c, dim) num_threads(4)
#pragma omp for schedule(static)
for(i=0; i < dim; i++)
{ for(j=0; j < dim; j++)
{ c(i, j) = 0;
for(k=0; k<dim; k++)
{ c(i, j) += a(i, k) * b(k, j);
}
}
}
Dynamic
Often, because of a number of reasons, ranging from heterogeneous computing resources
to non- uniform processor loads, equally partitioned workloads take widely varying
execution times.
For this reason, OpenMP has a dynamic scheduling class.
The general form of this class is schedule(dynamic[, chunk-size]).
Guided:
If there are as many processors as threads, this assignment results in considerable idling .
The solution to this problem (edge effect) is to reduce the chunk size as we proceed through
the computation. This is the principle of guided scheduling.
The general form of this class is schedule(guided[, chunk-size]).
In this class, the chunk size is reduced exponentially as each chunk is dispatched to a thread.
Runtime:
If one would like to see the impact of various scheduling strategies to select the best one,
the scheduling can be set to runtime.
In this case the environment variable OMP_SCHEDULE determines the scheduling class
and the chunk size.
Synchorization across multiple for directives:
OpenMP provides a clause – nowait, which can be used with a for directive to indicate
that the threads can proceed to the next statement without waiting for all the other
threads to complete the for loop execution.
Using the nowait clause
#pragma omp parallel
{ #pragma omp for nowait
for(i = 0; i < nmax; i++)
if(isEqual(name, current_list[i]))
processCurrentName(name);
#pragma omp for
for(i = 0; i < nmax; i++)
if(isEqual(name, past_list[i]))
processPastName(name);
}
The section directive
The for directive is suited to partitioning iteration spaces across threads.
Consider there are three tasks(taskA, taskB and taskC) that need to be executed. Assume
that these tasks are independent of each other and therefore can be assigned to different
threads.
OpenMP supports such non iterative parallel task assignment using the sections directive.
Example:
#pragma omp sections [clause list]
{ [#pragma omp section
/* structured block */
]
[#pragma omp section
/* structured block */
]
…
}
For executing the three concurrent tasks taskA, taskB and taskC, the corresponding sections directive is
as follows:
#pragma omp parallel
{ #pragma omp sections
{ #pragma omp sections
{ taskA;
}
#pragma omp sections
{ taskB;
}
#pragma omp sections
{ taskC;
}
}
Merging Directives
OpenMP allows the programmer to merge the parallel directives to parallel for and parallel
sections, respectively.
The clause list for the merged directive can be from the clause lists of either the parallel
or for/sections directives.
Example:
#pragma omp parallel default (private)
shared (n)
Is identical to:
{ #pragma omp for
#pragma omp parallel for default
for(i=0 < i < n; i++)
(private) shared (n)
{ /* body of parallel for loop
{ for(i=0 < i < n; i++)
*/
{ /* body of parallel for
}
loop */
}
}
}
#pragma omp parallel Is identical to:
{ #pragma omp sections #pragma omp parallel sections
{ #pragma omp sections { #pragma omp sections
{ taskA(); { taskA();
} }
#pragma omp sections #pragma omp sections
{ taskB(); { taskB();
} }
/* other section here */ /* other section here */
} }
}
Nesting parallel directives
#pragma omp parallel for default(private) shared(a, b, c, dim) num_threasds(2)
for(i=0; i < dim; i++)
{ #pragma omp parallel for default(private) shared(a, b, c, dim) num_threasds(2)
for(j=0; j < dim; j++)
{ c(i, j) = 0;
#pragma omp parallel for default(private) shared(a, b, c, dim) num_threasds(2)
for(k=0; k<dim; k++)
{ c(i, j) += a(i, k) * b(k, j);
}
}
}
Synchhronization constructs inOpenMP
The OpenMP standard provides high level functionality in an easy-to-
use API.
Directives:
Synchronization Point: the barrier directive
Single thread Executions: the single and master directives
Critical section: the critical and atomic directives
In-order execution: the ordered directive
Memory consistency: the flush directive
Synchronization Point: the barrier directive
Barrier is one of the most frequently used synchronization primitives.
Syntax:
#pragma omp barrier
On encountering this directive, all threads in a team wait until others have cought up,
and then release.
For executing barrier directive, it is important to note that a barrier directive must be
enclosed in a compound statement that is conditionally executed.
Single thread Executions: the single and master directives
The single directive specifies a structured block that isexecuted by a single(arbitrary)
thread.
Syntax:
#pragma omp single[clause list]
structured block
The master directive is a specialization of the single directive in which only the
master thread executes the structured block
#pragma omp master
structured block
Critical section: the critical and atomic directives
In addition to explicit lock management, OpenMP provides a critical directive for
implementing critical region.
Syntax of critical directive is
#pragma omp critical[(name)]
structured block
Example: using the critical directive for producer – consumer threads.
#pragma omp parallel sections
{ #pragma parallel section
{ /* producer thread*/
task = produce_task();
#pragma critical (task_queue)
{ insert_into_queue(task);
}
}
#pragma parallel section
{ /* consumer thread*/
#pragma critical (task_queue)
{ task = extract_from_queue(task);
}
consumer_task(task);
}
}
In-order execution: the ordered directive:
In many cases, it is necessary to execute a segment of a parallel loop in the order in
which the serial version would execute it.
The Syntax of the ordered directive is as follows:
#pragma omp ordered
structured block
Example: computing the cumulative sum of a list using the ordered
directive.
cumul_sum[0] = list[0];
#pragma omp parallel for private(i) shared(cumul_sum, list, n) ordered
for(i=1; i<n; i++)
{ /* other processing on list[i] if needed */
#pragma omp ordered
{ cumul_sum[i] = cumul_sum[i-1] + list[i];
}
}
Memory consistency: the flush directive
The flush directive provides a memory fence by forcing a variable to be
written to or read from the memory system.
The syntax of the flush directive is as follows:
#pragma omp flush[(list)]
Data Handling in OpenMP
If a thread initializes and uses a variable and no other thread accesses the data, then a
local copy of the variable should be made for the thread. Such data should be specified as
private.
If a thread repeatedly reads a variable that has been initialized earlier in the program, it is
beneficial to make a copy of the variable and inherit the value at the time of thread
creation. Such data should be specified as firstprivate.
If multiple threads manipulate a single piece of data, one must explore ways of breaking
these manipulations into local operations followed by a single global operation.
If multiple threads manipulate different parts of a large data structure, the programmer
should explore ways of breaking it into smaller data structures and making them private to
the thread manipulating them.
After all the above techniques have been explored and exhausted, remaining data items
may be shared among various threads using the clause shared.
The threadprivate and copyin directives:
It is useful to make a set of objects locally available to a thread in such a way
that these objects persists through parallel and serial block provided the
number of threads remains the same.
The class of variables is supported in OpenMP using the threadprivate
directive.
The syntax of the directive is as follows:
#pragma omp threadprivate(variable_list)
OpenMP Library functions:
Contolling number of threads and processors:
omp_set_num_threads(int num_threads); this function sets the default
number of threads that will be created on encountering the next parallel directive
provided the num_threads clause is not used in the parallel directive.
omp_get_num_threads() returns the number of threads participating in a
team.
omp_get_max_threads() returns the maximum number of threads.
omp_get_thread_num() returns a unique thread id.
omp_get_num_procs() returns the number of processors available.
omp_in_parallel() returns a non-zero value if called from within the scope of a
parallel region and zero otherwise.
Controlling and monitoring thread creation:
omp_set_dynamic(int dynamic_threads) this function allows the programmer to
dynamically alter the number of threads created on encountering a parallel region.
if the value dynamic_threads evaluates to zero, dynamic adjustment is disabled,
otherwise it is enabled.
the function must be called outside the scope of a parallel region.
omp_get_dynamic() by using this function we can query whether dynamic adjustment
is enabled or disabled.
omp_set_nested() this function enables nested parallelism if the value of its
argument, nested, is non-zero, and disabled it otherwise.
omp_get_nested() returns non zero value if nested parallelism is enabled and zero
otherwise.
Mutual exclusion
There are situation, where it is convenient to use an explicit lock.
For this OpenMP provides functions for initializing, locking and unlocking and disgarding
locks.
The lock data structure in OpenMP is of type omp_lock_t.
Functions:
omp_init_lock initializing lock
omp_destroy_lock destroying lock
omp_set_lock once the lock has been initialized, it can be locked and unlocked using this
function.
omp_test_lock used to attempt to set a lock. If it returns a non zero value, lock has been
successfully locked otherwise the lock is currently owned by another thread.
OpenMP also supports nestable locks that can be locked multiple times
by the same thread.
void omp_init_nest_lock(omp_nest_lock_t *lock);
void omp_destroy_nest_lock(omp_nest_lock_t *lock);
void omp_set_nest_lock(omp_nest_lock_t *lock);
void omp_unset_nest_lock(omp_nest_lock_t *lock);
void omp_test_nest_lock(omp_nest_lock_t *lock);
Environment variables in OpenMP
OMP_NUM_THREADS specifies default number of threads.
OMP_DYNAMIC this variable, when set toTRUE, allows the number of
threads to be controlled at runtime.
OMP_NESTED enables nested parallelism
OMP_schedule this environment variable controls the assignment of
iteration spaces associated with for directives that use the runtime
scheduling class.
END – Unit 2