0% found this document useful (0 votes)
30 views19 pages

Cloud Programming Models: MapReduce Explained

Unit 4 covers cloud programming models, focusing on thread, task, and MapReduce programming, highlighting their differences and advantages. It explains the MapReduce framework for distributed and parallel processing of large datasets, detailing its components, architecture, and benefits over traditional approaches. The document also outlines the processing steps in MapReduce, including map, shuffle, and reduce operations, along with the roles of job and task trackers in managing execution.

Uploaded by

xxxeroxxx10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views19 pages

Cloud Programming Models: MapReduce Explained

Unit 4 covers cloud programming models, focusing on thread, task, and MapReduce programming, highlighting their differences and advantages. It explains the MapReduce framework for distributed and parallel processing of large datasets, detailing its components, architecture, and benefits over traditional approaches. The document also outlines the processing steps in MapReduce, including map, shuffle, and reduce operations, along with the roles of job and task trackers in managing execution.

Uploaded by

xxxeroxxx10
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Unit 4: Cloud Programming Models (12 Hrs.

)
Thread programming, Task programming, Map-reduce programming,
Parallel efficiency of MapReduce, Enterprise batch processing using
Map-Reduce, Comparisons between Thread, Task and Map reduce
• What is a Task?
• A task is something you want done. It is a set of program instructions that
are loaded in memory . When program instruction is loaded into memory
people do call as process or task. Task and Process are synonyms
nowadays. A task will by default use the Threadpool , which saves
resources as creating threads can be expensive. The Task can tell you if the
work is completed and if the operation returns a result. Also, you can see a
Task as a higher level abstraction upon threads.
• What is a Thread?
• A thread is a basic unit of CPU utilization , consisting of a program counter,
a stack, and a set of registers. Thread has its own program area
and memory area . A thread of execution is the smallest sequence of
programmed instructions that can be managed independently by a
scheduler.
Differences Between Task And Thread
[Link] is more abstract then threads. It is always advised to use tasks instead of thread as it is created
on the thread pool which has already system created threads to improve the performance.
[Link] task can return a result. There is no direct mechanism to return the result from a thread.
[Link] supports cancellation through the use of cancellation tokens. But Thread doesn't.
4.A task can have multiple processes happening at the same time. Threads can only have one task
running at a time.
[Link] can attach task to the parent task, thus you can decide whether the parent or the child will
exist first.
[Link] using thread if we get the exception in the long running method it is not possible to catch
the exception in the parent function but the same can be easily caught if we are using tasks.
[Link] can easily build chains of tasks. You can specify when a task should start after the previous task
and you can specify if there should be a synchronization context switch. That gives you the great
opportunity to run a long running task in background and after that a UI refreshing task on the UI
thread.
8.A task is by default a background task. You cannot have a foreground task. On the other hand a
thread can be background or foreground.
[Link] default TaskScheduler will use thread pooling, so some Tasks may not start until other pending
Tasks have completed. If you use Thread directly, every use will start a new Thread.
Before MapReduce

Traditional approach will split the data into smaller parts or blocks and store them in different
machines. Then, it will find the highest value in each part stored in the corresponding machine. At
last, it will combine the results received from each of the machines to have the final output.
Challenges associated with traditional approach:

[Link] path problem: It is the amount of time taken to finish the job without delaying the next milestone or
actual completion date. So, if, any of the machines delay the job, the whole work gets delayed.
[Link] problem: What if, any of the machines which are working with a part of data fails? The
management of this failover becomes a challenge.
[Link] split issue: How will I divide the data into smaller chunks so that each machine gets even part of
data to work with. In other words, how to equally divide the data such that no individual machine is
overloaded or underutilized.
[Link] single split may fail: If any of the machines fail to provide the output, I will not be able to calculate the
result. So, there should be a mechanism to ensure this fault tolerance capability of the system.
[Link] of the result: There should be a mechanism to aggregate the result generated by each of the
machines to produce the final output.
To overcome these issues, we have the MapReduce framework which allows us to
perform such parallel computations without bothering about the issues like
reliability, fault tolerance etc. Therefore, MapReduce gives you the flexibility to
write code logic without caring about the design issues of the system.
MapReduce
MapReduce is a programming framework that allows us to perform distributed and parallel
processing on large data sets in a distributed environment.
• MapReduce consists of two distinct tasks – Map and Reduce.
• As the name MapReduce suggests, the reducer phase takes place after the
mapper phase has been completed.
• So, the first is the map job, where a block of data is read and processed to
produce key-value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples (intermediate key-
value pair) into a smaller set of tuples or key-value pairs which is the final output.
• Hadoop MapReduce is a software framework and programming model suitable for processing of huge
amounts data (multi-terabyte) in parallel, on large clusters of commodity hardware in a reliable manner.
• MapReduce programs are very useful for performing large-scale data analysis using multiple machines
in the cluster.
• Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python,
and C++.
• The MapReduce algorithm includes two significant tasks, namely Map and Reduce.
- In Map task a set of data and converts it into another set of data, where individual elements are broken
down into tuples.
- Secondly, reduce task, which takes the output from a map as an input and combines those data tuples
into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
• The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called mappers and
reducers.
• Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But,
once we write an application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a configuration change.
• This simple scalability is what has attracted many programmers to use the MapReduce model.
• MapReduce programming offers several benefits to help you gain valuable
insights from your big data:
- Scalability: Businesses can process petabytes of data stored in the Hadoop
Distributed File System (HDFS).
- Flexibility: Hadoop enables easier access to multiple sources of data and multiple
types of data.
- Speed: With parallel processing and minimal data movement, Hadoop offers
fast processing of massive amounts of data.
- Simple: Developers can write code in a choice of languages, including Java,
C++ and Python.
Algorithm of MapReduce
• Hadoop divides the job into tasks. There are two types of tasks:
i. Map tasks (Splits and Mapping)
- Input Splits: This is the very first phase in the execution of map-reduce program.
An input to a MapReduce job is divided into fixed-size pieces called input
splits.
- Mapping: In this phase data in each split is passed to a mapping function to
produce output values.
ii. Reduce tasks (Shuffling and Reducing)
- Shuffling: This phase consumes the output of mapping phase. Its task is to
consolidate the relevant records from Mapping phase output.
- Reducing: In this phase, output values from the shuffling phase are aggregated.
This phase combines values from shuffling phase and returns a single output
value. In short, this phase summarizes the complete dataset.
An Example for MapReduce:
• Consider you have following input data for your Map Reduce Program.
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
• Task is to count number of words; So, output should be:
bad:1
Class:1
good:1
Hadoop:3
is:2
to:1
Welcome:1
MapReduce Architecture
Comparison of MapReduce with
traditional approach:
1. Parallel Processing:
In MapReduce, we are dividing
the job among multiple nodes
and each node works with a part
of the job simultaneously. So,
MapReduce is based on Divide
and Conquer paradigm which
helps us to process the data
using different machines. As the
data is processed by multiple
machines instead of a single
machine in parallel, the time
taken to process the data gets
reduced by a tremendous
amount as shown in the figure
below:
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the
data in the MapReduce Framework. In the traditional system, we used to bring data to
the processing unit and process it. But, as the data grew and became very huge, bringing
this huge amount of data to the processing unit posed the following issues:
• Moving huge data to processing is costly and deteriorates the network performance.
• Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
• The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit
to the data. So, as you can see in the above image that the data is distributed among
multiple nodes where each node processes the part of the data residing on it. This allows
us to have the following advantages:
• It is very cost-effective to move processing unit to the data.
• The processing time is reduced as all the nodes are working with their part of the data
in parallel.
• Every node gets a part of the data to process and therefore, there is no chance of a
node getting overburdened.
Map Reduce
 MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of
servers in a Hadoop cluster.

A MapReduce program is composed of a map procedure, which performs filtering and sorting (such as
sorting students by first name into queues, one queue for each name), and a reduce method, which performs
a summary operation (such as counting the number of students in each queue, yielding name frequencies).

It is a framework for parallel computing.

With the help of map reduce programmers get simple API.

MapReduce allows for the distributed processing of the map and reduction operations.

6/2/2022 1
Map Reduce Framework
A programming model for parallel data processing.

Hadloops can run map reduce programs in multiple languages like Java,Python,Ruby etc.

It is usually composed of three operations:

Map

Shuffle

Reduce

6/2/2022 2
Map: Each worker node applies the map function to the local data and writes the output to a temporary
storage . A master node ensures that only one copy of the redundant input data is processed.

Shufflle : Worker nodes redistribute data based on the output keys, such that all data belonging to one key
is located on the same worker node.

Reduce: Worker nodes now process each group of output data, per key, in parallel.

6/2/2022 3
Processing in Map Reduce
Framework
• User runs a program on client computer
• Program submits a job to HDFS.
• Job contains:
 Input Data
Map/Reduce program
Configuration information
• Two types of daemons that control job execution:
Job Tracker(Master nodes)
Task Tracker(Slave nodes)

6/2/2022 4
• Job Tracker communicates with Name Node and assigns parts of job to Task Tracker.
• Task processes send heartbeats to Task Tracker.
• Task Tracker send heartbeats to Job tracker.
• Any task that did not report in certain time assumed to be failed and it’s JVM will be killed by task
tracker and reported to the job tracker.
• The job tracker will reschedule any failed tasks.
• If same task failed 4 times all the job fails.
• Any Task Tracker reporting high number of failed jobs on particular node will be blacklist the node.
• Job Tracker maintains and manages the status of each job . Result from failed tasks wil be ignored.

6/2/2022 5

You might also like