Cloud Programming Models: MapReduce Explained

Unit 4 covers cloud programming models, focusing on thread, task, and MapReduce programming, highlighting their differences and advantages. It explains the MapReduce framework for distributed and parallel processing of large datasets, detailing its components, architecture, and benefits over traditional approaches. The document also outlines the processing steps in MapReduce, including map, shuffle, and reduce operations, along with the roles of job and task trackers in managing execution.

Uploaded by

xxxeroxxx10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views19 pages

Cloud Programming Models: MapReduce Explained

Uploaded by

xxxeroxxx10

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Unit 4: Cloud Programming Models (12 Hrs.

)
Thread programming, Task programming, Map-reduce programming,
Parallel efficiency of MapReduce, Enterprise batch processing using
Map-Reduce, Comparisons between Thread, Task and Map reduce
• What is a Task?
• A task is something you want done. It is a set of program instructions that
are loaded in memory . When program instruction is loaded into memory
people do call as process or task. Task and Process are synonyms
nowadays. A task will by default use the Threadpool , which saves
resources as creating threads can be expensive. The Task can tell you if the
work is completed and if the operation returns a result. Also, you can see a
Task as a higher level abstraction upon threads.
• What is a Thread?
• A thread is a basic unit of CPU utilization , consisting of a program counter,
a stack, and a set of registers. Thread has its own program area
and memory area . A thread of execution is the smallest sequence of
programmed instructions that can be managed independently by a
scheduler.
Differences Between Task And Thread
[Link] is more abstract then threads. It is always advised to use tasks instead of thread as it is created
on the thread pool which has already system created threads to improve the performance.
[Link] task can return a result. There is no direct mechanism to return the result from a thread.
[Link] supports cancellation through the use of cancellation tokens. But Thread doesn't.
4.A task can have multiple processes happening at the same time. Threads can only have one task
running at a time.
[Link] can attach task to the parent task, thus you can decide whether the parent or the child will
exist first.
[Link] using thread if we get the exception in the long running method it is not possible to catch
the exception in the parent function but the same can be easily caught if we are using tasks.
[Link] can easily build chains of tasks. You can specify when a task should start after the previous task
and you can specify if there should be a synchronization context switch. That gives you the great
opportunity to run a long running task in background and after that a UI refreshing task on the UI
thread.
8.A task is by default a background task. You cannot have a foreground task. On the other hand a
thread can be background or foreground.
[Link] default TaskScheduler will use thread pooling, so some Tasks may not start until other pending
Tasks have completed. If you use Thread directly, every use will start a new Thread.
Before MapReduce

Traditional approach will split the data into smaller parts or blocks and store them in different
machines. Then, it will find the highest value in each part stored in the corresponding machine. At
last, it will combine the results received from each of the machines to have the final output.
Challenges associated with traditional approach:

[Link] path problem: It is the amount of time taken to finish the job without delaying the next milestone or
actual completion date. So, if, any of the machines delay the job, the whole work gets delayed.
[Link] problem: What if, any of the machines which are working with a part of data fails? The
management of this failover becomes a challenge.
[Link] split issue: How will I divide the data into smaller chunks so that each machine gets even part of
data to work with. In other words, how to equally divide the data such that no individual machine is
overloaded or underutilized.
[Link] single split may fail: If any of the machines fail to provide the output, I will not be able to calculate the
result. So, there should be a mechanism to ensure this fault tolerance capability of the system.
[Link] of the result: There should be a mechanism to aggregate the result generated by each of the
machines to produce the final output.
To overcome these issues, we have the MapReduce framework which allows us to
perform such parallel computations without bothering about the issues like
reliability, fault tolerance etc. Therefore, MapReduce gives you the flexibility to
write code logic without caring about the design issues of the system.
MapReduce
MapReduce is a programming framework that allows us to perform distributed and parallel
processing on large data sets in a distributed environment.
• MapReduce consists of two distinct tasks – Map and Reduce.
• As the name MapReduce suggests, the reducer phase takes place after the
mapper phase has been completed.
• So, the first is the map job, where a block of data is read and processed to
produce key-value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples (intermediate key-
value pair) into a smaller set of tuples or key-value pairs which is the final output.
• Hadoop MapReduce is a software framework and programming model suitable for processing of huge
amounts data (multi-terabyte) in parallel, on large clusters of commodity hardware in a reliable manner.
• MapReduce programs are very useful for performing large-scale data analysis using multiple machines
in the cluster.
• Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python,
and C++.
• The MapReduce algorithm includes two significant tasks, namely Map and Reduce.
- In Map task a set of data and converts it into another set of data, where individual elements are broken
down into tuples.
- Secondly, reduce task, which takes the output from a map as an input and combines those data tuples
into a smaller set of tuples.
• As the sequence of the name MapReduce implies, the reduce task is always performed after the map
job.
• The major advantage of MapReduce is that it is easy to scale data processing over multiple
computing nodes. Under the MapReduce model, the data processing primitives are called mappers and
reducers.
• Decomposing a data processing application into mappers and reducers is sometimes nontrivial. But,
once we write an application in the MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster is merely a configuration change.
• This simple scalability is what has attracted many programmers to use the MapReduce model.
• MapReduce programming offers several benefits to help you gain valuable
insights from your big data:
- Scalability: Businesses can process petabytes of data stored in the Hadoop
Distributed File System (HDFS).
- Flexibility: Hadoop enables easier access to multiple sources of data and multiple
types of data.
- Speed: With parallel processing and minimal data movement, Hadoop offers
fast processing of massive amounts of data.
- Simple: Developers can write code in a choice of languages, including Java,
C++ and Python.
Algorithm of MapReduce
• Hadoop divides the job into tasks. There are two types of tasks:
i. Map tasks (Splits and Mapping)
- Input Splits: This is the very first phase in the execution of map-reduce program.
An input to a MapReduce job is divided into fixed-size pieces called input
splits.
- Mapping: In this phase data in each split is passed to a mapping function to
produce output values.
ii. Reduce tasks (Shuffling and Reducing)
- Shuffling: This phase consumes the output of mapping phase. Its task is to
consolidate the relevant records from Mapping phase output.
- Reducing: In this phase, output values from the shuffling phase are aggregated.
This phase combines values from shuffling phase and returns a single output
value. In short, this phase summarizes the complete dataset.
An Example for MapReduce:
• Consider you have following input data for your Map Reduce Program.
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
• Task is to count number of words; So, output should be:
bad:1
Class:1
good:1
Hadoop:3
is:2
to:1
Welcome:1
MapReduce Architecture
Comparison of MapReduce with
traditional approach:
1. Parallel Processing:
In MapReduce, we are dividing
the job among multiple nodes
and each node works with a part
of the job simultaneously. So,
MapReduce is based on Divide
and Conquer paradigm which
helps us to process the data
using different machines. As the
data is processed by multiple
machines instead of a single
machine in parallel, the time
taken to process the data gets
reduced by a tremendous
amount as shown in the figure
below:
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the
data in the MapReduce Framework. In the traditional system, we used to bring data to
the processing unit and process it. But, as the data grew and became very huge, bringing
this huge amount of data to the processing unit posed the following issues:
• Moving huge data to processing is costly and deteriorates the network performance.
• Processing takes time as the data is processed by a single unit which becomes the
bottleneck.
• The master node can get over-burdened and may fail.
Now, MapReduce allows us to overcome the above issues by bringing the processing unit
to the data. So, as you can see in the above image that the data is distributed among
multiple nodes where each node processes the part of the data residing on it. This allows
us to have the following advantages:
• It is very cost-effective to move processing unit to the data.
• The processing time is reduced as all the nodes are working with their part of the data
in parallel.
• Every node gets a part of the data to process and therefore, there is no chance of a
node getting overburdened.
Map Reduce
 MapReduce is a programming paradigm that enables massive scalability across hundreds or thousands of
servers in a Hadoop cluster.

A MapReduce program is composed of a map procedure, which performs filtering and sorting (such as
sorting students by first name into queues, one queue for each name), and a reduce method, which performs
a summary operation (such as counting the number of students in each queue, yielding name frequencies).

It is a framework for parallel computing.

With the help of map reduce programmers get simple API.

MapReduce allows for the distributed processing of the map and reduction operations.

6/2/2022 1
Map Reduce Framework
A programming model for parallel data processing.

Hadloops can run map reduce programs in multiple languages like Java,Python,Ruby etc.

It is usually composed of three operations:

Map

Shuffle

Reduce

6/2/2022 2
Map: Each worker node applies the map function to the local data and writes the output to a temporary
storage . A master node ensures that only one copy of the redundant input data is processed.

Shufflle : Worker nodes redistribute data based on the output keys, such that all data belonging to one key
is located on the same worker node.

Reduce: Worker nodes now process each group of output data, per key, in parallel.

6/2/2022 3
Processing in Map Reduce
Framework
• User runs a program on client computer
• Program submits a job to HDFS.
• Job contains:
 Input Data
Map/Reduce program
Configuration information
• Two types of daemons that control job execution:
Job Tracker(Master nodes)
Task Tracker(Slave nodes)

6/2/2022 4
• Job Tracker communicates with Name Node and assigns parts of job to Task Tracker.
• Task processes send heartbeats to Task Tracker.
• Task Tracker send heartbeats to Job tracker.
• Any task that did not report in certain time assumed to be failed and it’s JVM will be killed by task
tracker and reported to the job tracker.
• The job tracker will reschedule any failed tasks.
• If same task failed 4 times all the job fails.
• Any Task Tracker reporting high number of failed jobs on particular node will be blacklist the node.
• Job Tracker maintains and manages the status of each job . Result from failed tasks wil be ignored.

6/2/2022 5

Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
6 pages
Overview of Hadoop Ecosystem
No ratings yet
Overview of Hadoop Ecosystem
7 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
31 pages
Control Flow in Hadoop Job Execution
No ratings yet
Control Flow in Hadoop Job Execution
12 pages
MapReduce Phases in Hadoop Architecture
No ratings yet
MapReduce Phases in Hadoop Architecture
13 pages
MapReduce Fundamentals for Big Data Analysis
No ratings yet
MapReduce Fundamentals for Big Data Analysis
14 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
27 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
21 pages
Understanding MapReduce in Big Data
No ratings yet
Understanding MapReduce in Big Data
120 pages
MapReduce in Cloud Computing Overview
No ratings yet
MapReduce in Cloud Computing Overview
11 pages
Overview of Hadoop MapReduce Framework
No ratings yet
Overview of Hadoop MapReduce Framework
5 pages
Bdaunit 3
No ratings yet
Bdaunit 3
23 pages
Understanding Hadoop and MapReduce
No ratings yet
Understanding Hadoop and MapReduce
44 pages
MapReduce in Parallel Computing Explained
No ratings yet
MapReduce in Parallel Computing Explained
14 pages
MapReduce Framework for Data Analytics
No ratings yet
MapReduce Framework for Data Analytics
19 pages
Understanding MapReduce and YARN Basics
No ratings yet
Understanding MapReduce and YARN Basics
44 pages
MapReduce in Big Data Explained
No ratings yet
MapReduce in Big Data Explained
10 pages
MapReduce Framework: Benefits & Process
No ratings yet
MapReduce Framework: Benefits & Process
12 pages
Introduction to Hadoop MapReduce
No ratings yet
Introduction to Hadoop MapReduce
55 pages
MapReduce Fundamentals and Examples
No ratings yet
MapReduce Fundamentals and Examples
28 pages
Understanding MapReduce Architecture
No ratings yet
Understanding MapReduce Architecture
12 pages
Understanding MapReduce Framework Basics
No ratings yet
Understanding MapReduce Framework Basics
23 pages
MapReduce in Hadoop: Big Data Solutions
No ratings yet
MapReduce in Hadoop: Big Data Solutions
15 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
25 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
35 pages
MapReduce Fault Tolerance Explained
No ratings yet
MapReduce Fault Tolerance Explained
27 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
7 pages
MapReduce and HBase in Big Data Analysis
No ratings yet
MapReduce and HBase in Big Data Analysis
20 pages
MapReduce: Combiner, Node Failure, and Limitations
No ratings yet
MapReduce: Combiner, Node Failure, and Limitations
5 pages
Overview of Hadoop Framework and Features
No ratings yet
Overview of Hadoop Framework and Features
15 pages
Data Processing with Hadoop Overview
No ratings yet
Data Processing with Hadoop Overview
23 pages
MapReduce in Cloud Computing
No ratings yet
MapReduce in Cloud Computing
21 pages
MapReduce and HBase in Big Data
No ratings yet
MapReduce and HBase in Big Data
98 pages
Unit 3 Final
No ratings yet
Unit 3 Final
17 pages
Understanding Hadoop Counters in MapReduce
No ratings yet
Understanding Hadoop Counters in MapReduce
63 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
7 - Hadoop Map Reduce
No ratings yet
7 - Hadoop Map Reduce
36 pages
Overview of MapReduce Architecture
No ratings yet
Overview of MapReduce Architecture
4 pages
Hadoop Ecosystem: MapReduce Overview
No ratings yet
Hadoop Ecosystem: MapReduce Overview
83 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Parallel Programming and MapReduce Overview
No ratings yet
Parallel Programming and MapReduce Overview
47 pages
MapReduce: Big Data Processing Explained
No ratings yet
MapReduce: Big Data Processing Explained
53 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
7 pages
Data Analytics: MapReduce & Visualization
No ratings yet
Data Analytics: MapReduce & Visualization
58 pages
Understanding the MapReduce Framework
No ratings yet
Understanding the MapReduce Framework
55 pages
Understanding MapReduce Framework
No ratings yet
Understanding MapReduce Framework
41 pages
Unit-4Introduction To Parallel Computing
No ratings yet
Unit-4Introduction To Parallel Computing
2 pages
MapReduce Applications in Big Data Analytics
No ratings yet
MapReduce Applications in Big Data Analytics
23 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
11 pages
Cloud Computing Programming Models
No ratings yet
Cloud Computing Programming Models
43 pages
Big Data and Hadoop Overview
No ratings yet
Big Data and Hadoop Overview
32 pages
Failures in Classic MapReduce and YARN
No ratings yet
Failures in Classic MapReduce and YARN
15 pages
Parallel Efficiency in MapReduce
No ratings yet
Parallel Efficiency in MapReduce
10 pages
MapReduce Overview and Implementation Guide
No ratings yet
MapReduce Overview and Implementation Guide
42 pages
Seminar Report on Hadoop Framework
No ratings yet
Seminar Report on Hadoop Framework
28 pages
MapReduce Job Execution in Cloud
No ratings yet
MapReduce Job Execution in Cloud
73 pages
Understanding MapReduce Basics
No ratings yet
Understanding MapReduce Basics
2 pages
MapReduce: Key Concepts and Workflow
No ratings yet
MapReduce: Key Concepts and Workflow
36 pages
PLC Programming Examples and Guide
0% (2)
PLC Programming Examples and Guide
2 pages
Java Applet Examples for Beginners
0% (1)
Java Applet Examples for Beginners
3 pages
ASUS TUF ROG End User Promotion 2022
No ratings yet
ASUS TUF ROG End User Promotion 2022
3 pages
Simplified Resume Builder Tool
No ratings yet
Simplified Resume Builder Tool
3 pages
Tecsa Appliance Price List PDF
No ratings yet
Tecsa Appliance Price List PDF
14 pages
USB Device Connection Events Log
No ratings yet
USB Device Connection Events Log
7 pages
EDI Connection Setup for SAP Partners
No ratings yet
EDI Connection Setup for SAP Partners
18 pages
Java Programming Language Overview
No ratings yet
Java Programming Language Overview
15 pages
SAP VIM 7.5 SP4 Release Highlights
No ratings yet
SAP VIM 7.5 SP4 Release Highlights
2 pages
Cloud Computing Models Overview
No ratings yet
Cloud Computing Models Overview
10 pages
CPU Architecture and Operations Tutorial
No ratings yet
CPU Architecture and Operations Tutorial
13 pages
Grade 11 IT Practical Assessment 2024
No ratings yet
Grade 11 IT Practical Assessment 2024
11 pages
Computer Systems Servicing Test
No ratings yet
Computer Systems Servicing Test
2 pages
I C BUS EEPROM (2-Wire) : Datasheet
No ratings yet
I C BUS EEPROM (2-Wire) : Datasheet
39 pages
Install MongoDB on Arch Linux Guide
No ratings yet
Install MongoDB on Arch Linux Guide
2 pages
UNM2000 Installation Guide for Windows
No ratings yet
UNM2000 Installation Guide for Windows
320 pages
HP Z2 Tower G9 Workstation Specs
No ratings yet
HP Z2 Tower G9 Workstation Specs
2 pages
Features: 15.6"/21.5"/23.8" Full HD Industrial Monitor With P-CAP Touch Control, Direct HDMI, DP and VGA Ports
No ratings yet
Features: 15.6"/21.5"/23.8" Full HD Industrial Monitor With P-CAP Touch Control, Direct HDMI, DP and VGA Ports
2 pages
Novalink P9eig
No ratings yet
Novalink P9eig
62 pages
Manish Kumar: Android & iOS Developer Resume
No ratings yet
Manish Kumar: Android & iOS Developer Resume
3 pages
High Voltage Pulse Generator for APT
No ratings yet
High Voltage Pulse Generator for APT
2 pages
Embedded System Peripherals Guide
No ratings yet
Embedded System Peripherals Guide
36 pages
Linux Memory Management Basics
No ratings yet
Linux Memory Management Basics
67 pages
Introduction to Stack Data Structure
No ratings yet
Introduction to Stack Data Structure
46 pages
Informatica Interview Questioner Ambarish PDF
No ratings yet
Informatica Interview Questioner Ambarish PDF
211 pages
Webasto Thermo Test 3.0 Installation Guide
No ratings yet
Webasto Thermo Test 3.0 Installation Guide
8 pages
Computer System Installation Guide
No ratings yet
Computer System Installation Guide
12 pages
Affordable Acer and HP Laptops
No ratings yet
Affordable Acer and HP Laptops
2 pages
SIMIT 9.1: Simulation for Pharma Automation
No ratings yet
SIMIT 9.1: Simulation for Pharma Automation
2 pages
SAP Made Easy
No ratings yet
SAP Made Easy
94 pages

Cloud Programming Models: MapReduce Explained

Uploaded by

Cloud Programming Models: MapReduce Explained

Uploaded by

Unit 4: Cloud Programming Models (12 Hrs.

It is a framework for parallel computing.

With the help of map reduce programmers get simple API.

It is usually composed of three operations:

You might also like