0% found this document useful (0 votes)

51 views15 pages

Data Analytics in IoT: Hadoop & MapReduce

The document outlines the Internet of Things course module on Data Analytics for IoT, focusing on the Hadoop ecosystem, including its architecture, components, and job execution workflow. It details the MapReduce process, YARN architecture, and various scheduling algorithms such as FIFO, Fair, and Capacity Scheduler. Additionally, it provides links for further reading on related Apache projects.

Uploaded by

udemy6061

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views15 pages

Data Analytics in IoT: Hadoop & MapReduce

Uploaded by

udemy6061

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

VII Semester

.IN
Course: Internet of Things
Course Code: BCS701

C
N
Credits: 04 & 2022 Scheme

SY
U
Module 5:Data Analytics for IoT
VT Soumya Jolad
[Link]@[Link]
Assistant Professor, Dept. of CSE, EPCET

Studied smart, not hard — thanks to [Link]

Outline

.IN
• Overview of Hadoop ecosystem
• MapReduce architecture

C
• MapReduce job execution flow

N
SY
• MapReduce schedulers

U
VT

Studied smart, not hard — thanks to [Link]

Hadoop Ecosystem

• Apache Hadoop is an open source framework for distributed batch processing of big data.

.IN
• Hadoop Ecosystem includes:
• Hadoop MapReduce

C
• HDFS

N
• YARN
• HBase

SY
• Zookeeper
• Pig
• Hive

U
• Mahout
• Chukwa VT
• Cassandra
• Avro
• Oozie
• Flume
• Sqoop

Studied smart, not hard — thanks to [Link]

Apache Hadoop

.IN
• A Hadoop cluster comprises of a Master node, backup node and a number
of slave nodes.
• The master node runs the NameNode and JobTracker processes and the
slave nodes run the DataNode and TaskTracker components of Hadoop.

C
• The backup node runs the Secondary NameNode process.

N
• NameNode
•

SY
NameNode keeps the directory tree of all files in the file system, and tracks
where across the cluster the file data is kept. It does not store the data of these
files itself. Client applications talk to the NameNode whenever they wish to
locate a file, or when they want to add/copy/move/delete a file.
• Secondary NameNode

U
• NameNode is a Single Point of Failure for the HDFS Cluster. An optional
Secondary NameNode which is hosted on a separate machine creates
VT
checkpoints of the namespace.
• JobTracker
• The JobTracker is the service within Hadoop that distributes MapReduce tasks to
specific nodes in the cluster, ideally the nodes that have the data, or at least are
in the same rack.

Studied smart, not hard — thanks to [Link]

Apache Hadoop

• TaskTracker

.IN
• TaskTracker is a node in a Hadoop cluster that accepts Map, Reduce
and Shuffie tasks from the JobTracker.
• Each TaskTracker has a defined number of slots which indicate the

C
number of tasks that it can accept.

N
• DataNode

SY
• A DataNode stores data in an HDFS file system.
• A functional HDFS filesystem has more than one DataNode, with data
replicated across them.

U
• DataNodes respond to requests from the NameNode for filesystem
operations. VT
• Client applications can talk directly to a DataNode, once the
NameNode has provided the location of the data.
• Similarly, MapReduce operations assigned to TaskTracker instances
near a DataNode, talk directly to the DataNode to access the files.
• TaskTracker instances can be deployed on the same servers that host
DataNode instances, so that MapReduce operations are performed
close to the data.

Studied smart, not hard — thanks to [Link]

MapReduce

• MapReduce job consists of two phases:

.IN
• Map: In the Map phase, data is read from a distributed
file system and partitioned among a set of computing

C
nodes in the cluster. The data is sent to the nodes as a set
of key-value pairs. The Map tasks process the input

N
records independently of each other and produce
intermediate results as key-value pairs. The intermediate

SY
results are stored on the local disk of the node running
the Map task.
• Reduce: When all the Map tasks are completed, the

U
Reduce phase begins in which the intermediate data with
the same key is aggregated.
VT
• Optional Combine Task
• An optional Combine task can be used to perform data
aggregation on the intermediate data of the same key for
the output of the mapper before transferring the output
to the Reduce task.

Studied smart, not hard — thanks to [Link]

MapReduce Job Execution Workflow
• MapReduce job execution starts when the client applications submit jobs to the Job tracker.

.IN
• The JobTracker returns a JobID to the client application. The JobTracker talks to the NameNode to determine
the location of the data.

C
• The JobTracker locates TaskTracker nodes with available slots at/or near the data.

N
• The TaskTrackers send out heartbeat messages to the JobTracker, usually every few minutes, to reassure the

SY
JobTracker that they are still alive. These messages also inform the JobTracker of the number of available
slots, so the JobTracker can stay up to date with where in the cluster, new work can be delegated.

U
VT

Studied smart, not hard — thanks to [Link]

MapReduce Job Execution Workflow
• The JobTracker submits the work to the TaskTracker nodes when they poll for tasks. To choose a task for a

.IN
TaskTracker, the JobTracker uses various scheduling algorithms (default is FIFO).
• The TaskTracker nodes are monitored using the heartbeat signals that are sent by the TaskTrackers to

C
JobTracker.
• The TaskTracker spawns a separate JVM process for each task so that any task failure does not bring down

N
the TaskTracker.

SY
• The TaskTracker monitors these spawned processes while capturing the output and exit codes. When the
process finishes, successfully or not, the TaskTracker notifies the JobTracker. When the job is completed, the

U
JobTracker updates its status.
VT

Studied smart, not hard — thanks to [Link]

MapReduce 2.0 - YARN

.IN
• In Hadoop 2.0 the original processing engine of Hadoop
(MapReduce) has been separated from the resource
management (which is now part of YARN).

C
N
• This makes YARN effectively an operating system for

SY
Hadoop that supports different processing engines on a
Hadoop cluster such as MapReduce for batch processing,
Apache Tez for interactive queries, Apache Storm for

U
stream processing, etc.
VT
• YARN architecture divides architecture divides the two
major functions of the JobTracker - resource management
and job life-cycle management - into separate components:
• ResourceManager
• ApplicationMaster.

Studied smart, not hard — thanks to [Link]

YARN Components

.IN
• Resource Manager (RM): RM manages the global assignment of
compute resources to applications. RM consists of two main
services:

C
• Scheduler: Scheduler is a pluggable service that manages and enforces the
resource scheduling policy in the cluster.

N
• Applications Manager (AsM): AsM manages the running Application Masters in
the cluster. AsM is responsible for starting application masters and for monitoring

SY
and restarting them on different nodes in case of failures.

• Application Master (AM): A per-application AM manages the

application’s life cycle. AM is responsible for negotiating resources

U
from the RM and working with the NMs to execute and monitor
the tasks. VT
• Node Manager (NM): A per-machine NM manages the user
processes on that machine.
• Containers: Container is a bundle of resources allocated by RM
(memory, CPU, network, etc.). A container is a conceptual entity
that grants an application the privilege to use a certain amount of
resources on a given machine to run a component task.

Studied smart, not hard — thanks to [Link]

Hadoop Schedulers

.IN
• Hadoop scheduler is a pluggable component that makes it open to support
different scheduling algorithms.

C
• The default scheduler in Hadoop is FIFO.

N
• Two advanced schedulers are also available - the Fair Scheduler, developed

SY
at Facebook, and the Capacity Scheduler, developed at Yahoo.

U
• The pluggable scheduler framework provides the flexibility to support a
VT
variety of workloads with varying priority and performance constraints.
• Efficient job scheduling makes Hadoop a multi-tasking system that can
process multiple data sets for multiple jobs for multiple users
simultaneously.

Studied smart, not hard — thanks to [Link]

FIFO Scheduler

.IN
• FIFO is the default scheduler in Hadoop that maintains a work queue
in which the jobs are queued.

C
• The scheduler pulls jobs in first in first out manner (oldest job first)

N
for scheduling.

SY
• There is no concept of priority or size of job in FIFO scheduler.

U
VT

Studied smart, not hard — thanks to [Link]

Fair Scheduler

.IN
• The Fair Scheduler allocates resources evenly between multiple jobs and also provides capacity
guarantees.
• Fair Scheduler assigns resources to jobs such that each job gets an equal share of the available

C
resources on average over time.

N
• Tasks slots that are free are assigned to the new jobs, so that each job gets roughly the same
amount of CPU time.

SY
• Job Pools
• The Fair Scheduler maintains a set of pools into which jobs are placed. Each pool has a guaranteed capacity.

U
• When there is a single job running, all the resources are assigned to that job. When there are multiple jobs in
the pools, each pool gets at least as many task slots as guaranteed.
VT
• Each pool receives at least the minimum share.
• When a pool does not require the guaranteed share the excess capacity is split between other jobs.
• Fairness
• The scheduler computes periodically the difference between the computing time received by each job and
the time it should have received in ideal scheduling.
• The job which has the highest deficit of the compute time received is scheduled next.

Studied smart, not hard — thanks to [Link]

Capacity Scheduler

• The Capacity Scheduler has similar functionally as the Fair Scheduler but

.IN
adopts a different scheduling philosophy.
• Queues

C
• In Capacity Scheduler, you define a number of named queues each with a

N
configurable number of map and reduce slots.

SY
• Each queue is also assigned a guaranteed capacity.
• The Capacity Scheduler gives each queue its capacity when it contains jobs, and
shares any unused capacity between the queues. Within each queue FIFO scheduling

U
with priority is used.
VT
• Fairness
• For fairness, it is possible to place a limit on the percentage of running tasks per user,
so that users share a cluster equally.
• A wait time for each queue can be configured. When a queue is not scheduled for
more than the wait time, it can preempt tasks of other queues to get its fair share.

Studied smart, not hard — thanks to [Link]

IoT Data Analytics: Hadoop & MapReduce
No ratings yet
IoT Data Analytics: Hadoop & MapReduce
15 pages
IoT Data Analytics: Hadoop & MapReduce
No ratings yet
IoT Data Analytics: Hadoop & MapReduce
31 pages
Chapter 10
No ratings yet
Chapter 10
45 pages
Chapter 10 Part I - Ed
No ratings yet
Chapter 10 Part I - Ed
15 pages
IoT Data Analytics for Fire Detection
No ratings yet
IoT Data Analytics for Fire Detection
19 pages
Hadoop Architecture and Data Processing Overview
No ratings yet
Hadoop Architecture and Data Processing Overview
17 pages
Apache Hadoop Overview and Components
No ratings yet
Apache Hadoop Overview and Components
14 pages
Hadoop Framework Overview and Components
No ratings yet
Hadoop Framework Overview and Components
66 pages
Understanding Hadoop: Framework Overview
No ratings yet
Understanding Hadoop: Framework Overview
26 pages
Understanding Hadoop MapReduce Framework
No ratings yet
Understanding Hadoop MapReduce Framework
32 pages
Key Benefits and Overview of Hadoop
No ratings yet
Key Benefits and Overview of Hadoop
37 pages
Module 3 MapReduce
No ratings yet
Module 3 MapReduce
96 pages
Jenny's Guide to Hadoop Essentials
No ratings yet
Jenny's Guide to Hadoop Essentials
12 pages
Hadoop Framework Overview
No ratings yet
Hadoop Framework Overview
62 pages
Overview of Apache Hadoop Components
No ratings yet
Overview of Apache Hadoop Components
47 pages
Hadoop: Big Data Processing Essentials
No ratings yet
Hadoop: Big Data Processing Essentials
51 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
47 pages
Overview of Hadoop Architecture and Use Cases
No ratings yet
Overview of Hadoop Architecture and Use Cases
47 pages
Understanding MapReduce Framework Basics
No ratings yet
Understanding MapReduce Framework Basics
11 pages
Understanding MapReduce Job Execution
No ratings yet
Understanding MapReduce Job Execution
17 pages
Overview of Hadoop Architecture and Ecosystem
No ratings yet
Overview of Hadoop Architecture and Ecosystem
47 pages
Understanding Hadoop Cluster Architecture
No ratings yet
Understanding Hadoop Cluster Architecture
18 pages
Overview of Hadoop Architecture and Components
No ratings yet
Overview of Hadoop Architecture and Components
31 pages
Job Tracker
No ratings yet
Job Tracker
23 pages
YARN and NoSQL in Big Data Analytics
No ratings yet
YARN and NoSQL in Big Data Analytics
38 pages
Hadoop and Python Integration Guide
No ratings yet
Hadoop and Python Integration Guide
50 pages
Understanding Hadoop: HDFS & MapReduce
No ratings yet
Understanding Hadoop: HDFS & MapReduce
25 pages
Failures in Classic MapReduce Explained
No ratings yet
Failures in Classic MapReduce Explained
65 pages
Introduction to Hadoop and Distributed Systems
No ratings yet
Introduction to Hadoop and Distributed Systems
24 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
101 pages
Introduction to Hadoop and DFS
No ratings yet
Introduction to Hadoop and DFS
34 pages
Hadoop Pipes and Heartbeat Overview
No ratings yet
Hadoop Pipes and Heartbeat Overview
18 pages
Inceptez Hadoop MapReduce Overview
No ratings yet
Inceptez Hadoop MapReduce Overview
16 pages
Understanding Big Data and Hadoop Basics
No ratings yet
Understanding Big Data and Hadoop Basics
191 pages
Overview of Hadoop Architecture and Components
No ratings yet
Overview of Hadoop Architecture and Components
50 pages
An Optimized Algorithm For Reduce Task Scheduling: Xiaotong Zhang, Bin Hu, Jiafu Jiang
No ratings yet
An Optimized Algorithm For Reduce Task Scheduling: Xiaotong Zhang, Bin Hu, Jiafu Jiang
8 pages
MapReduce and YARN Overview
No ratings yet
MapReduce and YARN Overview
50 pages
Understanding MapReduce in Hadoop
No ratings yet
Understanding MapReduce in Hadoop
22 pages
MapReduce Job Execution in Hadoop
No ratings yet
MapReduce Job Execution in Hadoop
5 pages
Big Data
No ratings yet
Big Data
47 pages
DetailedNotes Unit4 123306
No ratings yet
DetailedNotes Unit4 123306
35 pages
Introduction to Hadoop Ecosystem Overview
No ratings yet
Introduction to Hadoop Ecosystem Overview
27 pages
Hadoop Architecture and MapReduce Overview
No ratings yet
Hadoop Architecture and MapReduce Overview
34 pages
Cloud Computing: Hadoop & MapReduce Guide
No ratings yet
Cloud Computing: Hadoop & MapReduce Guide
20 pages
Hadoop in Cloud Computing Overview
No ratings yet
Hadoop in Cloud Computing Overview
10 pages
Open Source Distributed File Systems Overview
No ratings yet
Open Source Distributed File Systems Overview
60 pages
MapReduce and YARN Overview for AI
No ratings yet
MapReduce and YARN Overview for AI
16 pages
MapReduce Framework in Hadoop
No ratings yet
MapReduce Framework in Hadoop
9 pages
MapReduce Exam Question Bank Guide
No ratings yet
MapReduce Exam Question Bank Guide
70 pages
Introduction to Hadoop Architecture
No ratings yet
Introduction to Hadoop Architecture
71 pages
Understanding Hadoop as a Framework
No ratings yet
Understanding Hadoop as a Framework
34 pages
Hadoop in Cloud Computing Overview
No ratings yet
Hadoop in Cloud Computing Overview
27 pages
Hadoop vs YARN: Key Differences Explained
No ratings yet
Hadoop vs YARN: Key Differences Explained
59 pages
Hadoop Ecosystem and YARN Overview
No ratings yet
Hadoop Ecosystem and YARN Overview
24 pages
Computer Literacy Syllabus for Grades 8-10
No ratings yet
Computer Literacy Syllabus for Grades 8-10
87 pages
Minimum Optical Power for BER 10^-12
No ratings yet
Minimum Optical Power for BER 10^-12
10 pages
2 SC 2290
No ratings yet
2 SC 2290
3 pages
Panashe Muzamhindo's Professional CV
No ratings yet
Panashe Muzamhindo's Professional CV
8 pages
3HAC027097-001 Rev - en
No ratings yet
3HAC027097-001 Rev - en
32 pages
Netgate 6100 Security Gateway Manual
No ratings yet
Netgate 6100 Security Gateway Manual
98 pages
Microcontroller Solar Inverter Design
100% (1)
Microcontroller Solar Inverter Design
18 pages
10 MB Sample PDF for Testing
No ratings yet
10 MB Sample PDF for Testing
50 pages
XI-CA Computer Science Question Bank
No ratings yet
XI-CA Computer Science Question Bank
10 pages
Data Communication Systems Overview
No ratings yet
Data Communication Systems Overview
20 pages
sql92 Ica
No ratings yet
sql92 Ica
58 pages
ES9219: High-Performance Audio DAC
No ratings yet
ES9219: High-Performance Audio DAC
68 pages
Abap - HR: Presented by Gayathri & Sushmitha
No ratings yet
Abap - HR: Presented by Gayathri & Sushmitha
67 pages
History and Future of HCI
No ratings yet
History and Future of HCI
43 pages
Akida Execution Engine Overview
No ratings yet
Akida Execution Engine Overview
3 pages
Low-Dropout, 3.3 V Regulator - High Efficiency
No ratings yet
Low-Dropout, 3.3 V Regulator - High Efficiency
8 pages
Virtual Instrumentation Overview
100% (8)
Virtual Instrumentation Overview
25 pages
6RA70 V3.1 Operating Instructions
No ratings yet
6RA70 V3.1 Operating Instructions
736 pages
Tailwind Traders Azure Migration Guide
No ratings yet
Tailwind Traders Azure Migration Guide
26 pages
High-Performance PIC Packaging Solutions
No ratings yet
High-Performance PIC Packaging Solutions
2 pages
Overview of Output Devices
No ratings yet
Overview of Output Devices
2 pages
Light Detector Circuit with NAND Gate
No ratings yet
Light Detector Circuit with NAND Gate
11 pages
Zener Diode Voltage Regulation Guide
No ratings yet
Zener Diode Voltage Regulation Guide
11 pages
JCL Interview Questions and Answers
100% (1)
JCL Interview Questions and Answers
5 pages
Autel SkyCommand Center Overview
No ratings yet
Autel SkyCommand Center Overview
8 pages
Major Components of Teradata Architecture
No ratings yet
Major Components of Teradata Architecture
51 pages
Overview of Extreme Programming (XP)
No ratings yet
Overview of Extreme Programming (XP)
20 pages
Malibu Power Wedge Troubleshooting Guide
No ratings yet
Malibu Power Wedge Troubleshooting Guide
5 pages
HPE Aruba CX 6100 Switch Overview
No ratings yet
HPE Aruba CX 6100 Switch Overview
4 pages
Oracle 1D0-1086-25-D Exam Q&A Guide
No ratings yet
Oracle 1D0-1086-25-D Exam Q&A Guide
6 pages

Data Analytics in IoT: Hadoop & MapReduce

Uploaded by

Data Analytics in IoT: Hadoop & MapReduce

Uploaded by

VII Semester

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

• MapReduce job consists of two phases:

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

• Application Master (AM): A per-application AM manages the

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

Studied smart, not hard — thanks to [Link]

You might also like