0% found this document useful (0 votes)
35 views49 pages

Big Data Analytics with Hadoop Overview

The document provides an overview of Hadoop, an open-source framework for storing and processing large datasets in a distributed computing environment. It covers the core components of Hadoop, including HDFS, MapReduce, and YARN, and explains their roles in data storage, processing, and resource management. Additionally, it discusses the concept of scaling out in Hadoop to handle increasing data loads efficiently.

Uploaded by

bhargavidasari23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views49 pages

Big Data Analytics with Hadoop Overview

The document provides an overview of Hadoop, an open-source framework for storing and processing large datasets in a distributed computing environment. It covers the core components of Hadoop, including HDFS, MapReduce, and YARN, and explains their roles in data storage, processing, and resource management. Additionally, it discusses the concept of scaling out in Hadoop to handle increasing data loads efficiently.

Uploaded by

bhargavidasari23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

1 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
2 Big Data Analytics -R20 (21NM-IT)

Introduction to Hadoop
 Hadoop: History of Hadoop
 The Hadoop Distributed File System
 Components of Hadoop Analysing the Data with Hadoop
 Scaling Out
 Hadoop Streaming, Design of HDFS
 Java interfaces to HDFS Basics
 Developing a Map Reduce Application
PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM
 How Map Reduce Works
 Anatomy of a Map Reduce Job run
 Failures & Job Scheduling
 Shuffle and Sort
 Task execution
 Map Reduce Types and Formats
 Map Reduce Features Hadoop environment.

Mr NETAJI GANDI
3 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
4 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
5 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
6 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
7 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
8 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
9 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
10 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
11 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
12 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
13 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
14 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
15 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
16 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
17 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
18 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
19 Big Data Analytics -R20 (21NM-IT)

 Hadoop is an open-source software framework that is used for storing and


processing large amounts of data in a distributed computing environment. It is
designed to handle big data and is based on the MapReduce programming model,
which allows for the parallel processing of large datasets. Its framework is based on
Java programming with some native code in C and shell scripts.
 Hadoop also includes several additional modules that provide additional functionality,
such as Hive (a SQL-like query language), Pig (a high-level platform for creating
MapReduce programs), and HBase (a non-relational, distributed database).
 Hadoop is commonly used in big data scenarios such as data warehousing, business
PROTOCOLS OF GSM GSMPROTOCOLS OF GSM
intelligence, and machine learning. It’s also used for data processing, data analysis, and
data mining.
 The goal of designing Hadoop is to develop an inexpensive, reliable, and scalable
framework that stores and analyzes the rising big data.
 Hadoop follows the master-slave architecture for effectively storing and processing vast
amounts of data. The master nodes assign tasks to the slave nodes.
 The slave nodes are responsible for storing the actual data and performing the actual
computation/processing. The master nodes are responsible for storing the metadata
and managing the resources across the cluster.

Mr NETAJI GANDI
20 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
21 Big Data Analytics -R20 (21NM-IT)
Slave nodes store the actual business data, whereas the master stores the
metadata.

The Hadoop architecture comprises three layers. They are:

[Link] layer (HDFS)


[Link] Management layer (YARN)
[Link] layer (MapReduce)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
22 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
23 Big Data Analytics -R20 (21NM-IT)

The HDFS, YARN, and MapReduce are the core components of the Hadoop Framework.
1. HDFS (Hadoop Distributed File System)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
24 Big Data Analytics -R20 (21NM-IT)

The HDFS, YARN, and MapReduce are the core components of the Hadoop Framework.
1. HDFS (Hadoop Distributed File System)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
25 Big Data Analytics -R20 (21NM-IT)

The HDFS, YARN, and MapReduce are the core components of the Hadoop Framework.
1. HDFS (Hadoop Distributed File System)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
26 Big Data Analytics -R20 (21NM-IT)

 HDFS is the Hadoop Distributed File System, which runs on inexpensive commodity
hardware. It is the storage layer for Hadoop. The files in HDFS are broken into block-size
chunks called data blocks.
 These blocks are then stored on the slave nodes in the cluster. The block size is 128 MB by
default, which we can configure as per our requirements.
 Like Hadoop, HDFS also follows the master-slave architecture. It comprises two daemons-
NameNode and DataNode. The NameNode is the master daemon that runs on the master
node. The DataNodes are the slave daemon that runs on the slave nodes.
PROTOCOLS OF GSMGSMPROTOCOLS OF GSM
NameNode
NameNode stores the filesystem metadata, that is, files names, information about blocks of a
file, blocks locations, permissions, etc. It manages the Datanodes.

DataNode
DataNodes are the slave nodes that store the actual business data. It serves the client
read/write requests based on the NameNode instructions.

DataNodes stores the blocks of the files, and NameNode stores the metadata like block
locations, permission,

Mr NETAJI GANDI
27 Big Data Analytics -R20 (21NM-IT)

2. MapReduce

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
28 Big Data Analytics -R20 (21NM-IT)
 It is the data processing layer of Hadoop.

 It is a software framework for writing applications that process vast amounts of data
(terabytes to petabytes in range) in parallel on the cluster of commodity hardware.

 The MapReduce framework works on the <key, value> pairs.

 The MapReduce job is the unit of work the client wants to perform. MapReduce job
mainly consists of the input data, the MapReduce program, and the configuration
information. PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

 Hadoop runs the MapReduce jobs by dividing them into two types of tasks that are
map tasks and reduce tasks.

 Due to some unfavorable conditions, if the tasks fail, they will automatically get
rescheduled on a different node.

 The function of the map tasks is to load, parse, filter, and transform the data. The
output of the map task is the input to the reduce task. Reduce task then performs
grouping and aggregation on the output of the map task.
Mr NETAJI GANDI
29 Big Data Analytics -R20 (21NM-IT)

3. YARN (Yet Another Resource Negotiator)

It is the resource management layer of Hadoop. It was introduced in Hadoop 2.

YARN is designed with the idea of splitting up the functionalities of job scheduling and
resource management into separate daemons.

YARN consists of ResourceManager, NodeManager, and per-application ApplicationMaster.

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
30 Big Data Analytics -R20 (21NM-IT)
1. ResourceManager

 It arbitrates resources amongst all the applications in the cluster.


 It has two main components that are Scheduler and the ApplicationManager.

a. Scheduler
 The Scheduler allocates resources to the various applications running in the cluster,
considering the capacities, queues, etc.
 It is a pure Scheduler. It does not monitor or track the status of the application.
PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM
b. ApplicationManager

 They are responsible for accepting the job submissions.


 ApplicationManager negotiates the first container for executing application-specific
ApplicationMaster.

2. NodeManager:
NodeManager runs on the slave nodes. It is responsible for containers, monitoring the machine
resource usage that is CPU, memory, disk, network usage, and reporting the same to the
ResourceManager or Scheduler.
Mr NETAJI GANDI
31 Big Data Analytics -R20 (21NM-IT)

3. ApplicationMaster:

The per-application ApplicationMaster is a framework-specific library. It is


responsible for negotiating resources from the ResourceManager. It works with the
NodeManager(s) for executing and monitoring the tasks.

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
32 Big Data Analytics -R20 (21NM-IT)

How to Solve a Big data Challenge?

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
33 Big Data Analytics -R20 (21NM-IT)

Scaling OUT

Scaling out in Hadoop refers to the ability to expand a Hadoop system by adding more
nodes to the cluster in order to handle increasing amounts of data and workload.

Key points about scaling out in Hadoop:


[Link] Nodes: New machines (nodes) are added to the Hadoop cluster to enhance
its processing power and storage capacity.
[Link] Computing: Hadoop's PROTOCOLS OF GSM
distributed GSMPROTOCOLS
nature allows it OF GSM
to efficiently distribute
and parallelize data processing tasks across the added nodes.
[Link] Performance: Scaling out improves the overall performance of the
Hadoop cluster, enabling it to handle larger datasets and perform computations more
quickly.
[Link] Tolerance: Hadoop's architecture includes mechanisms for fault tolerance,
ensuring that data and processing can continue even if some nodes experience
failures.
[Link]: Scaling out provides elasticity to the Hadoop cluster, allowing it to adapt
to changing workloads by adding or removing nodes as needed.

Mr NETAJI GANDI
34 Big Data Analytics -R20 (21NM-IT)
Apache Hadoop is a big data technology used for handling data that is huge, complex
and cannot be stored in traditional databases.
Hadoop is a framework that allows you to store large volumes of data on several node
machines. It helps in processing data in a parallel manner.
Hadoop has 3 main components:

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
35 Big Data Analytics -R20 (21NM-IT)
Hadoop Distributed File System (HDFS) is the storage unit of Hadoop that stores data
in multiple data servers. It divides large data into different blocks, where each block has
a default size of 128MB.

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
36 Big Data Analytics -R20 (21NM-IT)
MapReduce is a framework that performs distributed and parallel processing of large
volumes of data. It has a map phase and a reduce phase. Here is how the generalized
flow to MapReduce looks like:

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

YARN stands for Yet Another Resource Negotiator. It is responsible to manager cluster
resources required for running an applications. These resources include Memory, CPU,
Disk, Network, etc.

Mr NETAJI GANDI
37 Big Data Analytics -R20 (21NM-IT)
Introduction to Hadoop Streaming
Hadoop streaming is the utility that enables us to create or run MapReduce
scripts in any language either, java or non-java, as mapper/reducer.

Hadoop provides API for


PROTOCOLS OF GSM writingOF
GSMPROTOCOLS MapReduce
GSM programs
in languages other than Java.

Hadoop Streaming is the utility that allows us to create and run MapReduce
jobs with any script or executable as the mapper or the reducer.

Mr NETAJI GANDI
38 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
39 Big Data Analytics -R20 (21NM-IT)
How Hadoop Streaming Works

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
40 Big Data Analytics -R20 (21NM-IT)

Streaming Command Options

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
41 Big Data Analytics -R20 (21NM-IT)

Streaming Command Options

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
42 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
43 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
44 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
45 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
46 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
47 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
48 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI
49 Big Data Analytics -R20 (21NM-IT)

PROTOCOLS OF GSM
GSMPROTOCOLS OF GSM

Mr NETAJI GANDI

You might also like