Big Data and Hadoop Overview Guide

Uploaded by

navya sree pinnamaneni AP22110010981

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views59 pages

Big Data and Hadoop Overview Guide

Uploaded by

navya sree pinnamaneni AP22110010981

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Unit I : Big Data and Hadoop

• Topics covered
Theory
– Introduction to BIG DATA
– Distributed computing
– Distributed File System
– Hadoop Eco System
– Hadoop Distributed file system (DFS)
Practical
– Hadoop installation
– HDFS commands
– Using HDFS commands in JAVA programs
Big Data Analytics
Mark Distribution
• Mid 1 20 marks
• CLA1 10 marks
• CLA2 10 marks
• CLA3 10 marks
• Final exam 50 marks
Big Data Characteristics
• Big Data Characteristics
– Volume – How much data?
– Velocity – How fast the data is generated/processed?
– Variety - The various types of data.
– Veracity – Quality of data
– Value -- The usefulness of the data

• Example: Mobile phone data, Sensor Data, Credit card

data, Weather data, Data generated in social media sites (eg.
Face book, Twitter), Video surveillance data, medical data,
data used for scientific and research experiments, online
transactions, etc.
Big Data Characteristics
• Volume: High data volumes impose
– Distinct storage and processing demands
– Additional data preparation, curation and management
processes,
• Velocity: Data can arrive at fast speeds, large data
will be accumulated at a shorter period of time
– Highly elastic storage and processing capabilities are
needed.
– eg. Per minute: 3,50, 000 tweets, 300 hrs YouTube video,
171 million emails, 330 GB of sensor data from Jet Engine
• Variety: Structured data (DBMS), Unstructured data
(Text, Image, Audio and Video) , Semi structured
data (XML)
Big Data Characteristics

• Veracity: Quality of data:

– Invalid data and noise has to be removed (noise is the data
that cannot be converted into information and thus has no
value)
– Data collected in a controlled manner sources (online
transactions) have less noise
– Data collected from uncontrolled sources will have more
noise
Big Data Characteristics
• Value: Usefulness of the data fro the enterprise.
– Quality impacts value
– Time taken for processing data eg. Stock market data
– Value is affected by
• useful attributes are removed during cleansing process
• Right types of questions being asked during data
analysis
Data Analysis Vs Data Analytics
• Data Analysis is the process of examining data to find
facts, relationships, patterns, insights and trends
among the data.
• Data Analytics includes development of analysis
methods, scientific methods and automated tools used
to manage the complete data life cycle (collection,
cleansing, organizing, storing, analysing and
governing data)
Data Analysis Vs Data Analytics
• Goal of Data Analysis: To conduct the analysis of
the data in such a manner that high quality results are
delivered in a timely manner, which provides optimal
value to the enterprise.
– Overall goal of data analysis is to support better decision making.
Big Data
• Overall goal of data analysis is to support better decision
making
• Uses of large data
– Knowledge (meaningful pattern) obtained by analyzing
data, can improve the efficiency of the business
– Analyzing super market data
• What kind of items sold?
– Analyzing weather data
• Precautionary measures can be taken against floods,
Tsunami , etc.
– Analyzing Medical data
• Reasons for epidemic/pandemic diseases (Covid-19)
and cure can be found, Cancer cells identification, etc.
Different Types of Data Analytics
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Different Types of Data Analytics

• Descriptive Analytics
– are carried out to answer questions about events that
already have occurred
– Example questions
• Sales volume over past 12 months
• Monthly commission earned by sales agent
• Number of calls received based on severity and
geographic location.
Different Types of Data Analytics
• Diagnostic Analytics
– aims to determine the cause of a phenomenon that
occurred in the past using questions that focus on the
reason behind the event
– Example questions
– Item 2 sales less than item 3 – why?
– More service request calls from western region- why?
– Patient re-admission rates are increasing why?
Different Types of Data Analytics
• Predictive Analytics
– try to predict the outcome of the events and predictions are
made based on patterns, trends and exceptions found in
historical and current data.
– Example questions.
• Chances for a customer default a loan if one month
payment is missed
• Patient survival rate if Drug B is administered instead
of Drug A
• Customer purchase A, B products- chances of
purchasing C.
Different Types of Data Analytics
• Prescriptive Analytics
– build upon the results of predictive analytics by
prescribing actions should be taken
– incorporates internal data (current and historical data,
customer information, product data and business rules) and
external data (social media data, weather forecasts and
government –produced demographic data)
– Example questions:
• Among three drugs which one provides best results
why?
• When is the best time to trade a particular stock?
Big Data
• How to store and process large data ?
– Hadoop supports storage and computing framework for
solving Big Data problems.
– Researchers are working on to find efficient solutions.
• Limitations of Centralized systems
– Storage is limited to terabytes
– Uni-processor systems: Multi programming, Time sharing,
Threads - improves throughput
– Multi processor systems: Parallel processing using multiple
CPUs - improves throughput.
• Disadvantage: Scalability – Resources cannot be added to handle
increase in load.
Big Data
• Centralized systems
– Data has to be transferred to the system where application
program is getting executed
– Do not have
• Enough storage
• Required computing power to process large data
Big Data
• Multi computer systems
– Scalable: Storage, Computing power can be increased
according to the requirement
– Data intensive computing : Type of parallel computing –
Tasks are executed by transferring them to the systems
where the data is available and executing these tasks in
parallel.
Fastest Super Computer
• Top two IBM-built supercomputers, Summit and
Sierra
– Installed at the Department of Energy’s Oak Ridge
National Laboratory (ORNL) in Tennessee and Lawrence
Livermore National Laboratory in California
• Both derive their computational power from Power 9
CPUs and NVIDIA V100 GPUs.
• The Summit system delivers a record 148.6 petaflops,
while the number two delivers at 94.6 petaflops.
Fastest Super Computer
• Sunway TaihuLight, a system developed by China’s
National Research Center of Parallel Computer Engineering
& Technology (NRCPC) and installed at the National
Supercomputing Center in Wuxi, maintains its top position.
• With a Linpack performance of 93 petaflops, TaihuLight is
far and away the most powerful number-cruncher on the
planet.
• The Sunway TaihuLight uses a total of 40,960 Chinese-
designed SW26010 many core 64-bit RISC processors.
– Each processor chip contains 260 processing cores for a total of
10,649,600 CPU cores across the entire system
Distributed System (DS)
• A distributed system is a collection of independent
computers that appears to its users as a single
coherent system (Single System Image) - Loosely
coupled systems
• Computing power and storage are shared among the
users.
Advantages of DS
• Scalability
• Reliability
• Availability
• Communication
Cluster Systems
• Collection of similar computers (nodes) connected by
high speed LAN
– Master node, Computational nodes
• Nodes run similar OS
• Master node controls storage allocation and job
scheduling
• Provides single system image
• Computing power and storage are shared among the
users.
Cluster Computing .. Contd
Cloud Computing
• Cloud is a pool of virtualized computer resources and
provides various services (XaaS) in a scalable
manner on subscription basis (Pay as you go model )
• Services
– IaaS (Infrastructure as a service) – Amazon S3, EC2
– PaaS (Platform as a service) – Google App. Engine, MS
Azure
– SaaS (Software as a service) - Google Docs, Facebook
Essential Cloud Characteristics
Different Types of Cloud
• Many types
– Public cloud – available via Internet to anyone willing to
pay
– Private cloud – run by a company for the company’s own
use
– Hybrid cloud – includes both public and private cloud
components
Mainframe and Cloud Comparison

• Mainframe: Multiprocessor System

• Cloud: Multicomputer System
• Both for Rental
• Mainframe: IaaS
• Cloud: XaaS
• Mainframe: Resources (Physical) are fixed
• Cloud: Resources (Virtual) can be added
dynamically
Virtual Machine
• It is a software implementation of a computing
environment in which an operating system or
program can be installed and run.
– VM mimics the behavior of the hardware
• It is widely used to
– to build elastically scalable systems
– to deliver customizable computing environment on
demand
Virtual Machine
• Advantages
– Efficient resource utilization
– Loading different operating systems on single physical
machine
Virtualization
• It is the creation of virtual version of something such
as a processor or a storage device or network
resources, etc.
Hosted VM

eg: Vmware for Windows

Bare-metal VM
VM cloning
• A clone is a copy of an existing VM
– The existing VM is called the parent of the clone.
– Changes made to a clone do not affect the parent VM.
Changes made to the parent VM do not appear in a clone.
• Clones are useful when you must deploy many
identical VMs to a group
VM cloning
• Installing a guest operating system and applications
can be time consuming. With clones, we can make
many copies of a VM from a single installation and
configuration process.
File System
• A File is named collection of related information that
is recorded on secondary storage
• A file consists of data or program.
• File system is one of the important component of
operating system that provides mechanism for storage
and access to file contents
File System
• File system has two parts
– (i) Collection of files (for storing data and programs)
– (ii) Directory Structure - provides information about all
files in the system.
• File systems are responsible for the organization,
storage, retrieval, naming, sharing and protection of
files.
Centralized File System
• File System which stores (maintains) files and
directories in a single computer system is known as
Centralized File System.
• Disadvantage
– Storage is limited
Distribute File System (DFS)
• File system that manages the storage across a network
of machines is called distributed file system (DFS).
– Keeps files and directories in multiple computer systems
• A DFS is a client/server-based application that allows
clients to access and process data stored on the
server(s) as if it were on their own computer.
• A DFS has to provide single system image to the
clients even though data is stored in multiple
computer systems.
Distributed File System
• Advantages of DFS
– Network Transparency: Client does not know about
location of files . Clients can access the files from any
computer connected in the network
– Reliability: By keeping multiple copies for files
(Replication) reliability can be achieved.
– Performance : If blocks of the file are distributed and
replication is applied, then parallel access of data is possible
and so the performance can be improved.
– Scalable: As the storage requirement increases, it is
possible to provide required storage by adding additional
computer systems in the network.
Distributed File System

• Name Server – Global directory - Meta data of files

are stored
• Data Servers
– Data files (objects) are stored
– User programs are executed
Hadoop
• Google published a paper – GFS – 2003, Mapreduce
– 2004
• In 2005, NDFS (NutchDFS) and Mapreduce were
created (based on GFS and Mapreduce) by Doug
cutting.
• Later in 2006, it was renamed as Hadoop by Yahoo!
• In 2008, it became an open source technology and
came under Apache.
Hadoop Eco System
• Apache HBase, a table-oriented database built on top of
Hadoop.
• Apache Hive, a data warehouse built on top of Hadoop
that makes data accessible through an SQL-like
language.
• Apache Sqoop, a tool for transferring data between
Hadoop and other data stores.
• Apache Pig, a platform for creating programs that run on
Hadoop in parallel. Pig Latin is the language used here.
• ZooKeeper, a tool for configuring and synchronizing
Hadoop clusters.
Hadoop Distributed File System
(HDFS)
• HDFS is open‐source DFS developed under Apache
• Designed to store very large datasets reliably.
• Master-Slave Architecture
– Single Name node acts as the master
– Multiple Data nodes act as workers (slaves)
• 64 MB block size (can be increased)
– Reduces seek time
– Data is transferred as per disk transfer rate (100 MB/s)
HDFS - Architecture
HDFS Architecture - Continued

• Name node
– The name mode manages file system’s metadata (location
of the file, size of the file, etc …) and name space
– Name space is a hierarchy of files and directories (name
space tree) and it is kept in the main memory.
– The mapping of blocks to data nodes is determined by the
name node
– Runs Job Tracker Program
HDFS Architecture - Continued
• File attributes like permissions, modification and
access times, etc.. are recorded in the inode strutures.
• The inode data and list of blocks belonging to each
file (metadata) is called as image.
• The persistent record of the image stored in the local
host’s native file system is called a checkpoint.
• The name node also stores the modification log of the
image called the journal in the local host’s native file
system.
HDFS Architecture - Continued
• For safety purpose, redundant copies of the
checkpoint and journal can be stored in other server
systems connected in the network.
• During restarts, the name node restores the
namespace image from the persistent checkpoint and
then replays the changes from the journal until the
image is up-to-date.
• The locations of block replicas may change over time
and are not part of the persistent checkpoint.
HDFS Architecture - Continued
• Data node
– Runs HDFS client program
– Manages the storage attached to the node; responsible for
storing and retrieving file blocks from the local file system
and from the remote data node
– Executes the user application programs. These programs
can access the HDFS using the HDFS client
– Executes Map & Reduce tasks
– Runs Task Tracker program
Rack organization in Hadoop
HDFS - Contd
• Characteristics of HDFS
– Block Replication factor is set by the user and 3 by default
– Stores one replica in the same node where a write operation
is requested and one on different node in the same rack
(Why?)
• And one replica on a node in the different rack
• Heartbeat Message (HM)
– Data node sends periodic HM to the name node
– Receipt of a HM indicates that the data node is functioning
properly
HDFS - Contd
• Heartbeat Message (HM)
– At every 3 seconds HM is generated
– If the name node does not receive the HM message from a
DN within 10 minutes and DN is considered to be not
alive.
– HM carries information about total storage capacity,
fraction of storage in use, number of data transfers in
progress (used for NameNode’s space allocation and load
balancing decisions)
HDFS - Contd
• The name node does not directly call data nodes. It
uses replies to heart beats to send instructions to the
Data nodes. (The name node can send the following
commands)
– Replicate blocks to other data nodes
– Send an immediate block report
– Remove local block replicas
• The name node is a multi threaded system and it can
process thousands of heartbeats per second, without
affecting other name node operations.
HDFS - Contd
• Block Report Message (BRM)
– Data node sends periodic BRM to the Name node
– Each BRM contains a list of all blocks (list contains the
block ID, generation stamp, and length of each block) on
the data node
– The first BR is sent immediately after Data node
registration.
– Subsequent BRs are sent every hour which provides up-to-
date view of block replica locations in the cluster
– Name node decides whether to create replica for a data
block or not based on BRM (How?)
HDFS Continued
• User applications access the file system using HDFS
client.
• HDFS supports operations to read, write and delete
files, and operations to create and delete directories.
Reading from a File
• HDFS client sends the read request to Name Node
• Name Node returns the addresses of a set of Data
Nodes containing replicas for the requested file
• First block of the file can be read by calling read
function and by giving address of the closest Data
Node.
• The same process is repeated for all blocks of the
requested file
Writing to a File
• Default replication factor is 3 for HDFS. Applications
can reset the replication factor.
• Write function is used for writing into a file. Assume
that ‘n’ blocks have to be written.
Writing to a File

• (First) Block is written into the local data node.

• Name node is contacted to get a list of suitable
data nodes (G) to store replicas of the (first) block. Then
the first block is copied into one data node from G and
then that block is forwarded to another data node
specified in G.
• This process is repeated for the remaining set of
blocks (n-1) of the file.
Concurrent File Access
• HDFS supports write once read many – concept
• Multiple reads are supported
• Concurrent writes are not allowed in HDFS
• Only append operation is allowed in HDFS and
modifying existing information is not allowed.
• [Link]
11ENeGTyshyqRqNKFKnbjsu6tuuTbnxr4QPBnZRN
AH1s/edit
THANK YOU

Unit 1
No ratings yet
Unit 1
25 pages
Big Data Fundamentals and Comparisons
No ratings yet
Big Data Fundamentals and Comparisons
11 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
63 pages
Big Data Processing Techniques Explained
No ratings yet
Big Data Processing Techniques Explained
9 pages
Big Data Course Overview and Objectives
No ratings yet
Big Data Course Overview and Objectives
63 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
17 pages
Big Data Lecture
No ratings yet
Big Data Lecture
53 pages
BDAUnit 1
No ratings yet
BDAUnit 1
31 pages
Big Data Analytics Overview for 21CS71
No ratings yet
Big Data Analytics Overview for 21CS71
32 pages
Big Data Analytics Fundamentals Guide
No ratings yet
Big Data Analytics Fundamentals Guide
64 pages
Hadoop: Key to Big Data Success
No ratings yet
Hadoop: Key to Big Data Success
99 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
31 pages
Importance of Big Data's 3 V's Explained
No ratings yet
Importance of Big Data's 3 V's Explained
8 pages
Understanding Big Data Types and Tools
No ratings yet
Understanding Big Data Types and Tools
26 pages
Big Data Analytics with Hadoop Overview
50% (2)
Big Data Analytics with Hadoop Overview
27 pages
Introduction to Big Data Concepts
100% (1)
Introduction to Big Data Concepts
87 pages
Introduction to Big Data Concepts
No ratings yet
Introduction to Big Data Concepts
15 pages
Big Data Processing Challenges & Solutions
No ratings yet
Big Data Processing Challenges & Solutions
24 pages
Overview of Big Data Tools and Concepts
No ratings yet
Overview of Big Data Tools and Concepts
29 pages
Big Data: NoSQL vs Relational Databases
No ratings yet
Big Data: NoSQL vs Relational Databases
33 pages
BD Intqb
No ratings yet
BD Intqb
11 pages
Big Data Analytics: Key Concepts Explained
No ratings yet
Big Data Analytics: Key Concepts Explained
24 pages
Understanding Big Data: Scope & Insights
No ratings yet
Understanding Big Data: Scope & Insights
38 pages
Big Data Analysis Assignment Guide
No ratings yet
Big Data Analysis Assignment Guide
21 pages
Big Data Analytics: Insights & Innovations
No ratings yet
Big Data Analytics: Insights & Innovations
6 pages
Big Data Analytics for Decision Making
No ratings yet
Big Data Analytics for Decision Making
28 pages
Security Challenges in Big Data
No ratings yet
Security Challenges in Big Data
5 pages
Big Data Analytics Overview and Concepts
No ratings yet
Big Data Analytics Overview and Concepts
168 pages
Understanding Big Data: Evolution & Characteristics
No ratings yet
Understanding Big Data: Evolution & Characteristics
55 pages
Understanding Big Data and Analytics
No ratings yet
Understanding Big Data and Analytics
35 pages
Understanding Big Data and Analytics
No ratings yet
Understanding Big Data and Analytics
21 pages
Types and Characteristics of Digital Data
No ratings yet
Types and Characteristics of Digital Data
59 pages
Big Data and Cloud Analytics Overview
No ratings yet
Big Data and Cloud Analytics Overview
9 pages
Introduction to Data Science Concepts
No ratings yet
Introduction to Data Science Concepts
22 pages
Understanding Big Data Types and Analytics
No ratings yet
Understanding Big Data Types and Analytics
19 pages
Big Data Fundamentals and Analytics Techniques
No ratings yet
Big Data Fundamentals and Analytics Techniques
32 pages
Key Applications of Big Data Insights
No ratings yet
Key Applications of Big Data Insights
15 pages
Cloud-Based Big Data Analysis Review
No ratings yet
Cloud-Based Big Data Analysis Review
7 pages
Data Science: Types, Storage, and Analysis
No ratings yet
Data Science: Types, Storage, and Analysis
21 pages
Big Data: Transforming Modern Management
No ratings yet
Big Data: Transforming Modern Management
49 pages
Hadoop in Cloud-Based Big Data Solutions
No ratings yet
Hadoop in Cloud-Based Big Data Solutions
5 pages
Big Data Analytics Course Overview
No ratings yet
Big Data Analytics Course Overview
84 pages
Big Data: Key Uses in Business
No ratings yet
Big Data: Key Uses in Business
30 pages
Real-Time Data Processing in Big Data
No ratings yet
Real-Time Data Processing in Big Data
21 pages
Big Data: Concepts, Types, and Technologies
No ratings yet
Big Data: Concepts, Types, and Technologies
30 pages
Module_1
No ratings yet
Module_1
18 pages
Understanding Big Data Types and Uses
No ratings yet
Understanding Big Data Types and Uses
45 pages
Overview of Big Data Ecosystem Components
No ratings yet
Overview of Big Data Ecosystem Components
76 pages
Unit I Introudction
No ratings yet
Unit I Introudction
26 pages
Introduction To Big Data and Types of Big Data Notes.
No ratings yet
Introduction To Big Data and Types of Big Data Notes.
32 pages
Data Science Overview and Concepts
No ratings yet
Data Science Overview and Concepts
57 pages
Overview of Data Science Concepts
No ratings yet
Overview of Data Science Concepts
23 pages
Big Data Analytics Overview and Tools
No ratings yet
Big Data Analytics Overview and Tools
14 pages
Evolution of Informatics and Technologies
No ratings yet
Evolution of Informatics and Technologies
12 pages
Big Data Techniques Overview
No ratings yet
Big Data Techniques Overview
6 pages
SQL Injection Testing with Sqlmap
No ratings yet
SQL Injection Testing with Sqlmap
19 pages
Excel Shortcuts for Data Analysis
No ratings yet
Excel Shortcuts for Data Analysis
3 pages
Online Restaurant Management System Report
100% (1)
Online Restaurant Management System Report
36 pages
User Profile Implementation in ASP.NET
No ratings yet
User Profile Implementation in ASP.NET
24 pages
Optimize Azure Monitoring & Insights
No ratings yet
Optimize Azure Monitoring & Insights
25 pages
Blockchain Basics Explained
No ratings yet
Blockchain Basics Explained
3 pages
PHP Dynamic Content & Database Guide
No ratings yet
PHP Dynamic Content & Database Guide
5 pages
Understanding Change Capture in Datastage
No ratings yet
Understanding Change Capture in Datastage
10 pages
Data.table Transformation Cheat Sheet
No ratings yet
Data.table Transformation Cheat Sheet
2 pages
Letter of Volatility for PXI/PXI-843x
No ratings yet
Letter of Volatility for PXI/PXI-843x
4 pages
File System Operations Overview
No ratings yet
File System Operations Overview
38 pages
ABAP/4 Development FAQs and Tips
No ratings yet
ABAP/4 Development FAQs and Tips
141 pages
Oajs 02 00095
No ratings yet
Oajs 02 00095
10 pages
Database Management Systems Lab Guide
No ratings yet
Database Management Systems Lab Guide
48 pages
NetSuite Administrator Fundamentals (English) Training Data Sheet
No ratings yet
NetSuite Administrator Fundamentals (English) Training Data Sheet
3 pages
User Management System Overview
No ratings yet
User Management System Overview
5 pages
SQL DML: Commands Overview
No ratings yet
SQL DML: Commands Overview
18 pages
Logical Database Design Using Entity-Relationship Modeling: Normalization To Avoid Redundancy
No ratings yet
Logical Database Design Using Entity-Relationship Modeling: Normalization To Avoid Redundancy
12 pages
RMAN Backup Script for Oracle DB
No ratings yet
RMAN Backup Script for Oracle DB
4 pages
Human Resources Management Assignment 2023
No ratings yet
Human Resources Management Assignment 2023
11 pages
BCA Semester 1 Lecture Schedule
No ratings yet
BCA Semester 1 Lecture Schedule
2 pages
NOSQL Data Models Overview
No ratings yet
NOSQL Data Models Overview
8 pages
Importance of Information System Security
No ratings yet
Importance of Information System Security
4 pages
Understanding Computer File Concepts
No ratings yet
Understanding Computer File Concepts
6 pages
MSSQL to MySQL Migration Guide
No ratings yet
MSSQL to MySQL Migration Guide
10 pages
Data Analytics Insights for Social Buzz
No ratings yet
Data Analytics Insights for Social Buzz
11 pages
Understanding .docm Maldoc Techniques
No ratings yet
Understanding .docm Maldoc Techniques
2 pages
SQL Server Security Best Practices Guide
No ratings yet
SQL Server Security Best Practices Guide
10 pages
Data Governance Framework Template
No ratings yet
Data Governance Framework Template
8 pages
Informatica Interview Questions & Answers
No ratings yet
Informatica Interview Questions & Answers
25 pages

Big Data and Hadoop Overview Guide

Uploaded by

Big Data and Hadoop Overview Guide

Uploaded by

Unit I : Big Data and Hadoop

• Example: Mobile phone data, Sensor Data, Credit card

• Veracity: Quality of data:

• Mainframe: Multiprocessor System

eg: Vmware for Windows

• Name Server – Global directory - Meta data of files

• (First) Block is written into the local data node.

You might also like