0% found this document useful (0 votes)
9 views59 pages

Big Data and Hadoop Overview Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views59 pages

Big Data and Hadoop Overview Guide

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Unit I : Big Data and Hadoop

• Topics covered
Theory
– Introduction to BIG DATA
– Distributed computing
– Distributed File System
– Hadoop Eco System
– Hadoop Distributed file system (DFS)
Practical
– Hadoop installation
– HDFS commands
– Using HDFS commands in JAVA programs
Big Data Analytics
Mark Distribution
• Mid 1 20 marks
• CLA1 10 marks
• CLA2 10 marks
• CLA3 10 marks
• Final exam 50 marks
Big Data Characteristics
• Big Data Characteristics
– Volume – How much data?
– Velocity – How fast the data is generated/processed?
– Variety - The various types of data.
– Veracity – Quality of data
– Value -- The usefulness of the data

• Example: Mobile phone data, Sensor Data, Credit card


data, Weather data, Data generated in social media sites (eg.
Face book, Twitter), Video surveillance data, medical data,
data used for scientific and research experiments, online
transactions, etc.
Big Data Characteristics
• Volume: High data volumes impose
– Distinct storage and processing demands
– Additional data preparation, curation and management
processes,
• Velocity: Data can arrive at fast speeds, large data
will be accumulated at a shorter period of time
– Highly elastic storage and processing capabilities are
needed.
– eg. Per minute: 3,50, 000 tweets, 300 hrs YouTube video,
171 million emails, 330 GB of sensor data from Jet Engine
• Variety: Structured data (DBMS), Unstructured data
(Text, Image, Audio and Video) , Semi structured
data (XML)
Big Data Characteristics

• Veracity: Quality of data:


– Invalid data and noise has to be removed (noise is the data
that cannot be converted into information and thus has no
value)
– Data collected in a controlled manner sources (online
transactions) have less noise
– Data collected from uncontrolled sources will have more
noise
Big Data Characteristics
• Value: Usefulness of the data fro the enterprise.
– Quality impacts value
– Time taken for processing data eg. Stock market data
– Value is affected by
• useful attributes are removed during cleansing process
• Right types of questions being asked during data
analysis
Data Analysis Vs Data Analytics
• Data Analysis is the process of examining data to find
facts, relationships, patterns, insights and trends
among the data.
• Data Analytics includes development of analysis
methods, scientific methods and automated tools used
to manage the complete data life cycle (collection,
cleansing, organizing, storing, analysing and
governing data)
Data Analysis Vs Data Analytics
• Goal of Data Analysis: To conduct the analysis of
the data in such a manner that high quality results are
delivered in a timely manner, which provides optimal
value to the enterprise.
– Overall goal of data analysis is to support better decision making.
Big Data
• Overall goal of data analysis is to support better decision
making
• Uses of large data
– Knowledge (meaningful pattern) obtained by analyzing
data, can improve the efficiency of the business
– Analyzing super market data
• What kind of items sold?
– Analyzing weather data
• Precautionary measures can be taken against floods,
Tsunami , etc.
– Analyzing Medical data
• Reasons for epidemic/pandemic diseases (Covid-19)
and cure can be found, Cancer cells identification, etc.
Different Types of Data Analytics
• Descriptive Analytics
• Diagnostic Analytics
• Predictive Analytics
• Prescriptive Analytics
Different Types of Data Analytics

• Descriptive Analytics
– are carried out to answer questions about events that
already have occurred
– Example questions
• Sales volume over past 12 months
• Monthly commission earned by sales agent
• Number of calls received based on severity and
geographic location.
Different Types of Data Analytics
• Diagnostic Analytics
– aims to determine the cause of a phenomenon that
occurred in the past using questions that focus on the
reason behind the event
– Example questions
– Item 2 sales less than item 3 – why?
– More service request calls from western region- why?
– Patient re-admission rates are increasing why?
Different Types of Data Analytics
• Predictive Analytics
– try to predict the outcome of the events and predictions are
made based on patterns, trends and exceptions found in
historical and current data.
– Example questions.
• Chances for a customer default a loan if one month
payment is missed
• Patient survival rate if Drug B is administered instead
of Drug A
• Customer purchase A, B products- chances of
purchasing C.
Different Types of Data Analytics
• Prescriptive Analytics
– build upon the results of predictive analytics by
prescribing actions should be taken
– incorporates internal data (current and historical data,
customer information, product data and business rules) and
external data (social media data, weather forecasts and
government –produced demographic data)
– Example questions:
• Among three drugs which one provides best results
why?
• When is the best time to trade a particular stock?
Big Data
• How to store and process large data ?
– Hadoop supports storage and computing framework for
solving Big Data problems.
– Researchers are working on to find efficient solutions.
• Limitations of Centralized systems
– Storage is limited to terabytes
– Uni-processor systems: Multi programming, Time sharing,
Threads - improves throughput
– Multi processor systems: Parallel processing using multiple
CPUs - improves throughput.
• Disadvantage: Scalability – Resources cannot be added to handle
increase in load.
Big Data
• Centralized systems
– Data has to be transferred to the system where application
program is getting executed
– Do not have
• Enough storage
• Required computing power to process large data
Big Data
• Multi computer systems
– Scalable: Storage, Computing power can be increased
according to the requirement
– Data intensive computing : Type of parallel computing –
Tasks are executed by transferring them to the systems
where the data is available and executing these tasks in
parallel.
Fastest Super Computer
• Top two IBM-built supercomputers, Summit and
Sierra
– Installed at the Department of Energy’s Oak Ridge
National Laboratory (ORNL) in Tennessee and Lawrence
Livermore National Laboratory in California
• Both derive their computational power from Power 9
CPUs and NVIDIA V100 GPUs.
• The Summit system delivers a record 148.6 petaflops,
while the number two delivers at 94.6 petaflops.
Fastest Super Computer
• Sunway TaihuLight, a system developed by China’s
National Research Center of Parallel Computer Engineering
& Technology (NRCPC) and installed at the National
Supercomputing Center in Wuxi, maintains its top position.
• With a Linpack performance of 93 petaflops, TaihuLight is
far and away the most powerful number-cruncher on the
planet.
• The Sunway TaihuLight uses a total of 40,960 Chinese-
designed SW26010 many core 64-bit RISC processors.
– Each processor chip contains 260 processing cores for a total of
10,649,600 CPU cores across the entire system
Distributed System (DS)
• A distributed system is a collection of independent
computers that appears to its users as a single
coherent system (Single System Image) - Loosely
coupled systems
• Computing power and storage are shared among the
users.
Advantages of DS
• Scalability
• Reliability
• Availability
• Communication
Cluster Systems
• Collection of similar computers (nodes) connected by
high speed LAN
– Master node, Computational nodes
• Nodes run similar OS
• Master node controls storage allocation and job
scheduling
• Provides single system image
• Computing power and storage are shared among the
users.
Cluster Computing .. Contd
Cloud Computing
• Cloud is a pool of virtualized computer resources and
provides various services (XaaS) in a scalable
manner on subscription basis (Pay as you go model )
• Services
– IaaS (Infrastructure as a service) – Amazon S3, EC2
– PaaS (Platform as a service) – Google App. Engine, MS
Azure
– SaaS (Software as a service) - Google Docs, Facebook
Essential Cloud Characteristics
Different Types of Cloud
• Many types
– Public cloud – available via Internet to anyone willing to
pay
– Private cloud – run by a company for the company’s own
use
– Hybrid cloud – includes both public and private cloud
components
Mainframe and Cloud Comparison

• Mainframe: Multiprocessor System


• Cloud: Multicomputer System
• Both for Rental
• Mainframe: IaaS
• Cloud: XaaS
• Mainframe: Resources (Physical) are fixed
• Cloud: Resources (Virtual) can be added
dynamically
Virtual Machine
• It is a software implementation of a computing
environment in which an operating system or
program can be installed and run.
– VM mimics the behavior of the hardware
• It is widely used to
– to build elastically scalable systems
– to deliver customizable computing environment on
demand
Virtual Machine
• Advantages
– Efficient resource utilization
– Loading different operating systems on single physical
machine
Virtualization
• It is the creation of virtual version of something such
as a processor or a storage device or network
resources, etc.
Hosted VM

eg: Vmware for Windows


Bare-metal VM
VM cloning
• A clone is a copy of an existing VM
– The existing VM is called the parent of the clone.
– Changes made to a clone do not affect the parent VM.
Changes made to the parent VM do not appear in a clone.
• Clones are useful when you must deploy many
identical VMs to a group
VM cloning
• Installing a guest operating system and applications
can be time consuming. With clones, we can make
many copies of a VM from a single installation and
configuration process.
File System
• A File is named collection of related information that
is recorded on secondary storage
• A file consists of data or program.
• File system is one of the important component of
operating system that provides mechanism for storage
and access to file contents
File System
• File system has two parts
– (i) Collection of files (for storing data and programs)
– (ii) Directory Structure - provides information about all
files in the system.
• File systems are responsible for the organization,
storage, retrieval, naming, sharing and protection of
files.
Centralized File System
• File System which stores (maintains) files and
directories in a single computer system is known as
Centralized File System.
• Disadvantage
– Storage is limited
Distribute File System (DFS)
• File system that manages the storage across a network
of machines is called distributed file system (DFS).
– Keeps files and directories in multiple computer systems
• A DFS is a client/server-based application that allows
clients to access and process data stored on the
server(s) as if it were on their own computer.
• A DFS has to provide single system image to the
clients even though data is stored in multiple
computer systems.
Distributed File System
• Advantages of DFS
– Network Transparency: Client does not know about
location of files . Clients can access the files from any
computer connected in the network
– Reliability: By keeping multiple copies for files
(Replication) reliability can be achieved.
– Performance : If blocks of the file are distributed and
replication is applied, then parallel access of data is possible
and so the performance can be improved.
– Scalable: As the storage requirement increases, it is
possible to provide required storage by adding additional
computer systems in the network.
Distributed File System

• Name Server – Global directory - Meta data of files


are stored
• Data Servers
– Data files (objects) are stored
– User programs are executed
Hadoop
• Google published a paper – GFS – 2003, Mapreduce
– 2004
• In 2005, NDFS (NutchDFS) and Mapreduce were
created (based on GFS and Mapreduce) by Doug
cutting.
• Later in 2006, it was renamed as Hadoop by Yahoo!
• In 2008, it became an open source technology and
came under Apache.
Hadoop Eco System
• Apache HBase, a table-oriented database built on top of
Hadoop.
• Apache Hive, a data warehouse built on top of Hadoop
that makes data accessible through an SQL-like
language.
• Apache Sqoop, a tool for transferring data between
Hadoop and other data stores.
• Apache Pig, a platform for creating programs that run on
Hadoop in parallel. Pig Latin is the language used here.
• ZooKeeper, a tool for configuring and synchronizing
Hadoop clusters.
Hadoop Distributed File System
(HDFS)
• HDFS is open‐source DFS developed under Apache
• Designed to store very large datasets reliably.
• Master-Slave Architecture
– Single Name node acts as the master
– Multiple Data nodes act as workers (slaves)
• 64 MB block size (can be increased)
– Reduces seek time
– Data is transferred as per disk transfer rate (100 MB/s)
HDFS - Architecture
HDFS Architecture - Continued

• Name node
– The name mode manages file system’s metadata (location
of the file, size of the file, etc …) and name space
– Name space is a hierarchy of files and directories (name
space tree) and it is kept in the main memory.
– The mapping of blocks to data nodes is determined by the
name node
– Runs Job Tracker Program
HDFS Architecture - Continued
• File attributes like permissions, modification and
access times, etc.. are recorded in the inode strutures.
• The inode data and list of blocks belonging to each
file (metadata) is called as image.
• The persistent record of the image stored in the local
host’s native file system is called a checkpoint.
• The name node also stores the modification log of the
image called the journal in the local host’s native file
system.
HDFS Architecture - Continued
• For safety purpose, redundant copies of the
checkpoint and journal can be stored in other server
systems connected in the network.
• During restarts, the name node restores the
namespace image from the persistent checkpoint and
then replays the changes from the journal until the
image is up-to-date.
• The locations of block replicas may change over time
and are not part of the persistent checkpoint.
HDFS Architecture - Continued
• Data node
– Runs HDFS client program
– Manages the storage attached to the node; responsible for
storing and retrieving file blocks from the local file system
and from the remote data node
– Executes the user application programs. These programs
can access the HDFS using the HDFS client
– Executes Map & Reduce tasks
– Runs Task Tracker program
Rack organization in Hadoop
HDFS - Contd
• Characteristics of HDFS
– Block Replication factor is set by the user and 3 by default
– Stores one replica in the same node where a write operation
is requested and one on different node in the same rack
(Why?)
• And one replica on a node in the different rack
• Heartbeat Message (HM)
– Data node sends periodic HM to the name node
– Receipt of a HM indicates that the data node is functioning
properly
HDFS - Contd
• Heartbeat Message (HM)
– At every 3 seconds HM is generated
– If the name node does not receive the HM message from a
DN within 10 minutes and DN is considered to be not
alive.
– HM carries information about total storage capacity,
fraction of storage in use, number of data transfers in
progress (used for NameNode’s space allocation and load
balancing decisions)
HDFS - Contd
• The name node does not directly call data nodes. It
uses replies to heart beats to send instructions to the
Data nodes. (The name node can send the following
commands)
– Replicate blocks to other data nodes
– Send an immediate block report
– Remove local block replicas
• The name node is a multi threaded system and it can
process thousands of heartbeats per second, without
affecting other name node operations.
HDFS - Contd
• Block Report Message (BRM)
– Data node sends periodic BRM to the Name node
– Each BRM contains a list of all blocks (list contains the
block ID, generation stamp, and length of each block) on
the data node
– The first BR is sent immediately after Data node
registration.
– Subsequent BRs are sent every hour which provides up-to-
date view of block replica locations in the cluster
– Name node decides whether to create replica for a data
block or not based on BRM (How?)
HDFS Continued
• User applications access the file system using HDFS
client.
• HDFS supports operations to read, write and delete
files, and operations to create and delete directories.
Reading from a File
• HDFS client sends the read request to Name Node
• Name Node returns the addresses of a set of Data
Nodes containing replicas for the requested file
• First block of the file can be read by calling read
function and by giving address of the closest Data
Node.
• The same process is repeated for all blocks of the
requested file
Writing to a File
• Default replication factor is 3 for HDFS. Applications
can reset the replication factor.
• Write function is used for writing into a file. Assume
that ‘n’ blocks have to be written.
Writing to a File

• (First) Block is written into the local data node.


• Name node is contacted to get a list of suitable
data nodes (G) to store replicas of the (first) block. Then
the first block is copied into one data node from G and
then that block is forwarded to another data node
specified in G.
• This process is repeated for the remaining set of
blocks (n-1) of the file.
Concurrent File Access
• HDFS supports write once read many – concept
• Multiple reads are supported
• Concurrent writes are not allowed in HDFS
• Only append operation is allowed in HDFS and
modifying existing information is not allowed.
• [Link]
11ENeGTyshyqRqNKFKnbjsu6tuuTbnxr4QPBnZRN
AH1s/edit
THANK YOU

You might also like