UNIT-2
1. Define Hadoop and explain its architecture in detail. Discuss the roles of NameNode,
DataNode, Secondary NameNode, and the functioning of HDFS.
[Link] the evolution of Hadoop and explain the key features and advantages that make
Hadoop suitable for Big Data processing. Include examples of industries using Hadoop.
[Link] the concept of distributed computing in Hadoop to explain how Hadoop handles
large datasets efficiently. Illustrate your answer with a real-time example of distributed
storage and processing.
[Link] the differences between RDBMS and Hadoop with respect to storage, processing,
data type handling, scalability, and fault tolerance. Provide a comparison table and justify
why Hadoop is preferred for Big Data.
[Link] the major components of the Hadoop ecosystem (HDFS, YARN, MapReduce,
Hive, Pig, HBase). Discuss how each component contributes to solving Big Data challenges.
Critically assess their advantages and limitations.
[Link] a complete Hadoop-based Big Data solution for a large e-commerce company.
Include data ingestion, storage, processing, workflow design, and analytics output. Justify
why Hadoop is the ideal choice for this scenario.
[Link] MapReduce processing with traditional batch processing systems. Analyze the
architecture of MapReduce, explaining the roles of Mapper, Reducer, Combiner, and
Partitioner with examples.
[Link] evaluate the challenges in distributed computing and explain how Hadoop
overcomes them. Discuss issues such as hardware failures, data replication, cluster
management, and parallel processing. Provide suitable illustrations.
1. Define Hadoop and explain its architecture in detail. Discuss the roles of NameNode,
DataNode, Secondary NameNode, and the functioning of HDFS.
1. Definition of Hadoop
Hadoop is an open-source, Java-based framework developed by Apache to store and process
large-scale data across clusters of low-cost commodity hardware.
It follows the distributed computing model, where data is split into chunks and processed in
parallel across multiple machines.
Key goals of Hadoop:
Distributed storage
Fault tolerance
High scalability
Cost-effective data processing
Handling structured, semi-structured, and unstructured data
Hadoop is mainly composed of:
1. HDFS (Hadoop Distributed File System) – for storage
2. YARN (Yet Another Resource Negotiator) – for resource management
3. MapReduce – for processing
2. Hadoop Architecture (Detailed Explanation)
Hadoop follows a Master–Slave architecture consisting of:
A. Master Nodes
NameNode (manages file system namespace)
ResourceManager (manages cluster resources)
B. Slave Nodes
DataNodes (store data blocks)
NodeManagers (execute tasks)
The architecture is divided mainly into two layers:
(a) Storage Layer: HDFS Architecture
HDFS consists of:
1. NameNode (Master)
Stores metadata such as:
o File names
o Directory structure
o Block locations
o Permissions
Manages the entire file system namespace.
Does NOT store actual data—only metadata.
Keeps track of which DataNode stores which block.
2. DataNodes (Slaves)
Store actual data blocks.
Perform read and write operations on request from clients.
Send Heartbeats to NameNode every 3 seconds to confirm availability.
Send Block Reports to share block information.
3. Secondary NameNode
Not a backup to NameNode.
Its main task is checkpointing:
o Merges NameNode’s edit logs and FSImage
o Creates a new checkpoint periodically
Reduces the load on NameNode and prevents edit log overflow.
Important:
If NameNode fails, Secondary NameNode cannot automatically replace it. Manual restoration
is needed.
(b) Processing Layer: MapReduce + YARN
1. YARN Architecture
YARN includes:
ResourceManager – Allocates resources to applications
NodeManagers – Execute tasks on each node
ApplicationMaster – Coordinates a specific job
2. MapReduce
A distributed processing framework
Composed of Mapper (data split processing) and Reducer (aggregation)
3. Roles of NameNode, DataNode, Secondary NameNode
A. NameNode (Master Node)
Responsibilities:
Maintains metadata (directory structure, permissions, file-block mapping)
Manages file system namespace
Coordinates with DataNodes
Handles client read/write requests
Ensures high availability
Why NameNode is critical?
If NameNode fails, the entire HDFS becomes inaccessible.
It is the “brain” of the Hadoop architecture.
B. DataNode (Slave Node)
Responsibilities:
Stores data in the form of blocks
Executes read/write operations
Regularly reports to NameNode
Replicates blocks based on NameNode instructions
Responsible for actual storage and retrieval
Replication
Data is automatically replicated (default: 3 copies) across DataNodes to ensure fault
tolerance.
C. Secondary NameNode
Responsibilities:
Performs checkpointing to avoid log file overflow
Downloads NameNode’s metadata and edit logs
Merges FSImage and edit logs
Uploads a new checkpointed FSImage
Misconception Clarified
It is not a backup NameNode.
It only reduces the NameNode’s workload.
4. Functioning of HDFS (Step-by-Step)
A. Write Operation (When storing a file)
1. Client contacts NameNode to request file creation.
2. NameNode checks permissions and allocates DataNodes.
3. File is split into blocks (default: 128 MB or 256 MB).
4. Blocks are written to DataNodes and replicated.
5. DataNodes send confirmation to NameNode.
6. NameNode updates metadata.
Result:
File stored reliably across multiple nodes.
B. Read Operation (When accessing a file)
1. Client requests file from NameNode.
2. NameNode returns block locations.
3. Client reads blocks directly from DataNodes.
4. Data is combined and returned to the application.
Result:
Fast, parallel data access.
5. Features of HDFS
Fault tolerance: Data replication, automatic recovery
High throughput: Parallel data access
Scalability: Add nodes easily
Distributed storage: Low-cost hardware
Write-once-read-many model: Suitable for Big Data workloads
Hadoop is a powerful distributed framework that enables organizations to store and process
massive datasets efficiently.
Its architecture is built around HDFS, NameNode, DataNode, and checkpointing by
Secondary NameNode to ensure scalability and fault tolerance. With HDFS handling storage
and MapReduce handling processing, Hadoop becomes the foundation of most Big Data
environments.
[Link] the evolution of Hadoop and explain the key features and advantages that
make Hadoop suitable for Big Data processing. Include examples of industries using
Hadoop.
Evolution of Hadoop
Hadoop originated from the need to store and process massive web data efficiently.
1. Early 2000s – Google Papers
2003: Google File System (GFS) – Introduced distributed storage.
2004: Google MapReduce – Introduced parallel data processing.
These laid the foundation for Hadoop.
2. 2005 – Hadoop Project Launched
Doug Cutting and Mike Cafarella created Hadoop as part of the Nutch project.
Adopted concepts from Google’s papers.
3. 2008 – Apache Hadoop
Hadoop became a top-level Apache project.
Organizations began adopting HDFS + MapReduce.
4. 2011 – Hadoop 1.x
Introduced HDFS + MapReduce.
MapReduce acted as both storage AND processing engine.
Limitations: Single JobTracker, scalability issues.
5. 2013 – Hadoop 2.x (YARN)
YARN (Yet Another Resource Negotiator) separated:
o Resource Management
o Data Processing
Allowed multiple processing engines:
o Spark
o Hive
o Tez
6. Hadoop Ecosystem Growth
Included tools like:
Hive, Pig, Sqoop, HBase, ZooKeeper, Oozie
Improved usability, SQL processing, NoSQL storage.
7. Hadoop 3.x
Better fault tolerance (3-replica → erasure coding)
More efficient resource usage
Support for heterogeneous storage
Diagrammatic Representation of Hadoop Evolution
Key Features of Hadoop
1. Distributed Storage (HDFS)
Data stored across clusters of inexpensive machines.
Automatic replication ensures reliability.
2. Scalability
Can scale from 1 node to thousands.
Add nodes without system downtime.
3. Fault Tolerance
If a node fails, processing continues on replicated blocks.
4. Parallel Processing (MapReduce/Spark)
Data processed simultaneously across many nodes.
5. Cost-Effective
Uses commodity hardware instead of expensive servers.
6. Flexible Data Handling
Supports structured, semi-structured, and unstructured data:
o Text
o Images
o Logs
o Videos
7. Vast Ecosystem
Tools for:
Ingestion (Sqoop, Flume)
Processing (Hive, Pig, Spark)
Coordination (ZooKeeper)
Scheduling (Oozie)
Advantages of Hadoop for Big Data Processing
1. Handles Huge Volumes of Data
Suitable for terabytes to petabytes of data.
2. High Speed
Parallel processing significantly reduces computation time.
3. Cost-Effective Storage
HDFS stores large data cheaply.
4. Flexible Processing Models
Works with:
Batch processing
Real-time (Spark Streaming)
Interactive queries (Hive)
5. Enhanced Security & Governance
Kerberos authentication
Ranger & Sentry for authorization
Industries Using Hadoop (with Examples)
1. Banking & Finance
Fraud detection, transaction analysis, risk modeling.
Example: Banks use Hadoop + Spark for analyzing customer spending patterns.
2. E-commerce
Customer behavior analytics
Recommendation systems
Inventory prediction
Example: Amazon/Flipkart analyze clickstream data using Hadoop.
3. Telecom
Network optimization
Dropped call analysis
Example: AT&T processes billions of call records with Hadoop.
4. Healthcare
Patient data analysis
Genomics processing
Example: Hospitals use Hadoop for analyzing MRI & genome datasets.
5. Social Media
Sentiment analysis
Trend prediction
Example: Facebook initially used HDFS-like infrastructure for data warehousing.
Diagram: Hadoop in Industry Workflow
[Link] the concept of distributed computing in Hadoop to explain how Hadoop
handles large datasets efficiently. Illustrate your answer with a real-time example of
distributed storage and processing.
Hadoop is designed on the principle of distributed computing, where large datasets are
divided into smaller chunks and processed across many machines simultaneously. This
enables Hadoop to store, manage, and analyze massive volumes of data efficiently.
1. Concept of Distributed Computing in Hadoop
Distributed computing means:
Data is divided into blocks
Blocks are stored across multiple machines in a cluster
Processing happens locally on each machine
Results are later combined
Hadoop achieves this using:
A. Distributed Storage – HDFS (Hadoop Distributed File System)
How HDFS stores data
A large file is split into fixed-size blocks (e.g., 128 MB each).
These blocks are stored on different DataNodes in the cluster.
Each block is automatically replicated (default 3 copies) for fault tolerance.
Benefits
Manage petabytes of data.
High availability even if machines fail.
B. Distributed Processing – MapReduce / YARN
How MapReduce processes data
Map tasks run on the same node where the block resides
➝ This is called data locality.
Each node processes its own block and outputs intermediate results.
Reduce tasks merge and aggregate the output.
Benefits
Massive parallel processing
Reduces network traffic
Faster execution on big datasets
2. How Hadoop Efficiently Handles Large Datasets
Step-by-step explanation
Step 1: Input Data Splitting
Large dataset → divided into small blocks
Example: 1 TB file → 8000 blocks (128 MB each)
Step 2: Distributed Storage
Blocks stored across 50–1000+ DataNodes.
Step 3: Parallel Processing
MapReduce assigns tasks to nodes storing those blocks.
Step 4: Fault Tolerance
If a node fails, another copy from a different node processes the data.
Step 5: Result Aggregation
All intermediate results combined to produce final output.
3. Diagrammatic Representation of Distributed Storage & Processing
A. Distributed Storage (HDFS)
HDFS ARCHITECTURE
B. Distributed Processing (MapReduce)
4. Real-Time Example: Log Analytics in an E-commerce Company
Scenario
Flipkart / Amazon generates terabytes of clickstream logs daily.
How Hadoop handles this using distributed computing
Step-by-Step Processing:
Step 1: Logs Stored in HDFS
Clickstream logs → 24 hours → 5 TB
HDFS divides into 40,000 blocks (128 MB each).
Day 1 Logs
→ Block 1 → DataNode 1
→ Block 2 → DataNode 4
→ Block 3 → DataNode 7
... replicated across 3 DataNodes each
Step 2: MapReduce Processing
Map tasks extract:
o Product IDs clicked
o Time spent
o User behavior
Each DataNode processes its own block.
Step 3: Reduce Phase
Combines user activities
Generates:
o Most viewed products
o High-traffic time intervals
o Abandoned carts analysis
Step 4: Output
The e-commerce company uses insights for:
Better recommendations
Targeted ads
Re-pricing strategies
Real-Time Example – Diagram
Hadoop efficiently handles large datasets using distributed computing through:
Distributed storage (HDFS)
Distributed processing (MapReduce)
Fault tolerance
Parallel execution
Data locality to reduce network load
Real-time use cases like e-commerce clickstream analysis prove Hadoop’s power in
managing huge data volumes.
[Link] the differences between RDBMS and Hadoop with respect to storage,
processing, data type handling, scalability, and fault tolerance. Provide a comparison
table and justify why Hadoop is preferred for Big Data.
Differences Between RDBMS and Hadoop
Traditional RDBMS and Hadoop are built for very different purposes.
RDBMS is optimized for structured data and transactional processing, whereas Hadoop is
designed for large-scale distributed storage and massive parallel processing of big data.
1. Comparison Table: RDBMS vs Hadoop
Feature RDBMS (Relational DBMS) Hadoop (HDFS + MapReduce/YARN)
Centralized storage; stores data Distributed storage across clusters using
Storage Model
in tables (rows & columns). HDFS. Data stored as blocks.
Uses ACID transactions; OLTP Uses parallel processing
Processing
operations; Query processing (MapReduce/Spark) executed across
Model
using SQL. multiple nodes.
Handles structured, semi-structured, and
Data Types Mostly structured data (numeric,
unstructured data (text, logs, images,
Handled text) with fixed schema.
videos).
Vertical scaling (add more
Horizontal scaling (add more machines).
Scalability CPU/RAM to one server).
Highly scalable from GB to petabytes.
Limited and expensive.
Limited fault tolerance. If server High fault tolerance. HDFS replicates data
Fault
fails, recovery is manual or blocks (default 3 replicas). Automatic
Tolerance
requires replication. failure recovery.
Uses commodity hardware, open-source →
Cost Expensive hardware & licenses.
very cost-effective.
Excellent for small–medium Excellent for massive datasets using
Performance
structured datasets. parallel computation.
Schema-on-write (must define Schema-on-read (flexible; process data
Schema
schema first). later).
Banking OLTP, inventory, Big Data analytics, log processing, machine
Use Cases
payroll systems. learning, IoT.
2. Why Hadoop Is Preferred for Big Data
Hadoop is popular for Big Data because:
A. Handles All Types of Data
RDBMS cannot store images/videos/log files efficiently.
Hadoop stores any format: JSON, XML, audio, logs, social media data, clickstream
logs.
B. Extremely Scalable
Hadoop clusters can expand from 10 nodes → 1000+ nodes easily.
RDBMS scalability is limited and costly.
C. Distributed Storage (HDFS)
Data is split into blocks and stored across multiple machines.
Allows storage of petabytes of data.
D. Distributed Parallel Processing
MapReduce/Spark process data simultaneously on different nodes.
Processing becomes much faster for huge datasets.
E. High Fault Tolerance
Automatic replication ensures no data loss even if nodes fail.
RDBMS requires expensive backup and replication solutions.
F. Cost-Effective
Hadoop runs on inexpensive commodity hardware.
RDBMS solutions (Oracle, SQL Server) require costly enterprise servers.
3. Diagrammatic Representation
A. RDBMS Architecture (Centralized & Structured)
B. Hadoop Architecture (Distributed & Scalable)
C. Distributed Data Processing (MapReduce)
4. Final Justification: Why Hadoop for Big Data?
1. Built for scalability (horizontal scaling)
Handles petabytes of data seamlessly.
2. Designed for distributed processing
Improves performance using parallel execution.
3. Handles diverse data formats
Ideal for real-world big datasets (logs, sensor data, social media).
4. High fault tolerance
Replication ensures no data loss.
5. Very cost-effective
Uses low-cost commodity servers.
6. Integrated Ecosystem
Hive, Pig, Spark, HBase, Sqoop, Oozie → complete analytics platform.
RDBMS is suitable for transactional structured data, but Hadoop’s distributed storage,
parallel processing, scalability, fault tolerance, and cost-efficiency make it ideal for Big Data
environments, especially when data volume, velocity, and variety are extremely high.
[Link] the major components of the Hadoop ecosystem (HDFS, YARN, MapReduce,
Hive, Pig, HBase). Discuss how each component contributes to solving Big Data
challenges. Critically assess their advantages and limitations.
The Hadoop ecosystem provides a set of tools that together solve the challenges of Big Data
storage, processing, management, and analysis.
Key ecosystem components include HDFS, YARN, MapReduce, Hive, Pig, and HBase.
1. HDFS (Hadoop Distributed File System)
Role in Big Data
Stores huge datasets across many commodity machines.
Splits files into blocks and distributes them across DataNodes.
How HDFS Solves Big Data Challenges
Stores petabytes of data
Fault tolerance via block replication
Supports unstructured & semi-structured data
Diagrammatic Representation: HDFS
Advantages
Highly scalable
High availability
Fault tolerant
Limitations
Not optimized for small files
No random read/write updates
2. YARN (Yet Another Resource Negotiator)
Role in Big Data
Manages cluster resources and schedules jobs.
Allows multiple processing engines (MapReduce, Spark, Hive).
How YARN Solves Big Data Challenges
Improves cluster utilization
Enables multiple workloads on same cluster
Supports real-time, interactive & batch processing
Diagram: YARN Architecture
Advantages
Flexible resource sharing
Supports multiple data engines
Limitations
Complex architecture
Potential bottleneck if Resource Manager overloads
3. MapReduce
Role in Big Data
Programming model for distributed parallel processing.
How MapReduce Solves Big Data Challenges
Parallel execution across nodes
Handles huge, unstructured datasets
Local data processing reduces network load
Diagram: MapReduce Workflow
Advantages
Highly scalable
High availability
Fault tolerant
Limitations
Not optimized for small files
No random read/write updates
2. YARN (Yet Another Resource Negotiator)
Role in Big Data
Manages cluster resources and schedules jobs.
Allows multiple processing engines (MapReduce, Spark, Hive).
How YARN Solves Big Data Challenges
✔ Improves cluster utilization
✔ Enables multiple workloads on same cluster
✔ Supports real-time, interactive & batch processing
Diagram: YARN Architecture
Advantages
Flexible resource sharing
Supports multiple data engines
Limitations
Complex architecture
Potential bottleneck if Resource Manager overloads
3. MapReduce
Role in Big Data
Programming model for distributed parallel processing.
How MapReduce Solves Big Data Challenges
Parallel execution across nodes
Handles huge, unstructured datasets
Local data processing reduces network load
Diagram: MapReduce Workflow
Advantages
Highly scalable
Fault tolerant
Simple programming model
Limitations
Slow for iterative workloads
High disk I/O overhead
Not ideal for real-time processing
4. Hive
Role in Big Data
A data warehouse tool that converts SQL-like queries (HiveQL) into
MapReduce/Spark jobs.
How Hive Solves Big Data Challenges
Makes big data querying easier using SQL
Ideal for summarization, ad hoc analysis
Handles huge datasets stored in HDFS
Diagram: Hive Architecture
Advantages
SQL-like familiarity
Good for batch analytics
Highly scalable
Limitations
Not suitable for real-time queries
High latency due to MapReduce jobs
5. Pig
Role in Big Data
High-level scripting language (Pig Latin) for data transformation.
How Pig Solves Big Data Challenges
Simplifies complex ETL pipelines
Automatically optimizes execution
Reduces code compared to Java MapReduce
Advantages
Easy to write ETL workflows
Extensible with custom functions
Limitations
Not ideal for complex analytical queries
Less popular after Spark’s ris
6. HBase
Role in Big Data
NoSQL column-oriented database built on top of HDFS.
How HBase Solves Big Data Challenges
Supports random read/write
Real-time access to large datasets
Handles high write throughput
Diagram: HBase Structure
Advantages
Real-time access
Scales horizontally
Efficient for sparse datasets
Limitations
Complex setup
Requires tuning for performance
Not suitable for analytical queries
Major Components of Hadoop Ecosystem
Component Role Advantages Limitations
HDFS Distributed storage Fault tolerant, scalable Inefficient for small files
YARN Resource management Flexible, multi-engine Complexity
MapReduce Batch processing Scalable, parallel Slow for iterative tasks
Hive SQL-based analysis Easy querying High latency
Pig ETL & scripting Easy pipelines Not for interactive queries
HBase NoSQL storage Real-time read/write Complex tuning
Hadoop ecosystem effectively addresses Big Data challenges:
Volume: HDFS stores petabytes
Velocity: YARN enables parallel processing
Variety: Hive, Pig, and HBase support diverse formats
Veracity: Replication ensures reliability
limitations:
MapReduce is slow for real-time analytics
Hive & Pig have high latency
HBase requires expertise
Hadoop ecosystem is being replaced by Spark-based architectures
[Link] a complete Hadoop-based Big Data solution for a large e-commerce company.
Include data ingestion, storage, processing, workflow design, and analytics output.
Justify why Hadoop is the ideal choice for this scenario.
high-level goals
Ingest high-volume, high-velocity data (clickstream, transactions, mobile events, logs,
product catalog, images)
Store cost-effectively at scale (TB → PB) with durability and fast access for analytics
Support both batch and near-real-time processing (analytics, recommendations, fraud
detection)
Enable ML model training and serving, dashboarding, ad-hoc SQL analytics
Provide fault tolerance, monitoring, and secure access
1. Architecture (text + diagram)
2. Component-by-component design & roles
Data ingestion
Apache Kafka (primary stream bus) — durable, partitioned, high throughput.
Producers: web/mobile SDKs, API gateway, microservices. Topics: clickstream,
orders, payments, inventory_events.
Apache Flume / Filebeat — ingest application logs, server logs into Kafka or HDFS.
Sqoop — bulk import/export relational DB data (legacy systems, product catalog
snapshots) into HDFS/Hive.
HTTP ingestion / APIs — small events to Kafka via lightweight gateways.
Why: Kafka decouples producers & consumers, supports replay and backpressure.
Storage
HDFS (Raw / Curated zones)
o Raw zone: immutable daily partitions (landing area).
o Curated zone: cleaned, partitioned Parquet/ORC files for analytics.
HBase (for real-time, low-latency reads/writes)
o Use-cases: user session store, shopping-cart state, product catalog lookups,
recent events.
Object Store (S3 / MinIO / HDFS Archive)
o Cold data, snapshots, backups, model artifacts.
Metadata & Catalog: Hive Metastore (tables on top of Parquet/ORC), Apache Atlas
(data lineage).
Processing & compute
YARN for cluster resource management.
Apache Spark (primary compute):
o Batch ETL (Spark SQL), ML training, distributed joins, feature engineering.
o Spark Structured Streaming for micro-batch near-real-time jobs.
Flink / Kafka Streams (optional) for true low-latency streaming (fraud detection, cart-
abandon alerts).
MapReduce retained for legacy batches if needed, but Spark preferred.
Workflow Orchestration & ML lifecycle
Apache Airflow / Oozie for job scheduling, DAG orchestration, data dependencies.
MLflow (or Kubeflow) for model training tracking, registration, deployment.
Kafka Connect for connectors (sink to HDFS, HBase) and CDC.
Querying & Serving
Hive / Presto / Trino for interactive SQL over HDFS (Parquet/ORC).
Druid / Pinot for OLAP, low-latency analytics (dashboarding on clickstream).
HBase / Phoenix for random reads; Phoenix adds SQL layer over HBase.
REST API / Model Server (TF-Serving or custom microservices) to serve ML
recommendations.
Security, monitoring, ops
Kerberos for authentication.
Ranger / Sentry for fine-grained authorization and audit.
SSL/TLS for Kafka/HTTP.
Ambari / Cloudera Manager or Kubernetes operators for cluster provisioning and
monitoring.
Prometheus + Grafana for metrics, Elasticsearch + Kibana for logs.
Backup and replication to object store.
3. Example pipelines (concrete)
A. Batch analytics pipeline (daily)
1. Nightly snapshots: Sqoop brings RDBMS product & user tables to HDFS raw zone.
2. Spark job (Airflow DAG) transforms raw JSON logs → cleans → enriches with user
profile → writes Parquet to curated zone (partitioned by date/country).
3. Hive/Presto ad-hoc queries and BI dashboards read curated Parquet (fast columnar
scans).
4. ML training job (Spark) runs on curated feature tables; model registered in MLflow.
B. Near-real-time personalization pipeline
1. Click events → Kafka topic clickstream (producer from web/mobile).
2. Spark Structured Streaming or Flink reads Kafka, does sessionization, aggregates
rolling features (last-5-min click counts).
3. Aggregates written to HBase (fast lookups) and to Druid (OLAP).
4. Recommendation service queries HBase to serve personalized product
recommendations with sub-100ms latency.
C. Real-time fraud detection
1. Payment events → Kafka.
2. Flink job applies rules + ML online model for scoring → if score > threshold, write to
alerts topic and trigger investigation workflow + block transaction.
4. Diagram — Dataflow (end-to-end)
5. Why Hadoop is an ideal choice for this e-commerce scenario (justification)
1. Scalability (horizontal): HDFS + commodity nodes handle TBs→PBs easily as
business grows.
2. Cost-effectiveness: Stores massive historical data cheaply vs high-cost transactional
DBs.
3. Flexibility of data types: Can store clickstreams, images, videos, logs, JSON events
— vital for e-commerce variety.
4. Rich ecosystem: Hive/Presto (SQL), Spark (ML), HBase (low-latency lookups),
Kafka integration — everything needed in one stack.
5. Fault tolerance & durability: HDFS replication + Kafka durability provide robust
guarantees for mission-critical data.
6. Separation of concerns: Ingestion (Kafka), storage (HDFS), compute (Spark/YARN),
serving (HBase/Druid) — components scale & evolve independently.
7. Mature tooling for analytics & governance: Hive Metastore, Atlas, Ranger, MLflow
enable enterprise workflows, lineage, security.
6. Advantages vs Limitations (critical assessment)
Advantages
Handles volume, velocity, variety (3Vs).
Good for both historical (deep analytics) and near-real-time use cases (with
Spark/Flink).
Mature ecosystem with a wide community and connectors.
Cost-effective long-term storage and batch processing.
Limitations / tradeoffs
Latency: HDFS + batch jobs not for strict sub-second workloads — must use
HBase/Druid and streaming engines for that.
Operational complexity: Running & tuning a Hadoop cluster (HDFS, YARN, HBase,
Kafka) needs platform engineering expertise.
Small files problem: HDFS inefficient for massive numbers of tiny files — use
compacting/sequence files or object store.
Resource contention: Need careful YARN scheduling or move to Kubernetes for
multi-tenant isolation.
Ecosystem fragmentation: Many choices (Spark vs Flink, Druid vs Pinot) — needs
clear standards to avoid sprawl.
7. Non-functional / ops considerations
Use tiered storage: hot (HBase/Druid), warm (Parquet in HDFS), cold (S3/MinIO).
Establish data retention policies and lifecycle rules (ingest → curate → archive).
Implement CI/CD for data pipelines (Airflow + GitOps).
Capacity planning: monitor partitions, Kafka lag, HDFS disk usage.
Disaster recovery: cross-region replication for critical topics/data.
8. Quick checklist / recommended technologies
Ingest: Kafka, Flume, Sqoop
Storage: HDFS (Parquet/ORC), HBase, S3 for archive
Compute: Spark (batch & structured streaming), Flink for low-latency
Resource mgmt: YARN or Kubernetes (via Spark-on-K8s)
Query: Hive, Presto/Trino, Druid/Pinot
Orchestration: Airflow (or Oozie)
ML lifecycle: MLflow / Kubeflow
Security: Kerberos, Ranger, TLS
Monitoring: Prometheus + Grafana, ELK
This Hadoop-based architecture gives the e-commerce company a scalable, flexible, and
enterprise-ready platform to ingest and store massive volumes of varied data, run large-scale
analytics and ML, and serve both batch and near-real-time use cases. Use HBase/Druid/Flink
where low latency is required, Spark for heavy analytics and ML, and HDFS/Parquet for
cost-efficient storage and historical analysis.
[Link] MapReduce processing with traditional batch processing systems. Analyze
the architecture of MapReduce, explaining the roles of Mapper, Reducer, Combiner,
and Partitioner with examples.
MapReduce is a distributed data-processing framework designed to handle massive datasets
by dividing work across clusters of commodity machines. Traditional batch processing
systems (like mainframes or standalone servers) normally process data sequentially or on
limited hardware.
1. Comparison: MapReduce vs. Traditional Batch Processing
Feature MapReduce Traditional Batch Processing
Processing
Distributed, parallel processing Sequential or limited parallelism
Model
Hardware Commodity clusters High-end proprietary systems
Vertical scaling (upgrade a single
Scalability Horizontal scaling (add more nodes)
machine)
Fault Tolerance Automatic via replication & retries Limited, failures often halt entire batch
“Move computation to data” (local "Move data to computation" (high
Data Location
processing) network I/O)
Suitable for Big Data (TB–PB range) Moderate-sized data
Cost Low (commodity hardware) High (mainframes/high-end servers)
MapReduce is ideal for Big Data because it distributes workload, minimizes network traffic,
and offers built-in fault tolerance.
2. Architecture of MapReduce
MapReduce consists of four major components:
1. Mapper
2. Reducer
3. Combiner (optional)
4. Partitioner
Below is the architecture with explanation and diagrams.
(A) Mapper
Role:
Processes input data (split into chunks).
Transforms each record into key–value pairs.
Example (Word Count):
Input:
Big data is big data
Mapper output:
(big,1)
(data,1)
(is,1)
(big,1)
(data,1)
Diagram: Mapper Phase
(B) Combiner (Local Aggregator)
Role:
Minimizes data transferred during shuffle.
Performs local reduce on mapper output.
Example (Word Count):
Mapper output:
(big,1), (big,1), (data,1)
Combiner output:
(big,2), (data,1)
Diagram: Combiner
Mapper Output → Combiner → Compressed Output
(C) Partitioner
Role:
Decides which reducer receives which key.
Ensures same keys go to the same reducer.
Default Function:
partition = hash(key) % number_of_reducers
Example
If reducers = 2:
Keys "a, c, e" → Reducer 1
Keys "b, d, f" → Reducer 2
Diagram: Partitioning
Mapper Output → Partitioner → Reducer Buckets
(D) Reducer
Role:
Receives grouped values for each key.
Produces final aggregated output.
Example (Word Count):
Input to reducer:
(big, [1,1])
(data, [1,1])
(is, [1])
Reducer Output:
(big,2)
(data,2)
(is,1)
Diagram: Reducer Phase
3. Full MapReduce Architecture (End-to-End Diagram)
4. Example Use Case: Word Count
Input
apple banana apple orange
banana apple
Mapper Output
(apple,1) (banana,1) (apple,1) (orange,1)
(banana,1) (apple,1)
Combiner
(apple,2) (banana,1) (orange,1)
(banana,1) (apple,1)
Partitioner
apple → Reducer 1
banana → Reducer 2
orange → Reducer 1
Reducer Output
(apple,3)
(banana,2)
(orange,1)
5. Why MapReduce Is Better Than Traditional Batch Processing
Handles Huge Datasets Efficiently
Distributed processing allows scaling to terabytes/petabytes.
Fault Tolerant
Tasks are re-executed automatically if a node fails.
High Throughput
Many mappers and reducers work simultaneously.
Cost-Effective
Runs on commodity hardware clusters.
Process Local Data
Reduces network overhead by processing data where it is stored.
Automatic Load Balancing
System distributes tasks based on cluster capacity.
MapReduce is a powerful distributed batch processing model built for Big Data. Its
components—Mapper, Combiner, Partitioner, and Reducer—work together to achieve
parallelism, fault tolerance, and high scalability, making it superior to traditional systems.
[Link] evaluate the challenges in distributed computing and explain how Hadoop
overcomes them. Discuss issues such as hardware failures, data replication, cluster
management, and parallel processing. Provide suitable illustrations.
Distributed computing involves processing data across multiple connected machines. While it
enables scalability and parallelism, it also introduces several challenges. Hadoop was
specifically designed to address these challenges through its core components: HDFS, YARN,
MapReduce, and Hadoop ecosystem tools.
1. Challenge: Hardware Failures
Problem in Traditional Distributed Systems
Commodity hardware is prone to frequent failures.
Failure of a single node can halt the entire job.
Manual recovery is time-consuming and risky.
How Hadoop Overcomes It (HDFS + MapReduce)
Automatic failure detection
Data replication ensures data availability
Failed tasks are re-executed automatically
NameNode and Secondary NameNode track block metadata
Diagram: Replication in HDFS
Critical Evaluation
Strong fault tolerance
Data never lost due to replication
But replication increases storage cost (up to 3×)
2. Challenge: Data Replication & Consistency
Problem in Distributed Systems
Ensuring consistent data across multiple nodes is difficult.
Replication mechanisms require coordination overhead.
How Hadoop Overcomes It
HDFS replicates each block (default: 3 copies)
Metadata stored in NameNode ensures consistent mapping
Heartbeats ensure replicas are always alive
Critical Evaluation
Strong consistency model for large files
But not ideal for frequent updates (HDFS is write-once-read-many)
3. Challenge: Cluster Management
Problem in Distributed Systems
Managing hundreds or thousands of nodes is complex.
Resource allocation, job scheduling, and node coordination are difficult.
Requires monitoring of CPU, memory, and I/O across machines.
How Hadoop Overcomes It (YARN)
YARN provides:
Centralized resource manager
Distributed node managers
Dynamic job scheduling
Automatic reallocation of resources
Diagram: YARN Cluster Management
Critical Evaluation
Highly scalable
Supports multiple engines (MR, Spark, Tez)
But Resource Manager is a single point of failure unless made highly available
4. Challenge: Parallel Processing & Load Balancing
Problem in Traditional Distributed Systems
Difficult to divide data evenly across nodes.
Parallelizing tasks requires manual setting.
Slow nodes (stragglers) delay overall job completion.
How Hadoop Overcomes It (MapReduce)
Splits datasets into blocks automatically
Assigns map tasks near the data
Automatically retries slow/failing tasks
Shuffles and sorts data for reducers
Diagram: Parallel MapReduce Execution
Critical Evaluation
Enables massive parallelism
Automatic handling of failures
But MapReduce has high disk I/O → slow for iterative algorithms
5. Additional Challenges & How Hadoop Solves Them
A. Scalability Issues
Traditional systems scale vertically (upgrade hardware).
Hadoop scales horizontally (add more nodes).
Linearly scalable
Commodity hardware reduces cost
B. Data Diversity (Structured, Semi-structured, Unstructured)
Traditional RDBMS cannot handle flexible schemas.
Hadoop handles all types of data (logs, text, images, JSON, videos).
Schema-on-read
Flexible data handling
C. Network Bottlenecks
Traditional systems move data to computation.
Hadoop moves computation to data (local processing).
Reduces network load
Improves speed
6. Consolidated Diagram: How Hadoop Overcomes Distributed Computing Challenges
Final Critical Evaluation Summary
Challenge How Hadoop Solves It Remaining Limitations
Hardware failures Replication, task re-execution Storage overhead
3× block copies; NameNode
Data replication Not suited for small files
metadata
Cluster management YARN controls resources Single point failure in RM
Parallel processing MapReduce automation High I/O latency
Scalability Horizontal scaling Requires large cluster setup
Data variety Supports all formats Needs tools (Hive/Pig) for
Challenge How Hadoop Solves It Remaining Limitations
querying
Hadoop provides a robust, fault-tolerant, highly scalable, parallel processing environment
that solves the inherent challenges of distributed computing. Through HDFS, YARN, and
MapReduce, Hadoop ensures reliable data handling, efficient cluster management, and
powerful parallel processing — making it an ideal platform for Big Data environments.