0% found this document useful (0 votes)

27 views6 pages

Understanding Big Data Analytics

Big Data refers to vast collections of structured, unstructured, and semi-structured data characterized by volume, velocity, and variety, requiring advanced analytics for insights. It is classified into structured, unstructured, and semi-structured data, with distinct differences from traditional Business Intelligence in terms of data sources, volume, and processing methods. NoSQL databases, designed for scalability and flexibility, handle diverse data types, while Hadoop provides a framework for distributed storage and processing of large datasets.

Uploaded by

Bhargav Sah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views6 pages

Understanding Big Data Analytics

Uploaded by

Bhargav Sah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Define Big Data

Big Data refers to extremely large and diverse collections of structured, unstructured, and semi-
structured data that grow exponentially over time, characterized by the three Vs: volume,
velocity, and variety. It encompasses datasets too complex for traditional processing tools,
enabling advanced analytics for insights in business and science.

Data Classification
Data is classified into three categories based on organization and analysis ease.
Structured data: Highly organized with predefined schemas, stored in relational databases
like SQL tables with rows and columns; examples include inventory databases, financial
transactions, and Excel spreadsheets for easy querying. [1] [2]
Unstructured data: Lacks predefined format, comprising 80-90% of Big Data; examples
include social media posts, images, videos, emails, and audio files stored in data lakes or
NoSQL systems. [1] [3]
Semi-structured data: Hybrid with some organization like tags or markers but no rigid
schema; examples include JSON/XML files, emails (with headers and body), and geo-
tagged images for flexible analysis. [1] [4]

Differences Between Traditional BI and Big Data Analytics

Traditional Business Intelligence (BI) focuses on historical, structured data from internal sources
using SQL for reporting, while Big Data Analytics handles massive, varied real-time data from
diverse sources like IoT and social media with tools like Hadoop for predictive insights.

Aspect Traditional BI Big Data Analytics

Data Internal, structured (e.g., ERP Diverse, including external unstructured (e.g.,
Sources systems) social media, sensors) [5]

Data [5]
Small to medium Extremely large (petabytes)
Volume

Processing Batch, SQL-based Real-time, distributed (e.g., MapReduce) [5]

Scale Centralized, vertical Distributed, horizontal [5]

Focus Descriptive/historical analysis Predictive/real-time [5]

Tools Dashboards, OLAP cubes Hadoop, Spark, NoSQL [6]

What is NoSQL
NoSQL databases are non-relational systems designed for scalability and flexibility in handling
unstructured or semi-structured data, avoiding fixed schemas unlike traditional SQL databases.
[7]

Types of NoSQL Databases

Key-value stores: Simple pairs for fast lookups, e.g., Redis, Amazon DynamoDB; ideal for
caching. [7] [8]
Document stores: JSON/BSON documents with nested structures, e.g., MongoDB,
Couchbase; suits semi-structured data. [7] [8]
Wide-column stores: Flexible columns per row, e.g., Cassandra, HBase; for sparse data like
logs. [7] [8]
Graph stores: Nodes and edges for relationships, e.g., Neo4j; for social networks. [7] [8]
Multi-model databases combine types for broader use. [7]

Challenges Associated with Big Data

Big Data challenges include managing massive volumes requiring scalable storage like cloud
solutions, handling diverse data types via integration tools, and ensuring real-time velocity with
stream processing. Additional issues involve data quality (veracity) through governance,
security/privacy via encryption, and integration across silos using ETL pipelines. [9] [10]

What is Hadoop
Hadoop is an open-source framework for distributed storage and processing of large datasets
across commodity hardware clusters, emphasizing fault tolerance and scalability. [11]

Hadoop Components
HDFS (Hadoop Distributed File System): Handles large-scale data storage by distributing
files into blocks across nodes. [11]
YARN (Yet Another Resource Negotiator): Manages cluster resources and schedules jobs.
[11]

MapReduce: Processes data in parallel via map and reduce phases. [11]
Hadoop Common: Provides shared libraries and utilities. [11]

MapReduce Concept and High-Level Architecture

MapReduce is a programming model for parallel processing: the Map phase breaks input data
into key-value pairs, while the Reduce phase aggregates them for output, enabling efficient
handling of large datasets. [12]
High-level architecture involves a Client submitting jobs to the JobTracker (master), which splits
tasks and assigns to TaskTrackers (slaves) on nodes. Data flows: Input splits → Map tasks
(process locally) → Shuffle/Sort (group by key) → Reduce tasks → Output to HDFS. The
diagram shows Client → JobTracker → TaskTrackers (Map/Reduce), with data locality
minimizing network transfer. [12]

Anatomy of File Write in HDFS

HDFS file write involves the client requesting block allocation from NameNode, which provides
DataNode addresses, then pipelining data to nodes for replication. [13]
Process: Client calls create() → NameNode checks permissions and allocates blocks → Client
writes to first DataNode, which replicates to others (e.g., replication factor 3) via
DFSOutputStream queues (data and ack). Acknowledgments flow back; on failure, pipeline
adjusts. Diagram: Client → NameNode (metadata) → DataNode1 → DataNode2 → DataNode3,
with blocks replicated sequentially. [13] [14]

Hadoop MapReduce Flow for Word Count Program

In MapReduce word count, input text splits into lines; Mapper tokenizes words emitting (word, 1)
pairs, Shuffler sorts/groups by word, Reducer sums counts outputting (word, total). [15]
Flowchart: Input files → Splitter (lines) → Mapper (word → (word,1)) → Partitioner/Sort (group
keys) → Combiner (optional local reduce) → Reducer (sum values) → Output files. For example,
"hello world" → Mapper: (hello,1), (world,1); Reduce: (hello,1), (world,1). [15] [16]

Key Terminologies in Big Data Environment

Key terms include Hadoop (distributed framework), MapReduce (parallel processing model),
NoSQL (non-relational databases), Data Lake (raw data repository), and ETL (Extract-
Transform-Load for pipelines). [17] [18]

Components of Hadoop Ecosystem for Data Analysis

Hadoop ecosystem includes HDFS for storage, YARN for resource management,
MapReduce/Spark for processing, Hive/Pig for querying (SQL-like), and HBase for real-time
access. [11] [19]

Difference Between HDFS and RDBMS

HDFS is a distributed file system for large-scale, fault-tolerant storage of any data type on
commodity hardware, while RDBMS uses structured tables with SQL for ACID-compliant
transactions on smaller datasets. [20] [21]

Aspect HDFS RDBMS

Data Type Structured/unstructured/semi-structured Primarily structured [20]

Scalability Horizontal (add nodes) Vertical (upgrade hardware) [21]

Cost Low (commodity hardware) High (licenses/hardware) [21]

Access Pattern Batch read/write, high throughput Random access, transactions [20]
Aspect HDFS RDBMS

Schema Schema-on-read Fixed schema-on-write [21]

Differences Between Traditional DBMS and NoSQL

Traditional DBMS (RDBMS) enforces fixed schemas and ACID for consistent, relational data,
whereas NoSQL offers flexible schemas and BASE for scalable, distributed handling of varied
data. [22] [23]

Aspect Traditional DBMS (RDBMS) NoSQL

Schema Fixed, predefined Dynamic, flexible [22]

Scalability Vertical Horizontal (clusters) [23]

Consistency ACID (strong) BASE (eventual) [22]

Query Language SQL Varied (e.g., API-based) [23]

Data Model Tables with relations Documents, key-value, graphs [23]

MongoDB Overview
MongoDB is a document-oriented NoSQL database storing data in flexible JSON-like BSON
documents, supporting dynamic schemas for scalable applications like content management.
[24]

Data Types in MongoDB

String: UTF-8 text, e.g., names. [24] [25]
Integer: 32/64-bit whole numbers, e.g., ages. [24] [25]
Double: Floating-point, e.g., prices. [25]
Boolean: True/false values. [25]
Array: Lists, e.g., tags. [25]
Object: Embedded documents. [25]
Date/Timestamp: Time records. [25]

Components (Daemons) of MapReduce Framework

MapReduce daemons include JobTracker (master: job scheduling, resource allocation) and
TaskTrackers (slaves: execute map/reduce tasks on nodes). [26] [12]
Diagram: Client submits to JobTracker → Splits job → Assigns to TaskTrackers (one per node)
for Map/Reduce execution, monitoring health and reassigning failures; shows master-slave
hierarchy with data flow. [26]
⁂
1. [Link]
ed-data/
2. [Link]
3. [Link]
4. [Link]
5. [Link]
6. [Link]
7. [Link]
8. [Link]
9. [Link]
10. [Link]
11. [Link]
12. [Link]
13. [Link]
14. [Link]
15. [Link]
16. [Link]
17. [Link]
18. [Link]
19. [Link]
20. [Link]
-database-and-hdfs
21. [Link]
22. [Link]
23. [Link]
24. [Link]
25. [Link]
26. [Link]
27. [Link]
28. [Link]
29. [Link]
30. [Link]
31. [Link]
32. [Link]
33. [Link]
34. [Link]
35. [Link]
36. [Link]
[Link]
37. [Link]
38. [Link]
39. [Link]
40. [Link]
tured-data/
41. [Link]
42. [Link]
43. [Link]
44. [Link]
45. [Link]
46. [Link]
47. [Link]
48. [Link]
49. [Link]
50. [Link]
51. [Link]
52. [Link] notes/FBDA1/FBDA [Link]
53. [Link]
54. [Link]
55. [Link]
56. [Link]
57. [Link]
58. [Link]
59. [Link]
60. [Link]
61. [Link]
62. [Link]

Understanding Big Data Analytics
No ratings yet
Understanding Big Data Analytics
47 pages
Understanding the 5 V's of Big Data
No ratings yet
Understanding the 5 V's of Big Data
2 pages
Hadoop - Quick Guide Hadoop - Big Data Overview
No ratings yet
Hadoop - Quick Guide Hadoop - Big Data Overview
32 pages
Hadoop Quick Guide
No ratings yet
Hadoop Quick Guide
32 pages
Understanding Big Data Analytics Basics
No ratings yet
Understanding Big Data Analytics Basics
87 pages
BD Intqb
No ratings yet
BD Intqb
11 pages
Methodologies for Managing Big Data
No ratings yet
Methodologies for Managing Big Data
25 pages
Hadoop Presentation (1)
No ratings yet
Hadoop Presentation (1)
322 pages
Big Data Analytics: Hadoop & NoSQL Overview
No ratings yet
Big Data Analytics: Hadoop & NoSQL Overview
9 pages
Big Data: The Data Is Too Big, Moves Too Fast, or Doesn't Fit The Structures of Our Current
No ratings yet
Big Data: The Data Is Too Big, Moves Too Fast, or Doesn't Fit The Structures of Our Current
12 pages
Understanding Big Data Concepts
No ratings yet
Understanding Big Data Concepts
18 pages
Spark and Hadoop Use Cases Explained
No ratings yet
Spark and Hadoop Use Cases Explained
48 pages
Big Data Processing with Hadoop & Spark
No ratings yet
Big Data Processing with Hadoop & Spark
30 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
NoSQL and Big Data Management Overview
No ratings yet
NoSQL and Big Data Management Overview
92 pages
Big Data Analytics Technology Overview
No ratings yet
Big Data Analytics Technology Overview
8 pages
NoSQL and Hadoop Overview
No ratings yet
NoSQL and Hadoop Overview
46 pages
HADOOP
No ratings yet
HADOOP
55 pages
IBM Guardium Map Customer Journey
No ratings yet
IBM Guardium Map Customer Journey
24 pages
Big Data Platforms and Use Cases
No ratings yet
Big Data Platforms and Use Cases
9 pages
Understanding the 5 V's of Big Data
No ratings yet
Understanding the 5 V's of Big Data
16 pages
Understanding Data and Big Data Concepts
No ratings yet
Understanding Data and Big Data Concepts
43 pages
Introduction to Big Data at IET Udaipur
No ratings yet
Introduction to Big Data at IET Udaipur
10 pages
Hadoop vs RDBMS: Key Differences Explained
No ratings yet
Hadoop vs RDBMS: Key Differences Explained
23 pages
Big Data Analytics: Concepts & Technologies
No ratings yet
Big Data Analytics: Concepts & Technologies
33 pages
Understanding Big Data and Hadoop Ecosystem
No ratings yet
Understanding Big Data and Hadoop Ecosystem
8 pages
Understanding Big Data Concepts
No ratings yet
Understanding Big Data Concepts
87 pages
Understanding Hadoop for Big Data
No ratings yet
Understanding Hadoop for Big Data
10 pages
NoSQL and Hadoop for Big Data Solutions
No ratings yet
NoSQL and Hadoop for Big Data Solutions
22 pages
Big Data Analytics Practical Guide
No ratings yet
Big Data Analytics Practical Guide
29 pages
BDA Study Notes
No ratings yet
BDA Study Notes
19 pages
Big Data Analytics and Hadoop Overview
100% (3)
Big Data Analytics and Hadoop Overview
33 pages
BigData Section 2
No ratings yet
BigData Section 2
65 pages
Hadoop Benefits for Big Data Analytics
No ratings yet
Hadoop Benefits for Big Data Analytics
1 page
NoSQL Databases for Big Data Solutions
No ratings yet
NoSQL Databases for Big Data Solutions
27 pages
Hadoop & BigData (UNIT - 2)
No ratings yet
Hadoop & BigData (UNIT - 2)
22 pages
Understanding Hadoop and Big Data Analysis
No ratings yet
Understanding Hadoop and Big Data Analysis
8 pages
Advanced Analytics and Unstructured Data Insights
No ratings yet
Advanced Analytics and Unstructured Data Insights
11 pages
Dbms Unit-5 Notes
No ratings yet
Dbms Unit-5 Notes
14 pages
Understanding Big Data and Hadoop
No ratings yet
Understanding Big Data and Hadoop
73 pages
Doug Cutting and Hadoop Evolution
No ratings yet
Doug Cutting and Hadoop Evolution
26 pages
HadoopDB: Hybrid MapReduce & DBMS
No ratings yet
HadoopDB: Hybrid MapReduce & DBMS
34 pages
Essential Tools for Data Analytics
No ratings yet
Essential Tools for Data Analytics
17 pages
Big Data Tools: Hadoop & Spark Overview
No ratings yet
Big Data Tools: Hadoop & Spark Overview
14 pages
Understanding Big Data: Definitions & Importance
No ratings yet
Understanding Big Data: Definitions & Importance
38 pages
Understanding Big Data and HDFS
No ratings yet
Understanding Big Data and HDFS
32 pages
Hadoop Lab Guide for Big Data Processing
100% (1)
Hadoop Lab Guide for Big Data Processing
32 pages
Understanding Big Data and Hadoop Essentials
No ratings yet
Understanding Big Data and Hadoop Essentials
17 pages
Big Data Analytics with Hadoop Overview
100% (2)
Big Data Analytics with Hadoop Overview
63 pages
Understanding MapReduce and Big Data
No ratings yet
Understanding MapReduce and Big Data
26 pages
Overview of Big Data Technologies
No ratings yet
Overview of Big Data Technologies
36 pages
Data Processing with Hadoop Overview
No ratings yet
Data Processing with Hadoop Overview
23 pages
Characteristics of Social Networks as Big Data
No ratings yet
Characteristics of Social Networks as Big Data
6 pages
Hadoop and Spark Overview Guide
No ratings yet
Hadoop and Spark Overview Guide
34 pages
Hadoop Ecosystem Overview and Skills
No ratings yet
Hadoop Ecosystem Overview and Skills
970 pages
Big Data Analytics Exam Guide
No ratings yet
Big Data Analytics Exam Guide
15 pages
Health and Wellness Overview SFH Module 1
100% (6)
Health and Wellness Overview SFH Module 1
6 pages
UHV Course Overview and Impact Report
No ratings yet
UHV Course Overview and Impact Report
12 pages
UHV Project Report Using LaTeX
No ratings yet
UHV Project Report Using LaTeX
12 pages
IoT Networking Basics Module 1
No ratings yet
IoT Networking Basics Module 1
23 pages
Chemistry Module - 5
No ratings yet
Chemistry Module - 5
6 pages
Communicative English Model Question Paper
100% (1)
Communicative English Model Question Paper
5 pages
IoT Sensors and Actuators Overview
No ratings yet
IoT Sensors and Actuators Overview
16 pages
Kannada Notes PDF
No ratings yet
Kannada Notes PDF
27 pages
Sister Act Rehearsal Schedule
No ratings yet
Sister Act Rehearsal Schedule
8 pages
Inclusive Language for Disability Sports
No ratings yet
Inclusive Language for Disability Sports
4 pages
Class 9 WBBSE Math Syllabus Overview
No ratings yet
Class 9 WBBSE Math Syllabus Overview
6 pages
MERN Stack Social Media App Development
No ratings yet
MERN Stack Social Media App Development
2 pages
Khelo India Kabaddi Assessment Camp 2025
No ratings yet
Khelo India Kabaddi Assessment Camp 2025
6 pages
Understanding Listening Skills
100% (4)
Understanding Listening Skills
29 pages
West Harbour Alliance Church Profile
100% (1)
West Harbour Alliance Church Profile
9 pages
MITCON e-School Exam Calendar 2011
No ratings yet
MITCON e-School Exam Calendar 2011
2 pages
CIP Process Values with Siemens S7
No ratings yet
CIP Process Values with Siemens S7
39 pages
Understanding Skill in Sports Performance
No ratings yet
Understanding Skill in Sports Performance
22 pages
Electromagnetic Induction Concepts and Calculations
No ratings yet
Electromagnetic Induction Concepts and Calculations
1 page
Understanding Context in 21st Century Literature
No ratings yet
Understanding Context in 21st Century Literature
23 pages
Grade 11 Web Page Creation Guide
No ratings yet
Grade 11 Web Page Creation Guide
10 pages
Contemporary Missiology Course Overview
No ratings yet
Contemporary Missiology Course Overview
3 pages
Famous Foods and Nationalities Quiz
No ratings yet
Famous Foods and Nationalities Quiz
6 pages
GameCenter Startup Log Analysis
No ratings yet
GameCenter Startup Log Analysis
9 pages
Senior Design Verification Engineer Resume
No ratings yet
Senior Design Verification Engineer Resume
5 pages
Dynamics of Cultural Memory in Assam
No ratings yet
Dynamics of Cultural Memory in Assam
6 pages
Sidney's Defense of Poetry Explained
No ratings yet
Sidney's Defense of Poetry Explained
3 pages
CVC Words
No ratings yet
CVC Words
3 pages
Janam Patri and Poorva Yunja Insights
No ratings yet
Janam Patri and Poorva Yunja Insights
9 pages
LotF Fishbowl Discussion Rubric
No ratings yet
LotF Fishbowl Discussion Rubric
1 page
Simple Billing System in C++
100% (2)
Simple Billing System in C++
36 pages
Literature and Identity in Grade 7
No ratings yet
Literature and Identity in Grade 7
5 pages
Comprehensive Guide to Express.js
No ratings yet
Comprehensive Guide to Express.js
10 pages
CE-158 Service Manual Overview
No ratings yet
CE-158 Service Manual Overview
36 pages
Understanding Rational Numbers
No ratings yet
Understanding Rational Numbers
45 pages
Weekly Lesson Plan for KG 2 - Week 7
No ratings yet
Weekly Lesson Plan for KG 2 - Week 7
3 pages
Understanding URL Structure and Parts
No ratings yet
Understanding URL Structure and Parts
5 pages
Vector Spaces and Properties
No ratings yet
Vector Spaces and Properties
11 pages

Understanding Big Data Analytics

Uploaded by

Understanding Big Data Analytics

Uploaded by

Define Big Data

Differences Between Traditional BI and Big Data Analytics

Aspect Traditional BI Big Data Analytics

Processing Batch, SQL-based Real-time, distributed (e.g., MapReduce) [5]

Scale Centralized, vertical Distributed, horizontal [5]

Focus Descriptive/historical analysis Predictive/real-time [5]

Tools Dashboards, OLAP cubes Hadoop, Spark, NoSQL [6]

Types of NoSQL Databases

Challenges Associated with Big Data

MapReduce Concept and High-Level Architecture

Anatomy of File Write in HDFS

Hadoop MapReduce Flow for Word Count Program

Key Terminologies in Big Data Environment

Components of Hadoop Ecosystem for Data Analysis

Difference Between HDFS and RDBMS

Aspect HDFS RDBMS

Data Type Structured/unstructured/semi-structured Primarily structured [20]

Scalability Horizontal (add nodes) Vertical (upgrade hardware) [21]

Cost Low (commodity hardware) High (licenses/hardware) [21]

Schema Schema-on-read Fixed schema-on-write [21]

Differences Between Traditional DBMS and NoSQL

Aspect Traditional DBMS (RDBMS) NoSQL

Schema Fixed, predefined Dynamic, flexible [22]

Scalability Vertical Horizontal (clusters) [23]

Consistency ACID (strong) BASE (eventual) [22]

Query Language SQL Varied (e.g., API-based) [23]

Data Model Tables with relations Documents, key-value, graphs [23]

Data Types in MongoDB

Components (Daemons) of MapReduce Framework

You might also like