0% found this document useful (0 votes)
27 views6 pages

Understanding Big Data Analytics

Big Data refers to vast collections of structured, unstructured, and semi-structured data characterized by volume, velocity, and variety, requiring advanced analytics for insights. It is classified into structured, unstructured, and semi-structured data, with distinct differences from traditional Business Intelligence in terms of data sources, volume, and processing methods. NoSQL databases, designed for scalability and flexibility, handle diverse data types, while Hadoop provides a framework for distributed storage and processing of large datasets.

Uploaded by

Bhargav Sah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views6 pages

Understanding Big Data Analytics

Big Data refers to vast collections of structured, unstructured, and semi-structured data characterized by volume, velocity, and variety, requiring advanced analytics for insights. It is classified into structured, unstructured, and semi-structured data, with distinct differences from traditional Business Intelligence in terms of data sources, volume, and processing methods. NoSQL databases, designed for scalability and flexibility, handle diverse data types, while Hadoop provides a framework for distributed storage and processing of large datasets.

Uploaded by

Bhargav Sah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Define Big Data

Big Data refers to extremely large and diverse collections of structured, unstructured, and semi-
structured data that grow exponentially over time, characterized by the three Vs: volume,
velocity, and variety. It encompasses datasets too complex for traditional processing tools,
enabling advanced analytics for insights in business and science.

Data Classification
Data is classified into three categories based on organization and analysis ease.
Structured data: Highly organized with predefined schemas, stored in relational databases
like SQL tables with rows and columns; examples include inventory databases, financial
transactions, and Excel spreadsheets for easy querying. [1] [2]
Unstructured data: Lacks predefined format, comprising 80-90% of Big Data; examples
include social media posts, images, videos, emails, and audio files stored in data lakes or
NoSQL systems. [1] [3]
Semi-structured data: Hybrid with some organization like tags or markers but no rigid
schema; examples include JSON/XML files, emails (with headers and body), and geo-
tagged images for flexible analysis. [1] [4]

Differences Between Traditional BI and Big Data Analytics


Traditional Business Intelligence (BI) focuses on historical, structured data from internal sources
using SQL for reporting, while Big Data Analytics handles massive, varied real-time data from
diverse sources like IoT and social media with tools like Hadoop for predictive insights.

Aspect Traditional BI Big Data Analytics

Data Internal, structured (e.g., ERP Diverse, including external unstructured (e.g.,
Sources systems) social media, sensors) [5]

Data [5]
Small to medium Extremely large (petabytes)
Volume

Processing Batch, SQL-based Real-time, distributed (e.g., MapReduce) [5]

Scale Centralized, vertical Distributed, horizontal [5]

Focus Descriptive/historical analysis Predictive/real-time [5]

Tools Dashboards, OLAP cubes Hadoop, Spark, NoSQL [6]


What is NoSQL
NoSQL databases are non-relational systems designed for scalability and flexibility in handling
unstructured or semi-structured data, avoiding fixed schemas unlike traditional SQL databases.
[7]

Types of NoSQL Databases


Key-value stores: Simple pairs for fast lookups, e.g., Redis, Amazon DynamoDB; ideal for
caching. [7] [8]
Document stores: JSON/BSON documents with nested structures, e.g., MongoDB,
Couchbase; suits semi-structured data. [7] [8]
Wide-column stores: Flexible columns per row, e.g., Cassandra, HBase; for sparse data like
logs. [7] [8]
Graph stores: Nodes and edges for relationships, e.g., Neo4j; for social networks. [7] [8]
Multi-model databases combine types for broader use. [7]

Challenges Associated with Big Data


Big Data challenges include managing massive volumes requiring scalable storage like cloud
solutions, handling diverse data types via integration tools, and ensuring real-time velocity with
stream processing. Additional issues involve data quality (veracity) through governance,
security/privacy via encryption, and integration across silos using ETL pipelines. [9] [10]

What is Hadoop
Hadoop is an open-source framework for distributed storage and processing of large datasets
across commodity hardware clusters, emphasizing fault tolerance and scalability. [11]

Hadoop Components
HDFS (Hadoop Distributed File System): Handles large-scale data storage by distributing
files into blocks across nodes. [11]
YARN (Yet Another Resource Negotiator): Manages cluster resources and schedules jobs.
[11]

MapReduce: Processes data in parallel via map and reduce phases. [11]
Hadoop Common: Provides shared libraries and utilities. [11]

MapReduce Concept and High-Level Architecture


MapReduce is a programming model for parallel processing: the Map phase breaks input data
into key-value pairs, while the Reduce phase aggregates them for output, enabling efficient
handling of large datasets. [12]
High-level architecture involves a Client submitting jobs to the JobTracker (master), which splits
tasks and assigns to TaskTrackers (slaves) on nodes. Data flows: Input splits → Map tasks
(process locally) → Shuffle/Sort (group by key) → Reduce tasks → Output to HDFS. The
diagram shows Client → JobTracker → TaskTrackers (Map/Reduce), with data locality
minimizing network transfer. [12]

Anatomy of File Write in HDFS


HDFS file write involves the client requesting block allocation from NameNode, which provides
DataNode addresses, then pipelining data to nodes for replication. [13]
Process: Client calls create() → NameNode checks permissions and allocates blocks → Client
writes to first DataNode, which replicates to others (e.g., replication factor 3) via
DFSOutputStream queues (data and ack). Acknowledgments flow back; on failure, pipeline
adjusts. Diagram: Client → NameNode (metadata) → DataNode1 → DataNode2 → DataNode3,
with blocks replicated sequentially. [13] [14]

Hadoop MapReduce Flow for Word Count Program


In MapReduce word count, input text splits into lines; Mapper tokenizes words emitting (word, 1)
pairs, Shuffler sorts/groups by word, Reducer sums counts outputting (word, total). [15]
Flowchart: Input files → Splitter (lines) → Mapper (word → (word,1)) → Partitioner/Sort (group
keys) → Combiner (optional local reduce) → Reducer (sum values) → Output files. For example,
"hello world" → Mapper: (hello,1), (world,1); Reduce: (hello,1), (world,1). [15] [16]

Key Terminologies in Big Data Environment


Key terms include Hadoop (distributed framework), MapReduce (parallel processing model),
NoSQL (non-relational databases), Data Lake (raw data repository), and ETL (Extract-
Transform-Load for pipelines). [17] [18]

Components of Hadoop Ecosystem for Data Analysis


Hadoop ecosystem includes HDFS for storage, YARN for resource management,
MapReduce/Spark for processing, Hive/Pig for querying (SQL-like), and HBase for real-time
access. [11] [19]

Difference Between HDFS and RDBMS


HDFS is a distributed file system for large-scale, fault-tolerant storage of any data type on
commodity hardware, while RDBMS uses structured tables with SQL for ACID-compliant
transactions on smaller datasets. [20] [21]

Aspect HDFS RDBMS

Data Type Structured/unstructured/semi-structured Primarily structured [20]

Scalability Horizontal (add nodes) Vertical (upgrade hardware) [21]

Cost Low (commodity hardware) High (licenses/hardware) [21]

Access Pattern Batch read/write, high throughput Random access, transactions [20]
Aspect HDFS RDBMS

Schema Schema-on-read Fixed schema-on-write [21]

Differences Between Traditional DBMS and NoSQL


Traditional DBMS (RDBMS) enforces fixed schemas and ACID for consistent, relational data,
whereas NoSQL offers flexible schemas and BASE for scalable, distributed handling of varied
data. [22] [23]

Aspect Traditional DBMS (RDBMS) NoSQL

Schema Fixed, predefined Dynamic, flexible [22]

Scalability Vertical Horizontal (clusters) [23]

Consistency ACID (strong) BASE (eventual) [22]

Query Language SQL Varied (e.g., API-based) [23]

Data Model Tables with relations Documents, key-value, graphs [23]

MongoDB Overview
MongoDB is a document-oriented NoSQL database storing data in flexible JSON-like BSON
documents, supporting dynamic schemas for scalable applications like content management.
[24]

Data Types in MongoDB


String: UTF-8 text, e.g., names. [24] [25]
Integer: 32/64-bit whole numbers, e.g., ages. [24] [25]
Double: Floating-point, e.g., prices. [25]
Boolean: True/false values. [25]
Array: Lists, e.g., tags. [25]
Object: Embedded documents. [25]
Date/Timestamp: Time records. [25]

Components (Daemons) of MapReduce Framework


MapReduce daemons include JobTracker (master: job scheduling, resource allocation) and
TaskTrackers (slaves: execute map/reduce tasks on nodes). [26] [12]
Diagram: Client submits to JobTracker → Splits job → Assigns to TaskTrackers (one per node)
for Map/Reduce execution, monitoring health and reassigning failures; shows master-slave
hierarchy with data flow. [26]

1. [Link]
ed-data/
2. [Link]
3. [Link]
4. [Link]
5. [Link]
6. [Link]
7. [Link]
8. [Link]
9. [Link]
10. [Link]
11. [Link]
12. [Link]
13. [Link]
14. [Link]
15. [Link]
16. [Link]
17. [Link]
18. [Link]
19. [Link]
20. [Link]
-database-and-hdfs
21. [Link]
22. [Link]
23. [Link]
24. [Link]
25. [Link]
26. [Link]
27. [Link]
28. [Link]
29. [Link]
30. [Link]
31. [Link]
32. [Link]
33. [Link]
34. [Link]
35. [Link]
36. [Link]
[Link]
37. [Link]
38. [Link]
39. [Link]
40. [Link]
tured-data/
41. [Link]
42. [Link]
43. [Link]
44. [Link]
45. [Link]
46. [Link]
47. [Link]
48. [Link]
49. [Link]
50. [Link]
51. [Link]
52. [Link] notes/FBDA1/FBDA [Link]
53. [Link]
54. [Link]
55. [Link]
56. [Link]
57. [Link]
58. [Link]
59. [Link]
60. [Link]
61. [Link]
62. [Link]

You might also like