Define Big Data
Big Data refers to extremely large and diverse collections of structured, unstructured, and semi-
structured data that grow exponentially over time, characterized by the three Vs: volume,
velocity, and variety. It encompasses datasets too complex for traditional processing tools,
enabling advanced analytics for insights in business and science.
Data Classification
Data is classified into three categories based on organization and analysis ease.
Structured data: Highly organized with predefined schemas, stored in relational databases
like SQL tables with rows and columns; examples include inventory databases, financial
transactions, and Excel spreadsheets for easy querying. [1] [2]
Unstructured data: Lacks predefined format, comprising 80-90% of Big Data; examples
include social media posts, images, videos, emails, and audio files stored in data lakes or
NoSQL systems. [1] [3]
Semi-structured data: Hybrid with some organization like tags or markers but no rigid
schema; examples include JSON/XML files, emails (with headers and body), and geo-
tagged images for flexible analysis. [1] [4]
Differences Between Traditional BI and Big Data Analytics
Traditional Business Intelligence (BI) focuses on historical, structured data from internal sources
using SQL for reporting, while Big Data Analytics handles massive, varied real-time data from
diverse sources like IoT and social media with tools like Hadoop for predictive insights.
Aspect Traditional BI Big Data Analytics
Data Internal, structured (e.g., ERP Diverse, including external unstructured (e.g.,
Sources systems) social media, sensors) [5]
Data [5]
Small to medium Extremely large (petabytes)
Volume
Processing Batch, SQL-based Real-time, distributed (e.g., MapReduce) [5]
Scale Centralized, vertical Distributed, horizontal [5]
Focus Descriptive/historical analysis Predictive/real-time [5]
Tools Dashboards, OLAP cubes Hadoop, Spark, NoSQL [6]
What is NoSQL
NoSQL databases are non-relational systems designed for scalability and flexibility in handling
unstructured or semi-structured data, avoiding fixed schemas unlike traditional SQL databases.
[7]
Types of NoSQL Databases
Key-value stores: Simple pairs for fast lookups, e.g., Redis, Amazon DynamoDB; ideal for
caching. [7] [8]
Document stores: JSON/BSON documents with nested structures, e.g., MongoDB,
Couchbase; suits semi-structured data. [7] [8]
Wide-column stores: Flexible columns per row, e.g., Cassandra, HBase; for sparse data like
logs. [7] [8]
Graph stores: Nodes and edges for relationships, e.g., Neo4j; for social networks. [7] [8]
Multi-model databases combine types for broader use. [7]
Challenges Associated with Big Data
Big Data challenges include managing massive volumes requiring scalable storage like cloud
solutions, handling diverse data types via integration tools, and ensuring real-time velocity with
stream processing. Additional issues involve data quality (veracity) through governance,
security/privacy via encryption, and integration across silos using ETL pipelines. [9] [10]
What is Hadoop
Hadoop is an open-source framework for distributed storage and processing of large datasets
across commodity hardware clusters, emphasizing fault tolerance and scalability. [11]
Hadoop Components
HDFS (Hadoop Distributed File System): Handles large-scale data storage by distributing
files into blocks across nodes. [11]
YARN (Yet Another Resource Negotiator): Manages cluster resources and schedules jobs.
[11]
MapReduce: Processes data in parallel via map and reduce phases. [11]
Hadoop Common: Provides shared libraries and utilities. [11]
MapReduce Concept and High-Level Architecture
MapReduce is a programming model for parallel processing: the Map phase breaks input data
into key-value pairs, while the Reduce phase aggregates them for output, enabling efficient
handling of large datasets. [12]
High-level architecture involves a Client submitting jobs to the JobTracker (master), which splits
tasks and assigns to TaskTrackers (slaves) on nodes. Data flows: Input splits → Map tasks
(process locally) → Shuffle/Sort (group by key) → Reduce tasks → Output to HDFS. The
diagram shows Client → JobTracker → TaskTrackers (Map/Reduce), with data locality
minimizing network transfer. [12]
Anatomy of File Write in HDFS
HDFS file write involves the client requesting block allocation from NameNode, which provides
DataNode addresses, then pipelining data to nodes for replication. [13]
Process: Client calls create() → NameNode checks permissions and allocates blocks → Client
writes to first DataNode, which replicates to others (e.g., replication factor 3) via
DFSOutputStream queues (data and ack). Acknowledgments flow back; on failure, pipeline
adjusts. Diagram: Client → NameNode (metadata) → DataNode1 → DataNode2 → DataNode3,
with blocks replicated sequentially. [13] [14]
Hadoop MapReduce Flow for Word Count Program
In MapReduce word count, input text splits into lines; Mapper tokenizes words emitting (word, 1)
pairs, Shuffler sorts/groups by word, Reducer sums counts outputting (word, total). [15]
Flowchart: Input files → Splitter (lines) → Mapper (word → (word,1)) → Partitioner/Sort (group
keys) → Combiner (optional local reduce) → Reducer (sum values) → Output files. For example,
"hello world" → Mapper: (hello,1), (world,1); Reduce: (hello,1), (world,1). [15] [16]
Key Terminologies in Big Data Environment
Key terms include Hadoop (distributed framework), MapReduce (parallel processing model),
NoSQL (non-relational databases), Data Lake (raw data repository), and ETL (Extract-
Transform-Load for pipelines). [17] [18]
Components of Hadoop Ecosystem for Data Analysis
Hadoop ecosystem includes HDFS for storage, YARN for resource management,
MapReduce/Spark for processing, Hive/Pig for querying (SQL-like), and HBase for real-time
access. [11] [19]
Difference Between HDFS and RDBMS
HDFS is a distributed file system for large-scale, fault-tolerant storage of any data type on
commodity hardware, while RDBMS uses structured tables with SQL for ACID-compliant
transactions on smaller datasets. [20] [21]
Aspect HDFS RDBMS
Data Type Structured/unstructured/semi-structured Primarily structured [20]
Scalability Horizontal (add nodes) Vertical (upgrade hardware) [21]
Cost Low (commodity hardware) High (licenses/hardware) [21]
Access Pattern Batch read/write, high throughput Random access, transactions [20]
Aspect HDFS RDBMS
Schema Schema-on-read Fixed schema-on-write [21]
Differences Between Traditional DBMS and NoSQL
Traditional DBMS (RDBMS) enforces fixed schemas and ACID for consistent, relational data,
whereas NoSQL offers flexible schemas and BASE for scalable, distributed handling of varied
data. [22] [23]
Aspect Traditional DBMS (RDBMS) NoSQL
Schema Fixed, predefined Dynamic, flexible [22]
Scalability Vertical Horizontal (clusters) [23]
Consistency ACID (strong) BASE (eventual) [22]
Query Language SQL Varied (e.g., API-based) [23]
Data Model Tables with relations Documents, key-value, graphs [23]
MongoDB Overview
MongoDB is a document-oriented NoSQL database storing data in flexible JSON-like BSON
documents, supporting dynamic schemas for scalable applications like content management.
[24]
Data Types in MongoDB
String: UTF-8 text, e.g., names. [24] [25]
Integer: 32/64-bit whole numbers, e.g., ages. [24] [25]
Double: Floating-point, e.g., prices. [25]
Boolean: True/false values. [25]
Array: Lists, e.g., tags. [25]
Object: Embedded documents. [25]
Date/Timestamp: Time records. [25]
Components (Daemons) of MapReduce Framework
MapReduce daemons include JobTracker (master: job scheduling, resource allocation) and
TaskTrackers (slaves: execute map/reduce tasks on nodes). [26] [12]
Diagram: Client submits to JobTracker → Splits job → Assigns to TaskTrackers (one per node)
for Map/Reduce execution, monitoring health and reassigning failures; shows master-slave
hierarchy with data flow. [26]
⁂
1. [Link]
ed-data/
2. [Link]
3. [Link]
4. [Link]
5. [Link]
6. [Link]
7. [Link]
8. [Link]
9. [Link]
10. [Link]
11. [Link]
12. [Link]
13. [Link]
14. [Link]
15. [Link]
16. [Link]
17. [Link]
18. [Link]
19. [Link]
20. [Link]
-database-and-hdfs
21. [Link]
22. [Link]
23. [Link]
24. [Link]
25. [Link]
26. [Link]
27. [Link]
28. [Link]
29. [Link]
30. [Link]
31. [Link]
32. [Link]
33. [Link]
34. [Link]
35. [Link]
36. [Link]
[Link]
37. [Link]
38. [Link]
39. [Link]
40. [Link]
tured-data/
41. [Link]
42. [Link]
43. [Link]
44. [Link]
45. [Link]
46. [Link]
47. [Link]
48. [Link]
49. [Link]
50. [Link]
51. [Link]
52. [Link] notes/FBDA1/FBDA [Link]
53. [Link]
54. [Link]
55. [Link]
56. [Link]
57. [Link]
58. [Link]
59. [Link]
60. [Link]
61. [Link]
62. [Link]