19CSE357 BIG DATA ANALYTICS
L-T-P-C: 3-0-0-3
What is DATA?
• Data is raw information—facts, figures, observations—that can be
collected, stored, and analyzed.
What is Big Data?
Big Data is a collection of data that is huge in volume, yet growing exponentially
with time. It is a data with so large size and complexity that none of traditional data
management tools can store it or process it efficiently.
• Big Data is data that is too large, fast, or complex for traditional tools to handle.
• Analytics is the process of examining data to find patterns, trends, and insights.
Examples of Bigdata
Sources of Bigdata
These data come from many sources like
• Social networking sites: Facebook, Google, LinkedIn all these sites generates
huge amount of data on a day-to-day basis as they have billions of users
worldwide.
• E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of
logs from which users buying trends can be traced.
• Weather Station: All the weather station and satellite gives very huge data which
are stored and manipulated to forecast weather.
• Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.
• Share Market: Stock exchange across the world generates huge amount of data
through its daily transaction.
History of Bigdata
Characteristics of Bigdata
Volume : Scale of Data
• The name ‘Big Data’ itself is related to a size which is enormous.
• Refers to the massive amount of data generated every second from sources like social
media, sensors, transactions, etc.
• If the volume of data is very large then it is actually considered as a ‘Big Data’. This
means whether a particular data can actually be considered as a Big Data or not, is
dependent upon the volume of data.
• Ex : Social media platforms like Facebook generate petabytes of data daily through
posts, comments, and multimedia uploads.
Characteristics of Bigdata
Velocity: Speed of Data generation
• The speed at which data is created, collected, and processed.
• In Big Data velocity data flows in from sources like machines, networks, social
media, mobile phones etc.
• There is a massive and continuous flow of data. This determines the potential of
data that how fast the data is generated and processed to meet the demands.
• Ex: There are more than 3.5 billion searches per day are made on Google. Also,
Facebook users are increasing by 22%(Approx.) year by year.
Characteristics of Bigdata
Variety: Different Types of Data
• Variety refers to the different forms and types of data that are generated.
• The arrival of data from new sources that are both inside and outside of an
enterprise.
• It can be structured, semi-structured and unstructured.
• Ex: Emails, social media posts, videos, and sensor data all represent different
forms of data collected by companies.
Characteristics of Bigdata
Veracity: The Trustworthiness of Data
• Veracity refers to the quality and reliability of the data
• It refers to inconsistencies and uncertainty in data, that is data which is
available can sometimes get messy and quality and accuracy are
difficult to control.
• Big Data is also variable because of the multitude of data dimensions
resulting from multiple disparate data types and sources.
• Ex: Customer feedback data may be filled with inconsistencies, biases, or
inaccuracies, requiring validation and cleansing to ensure reliability.
Characteristics of Bigdata
• Value: The Worth of Data
• After having the 4 V’s into account there comes one more V which stands for
Value!
• Value refers to the usefulness or business impact that can be derived from data.
• The bulk of Data having no Value is of no good to the company, unless you turn it
into something useful.
• Data in itself is of no use or importance but it needs to be converted into
something valuable to extract Information. Hence, you can state that Value! isthe
most important V of all the 5V’s.
Characteristics of Bigdata
V's Meaning Need
Volume Amount of data (terabytes, Helps us understand storage needs and
petabytes, etc.) scalable processing strategies (e.g., Hadoop,
Spark).
Velocity Speed of data generation and Enables real-time or near-real-time decision
processing making (e.g., fraud detection, social media
analytics).
Variety Different forms of data (text, images, Drives the need for flexible data models
video, logs) (NoSQL, document stores) and diverse
processing tools.
Veracity Accuracy and trustworthiness of Critical for making reliable, evidence-based
data decisions; affects data cleaning and
preprocessing stages.
Value Usefulness and business value of Justifies the entire analytics process; focuses
the data on extracting actionable insights that impact
outcomes.
Characteristics of Bigdata
Non-Definitional traits of Big Data
• Volatility: It deals with “How long the data is valid? “
• Validity: Validity refers to accuracy & correctness of data.
Data Authenticity-Information that was extracted from the data should be accurate or
correct.
• Variability: Data flows can be highly inconsistent with periodic peaks.
• Visualization:- It is a method of representing intangible ideas using graphs, charts etc
Challenges of Big Data
• Incomplete Understanding of Big Data:
Lack of proper understanding
Skilled persons required.
• Exponential Data Growth
Since data is growing exponentially, it’s becoming difficult to store this data
• Security of Data
Many resources for understanding, storing, and analyzing the data they often forget
to prioritize the security part.
• Data Integration
Data can be in any form from structured data like phone numbers to unstructured
data like videos. So, integrating these data is difficult
• Lack of data professionals
Types of Bigdata
Big data can be broadly classified into three main types:
Structured data
Structured
Semi-structured data
More structured
Semi-structured
Unstructured data
Unstructured
Types of Bigdata
Structured data
• The data that can be stored and processed in a fixed
format is called as Structured Data.
• Typically represented in tables, rows, and columns.
• Data stored in a relational database management
system (RDBMS) is one example of ‘structured’
data.
• It is easy to process structured data as it has a fixed
schema.
• Structured Query Language (SQL) is often used to
manage such kind of Data.
Example
Types of Bigdata
Semi-structured data
• Semi-structured data is one of the types of big data that
represents a middle ground between the structured and
unstructured data categories.
• It combines elements of organization and flexibility,
allowing for data to be partially structured while
accommodating variations in format and content.
• This type of data is often represented with tags, labels,
or hierarchies, which provide a level of organization
without strict constraints.
Types of Bigdata
Unstructured data
• The data which have unknown form, cannot be analyzed unless it is
transformed into a structured format is called as unstructured data.
• It lacks a consistent structure, making it more challenging to organize and
analyze.
• Text Files and multimedia contents like images, audios, videos are example of
unstructured data.
• The unstructured data is growing quicker than others, experts say that80 percent
of the data in an organization are unstructured.
Traditional Business Intelligence versus Big
Data
Traditional BI Environment Big Data Environment
Data is stored in central server Data is stored in a distributed file system.
Analyzes offline or historic data Analyzes offline and real or streaming
data
Supports Structured data only Supports variety of data ie; structured,
semi-structured and unstructured data.
Data warehouse
• A Data Warehouse is a central repository
that stores huge amount of structured,
historical data from various sources within an
organization.
• An ordinary Database can store MBs to GBs
of data and that too for a specific purpose. For
storing data of TB size, the storage shifted to
Data Warehouse.
• To effectively perform analytics, an
organization keeps a central Data Warehouse
to closely study its business by organizing,
understanding, and using its historic data for
taking strategic decisions and analyzing
trends.
key aspects of a data warehouse:
Structure: Data warehouses use a relational database management system (RDBMS)
to store data in a structured format.
Data is organized into tables, rows, and columns, following a predefined schema
called a dimensional model.
ETL Processes:
Extract, Transform, Load (ETL) processes are used to populate the data warehouse.
Data is extracted from different sources, transformed to adhere to the data
warehouse schema, and then loaded into the warehouse.
ETL processes often involve data cleansing, integration, and aggregation to ensure
data quality and consistency.
Historical Data:
• Data warehouses primarily store historical data, providing a long-term view of the
organization's operations.
• Data is typically collected at regular intervals, such as daily, weekly, or monthly,
allowing for trend analysis and historical reporting.
Business Intelligence and Reporting:
• Data warehouses serve as a foundation for business intelligence activities.
• They support ad hoc querying, reporting, and analysis by providing a consolidated
view of data across different business areas.
• Data is often pre-aggregated and optimized for query performance to facilitate fast
and interactive reporting.
Hadoop Environment:
• Hadoop is an open-source framework that enables distributed storage and
processing of large datasets across clusters of commodity hardware.
• It provides a scalable and cost-effective solution for managing Big Data.
• Distributed Storage:
Hadoop uses the Hadoop Distributed File System (HDFS) to store data across
multiple nodes in a cluster.
Data is split into blocks and distributed across the cluster, ensuring high availability
and fault tolerance.
Distributed Processing:
Hadoop leverages the MapReduce framework to process data in parallel across the
cluster.
MapReduce divides data processing tasks into smaller subtasks and distributes them
to different nodes, allowing for efficient parallel processing.
• Scalability:
Hadoop is designed to scale horizontally by adding more nodes to the cluster. This
enables organizations to store and process large volumes of data without relying on
expensive and specialized hardware.
Flexibility and Variety:
Hadoop can handle various types of data, including structured, unstructured, and
semi-structured data.
It allows organizations to store and process diverse data formats, such as text, log
files, images, videos, and more.
Data Processing Frameworks:
Hadoop ecosystem includes several data processing frameworks and tools built on
top of Hadoop, such as Apache Spark, Apache Hive, and Apache Pig.
These frameworks provide higher-level abstractions and APIs for data manipulation,
querying, and analysis.
• Batch and Real-time Processing:
Hadoop supports both batch processing and real-time/streaming processing.
While its MapReduce framework is suitable for batch processing, tools like Apache Spark
enable real-time and near-real-time analytics on streaming data.