BIG DATA ANALYTICS
(BCS714D)
MODULE 1:
1. Introduction to Big Data
2. Big Data Analytics.
MODULE 2:
1. Introduction to Hadoop
2. Introduction to Map Reduce Programming
MODULE 3:
1. Introduction to Mongo DB
MODULE 4:
1. Introduction to Hive
2. Introduction to Pig
MODULE 5:
1. Spark and Big Data Analytics
2. Text, Web Content and Link Analytics
Text Books:
1. Seema Acharya and Subhashini Chellappan “Big
data and Analytics” Wiley India Publishers, 2nd
Edition,2019.
2. Rajkamal and Preeti Saxena, “Big Data
Analytics, Introduction to Hadoop, Spark and
Machine Learning”,McGraw Hill Publication,
2019.
MODULE 1
Syllabus
Introduction to Big Data:Classification of data, Characteristics,
Evolution and definition of Big data, What is Big data, Why Big data,
Traditional Business Intelligence Vs Big Data, Typical data warehouse and
Hadoop environment.
Big Data Analytics: What is Big data Analytics, Classification of
Analytics, Importance of Big Data Analytics, Technologies used in Big
data Environments, Few Top Analytical Tools , NoSQL, Hadoop.
TB1: Ch 1: 1.1, Ch2: 2.1-2.5,2.7,2.9-2.11, Ch3: 3.2,3.5,3.8,3.12, Ch4:
4.1,4.2
INTRODUCTION:
Today, data is very important for all kinds of businesses big or
small. It is found both inside and outside the company, and it comes
in different forms from many sources. To make good decisions, we
need to:
Collect the data
Understand it
Use it properly to get useful information
Data is a set of values that represent a concept or concepts. It can be
raw information, such as numbers or text, or it can be more
complex, such as images, graphics, or videos.
[Link] OF DATA
Digital data can be broadly classified into structured , semi-structured, and
unstructured data.
[Link] data:
This type of data has no specific format or structure.
Computers can't easily understand or process it directly.
Examples: Word files, emails, images, videos, PDFs, research papers,
presentations.
Around 80–90% of data in companies is unstructured.
2. Semi-Structured Data
This data has some structure, but not as organized as structured data.
It's not very easy for computers to use directly.
Examples: XML files, HTML pages.
3. Structured Data
This data is well-organized in tables (like rows and columns).
Computers can easily read and use it.
Example: Data stored in databases like student records, sales data, etc.
Relationships between data are also defined (e.g., student and their marks).
In fact, a company called Gartner says that today, about 80% of the data in
companies is unstructured, and only about 10% is structured or semi-
structured.
1. Structured Data
Most structured data is stored in databases called RDBMS (Relational Database
Management Systems), such as Oracle, MySQL, and SQL Server.
Data in these databases is kept in tables. Each table has rows (each row is one record
or entry) and columns (each column stores a particular type of information, such as
employee name or number).
Each table represents a type of object or entity, for example, “Employee”.
Every column has a specific meaning and data type (like a number or text)
and can have rules, such as “cannot be empty” or “must be unique”.
Tables can be linked to each other. For example, each Employee record has a
DeptNo, which connects to the Department table. This shows which department
the employee works in. This relationship keeps data organized and avoids
duplication.
Structured data is stored in:
Databases (like Oracle, MySQL)
Spreadsheets (like Excel)
OLTP (Online Transaction Processing)
systems
1.1.1 Sources of Structured Data
Here’s where it usually comes from:
[Link]
2. Spreadsheets. In a spreadsheet, information is put into rows and columns, just
like in a table.
3. OLTP Systems (Online Transaction Processing)
1.1.2 Ease of working with Structured data
[Link] Structured Data
Semi-structured data is between structured and unstructured. It’s more
flexible and doesn’t follow strict table rules.
Key Features:
[Link] Fixed Tables: It doesn’t follow regular database format.
[Link] Tags: Data is stored with tags (like in XML or JSON), making it easier to
understand.
3. Tags and Hierarchies
Tags (like <name>) are used to show the structure and hierarchy of data.
4. Schema Mixed with Data
Information about the structure (schema) is mixed along with the actual data
values.
5. Unknown or Varying Attributes
Data may have different attributes, and we might not know these in advance.
Items in the same group don't need to have the same properties.
1.2.1 Sources of Semi-structured Data
The most common formats for semi-structured data are:
1. XML (eXtensible Markup Language)
Used by web services (especially with SOAP).
Stores data with opening and closing tags.
Example:
xml
<name>John</name>
2. JSON (JavaScript Object Notation)
Used to transfer data between a web server and a browser.
Common in modern web applications (using REST).
Also used in NoSQL databases like MongoDB and Couchbase.
Example:
json
{ "name": "John" }
AN EXAMPLE OF HTML IS AS FOLLOWS:
<HTML> : Enclose the entire HTML document. This indicates the
start and end of the HTML code.
<HEAD> : Contain meta-information about the document, such as its title
and links to stylesheets or scripts.
<TITLE>Place your title here</TITLE>: Sets the page title, which appears in
the browser’s title bar or tab.
<BODY BGCOLOR="FFFFFF">: Defines the main content
area of the webpage. The BGCOLOR="FFFFFF" attribute sets the background
colour of the page to white using the hexadecimal colour code FFFFFF.
<CENTER><IMG SRC="[Link]" ALIGN="BOTTOM"></CENTER>
<HR>
<a href="[Link] Name</a>
<H1>this is a Header</H1>
<H2>this is a sub Header</H2>
Send me mail at <a
href="[Link]
<P>a new paragraph!
<P><B>a new paragraph!</B>
<BR><B><I>this is a new sentence without a paragraph break, in bold
italics.</I></B>
<HR>
</BODY>
</HTML>
SAMPLE JSON DOCUMENT
{
"_id": 9,
"BookTitle": "Fundamentals of Business Analytics",
"AuthorName": "Seema Acharya",
"Publisher": "Wiley India",
"YearofPublication": "2011"
}
[Link] Data
Unstructured data refers to information that does not conform to a predefined
model or structure.
It’s unpredictable, free-form, and often varies widely from one instance to
another.
Examples include social media posts, emails, and logs.
Sometimes, patterns exist in unstructured data, leading to debates about whether
some of it is actually "semi-structured."
Twitter
Message "Feeling miffed®. Victim of twishing."
Facebook
Post "LOL. C ya. BFN"
[Link] - frank [10/Oct/[Link] -0700] "GET /apache_pb.gif HTTP/1.0"
Log Files 200 2326 ...
"Hey Joan, possible to send across the first cut on the Hadoop chapter by Friday
Email EOD or maybe we can meet up over a cup of coffee. Best regards, Tom"
1.3.1 Sources of Unstructured Data
Web pages: The actual content of web pages, which is often complex and not
neatly organized.
Images: Photographs, diagrams, and pictures.
Free form text: Any text that isn’t organized into records or tables, such as essays
or reports.
Audios: Recordings and voice files.
Videos: Multimedia files combining images and sound.
Body of Email: The main content area of emails—not the sender/recipient or time
fields, but what people actually write.
Text messages: SMS or instant messaging text.
Chats: Conversations from online chat applications.
Social media data: Posts, comments, reactions on platforms like Facebook,
Twitter, Instagram, etc.
Word documents
1. Implied Structure:
Sometimes, there's structure present (e.g., date at the start of a log entry) even if it
wasn't pre-defined.
2. Structure Not Helpful:
Data might have some internal structure, but if that structure isn't useful for a given
task, it's still treated as unstructured.
3. Unexpected/Unstated Structure:
Data may be more structured than we realize, but if it's not anticipated or announced,
it's called unstructured
This image shows a seesaw with “Unstructured data” on one side and “Structured
data” on the other, tilting heavily towards unstructured.
Meaning: Unstructured data makes up the majority of enterprise
(business/organizational) information, outweighing structured data.
Techniques to Find Patterns in Unstructured Data
[Link] Mining:
Data mining is the analysis of large data sets to identify consistent patterns
or relationships between variables. It draws upon artificial intelligence,
machine learning, statistics, and database systems.
Think of it as the "analysis step" in the process called "knowledge discovery
in databases."
Popular Data Mining Algorithms:
Association Rule Mining:
Also called: Market basket analysis or affinity analysis
Purpose: Determines "What goes with what?"
For example, if someone buys bread, do they also tend to buy eggs or cheese?
Use: Helps stores recommend or place products together based on previous
purchases.
Regression Analysis:
Purpose: Predicts the relationship between two variables.
How: One variable (dependent variable) is predicted using other variables
(independent variables).
Use: Estimate outcomes, trends, or values based on related data.
Text Analytics/Text Mining:
Extracts meaningful information from unstructured text (like social media or
emails). Tasks include categorization, clustering, sentiment, or entity extraction.
Natural Language Processing (NLP):
Enables computers to understand and interpret human language.
Noisy Text Analytics:
Deals with messy data (chats, messages) that may have errors or informal
language.
Manual Tagging with Metadata:
Attaching manual tags/labels to data to add meaning/structure.
Part-of-Speech Tagging:
Tagging text with its grammatical parts (noun, verb, adjective, etc.)
Unstructured Information Management Architecture (UIMA):
A platform to process unstructured content (text, audio) in real time for extracting
relevant meaning.
[Link] OF DATA
Data has three key characteristics.
1. Composition
This refers to the structure of data: its sources, granularity, types, and
whether it is static or involves real-time streaming.
It answers questions like: What is the origin of the data? Is it organized as
batches or streams? Is it highly granular or summarized?
2. Condition
This addresses the state or quality of the data.
It focuses on whether the data is ready for analysis or if it needs to be
cleansed or improved through enrichment. Typical questions include: Is the
data suitable for immediate use? Does it need preprocessing?
[Link]
Context covers the background in which data was generated or is being
used.
It helps answer where, why, and how the data came about, its sensitivity,
and associated events. For example: Where did this data originate? Why
was it created? What events are linked to it?
[Link] OF BIG DATA
Before 1980: Simple, structured data stored in mainframes,
limited usage.
1980s-1990s: Relational databases enable more sophisticated,
relational data storage and some data analysis.
2000s-now: Boom of the internet and new technologies generates
vast, varied data (structured, unstructured, multimedia); data is
now a strategic asset driving decisions and innovations.
[Link] OF BIG DATA
Flexible Definitions:
Beyond Human/Technical Limits: Anything exceeding current human or
technical infrastructure for storage, processing, and analysis.
Relativity: What is considered "big" today may be normal tomorrow, showing
how fast the landscape evolves.
Massive Scale: Sometimes simply defined as terabytes, petabytes, or even
zettabytes of data.
Three Vs: Most commonly, Big Data is described using the "3 Vs": Volume,
Velocity, and Variety.
Big data is high-volume, high-velocity, and high-variety information assets
that demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.
— Gartner IT Glossary
Big data is high-volume, high-velocity, and high-variety information
assets that demand cost-effective, innovative forms of information
processing for enhanced insight and decision making.
— Gartner IT Glossary
Aspect Description
Enormous amounts of data, both structured
Volume and unstructured.
Speed at which new data is generated and
Velocity must be processed.
Diversity in data types: text, images,
Variety videos, logs, streams, etc.
Gartner's definition of big data through a simple flow diagram.
High-volume, high-velocity, high-variety data
→ Need for cost-effective, innovative information processing
→ Leads to enhanced insight and better decision making
Cost-effective and Innovative Processing:
Big Data requires new technologies and approaches to ingest, store, and
analyze huge, fast-flowing, and diverse data sets.
Enhanced Insight and Decision-Making:
The ultimate goal is to derive deeper, richer, and more actionable insights—
turning data into information, then actionable intelligence, leading to better
decisions and greater business value.
This chain is summarized as:
Data → Information → Actionable intelligence → Better decisions
→ Enhanced business value
Big Data isn't just about size; it's about complexity, speed, diversity,
and the ability to draw deeper insights to achieve a competitive edge
in decision making.
Big Data helps turn raw facts into smart
insights, which lead to better decisions
and improved business results.
[Link] WITH BIG
DATA
Main Challenges with Big Data
[Link] Data Growth
Data is growing at an exponential rate, with most existing data generated in just the last
few years.
The challenge is : Will all this data be useful? Should we analyze all data or just a
subset? How do we distinguish valuable insights from noise
[Link] Computing and Virtualization
Managing big data infrastructure often involves cloud computing, which provides cost-
efficiency and flexibility.
However, deciding whether to host data solutions inside or outside the enterprise adds
complexity due to concerns about control and security.
[Link] Decisions
Determining how long to keep data is challenging. Some data may only be relevant for
a short period, while other data could have long-term value.
[Link] of Skilled Professionals
There is a shortage of experts in data science, which is crucial for
implementing effective big data solutions.
[Link] Issues
Big data involves datasets too large for traditional databases.
No clear limit defines when data becomes "big,” and new methods are needed
as data changes rapidly and in unpredictable ways.
Challenges include not only storage, but also capturing, preparing,
transferring, securing, and visualizing the data.
[Link] for Data Visualization
Clear and effective visualization is essential to making sense of vast datasets.
There aren’t enough specialists in data or business visualization to meet
demand.
CHALLENGES WITH BIG DATA
The diagram highlights the following core challenges in handling big data:
Capture (gathering data from multiple sources)
Storage (handling massive volumes of information)
Curation (organizing and maintaining data quality)
Search (efficiently finding relevant information)
Analysis (extracting insights)
Transfer (moving data across locations or systems)
Visualization (presenting data in understandable formats)
Privacy Violations (ensuring data security and privacy)
[Link] IS BIG DATA?
Big data refers to data that is extremely large in volume, moves at a high velocity, and comes in
a wide variety of forms. The concept of big data is usually captured by the "3 Vs":
6.1 Volume:
Massive amounts of data, ranging from terabytes to yottabytes.
Growth of Data (Volume)
Data grows from small units (bits, bytes) to massive scales, as shown in the
growth path:
Bits → Bytes → Kilobytes → Megabytes → Gigabytes → Terabytes →
Petabytes → Exabytes → Zettabytes → Yottabytes
Velocity: The speed at which data is generated and processed, from batch to
real-time streams.
Variety: Diversity in data sources and formats (structured, unstructured—like
text, video, databases, etc.).
Unit Size (in bytes)
Bits 0 or 1
Bytes 8 bits
Kilobytes 1,024 bytes
Megabytes 1,024² bytes
Gigabytes 1,024³ bytes
Terabytes 1,024⁴ bytes
Petabytes 1,024⁵ bytes
Exabytes 1,024⁶ bytes
Zettabytes 1,024⁷ bytes
Yottabytes 1,024⁸ bytes
6.1.1 Where is Big Data Generated?
Big data can be generated from a multitude of sources, both internal and
external:
Files and Documents: XLS, DOC, PDF files (often unstructured).
Multimedia: Video (YouTube), audio, social media.
Communication: Chat messages, customer feedback forms.
Other Examples: CCTV footage, weather forecasts, mobile data.
Internal Data Sources in Organizations
Data Storage: File systems, relational databases (RDBMS like Oracle, MS SQL
Server, MySQL, PostgreSQL), NoSQL databases (MongoDB, Cassandra).
Archives: Scanned documents, customer records, patient health records, student
records, etc.
SOURCES OF BIG DATA
Big data is therefore characterized by its large volume, high velocity, and multiple
varieties, and is generated from numerous sources—ranging from structured databases to
unstructured social media, videos, and organizational records.
6.2 Velocity
Velocity describes the speed at which data is generated, collected, and needs to be
processed. In the past, data used to be processed in batches—meaning all data was
collected over a period and then analyzed together (for example, payroll
calculations). Today, data increasingly needs to be processed in real time, or near
real time, as it arrives. This evolution is summarized as:
Batch → Periodic → Near real time → Real-time processing
Modern organizations now expect their systems to process and respond to data
instantly or within seconds, rather than waiting for slow, scheduled processing.
6.3 Variety
Variety refers to the diversity of data types and sources that organizations must handle. It
is categorized into three types:
Structured Data:
Highly organized and easily searchable (e.g., data stored in relational databases like RDBMS,
traditional transaction processing systems).
Semi-Structured Data:
Not as rigidly organized, but contains tags or markers to separate elements (e.g., HTML, XML).
Unstructured Data:
No predefined structure. Examples include text documents, emails, audios, videos, social media
posts, PDFs, and photos. Unstructured data is the most challenging but also the biggest source of
insights.
Variety means organizations must be able to manage everything from traditional database
records and spreadsheets to social media posts, sensor logs, images, and more—all of
which may require specialized processing techniques.
[Link] BIG DATA?
Big data is important because the more data we have for analysis, the more
accurate our analytical results become. This increased accuracy boosts our
confidence in the decisions we make based on these analyses. With this greater
confidence, organizations can realize significant positive outcomes, namely:
Enhanced operational efficiency
Reduced costs
Less time spent on processes
Increased innovation in developing new products and services
Optimization of existing offerings
The process can be visualized as a sequence:
More data → More accurate analysis → Greater confidence in decision
making → Greater operational efficiencies, cost reduction, time reduction,
new product development, and optimized offerings
8. TRADITIONAL BUSINESS INTELLIGENCE VS BIG
DATA
1. Data Storage and Architecture
Traditional BI:
All enterprise data is stored on a central server (usually on a single or a few large
database servers).
Big Data:
Data is stored in a distributed file system (spread across many servers or nodes).
Distributed systems can scale “horizontally” by adding more servers (nodes),
rather than making a single server bigger (“vertical” scaling).
2. Data Analysis Mode
Traditional BI:
Data analysis usually happens in offline mode, meaning data is collected and then
analyzed at a later time (batch processing).
Big Data:
Analysis can happen both in real time and in offline (batch) mode.
3. Data Type and Processing Method
Traditional BI:
Deals mostly with structured data (data that fits neatly into tables, like
databases). The typical approach is to move data to the processing
function (“move data to code”).
Big Data:
Handles all types of data: structured, semi-structured, and unstructured (such
as logs, images, social media text, etc.). In Big Data systems, it is more common
to move the processing function to where the data is (“move code to data”).
9. TYPICAL DATA WAREHOUSE ENVIRONMENT
Step 1: Data Collection (Sources)
Data comes from different systems inside and outside the company,
such as:
ERP systems (finance, HR, inventory)
CRM systems (customer details, sales info)
Old legacy systems (still in use)
Third-party apps (external software)
This data can be in many formats:
Databases (Oracle, SQL Server, MySQL)
Excel sheets
Text/CSV files
Step 2: Data Integration (ETL Process)
Since data comes in different formats, it must be:
Extracted (taken out from sources)
Transformed (cleaned and converted into a common format)
Loaded (sent into the warehouse)
This process is called ETL.
Step 3: Loading Data
After ETL, the cleaned data goes into the Data Warehouse.
It’s stored at the enterprise level (for the whole company).
Sometimes, smaller warehouses called Data Marts are made for
specific teams (like sales, HR).
Step 4: Business Intelligence & Analytics
Once data is ready in the warehouse, companies can use tools to:
Run quick queries (questions to the database)
Create dashboards (visual summaries)
Do data mining (find patterns and trends)
Generate reports
This helps managers make smarter, faster, data-driven decisions.
A Data Warehouse is like a super-organized library of business
data.
Collect from different sources → Clean & combine (ETL) →
Store in warehouse → Analyze using BI tools.
9. TYPICAL HADOOP ENVIRONMENT
Differences Between Hadoop and Data Warehousing
[Link] and Type of Data
Hadoop:
Collects data from a wide and diverse set of sources—web logs, images, videos,
social media content (Twitter, Facebook, etc.), documents, PDFs, and more. It is
designed to handle not just structured data, but also semi-structured and
unstructured data. This includes data both within and outside the company's
firewall.
Data Warehouse:
Traditionally focuses on structured data from well-defined business applications
like ERP, CRM, or legacy systems, typically within the organization's boundaries.
2. Storage Mechanism
Hadoop:
Uses the Hadoop Distributed File System (HDFS) to store data reliably across
many servers. Data of various types and sizes is kept in this distributed file
system, which is highly scalable and fault tolerant.
Data Warehouse:
Uses relational databases or similar systems where data is stored in tables with
fixed schemas (rows and columns).
3. Processing and Output
Hadoop:
Processing is done via MapReduce, a programming model that allows massive
scalability by dividing tasks across multiple nodes. After processing, data can be
sent to different destinations: back to operational systems, to data warehouses,
data marts, or operational data stores (ODS) for further analysis.
Data Warehouse:
Processing is done using SQL queries, and data is mostly kept in place for
analytics and reporting.
4. Integration and Use
In Hadoop environments, data can flow from many types of source
systems (logs, media, social platforms, documents) into Hadoop, where it is
processed and then routed to the relevant business destinations (operational
systems, warehouses, marts, ODS) for final use or reporting
Hadoop is built for handling massive volumes and varieties of data (structured
and unstructured, internal and external), storing it in a distributed fashion with
flexible processing pipelines.
Traditional Data Warehousing excels at managing structured, business-
critical data in centralized, organized storage.