1.
(a) With the help of a diagram, explain the real-time
data warehouse architecture. Also, discuss its trade-off.
A Real-Time Data Warehouse (RTDW) is a data warehousing system that updates data
continuously (or near real-time) as changes occur in the source systems. Unlike traditional data
warehouses which update data in batches (e.g., nightly or weekly), RTDW provides up-to-the-
minute or real-time analytics.
Real-time data warehousing is used in scenarios such as:
Fraud detection
Stock trading
E-commerce recommendations
Real-time dashboards
Key Components Explained:
1. Operational Source Systems:
These are real-time transactional systems like:
CRM
Banking systems
Online shopping platforms
They generate constant data streams that must be processed instantly.
2. Change Data Capture (CDC):
CDC detects changes in source systems (insert/update/delete) and captures them as events.
Techniques include:
Log-based capture
Trigger-based methods
Timestamp comparison
3. Real-Time ETL Engine:
Unlike batch ETL, real-time ETL runs continuously:
Extracts and transforms small volumes of data on-the-fly
Maintains data quality and consistency in real-time
Manages low-latency transformations
4. Staging Area (Optional):
Acts as a buffer:
Holds transient data temporarily
Cleanses and filters before loading into DW
Can help in load balancing and handling data bursts
5. Real-Time Data Warehouse:
The central repository that supports:
Low-latency write operations
High-speed querying
Partitioning and indexing for fast access
Technologies: Snowflake, Google BigQuery, Amazon Redshift, Apache Druid.
6. Metadata Repository:
Stores data definitions, transformation rules, data lineage, and operational metadata. Helps in
synchronization and auditing.
7. Business Intelligence Layer:
Provides:
Dashboards (Power BI, Tableau)
Real-time reports
Alerts (e.g., for fraud detection)
Decision-making support
Trade-offs in Real-Time Data Warehousing
Aspect Real-Time DW Traditional DW Trade-Offs
Low latency gives quick insights
Latency Seconds to minutes Hours to days
but may impact performance.
Continuous ETL is harder to
ETL Complexity High Moderate
manage and debug.
High (due to streaming, Real-time systems need high
System Cost Lower
high IOPS) performance hardware/software.
May be compromised Higher with Need for real-time validation and
Data Quality
due to speed batch cleansing cleansing.
Limited unless
Historical Needs hybrid architecture for deep
integrated with batch Strong
Analysis historical + real-time analysis.
warehouse
In real-time, errors must be
Error Recovery Difficult in real-time Easier in batch
handled instantly.
Architecture More moving parts (CDC,
Very High Simpler
Complexity streaming, micro-batching, alerts)
(b) Describe fact constallation scheme. How is it different
from snow-flake scheme ? With the help of an example
explain fact
constallation schema. Mention its advantages and
limitations.
A Fact Constellation Schema (also called Galaxy Schema) is a multifaceted data warehouse
schema that contains two or more fact tables sharing common dimension tables.
It is used to represent complex data warehouse systems involving multiple business
processes.
🔸 Key Features:
Involves multiple fact tables
Shared dimension tables (e.g., time, customer)
Represents a collection of star schemas
Used for modular and complex data models
Explanation of the Example:
Two fact tables: Sales Fact and Shipping Fact
Both share common dimensions like:
o Time Dimension
o Product Dimension
Each has its own unique dimensions:
o Sales Fact uses Store Dimension
o Shipping Fact uses Customer Dimension
This shows a constellation of facts surrounded by a galaxy of shared and unique dimensions.
Difference Between Fact Constellation and Snowflake Schema:
Aspect Fact Constellation Schema Snowflake Schema
Multiple fact tables with shared Single fact table with normalized
Definition
dimensions dimensions
Complexity Higher due to multiple facts Moderate due to normalization
Multiple related business
Use Case Organized dimension modeling
processes
Dimension Tables Shared across fact tables Split into sub-dimensions
Schema Type Collection of star schemas Variant of star schema
Query May also be slower than star due to
May be slower due to joins
Performance normalized dims
Advantages of Fact Constellation Schema
1. Reusability: Common dimensions are reused across multiple facts.
2. Supports multiple business processes: E.g., sales, shipping, returns, etc.
3. Scalability: Can model complex data warehouse scenarios.
4. Modular design: New fact tables can be added with minimal effort.
Limitations of Fact Constellation Schema
1. Complex queries: Requires handling joins across multiple fact and dimension tables.
2. Difficult to design: Schema becomes complex as number of facts increases.
3. Performance issues: Due to large number of joins and complex relationships.
4. Maintenance overhead: More tables to manage and optimize.
(c) Write and explain K-NN classification algorithm in
Data-Mining. Discuss its advantages and disadvantages.
K-Nearest Neighbors (K-NN) is a supervised learning algorithm used for both classification
and regression, but more commonly for classification.
It is based on the principle that data points with similar features exist in close proximity.
In simple terms, K-NN classifies a new data point based on the majority class among its ‘K’
nearest neighbors.
Working of K-NN Classification Algorithm:
Step-by-step Explanation:
1. Choose the value of K:
o This is the number of nearest neighbors to consider (e.g., K = 3, 5, etc.)
2. Compute distance between the new data point and all other points in the training dataset.
o Common distance metrics:
Euclidean distance
Manhattan distance
Minkowski distance
3. Sort the distances in ascending order and find the K closest neighbors.
4. Count the number of occurrences of each class among the K neighbors.
5. Assign the class label to the new data point that is most frequent among its K neighbors.
🔹 Euclidean Distance Formula
For two points X = (x1, x2, ..., xn) and Y = (y1, y2, ..., yn):
Advantages of K-NN:
Advantage Description
Simple & Easy to Implement No training phase; easy to code and understand
Non-parametric No assumptions about data distribution
Versatile Can be used for classification and regression
Real-time Learning K-NN can adapt as new data is added (lazy learning)
Effective in multi-class problems Handles multiple class labels well
Disadvantages of K-NN:
Disadvantage Description
Computationally Expensive Must compute distance to all training data for each query
Slow with large datasets Memory and time complexity increase with data size
Sensitive to Irrelevant Features No feature selection — all features are equally weighted
Distance metrics are affected by scale; normalization is
Requires Feature Scaling
needed
Affected by Imbalanced Data May bias toward frequent classes in imbalanced datasets
(d) Discuss vector space modeling for representing text
documents. With reference to this modeling, explain TF-
IDF
and Inverse Document Frequency (IDF).
[Link] Space Model (VSM) is a mathematical model used to represent text documents as
vectors in a multi-dimensional space.
Each dimension corresponds to a term (word) from the document corpus, and the value along
that dimension indicates the importance of the term in that document.
It is used in:
Information Retrieval (IR)
Text Mining
Document Clustering
Text Classification
How Vector Space Model Works
🔸 Key Idea:
Treat each document as a vector of term weights.
Terms are typically stemmed, stop words removed, and frequency counted.
Documents are then compared using cosine similarity or other distance measures.
TF-IDF: Term Frequency–Inverse Document Frequency
TF-IDF is a statistical measure used in VSM to evaluate how important a word is to a
document in a collection.
It balances:
Term Frequency (TF): How often a term appears in a document
Inverse Document Frequency (IDF): How rare the term is across all documents
🔹 Formula:
TF-IDF(t,d)=TF(t,d)×IDF(t)
Term Frequency (TF)
TF measures how frequently a term t occurs in a document d.
Number of times term t appears in d
TF(t,d)= ----------------------------------------------------
Total number of terms in d
👉 Example:
If the word "data" appears 3 times in a document with 100 words:
TF(data) = 3 / 100 = 0.03
Inverse Document Frequency (IDF)
IDF measures how important (i.e., rare) a term is across all documents.
Rare terms get higher IDF scores.
Where:
N = Total number of documents
df(t) = Number of documents containing term t
👉 Example:
If "data" appears in 1000 out of 5000 documents:
Putting It All Together (TF-IDF Example):
Let’s say:
TF(data, D1) = 0.1
IDF(data) = 0.699
Then,
TF-IDF(data, D1)=0.1×0.699=0.0699
This means the term "data" is moderately important in document D1.
Advantages of Vector Space Model & TF-IDF:
Advantage Description
Simple & Effective Easy to implement and widely used
Quantitative Comparison Allows similarity computation (e.g., cosine similarity)
Relevance-based Ranking TF-IDF boosts rare but meaningful terms
No Need for Linguistic Knowledge Works purely based on statistics
2. (a) Discuss various factors to be considered to improve the performance of
ETL.
ETL (Extract, Transform, Load) is the critical process in data warehousing used to:
Extract data from various sources,
Transform it into the desired format, and
Load it into the target Data Warehouse.
Due to the large volumes of data, the performance of ETL processes directly affects the
efficiency, freshness, and usability of data.
Improving ETL performance involves optimizing each stage of the process.
Key Factors to Improve ETL Performance:
🔹 1. Source System Optimization
Minimize the load on the source system during extraction.
Use incremental data extraction (e.g., Change Data Capture - CDC) rather than full
extraction.
Avoid complex queries and joins on the source.
Use indexes in source databases to speed up data retrieval.
🔹 2. Use of Parallelism and Partitioning
Parallel Processing: Split data into chunks and process them simultaneously.
Pipeline Parallelism: Run multiple ETL stages (extract, transform, load) concurrently.
Data Partitioning: Divide data logically (e.g., by region, date) to improve throughput.
🔹 3. Efficient Transformation Logic
Minimize complex transformation operations such as nested IF, JOIN, and CASE
statements.
Push simple transformations to the database layer using SQL.
Use optimized data types and avoid row-by-row operations (prefer bulk operations).
Leverage lookup caches instead of repetitive lookups.
🔹 4. Staging Area Usage
Use a staging database to temporarily store extracted raw data before transformation.
This allows for:
o Isolation from source systems
o Easy recovery from failures
o Intermediate data validation
🔹 5. Optimized Loading Strategies
Use bulk/batch loading instead of row-by-row inserts.
Disable indexes and constraints during load and re-enable after.
Use truncate-and-load where applicable instead of slow update/delete/insert logic.
Use load balancing if loading into distributed data warehouses.
🔹 6. Data Volume and Scheduling Management
Filter unnecessary data early in the process.
Use incremental loads wherever possible.
Schedule ETL jobs during off-peak hours to reduce contention.
Monitor and limit resource usage (CPU, I/O, memory).
🔹 7. Error Handling and Logging
Use error logs and recovery mechanisms to handle failures efficiently.
Avoid reprocessing the entire dataset for minor errors.
Design the ETL to resume from the point of failure.
🔹 8. Metadata and Dependency Management
Use metadata to manage mappings, data lineage, and schema changes.
Automate dependency tracking for data pipelines to avoid unnecessary reprocessing.
🔹 9. Hardware and Infrastructure
Deploy ETL on high-performance servers with sufficient memory, I/O, and CPU.
Use distributed or cloud-based ETL platforms like Apache Spark, Talend, or AWS
Glue for scalability.
(b) Define a Data Mart. How is it different from a centralized data warehouse ?
Discuss the structure of Data Mart with the help of an example.
A Data Mart is a subset of a data warehouse that is focused on a specific business line,
department, or team within an organization.
It is designed to meet the particular needs of a specific group of users, such as:
Sales
Marketing
Finance
HR
🔹 Definition:
“A Data Mart is a subject-oriented, department-specific data repository that stores summarized
or detailed data to serve a particular community of knowledge workers.”
Difference Between Data Mart and Centralized Data Warehouse:
Aspect Data Mart Data Warehouse
Department-level (e.g., sales,
Scope Enterprise-wide
finance)
Data Volume Smaller Very large
Usually derived from a data Consolidated from multiple
Data Source
warehouse heterogeneous sources
Complexity Low High
Implementation
Shorter (weeks) Longer (months to years)
Time
Maintenance Easier Complex
Cost Less expensive High
Speed Faster for specific queries May be slower for broad queries
Types of Data Marts
1. Dependent Data Mart:
o Sourced from the centralized data warehouse
o Ensures consistency and central control
2. Independent Data Mart:
o Sourced directly from operational systems or external data
o No dependency on a central warehouse
3. Hybrid Data Mart:
o Combines data from both warehouse and external sources
Structure of a Data Mart
🔸 Basic Architecture of a Data Mart:
Example: Sales Data Mart
📌 Use Case:
The Sales Department wants to analyze sales performance by region, product, and time.
🔸 Components:
Fact Table: Sales_Fact
o Attributes: Sales_ID, Date_ID, Product_ID, Region_ID, Revenue,
Quantity_Sold
Dimension Tables:
o Time_Dim (Date_ID, Day, Month, Year)
o Product_Dim (Product_ID, Name, Category, Price)
o Region_Dim (Region_ID, State, Country)
Advantages of Data Marts:
Advantage Description
Faster Access Optimized for specific departmental queries
Advantage Description
Simpler Design Easier to design, implement, and manage
Lower Cost Less hardware and development cost
Custom Solutions Tailored for department needs
Security Limits access to relevant data only
(c) Discuss the following OLAP data cube operations :
(i) Roll-up (ii) Drill-down.
OLAP (Online Analytical Processing) enables analysts and decision-makers to interactively
analyze multidimensional data stored in a data cube format.
A data cube organizes data into dimensions (e.g., time, location, product) and facts (e.g., sales,
profit).
✅ (i) ROLL-UP (2.5 Marks)
🔹 Definition:
Roll-up is an OLAP operation that performs data aggregation by climbing up a concept
hierarchy or by reducing dimensions.
It allows viewing data at a more summarized or abstract level.
🔸 Example:
Assume a data cube with dimensions:
Product → Brand → Category
Location → City → State → Country
Time → Day → Month → Quarter → Year
If we roll-up on the Location dimension from City to State, we are aggregating sales data from
individual cities to the state level.
✅ (ii) DRILL-DOWN (2.5 Marks)
🔹 Definition:
Drill-down is the opposite of roll-up. It allows users to view data at a more detailed or
granular level, by:
Descending the hierarchy
Adding a new dimension
🔸 Example:
Starting from Year-level sales data, if we drill down to Quarter, then Month, we get more
detailed insights.
3. (a) Discuss the following Data Integration issues :
(i) Schema Integration and object modeling
(ii) Redundancy
(iii) Detection and Resolution of Data Value Conflicts.
Data Integration is the process of combining data from multiple heterogeneous sources into a
single, unified view. It is a critical step in building a data warehouse or data mart.
However, integration poses several challenges due to differences in:
Data formats
Semantics
Naming conventions
Schema structures
Value inconsistencies
✅ (i) Schema Integration and Object Modeling
🔹 Schema Integration:
Schema integration refers to the process of merging schemas (i.e., structures) from multiple
data sources into a coherent global schema.
Challenges:
Different naming conventions (e.g., Cust_ID vs. CustomerID)
Different data types for the same field (e.g., Date vs. String)
Different levels of normalization
Missing or extra attributes
Example:
One source has: Employee(Name, Dept, Salary)
Another source has: Emp(Name, DepartmentName, MonthlySalary)
Schema integration must map Dept ↔ DepartmentName and Salary ↔ MonthlySalary.
🔹 Object Modeling:
Object modeling involves using data models (like ER diagrams, UML models) to:
Define entities, attributes, and relationships
Capture data semantics
Resolve structural differences between source schemas
It ensures that data from different sources conform to a common structure and meaning.
✅ Benefits of Schema Integration & Object Modeling:
Provides a unified view of enterprise data
Simplifies querying and analysis
Enables automated or semi-automated ETL pipelines
✅ (ii) Redundancy
🔹 Definition:
Redundancy in data integration occurs when the same data appears in multiple source
systems.
Redundancy can be:
Syntactic (exact same data values)
Semantic (same real-world meaning, different representation)
🔹 Problems Caused by Redundancy:
Inconsistent values (e.g., one source says salary is ₹50,000, another says ₹55,000)
Increased storage costs
Incorrect aggregations (e.g., duplicate rows inflating totals)
Conflicting updates when integrated systems allow changes
🔹 Solutions:
Use primary keys and matching logic to detect duplicates
Employ data deduplication tools
Define master data sources and discard redundant versions
Use record linkage and entity resolution techniques
✅ (iii) Detection and Resolution of Data Value Conflicts
🔹 Definition:
Data value conflict arises when the same real-world entity has different values for the same
attribute across different sources.
🔸 Examples of Conflicts:
Attribute Source A Source B Conflict Type
Salary ₹50,000 ₹55,000 Numeric discrepancy
Gender "M" "Male" Format inconsistency
Date 01/02/2023 02/01/2023 Locale format conflict
Address "123 MG Road" "123 Mahatma Gandhi Rd" Semantic difference
🔹 Types of Conflicts:
1. Representation conflicts (e.g., different date formats)
2. Scaling conflicts (e.g., salary in INR vs. USD)
3. Missing or default values
4. Measurement unit differences
5. Semantically equivalent but differently labeled values
🔹 Detection Techniques:
Use data profiling to understand patterns and outliers
Apply match rules, lookup tables, and pattern recognition
Apply statistical analysis to detect value ranges
🔹 Resolution Strategies:
Use data standardization rules (e.g., unify date formats)
Apply domain-specific rules (e.g., prefer most recent value)
Use confidence scores or source trust levels
Consult business rules or data stewards
Use ETL scripts to transform and reconcile values
(b) Write and explain Decision Tree Classifer algorithm with the help of an
example.
A Decision Tree is a supervised learning algorithm used for classification and prediction
tasks.
It models decisions and their possible consequences in a tree-like structure.
Each internal node represents a test on an attribute,
each branch represents an outcome of the test,
and each leaf node represents a class label (decision).
Steps in Building a Decision Tree:-
The algorithm follows a top-down, recursive approach known as “divide and conquer”.
Common algorithms include:
ID3 (Iterative Dichotomiser 3)
C4.5 (Successor to ID3)
CART (Classification and Regression Trees)
🔹 General Steps:
1. Select the best attribute to split the data based on some criterion (e.g., Information
Gain, Gini Index).
2. Partition the dataset into subsets based on the attribute's values.
3. Repeat the process recursively for each subset until:
o All records in a subset belong to the same class, or
o No attributes are left to split on.
Splitting Criteria
🔸 (a) Information Gain (ID3 Algorithm)
Information Gain=Entropy (Parent)−∑Weighted Entropy (Children)
Entropy quantifies impurity in the data.
🔸 (b) Gini Index (CART Algorithm)
Gini(D)=1−∑pi2
Lower Gini means a purer node.
Example of Decision Tree (ID3 Algorithm)
🧠 Problem: Predict whether a person will buy a computer based on Age and
Income.
Age Income Buys_Computer
<30 High No
<30 Medium Yes
30–40 High Yes
>40 Low Yes
Age Income Buys_Computer
>40 Medium Yes
>40 High No
Advantages of Decision Trees:
Advantage Description
Easy to understand Visual and interpretable
No need for data normalization Can handle numeric and categorical attributes
Works well with missing values Can split on available attributes
Performs feature selection Automatically selects important variables
Disadvantages of Decision Trees:
Disadvantage Description
Overfitting May create overly complex trees
Instability Small changes in data can cause large changes in the tree
Bias Towards attributes with more levels
Lower accuracy Compared to ensemble methods like Random Forests
4.(a) Discuss Linear Discriminant Analysis (LDA) and Principal Component
Analysis (PDA) feature extraction techniques.
Feature extraction is the process of transforming raw data into a reduced set of informative
features while preserving essential information.
This helps in:
Reducing dimensionality
Improving model performance
Avoiding overfitting
Visualizing data
Two popular feature extraction techniques are:
LDA (Linear Discriminant Analysis) – supervised
PCA (Principal Component Analysis) – unsupervised
Principal Component Analysis (PCA)
🔹 Definition:
PCA is an unsupervised statistical technique that transforms the original set of variables into a
new set of orthogonal (uncorrelated) variables called principal components, ordered by the
amount of variance they explain.
🔹 Key Characteristics:
Captures maximum variance in the data
Does not use class labels
Helps in dimensionality reduction by keeping top-k components
🔹 Steps in PCA:
1. Standardize the data
2. Compute the covariance matrix
3. Calculate eigenvalues and eigenvectors
4. Select top k eigenvectors (principal components)
5. Transform the original data using these components
🔹 Example:
Given 2D data (Height, Weight), PCA might find a new axis (PC1) that best explains the
variance between individuals.
🔹 Applications:
Image compression
Noise reduction
Data visualization (e.g., projecting to 2D)
🔹 Advantages of PCA:
Reduces dimensionality while preserving variance
Improves speed and accuracy of models
Handles correlated features effectively
🔹 Limitations:
Does not consider class labels
Components are difficult to interpret
Assumes linear relationships
Linear Discriminant Analysis (LDA)
🔹 Definition:
LDA is a supervised feature extraction technique that projects data onto a lower-dimensional
space such that the class separability is maximized.
It finds a linear combination of features that best separates two or more classes.
🔹 Key Characteristics:
Uses class labels
Maximizes the between-class variance and minimizes the within-class variance
Often used in classification tasks
🔹 Steps in LDA:
1. Compute the mean vectors for each class
2. Compute within-class and between-class scatter matrices
3. Calculate eigenvectors and eigenvalues for scatter matrices
4. Select the top linear discriminants
5. Project the data onto the new subspace
🔹 Example:
In a dataset with two classes (e.g., spam vs. non-spam emails), LDA will find the best line or
plane to separate the two classes based on feature distributions.
🔹 Applications:
Face recognition
Document classification
Bioinformatics (e.g., gene expression analysis)
🔹 Advantages of LDA:
Improves class separability
Better for classification tasks than PCA
Provides interpretable features
🔹 Limitations:
Assumes data is normally distributed
Assumes equal covariance among classes
Not suitable for non-linear boundaries
Comparison: PCA vs. LDA:
Feature PCA LDA
Type Unsupervised Supervised
Goal Maximize data variance Maximize class separability
Uses Class Labels ❌ No ✅ Yes
Output Principal components Linear discriminants
Application Data compression, visualization Classification, pattern recognition
(b) Explain Association Rule Mining (ARM). Also, explain the two phases
involved in ARM process with the help of an example.
What is Association Rule Mining (ARM)?
Association Rule Mining (ARM) is a data mining technique used to discover interesting
relationships, patterns, or correlations among items in large datasets.
ARM is commonly used in market basket analysis, where it finds associations like:
If a customer buys bread and butter, they are likely to buy milk.
🔹 Formal Representation of an Association Rule:
A⇒B\text{A} \Rightarrow \text{B}A⇒B
Where:
A (Antecedent) = If this occurs...
B (Consequent) = Then this is likely to occur
🔹 Key Measures:
Support: Frequency of the itemset in the database
Confidence: Likelihood of B occurring when A has occurred
Lift: Measures correlation between A and B
Two Phases of the ARM Process
Association Rule Mining is typically done in two major phases:
🔸 Phase 1: Frequent Itemset Generation
Identify all frequent itemsets (sets of items that appear together frequently in the
dataset) based on a minimum support threshold.
Algorithms used:
o Apriori (uses candidate generation and pruning)
o FP-Growth (uses tree structure to avoid candidate generation)
🔸 Phase 2: Rule Generation from Frequent Itemsets
Generate strong rules from the frequent itemsets using a minimum confidence
threshold.
Rules of the form A ⇒ B are created by dividing each frequent itemset into antecedent
and consequent.
Only rules that satisfy both support and confidence thresholds are retained.
Example: Market Basket Dataset:
Transaction ID Items Purchased
T1 Bread, Butter, Milk
T2 Bread, Butter
T3 Bread, Milk
T4 Butter, Milk
T5 Bread, Butter, Milk, Eggs
🔸 Phase 1: Frequent Itemset Generation (min support = 60%)
Itemsets and supports:
o {Bread} → 4/5 = 80% ✔️
o {Butter} → 4/5 = 80% ✔️
o {Milk} → 4/5 = 80% ✔️
o {Bread, Butter} → 3/5 = 60% ✔️
o {Bread, Milk} → 3/5 = 60% ✔️
o {Butter, Milk} → 3/5 = 60% ✔️
o {Bread, Butter, Milk} → 3/5 = 60% ✔️
🔸 Phase 2: Rule Generation (min confidence = 70%)
From {Bread, Butter}:
Rule: Bread ⇒ Butter
o Confidence = Support(Bread ∩ Butter) / Support(Bread) = 60% / 80% = 75% ✔️
From {Bread, Butter, Milk}:
Rule: Bread & Butter ⇒ Milk
o Confidence = 60% / 60% = 100% ✔️
From {Milk ⇒ Bread}:
Confidence = 60% / 80% = 75% ✔️
5. Write short notes on the following :
(a) ELT and its need
ELT (Extract, Load, Transform) is a modern data integration approach where:
Data is Extracted from multiple sources
Loaded into a centralized system (typically a cloud data lake or warehouse)
Then Transformed within the target system (e.g., using SQL)
This is a shift from traditional ETL (Extract-Transform-Load), where transformation happens
before loading.
🔹 Why ELT is Needed:
1. Scalability:
o Cloud data warehouses (like BigQuery, Snowflake) can process transformations
on massive datasets.
2. Performance:
o Reduces the time between data extraction and availability for use.
o Parallel processing in data warehouses speeds up transformation.
3. Flexibility:
o Raw data is available for various transformations and analytics.
o Allows schema-on-read rather than schema-on-write.
4. Real-Time Processing:
o Better suited for streaming data pipelines and modern analytics.
🔹 Use Cases:
Real-time business intelligence dashboards
Machine learning pipelines needing raw data
Agile data engineering environments
🔹 Tools That Support ELT:
Apache Spark, dbt (Data Build Tool), Snowflake, Azure Synapse
(b) Metadata and Data Warehousing
🔹 What is Metadata?
Metadata is “data about data.” In data warehousing, it describes:
Source, structure, format, and semantics of data
Data lineage (origin and transformation)
Access rights and refresh frequency
🔹 Types of Metadata in Data Warehousing:
1. Technical Metadata:
o Table names, column types, data types, ETL rules
2. Business Metadata:
o Business definitions, descriptions, ownership
3. Operational Metadata:
o Load status, error logs, job execution time
🔹 Importance of Metadata:
Ensures data quality and consistency
Aids ETL developers and analysts
Supports data governance and compliance
Helps in impact analysis when changing schemas
🔹 Example:
If a data field “Revenue” is stored, metadata might include:
Definition: Total amount after discounts
Source: Sales DB → Fact_Sales
Type: Decimal(10,2)
Last updated: Daily at 2 AM
(c) Rule-Based Classification
🔹 What is Rule-Based Classification?
It is a classification method where IF-THEN rules are used to assign class labels to data
instances.
Each rule has:
Antecedent (condition) – e.g., IF Age > 30 AND Income = High
Consequent (class) – THEN Class = Yes
🔹 Example Rule:
IF (Age > 40 AND Income = High) THEN Class = “Yes”
🔹 How Rules are Generated:
From decision trees (e.g., ID3, C4.5)
From algorithms like RIPPER, CN2
Using association rules adapted for classification
🔹 Advantages:
Easy to understand and interpret
Flexible – new rules can be added or modified easily
Handles both categorical and numerical data
🔹 Disadvantages:
Conflicting or redundant rules may arise
Performance degrades with too many rules
May not handle noise well
🔹 Applications:
Spam filtering
Medical diagnosis
Fraud detection
(d) Naïve Bayes Classifier
🔹 What is Naïve Bayes?
Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem with the naïve assumption
that features are conditionally independent given the class.
🔹 Bayes' Theorem:
Where:
P(C∣X)P(C|X)P(C∣X): Posterior probability of class C given data X
P(X∣C)P(X|C)P(X∣C): Likelihood of data X given class C
P(C)P(C)P(C): Prior probability of class C
P(X)P(X)P(X): Probability of data X
🔹 Working Steps:
1. Calculate prior probabilities for each class
2. Compute likelihood for each attribute given the class
3. Use Bayes' Theorem to compute posterior for each class
4. Choose class with highest posterior probability
🔹 Example:
Given data about emails:
Features: Contains “offer”, Contains “win”
Class: Spam / Not Spam
Rule:
IF email contains “win” and “offer” THEN classify as “Spam” with high probability.
🔹 Advantages:
Simple and fast
Works well with high-dimensional data
Performs well on text classification (e.g., spam filtering)
🔹 Disadvantages:
Assumes independent features, which is often unrealistic
May perform poorly if this assumption is violated
🔹 Applications:
Email spam detection
Sentiment analysis
Document classification