0% found this document useful (0 votes)
25 views30 pages

Real-Time Data Warehouse Architecture Explained

Z paper

Uploaded by

tejasadodariya1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views30 pages

Real-Time Data Warehouse Architecture Explained

Z paper

Uploaded by

tejasadodariya1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

1.

(a) With the help of a diagram, explain the real-time


data warehouse architecture. Also, discuss its trade-off.

A Real-Time Data Warehouse (RTDW) is a data warehousing system that updates data
continuously (or near real-time) as changes occur in the source systems. Unlike traditional data
warehouses which update data in batches (e.g., nightly or weekly), RTDW provides up-to-the-
minute or real-time analytics.

Real-time data warehousing is used in scenarios such as:

 Fraud detection
 Stock trading
 E-commerce recommendations
 Real-time dashboards
Key Components Explained:

1. Operational Source Systems:

These are real-time transactional systems like:

 CRM
 Banking systems
 Online shopping platforms

They generate constant data streams that must be processed instantly.

2. Change Data Capture (CDC):

CDC detects changes in source systems (insert/update/delete) and captures them as events.
Techniques include:

 Log-based capture
 Trigger-based methods
 Timestamp comparison

3. Real-Time ETL Engine:

Unlike batch ETL, real-time ETL runs continuously:

 Extracts and transforms small volumes of data on-the-fly


 Maintains data quality and consistency in real-time
 Manages low-latency transformations

4. Staging Area (Optional):

Acts as a buffer:

 Holds transient data temporarily


 Cleanses and filters before loading into DW
 Can help in load balancing and handling data bursts
5. Real-Time Data Warehouse:

The central repository that supports:

 Low-latency write operations


 High-speed querying
 Partitioning and indexing for fast access

Technologies: Snowflake, Google BigQuery, Amazon Redshift, Apache Druid.

6. Metadata Repository:

Stores data definitions, transformation rules, data lineage, and operational metadata. Helps in
synchronization and auditing.

7. Business Intelligence Layer:

Provides:

 Dashboards (Power BI, Tableau)


 Real-time reports
 Alerts (e.g., for fraud detection)
 Decision-making support

Trade-offs in Real-Time Data Warehousing

Aspect Real-Time DW Traditional DW Trade-Offs


Low latency gives quick insights
Latency Seconds to minutes Hours to days
but may impact performance.
Continuous ETL is harder to
ETL Complexity High Moderate
manage and debug.
High (due to streaming, Real-time systems need high
System Cost Lower
high IOPS) performance hardware/software.
May be compromised Higher with Need for real-time validation and
Data Quality
due to speed batch cleansing cleansing.
Limited unless
Historical Needs hybrid architecture for deep
integrated with batch Strong
Analysis historical + real-time analysis.
warehouse
In real-time, errors must be
Error Recovery Difficult in real-time Easier in batch
handled instantly.
Architecture More moving parts (CDC,
Very High Simpler
Complexity streaming, micro-batching, alerts)
(b) Describe fact constallation scheme. How is it different
from snow-flake scheme ? With the help of an example
explain fact
constallation schema. Mention its advantages and
limitations.

A Fact Constellation Schema (also called Galaxy Schema) is a multifaceted data warehouse
schema that contains two or more fact tables sharing common dimension tables.
It is used to represent complex data warehouse systems involving multiple business
processes.

🔸 Key Features:

 Involves multiple fact tables


 Shared dimension tables (e.g., time, customer)
 Represents a collection of star schemas
 Used for modular and complex data models

Explanation of the Example:

 Two fact tables: Sales Fact and Shipping Fact


 Both share common dimensions like:
o Time Dimension
o Product Dimension
 Each has its own unique dimensions:
o Sales Fact uses Store Dimension
o Shipping Fact uses Customer Dimension

This shows a constellation of facts surrounded by a galaxy of shared and unique dimensions.

Difference Between Fact Constellation and Snowflake Schema:

Aspect Fact Constellation Schema Snowflake Schema


Multiple fact tables with shared Single fact table with normalized
Definition
dimensions dimensions
Complexity Higher due to multiple facts Moderate due to normalization
Multiple related business
Use Case Organized dimension modeling
processes
Dimension Tables Shared across fact tables Split into sub-dimensions
Schema Type Collection of star schemas Variant of star schema
Query May also be slower than star due to
May be slower due to joins
Performance normalized dims
Advantages of Fact Constellation Schema

1. Reusability: Common dimensions are reused across multiple facts.


2. Supports multiple business processes: E.g., sales, shipping, returns, etc.
3. Scalability: Can model complex data warehouse scenarios.
4. Modular design: New fact tables can be added with minimal effort.

Limitations of Fact Constellation Schema

1. Complex queries: Requires handling joins across multiple fact and dimension tables.
2. Difficult to design: Schema becomes complex as number of facts increases.
3. Performance issues: Due to large number of joins and complex relationships.
4. Maintenance overhead: More tables to manage and optimize.

(c) Write and explain K-NN classification algorithm in


Data-Mining. Discuss its advantages and disadvantages.

K-Nearest Neighbors (K-NN) is a supervised learning algorithm used for both classification
and regression, but more commonly for classification.
It is based on the principle that data points with similar features exist in close proximity.

In simple terms, K-NN classifies a new data point based on the majority class among its ‘K’
nearest neighbors.

Working of K-NN Classification Algorithm:

Step-by-step Explanation:

1. Choose the value of K:


o This is the number of nearest neighbors to consider (e.g., K = 3, 5, etc.)
2. Compute distance between the new data point and all other points in the training dataset.
o Common distance metrics:
 Euclidean distance
 Manhattan distance
 Minkowski distance
3. Sort the distances in ascending order and find the K closest neighbors.
4. Count the number of occurrences of each class among the K neighbors.
5. Assign the class label to the new data point that is most frequent among its K neighbors.

🔹 Euclidean Distance Formula


For two points X = (x1, x2, ..., xn) and Y = (y1, y2, ..., yn):

Advantages of K-NN:

Advantage Description
Simple & Easy to Implement No training phase; easy to code and understand
Non-parametric No assumptions about data distribution
Versatile Can be used for classification and regression
Real-time Learning K-NN can adapt as new data is added (lazy learning)
Effective in multi-class problems Handles multiple class labels well

Disadvantages of K-NN:

Disadvantage Description
Computationally Expensive Must compute distance to all training data for each query
Slow with large datasets Memory and time complexity increase with data size
Sensitive to Irrelevant Features No feature selection — all features are equally weighted
Distance metrics are affected by scale; normalization is
Requires Feature Scaling
needed
Affected by Imbalanced Data May bias toward frequent classes in imbalanced datasets

(d) Discuss vector space modeling for representing text


documents. With reference to this modeling, explain TF-
IDF
and Inverse Document Frequency (IDF).

[Link] Space Model (VSM) is a mathematical model used to represent text documents as
vectors in a multi-dimensional space.
Each dimension corresponds to a term (word) from the document corpus, and the value along
that dimension indicates the importance of the term in that document.

It is used in:

 Information Retrieval (IR)


 Text Mining
 Document Clustering
 Text Classification

How Vector Space Model Works

🔸 Key Idea:

 Treat each document as a vector of term weights.


 Terms are typically stemmed, stop words removed, and frequency counted.
 Documents are then compared using cosine similarity or other distance measures.

TF-IDF: Term Frequency–Inverse Document Frequency

TF-IDF is a statistical measure used in VSM to evaluate how important a word is to a


document in a collection.

It balances:

 Term Frequency (TF): How often a term appears in a document


 Inverse Document Frequency (IDF): How rare the term is across all documents

🔹 Formula:

TF-IDF(t,d)=TF(t,d)×IDF(t)

Term Frequency (TF)

TF measures how frequently a term t occurs in a document d.

Number of times term t appears in d


TF(t,d)= ----------------------------------------------------
Total number of terms in d

👉 Example:
If the word "data" appears 3 times in a document with 100 words:
TF(data) = 3 / 100 = 0.03

Inverse Document Frequency (IDF)


IDF measures how important (i.e., rare) a term is across all documents.
Rare terms get higher IDF scores.

Where:

 N = Total number of documents


 df(t) = Number of documents containing term t

👉 Example:
If "data" appears in 1000 out of 5000 documents:

Putting It All Together (TF-IDF Example):

Let’s say:

 TF(data, D1) = 0.1


 IDF(data) = 0.699
Then,

TF-IDF(data, D1)=0.1×0.699=0.0699

This means the term "data" is moderately important in document D1.

Advantages of Vector Space Model & TF-IDF:


Advantage Description
Simple & Effective Easy to implement and widely used
Quantitative Comparison Allows similarity computation (e.g., cosine similarity)
Relevance-based Ranking TF-IDF boosts rare but meaningful terms
No Need for Linguistic Knowledge Works purely based on statistics
2. (a) Discuss various factors to be considered to improve the performance of
ETL.
ETL (Extract, Transform, Load) is the critical process in data warehousing used to:

 Extract data from various sources,


 Transform it into the desired format, and
 Load it into the target Data Warehouse.

Due to the large volumes of data, the performance of ETL processes directly affects the
efficiency, freshness, and usability of data.

Improving ETL performance involves optimizing each stage of the process.

Key Factors to Improve ETL Performance:

🔹 1. Source System Optimization

 Minimize the load on the source system during extraction.


 Use incremental data extraction (e.g., Change Data Capture - CDC) rather than full
extraction.
 Avoid complex queries and joins on the source.
 Use indexes in source databases to speed up data retrieval.

🔹 2. Use of Parallelism and Partitioning

 Parallel Processing: Split data into chunks and process them simultaneously.
 Pipeline Parallelism: Run multiple ETL stages (extract, transform, load) concurrently.
 Data Partitioning: Divide data logically (e.g., by region, date) to improve throughput.

🔹 3. Efficient Transformation Logic

 Minimize complex transformation operations such as nested IF, JOIN, and CASE
statements.
 Push simple transformations to the database layer using SQL.
 Use optimized data types and avoid row-by-row operations (prefer bulk operations).
 Leverage lookup caches instead of repetitive lookups.

🔹 4. Staging Area Usage

 Use a staging database to temporarily store extracted raw data before transformation.
 This allows for:
o Isolation from source systems
o Easy recovery from failures
o Intermediate data validation

🔹 5. Optimized Loading Strategies

 Use bulk/batch loading instead of row-by-row inserts.


 Disable indexes and constraints during load and re-enable after.
 Use truncate-and-load where applicable instead of slow update/delete/insert logic.
 Use load balancing if loading into distributed data warehouses.

🔹 6. Data Volume and Scheduling Management

 Filter unnecessary data early in the process.


 Use incremental loads wherever possible.
 Schedule ETL jobs during off-peak hours to reduce contention.
 Monitor and limit resource usage (CPU, I/O, memory).

🔹 7. Error Handling and Logging

 Use error logs and recovery mechanisms to handle failures efficiently.


 Avoid reprocessing the entire dataset for minor errors.
 Design the ETL to resume from the point of failure.

🔹 8. Metadata and Dependency Management

 Use metadata to manage mappings, data lineage, and schema changes.


 Automate dependency tracking for data pipelines to avoid unnecessary reprocessing.

🔹 9. Hardware and Infrastructure

 Deploy ETL on high-performance servers with sufficient memory, I/O, and CPU.
 Use distributed or cloud-based ETL platforms like Apache Spark, Talend, or AWS
Glue for scalability.

(b) Define a Data Mart. How is it different from a centralized data warehouse ?
Discuss the structure of Data Mart with the help of an example.
A Data Mart is a subset of a data warehouse that is focused on a specific business line,
department, or team within an organization.
It is designed to meet the particular needs of a specific group of users, such as:

 Sales
 Marketing
 Finance
 HR

🔹 Definition:

“A Data Mart is a subject-oriented, department-specific data repository that stores summarized


or detailed data to serve a particular community of knowledge workers.”

Difference Between Data Mart and Centralized Data Warehouse:

Aspect Data Mart Data Warehouse


Department-level (e.g., sales,
Scope Enterprise-wide
finance)
Data Volume Smaller Very large
Usually derived from a data Consolidated from multiple
Data Source
warehouse heterogeneous sources
Complexity Low High
Implementation
Shorter (weeks) Longer (months to years)
Time
Maintenance Easier Complex
Cost Less expensive High
Speed Faster for specific queries May be slower for broad queries
Types of Data Marts

1. Dependent Data Mart:


o Sourced from the centralized data warehouse
o Ensures consistency and central control
2. Independent Data Mart:
o Sourced directly from operational systems or external data
o No dependency on a central warehouse
3. Hybrid Data Mart:
o Combines data from both warehouse and external sources

Structure of a Data Mart

🔸 Basic Architecture of a Data Mart:


Example: Sales Data Mart

📌 Use Case:

The Sales Department wants to analyze sales performance by region, product, and time.

🔸 Components:

 Fact Table: Sales_Fact


o Attributes: Sales_ID, Date_ID, Product_ID, Region_ID, Revenue,
Quantity_Sold
 Dimension Tables:
o Time_Dim (Date_ID, Day, Month, Year)
o Product_Dim (Product_ID, Name, Category, Price)
o Region_Dim (Region_ID, State, Country)

Advantages of Data Marts:

Advantage Description
Faster Access Optimized for specific departmental queries
Advantage Description
Simpler Design Easier to design, implement, and manage
Lower Cost Less hardware and development cost
Custom Solutions Tailored for department needs
Security Limits access to relevant data only

(c) Discuss the following OLAP data cube operations :

(i) Roll-up (ii) Drill-down.

OLAP (Online Analytical Processing) enables analysts and decision-makers to interactively


analyze multidimensional data stored in a data cube format.
A data cube organizes data into dimensions (e.g., time, location, product) and facts (e.g., sales,
profit).

✅ (i) ROLL-UP (2.5 Marks)

🔹 Definition:

Roll-up is an OLAP operation that performs data aggregation by climbing up a concept


hierarchy or by reducing dimensions.
It allows viewing data at a more summarized or abstract level.

🔸 Example:

Assume a data cube with dimensions:

 Product → Brand → Category


 Location → City → State → Country
 Time → Day → Month → Quarter → Year

If we roll-up on the Location dimension from City to State, we are aggregating sales data from
individual cities to the state level.

✅ (ii) DRILL-DOWN (2.5 Marks)

🔹 Definition:
Drill-down is the opposite of roll-up. It allows users to view data at a more detailed or
granular level, by:

 Descending the hierarchy


 Adding a new dimension

🔸 Example:

Starting from Year-level sales data, if we drill down to Quarter, then Month, we get more
detailed insights.

3. (a) Discuss the following Data Integration issues :

(i) Schema Integration and object modeling

(ii) Redundancy

(iii) Detection and Resolution of Data Value Conflicts.

Data Integration is the process of combining data from multiple heterogeneous sources into a
single, unified view. It is a critical step in building a data warehouse or data mart.

However, integration poses several challenges due to differences in:

 Data formats
 Semantics
 Naming conventions
 Schema structures
 Value inconsistencies

✅ (i) Schema Integration and Object Modeling

🔹 Schema Integration:
Schema integration refers to the process of merging schemas (i.e., structures) from multiple
data sources into a coherent global schema.

Challenges:

 Different naming conventions (e.g., Cust_ID vs. CustomerID)


 Different data types for the same field (e.g., Date vs. String)
 Different levels of normalization
 Missing or extra attributes

Example:

 One source has: Employee(Name, Dept, Salary)


 Another source has: Emp(Name, DepartmentName, MonthlySalary)
Schema integration must map Dept ↔ DepartmentName and Salary ↔ MonthlySalary.

🔹 Object Modeling:

Object modeling involves using data models (like ER diagrams, UML models) to:

 Define entities, attributes, and relationships


 Capture data semantics
 Resolve structural differences between source schemas

It ensures that data from different sources conform to a common structure and meaning.

✅ Benefits of Schema Integration & Object Modeling:

 Provides a unified view of enterprise data


 Simplifies querying and analysis
 Enables automated or semi-automated ETL pipelines

✅ (ii) Redundancy

🔹 Definition:
Redundancy in data integration occurs when the same data appears in multiple source
systems.

Redundancy can be:

 Syntactic (exact same data values)


 Semantic (same real-world meaning, different representation)

🔹 Problems Caused by Redundancy:

 Inconsistent values (e.g., one source says salary is ₹50,000, another says ₹55,000)
 Increased storage costs
 Incorrect aggregations (e.g., duplicate rows inflating totals)
 Conflicting updates when integrated systems allow changes

🔹 Solutions:

 Use primary keys and matching logic to detect duplicates


 Employ data deduplication tools
 Define master data sources and discard redundant versions
 Use record linkage and entity resolution techniques

✅ (iii) Detection and Resolution of Data Value Conflicts

🔹 Definition:

Data value conflict arises when the same real-world entity has different values for the same
attribute across different sources.

🔸 Examples of Conflicts:

Attribute Source A Source B Conflict Type


Salary ₹50,000 ₹55,000 Numeric discrepancy
Gender "M" "Male" Format inconsistency
Date 01/02/2023 02/01/2023 Locale format conflict
Address "123 MG Road" "123 Mahatma Gandhi Rd" Semantic difference
🔹 Types of Conflicts:

1. Representation conflicts (e.g., different date formats)


2. Scaling conflicts (e.g., salary in INR vs. USD)
3. Missing or default values
4. Measurement unit differences
5. Semantically equivalent but differently labeled values

🔹 Detection Techniques:

 Use data profiling to understand patterns and outliers


 Apply match rules, lookup tables, and pattern recognition
 Apply statistical analysis to detect value ranges

🔹 Resolution Strategies:

 Use data standardization rules (e.g., unify date formats)


 Apply domain-specific rules (e.g., prefer most recent value)
 Use confidence scores or source trust levels
 Consult business rules or data stewards
 Use ETL scripts to transform and reconcile values

(b) Write and explain Decision Tree Classifer algorithm with the help of an
example.

A Decision Tree is a supervised learning algorithm used for classification and prediction
tasks.
It models decisions and their possible consequences in a tree-like structure.

Each internal node represents a test on an attribute,


each branch represents an outcome of the test,
and each leaf node represents a class label (decision).
Steps in Building a Decision Tree:-

The algorithm follows a top-down, recursive approach known as “divide and conquer”.
Common algorithms include:

 ID3 (Iterative Dichotomiser 3)


 C4.5 (Successor to ID3)
 CART (Classification and Regression Trees)

🔹 General Steps:

1. Select the best attribute to split the data based on some criterion (e.g., Information
Gain, Gini Index).
2. Partition the dataset into subsets based on the attribute's values.
3. Repeat the process recursively for each subset until:
o All records in a subset belong to the same class, or
o No attributes are left to split on.

Splitting Criteria

🔸 (a) Information Gain (ID3 Algorithm)

Information Gain=Entropy (Parent)−∑Weighted Entropy (Children)

Entropy quantifies impurity in the data.

🔸 (b) Gini Index (CART Algorithm)

Gini(D)=1−∑pi2

Lower Gini means a purer node.

Example of Decision Tree (ID3 Algorithm)

🧠 Problem: Predict whether a person will buy a computer based on Age and
Income.

Age Income Buys_Computer


<30 High No
<30 Medium Yes
30–40 High Yes
>40 Low Yes
Age Income Buys_Computer
>40 Medium Yes
>40 High No

Advantages of Decision Trees:

Advantage Description
Easy to understand Visual and interpretable
No need for data normalization Can handle numeric and categorical attributes
Works well with missing values Can split on available attributes
Performs feature selection Automatically selects important variables

Disadvantages of Decision Trees:

Disadvantage Description
Overfitting May create overly complex trees
Instability Small changes in data can cause large changes in the tree
Bias Towards attributes with more levels
Lower accuracy Compared to ensemble methods like Random Forests

4.(a) Discuss Linear Discriminant Analysis (LDA) and Principal Component


Analysis (PDA) feature extraction techniques.

Feature extraction is the process of transforming raw data into a reduced set of informative
features while preserving essential information.
This helps in:

 Reducing dimensionality
 Improving model performance
 Avoiding overfitting
 Visualizing data
Two popular feature extraction techniques are:

 LDA (Linear Discriminant Analysis) – supervised


 PCA (Principal Component Analysis) – unsupervised

Principal Component Analysis (PCA)

🔹 Definition:

PCA is an unsupervised statistical technique that transforms the original set of variables into a
new set of orthogonal (uncorrelated) variables called principal components, ordered by the
amount of variance they explain.

🔹 Key Characteristics:

 Captures maximum variance in the data


 Does not use class labels
 Helps in dimensionality reduction by keeping top-k components

🔹 Steps in PCA:

1. Standardize the data


2. Compute the covariance matrix
3. Calculate eigenvalues and eigenvectors
4. Select top k eigenvectors (principal components)
5. Transform the original data using these components

🔹 Example:

Given 2D data (Height, Weight), PCA might find a new axis (PC1) that best explains the
variance between individuals.

🔹 Applications:
 Image compression
 Noise reduction
 Data visualization (e.g., projecting to 2D)

🔹 Advantages of PCA:

 Reduces dimensionality while preserving variance


 Improves speed and accuracy of models
 Handles correlated features effectively

🔹 Limitations:

 Does not consider class labels


 Components are difficult to interpret
 Assumes linear relationships

Linear Discriminant Analysis (LDA)

🔹 Definition:

LDA is a supervised feature extraction technique that projects data onto a lower-dimensional
space such that the class separability is maximized.

It finds a linear combination of features that best separates two or more classes.

🔹 Key Characteristics:

 Uses class labels


 Maximizes the between-class variance and minimizes the within-class variance
 Often used in classification tasks

🔹 Steps in LDA:

1. Compute the mean vectors for each class


2. Compute within-class and between-class scatter matrices
3. Calculate eigenvectors and eigenvalues for scatter matrices
4. Select the top linear discriminants
5. Project the data onto the new subspace

🔹 Example:

In a dataset with two classes (e.g., spam vs. non-spam emails), LDA will find the best line or
plane to separate the two classes based on feature distributions.

🔹 Applications:

 Face recognition
 Document classification
 Bioinformatics (e.g., gene expression analysis)

🔹 Advantages of LDA:

 Improves class separability


 Better for classification tasks than PCA
 Provides interpretable features

🔹 Limitations:

 Assumes data is normally distributed


 Assumes equal covariance among classes
 Not suitable for non-linear boundaries

Comparison: PCA vs. LDA:

Feature PCA LDA


Type Unsupervised Supervised
Goal Maximize data variance Maximize class separability
Uses Class Labels ❌ No ✅ Yes
Output Principal components Linear discriminants
Application Data compression, visualization Classification, pattern recognition
(b) Explain Association Rule Mining (ARM). Also, explain the two phases
involved in ARM process with the help of an example.

What is Association Rule Mining (ARM)?

Association Rule Mining (ARM) is a data mining technique used to discover interesting
relationships, patterns, or correlations among items in large datasets.

ARM is commonly used in market basket analysis, where it finds associations like:

If a customer buys bread and butter, they are likely to buy milk.

🔹 Formal Representation of an Association Rule:

A⇒B\text{A} \Rightarrow \text{B}A⇒B

Where:

 A (Antecedent) = If this occurs...


 B (Consequent) = Then this is likely to occur

🔹 Key Measures:

 Support: Frequency of the itemset in the database

 Confidence: Likelihood of B occurring when A has occurred


 Lift: Measures correlation between A and B

Two Phases of the ARM Process

Association Rule Mining is typically done in two major phases:

🔸 Phase 1: Frequent Itemset Generation

 Identify all frequent itemsets (sets of items that appear together frequently in the
dataset) based on a minimum support threshold.
 Algorithms used:
o Apriori (uses candidate generation and pruning)
o FP-Growth (uses tree structure to avoid candidate generation)

🔸 Phase 2: Rule Generation from Frequent Itemsets

 Generate strong rules from the frequent itemsets using a minimum confidence
threshold.
 Rules of the form A ⇒ B are created by dividing each frequent itemset into antecedent
and consequent.
 Only rules that satisfy both support and confidence thresholds are retained.

Example: Market Basket Dataset:

Transaction ID Items Purchased


T1 Bread, Butter, Milk
T2 Bread, Butter
T3 Bread, Milk
T4 Butter, Milk
T5 Bread, Butter, Milk, Eggs

🔸 Phase 1: Frequent Itemset Generation (min support = 60%)

 Itemsets and supports:


o {Bread} → 4/5 = 80% ✔️
o {Butter} → 4/5 = 80% ✔️
o {Milk} → 4/5 = 80% ✔️
o {Bread, Butter} → 3/5 = 60% ✔️
o {Bread, Milk} → 3/5 = 60% ✔️
o {Butter, Milk} → 3/5 = 60% ✔️
o {Bread, Butter, Milk} → 3/5 = 60% ✔️

🔸 Phase 2: Rule Generation (min confidence = 70%)

From {Bread, Butter}:

 Rule: Bread ⇒ Butter


o Confidence = Support(Bread ∩ Butter) / Support(Bread) = 60% / 80% = 75% ✔️

From {Bread, Butter, Milk}:

 Rule: Bread & Butter ⇒ Milk


o Confidence = 60% / 60% = 100% ✔️

From {Milk ⇒ Bread}:

 Confidence = 60% / 80% = 75% ✔️

5. Write short notes on the following :

(a) ELT and its need

ELT (Extract, Load, Transform) is a modern data integration approach where:

 Data is Extracted from multiple sources


 Loaded into a centralized system (typically a cloud data lake or warehouse)
 Then Transformed within the target system (e.g., using SQL)

This is a shift from traditional ETL (Extract-Transform-Load), where transformation happens


before loading.
🔹 Why ELT is Needed:

1. Scalability:
o Cloud data warehouses (like BigQuery, Snowflake) can process transformations
on massive datasets.
2. Performance:
o Reduces the time between data extraction and availability for use.
o Parallel processing in data warehouses speeds up transformation.
3. Flexibility:
o Raw data is available for various transformations and analytics.
o Allows schema-on-read rather than schema-on-write.
4. Real-Time Processing:
o Better suited for streaming data pipelines and modern analytics.

🔹 Use Cases:

 Real-time business intelligence dashboards


 Machine learning pipelines needing raw data
 Agile data engineering environments

🔹 Tools That Support ELT:

 Apache Spark, dbt (Data Build Tool), Snowflake, Azure Synapse

(b) Metadata and Data Warehousing

🔹 What is Metadata?

Metadata is “data about data.” In data warehousing, it describes:

 Source, structure, format, and semantics of data


 Data lineage (origin and transformation)
 Access rights and refresh frequency

🔹 Types of Metadata in Data Warehousing:

1. Technical Metadata:
o Table names, column types, data types, ETL rules
2. Business Metadata:
o Business definitions, descriptions, ownership
3. Operational Metadata:
o Load status, error logs, job execution time

🔹 Importance of Metadata:

 Ensures data quality and consistency


 Aids ETL developers and analysts
 Supports data governance and compliance
 Helps in impact analysis when changing schemas

🔹 Example:

If a data field “Revenue” is stored, metadata might include:

 Definition: Total amount after discounts


 Source: Sales DB → Fact_Sales
 Type: Decimal(10,2)
 Last updated: Daily at 2 AM

(c) Rule-Based Classification

🔹 What is Rule-Based Classification?

It is a classification method where IF-THEN rules are used to assign class labels to data
instances.

Each rule has:

 Antecedent (condition) – e.g., IF Age > 30 AND Income = High


 Consequent (class) – THEN Class = Yes

🔹 Example Rule:

IF (Age > 40 AND Income = High) THEN Class = “Yes”

🔹 How Rules are Generated:


 From decision trees (e.g., ID3, C4.5)
 From algorithms like RIPPER, CN2
 Using association rules adapted for classification

🔹 Advantages:

 Easy to understand and interpret


 Flexible – new rules can be added or modified easily
 Handles both categorical and numerical data

🔹 Disadvantages:

 Conflicting or redundant rules may arise


 Performance degrades with too many rules
 May not handle noise well

🔹 Applications:

 Spam filtering
 Medical diagnosis
 Fraud detection

(d) Naïve Bayes Classifier

🔹 What is Naïve Bayes?

Naïve Bayes is a probabilistic classifier based on Bayes’ Theorem with the naïve assumption
that features are conditionally independent given the class.

🔹 Bayes' Theorem:
Where:

 P(C∣X)P(C|X)P(C∣X): Posterior probability of class C given data X


 P(X∣C)P(X|C)P(X∣C): Likelihood of data X given class C
 P(C)P(C)P(C): Prior probability of class C
 P(X)P(X)P(X): Probability of data X

🔹 Working Steps:

1. Calculate prior probabilities for each class


2. Compute likelihood for each attribute given the class
3. Use Bayes' Theorem to compute posterior for each class
4. Choose class with highest posterior probability

🔹 Example:

Given data about emails:

 Features: Contains “offer”, Contains “win”


 Class: Spam / Not Spam

Rule:
IF email contains “win” and “offer” THEN classify as “Spam” with high probability.

🔹 Advantages:

 Simple and fast


 Works well with high-dimensional data
 Performs well on text classification (e.g., spam filtering)

🔹 Disadvantages:

 Assumes independent features, which is often unrealistic


 May perform poorly if this assumption is violated

🔹 Applications:

 Email spam detection


 Sentiment analysis
 Document classification

You might also like