0% found this document useful (0 votes)
39 views5 pages

Data Warehouse and ETL Concepts Explained

A Data Warehouse (DW) is a centralized repository for historical data from various sources, primarily used for business intelligence and decision-making. The ETL process involves extracting, transforming, and loading data into the DW, while Data Marts serve specific business functions. Data modeling techniques, such as Star and Snow-Flake schemas, are used to optimize data for query performance, and Slowly Changing Dimensions (SCD) manage changes in dimension data over time.

Uploaded by

maheshkaikala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views5 pages

Data Warehouse and ETL Concepts Explained

A Data Warehouse (DW) is a centralized repository for historical data from various sources, primarily used for business intelligence and decision-making. The ETL process involves extracting, transforming, and loading data into the DW, while Data Marts serve specific business functions. Data modeling techniques, such as Star and Snow-Flake schemas, are used to optimize data for query performance, and Slowly Changing Dimensions (SCD) manage changes in dimension data over time.

Uploaded by

maheshkaikala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Data Warehouse Concepts

A Data Warehouse (DW) is a centralized repository for storing integrated, historical


data from multiple disparate sources, primarily used for reporting and analysis.
Purpose of Data Warehouse

The main purpose is to support business intelligence (BI) and decision-making. It


provides a single, consistent, historical view of the business to help analysts and executives
identify trends, patterns, and insights.

ETL (Extract, Transform, Load)


ETL is a process that involves:
1. Extracting data from source systems.
2. Transforming the data into a clean, consistent format suitable for the data warehouse.
3. Loading the transformed data into the data warehouse.
Overview of ETL Operations

The operations include:


● Extraction: Reading data from various sources (databases, files, etc.).
● Transformation: Cleaning, filtering, validating, aggregating, and applying business
rules to the data.
● Loading: Writing the processed data into the target data warehouse or data mart.
ETL Architecture

ETL architecture typically involves a staging area (a temporary storage area for extracted
data), the ETL tool/server (where transformation logic is applied), and the target data
warehouse.

Data Mart Concepts


A Data Mart is a subset of the Data Warehouse that focuses on a specific business
function or department (e.g., Sales, Marketing, Finance). It's smaller, more specialized, and
easier to manage than a full DW.
Types of Data Mart

1. Dependent Data Mart: Created directly from the enterprise Data Warehouse.
2. Independent Data Mart: Created separately from operational data sources without
relying on an existing central DW.
Data Mart vs Data Warehouse

Feature Data Mart Data Warehouse


Scope Departmental/Specific Subject Enterprise-wide/Multiple
Subjects
Size Smaller Larger
Focus Specific analysis needs Consolidated view for BI
Data Sources DW or specific operational Multiple operational sources
sources

Dimensional Modeling Components


What is Dimension Table

A Dimension Table describes the "who, what, where, when, and how" of a business event.
It contains descriptive attributes used for filtering, grouping, and labeling (e.g., Customer
Name, Product Category, Date).
Types of Dimension Table

1. Conformed Dimension: Identical dimensions shared across multiple fact tables,


ensuring consistent reporting.
2. Junk Dimension: A table grouping small, unrelated flags and attributes to avoid adding
many columns to the fact table.
3. Role-Playing Dimension: A single dimension used multiple times in a fact table, where
each usage has a different meaning (e.g., using a single Date dimension for Order Date,
Ship Date, and Delivery Date).
4. Degenerate Dimension: A dimension attribute (like an order number) stored within the
fact table itself because there are no other associated attributes.
What is Fact Table

A Fact Table stores the "measurements" or "metrics" of a business process (e.g., Sales
Amount, Quantity Sold, Profit). It contains foreign keys to related dimension tables and
numerical facts.
Types of Fact Table

1. Additive Fact Table: Measures can be summed up across all dimensions (e.g., Sales
Amount).
2. Semi-Additive Fact Table: Measures can be summed up across some dimensions but
not others (e.g., Inventory level or Account Balance, which can be summed across
location but not time).
3. Non-Additive Fact Table: Measures that cannot be summed up at all (e.g.,
percentages, ratios, or unit prices).
OLAP and OLTP

What is OLAP

Online Analytical Processing (OLAP) is a category of software technology that enables


analysts to quickly and interactively analyze vast amounts of data from a data
warehouse, often involving complex queries, aggregation, and multi-dimensional views.
What is OLTP

Online Transaction Processing (OLTP) is a class of software that supports transaction-


oriented applications. Its focus is on the fast, real-time processing of day-to-day
operational transactions (e.g., bank withdrawals, online purchases, order entry).
OLAP vs OLTP

Feature OLAP OLTP


Purpose Analysis and Decision Support Day-to-day Transaction
Processing
Data Historical, Consolidated Current, Operational
Database Design Denormalized (Star/Snowflake Highly Normalized (Relational)
Schema)
Operations Read-heavy, Complex Queries, Insert/Update/Delete-heavy,
Aggregations Simple Transactions
Response Time High latency (seconds to Low latency (milliseconds)
minutes)

Data Modeling and Schemas


What is Data Modeling

Data Modeling is the process of creating a visual representation of either a whole


information system or parts of it to communicate connections between data points and
structures. In DW, it focuses on optimizing data for query performance.
Types of Data Modeling

1. Conceptual Data Model: High-level, business-focused model (independent of


technology).
2. Logical Data Model: Detailed model of data elements and relationships (independent
of a specific DBMS).
3. Physical Data Model: Specifies the database schema, including tables, columns, data
types, and constraints (DBMS-specific).
4. Dimensional Model: A logical design structure that uses Fact and Dimension tables,
popular for data warehousing.
Star Schema

A Star Schema is the simplest dimensional model where a central fact table directly links to
multiple surrounding dimension tables. It looks like a star, and the dimensions are not joined
to each other. It favors query simplicity and speed.
Snow-Flake Schema

A Snow-Flake Schema is an extension of the Star Schema where dimension tables are
further normalized (broken down) into multiple related tables. It saves storage space but
makes queries more complex due to more joins.

Normalization and De-Normalization


Normalization is the process of organizing the columns and tables in a relational
database to minimize data redundancy and dependencies. It is typically used in OLTP
systems.
De-Normalization is the strategy of intentionally introducing redundancy to a table to
improve query performance by reducing the number of joins required. It is commonly used in
OLAP systems (Data Warehouses).
Slowly Changing Dimension (SCD) and Types

Slowly Changing Dimension (SCD) is a set of techniques for managing and tracking
changes in dimension table data over time.
Common Types of SCD:
● SCD Type 1 (Overwrite): The old data is simply overwritten with the new data. No
history is preserved.
● SCD Type 2 (Add New Row): A new row is added to the dimension table to store the
new version of the data, while the old version is preserved, typically using flags,
start/end dates, or version numbers. Full history is preserved.
● SCD Type 3 (Add New Column): A new column is added to the dimension table to
store a specific previous value (e.g., the previous region). Partial history is preserved.
Uses Of SCD

SCDs are used to ensure that historical facts in the fact table are correctly linked to the
correct dimension attributes at the time the event occurred. This allows for accurate
historical reporting and trend analysis.
Surrogate Key
A Surrogate Key is a system-generated primary key in a dimension or fact table that is
not derived from the source data (i.e., it's a non-business key). It's typically a simple integer
sequence, used to ensure unique identification of rows and to maintain referential integrity
with the fact table, especially crucial for implementing SCD Type 2.

You might also like