Data Warehouse Concepts
A Data Warehouse (DW) is a centralized repository for storing integrated, historical
data from multiple disparate sources, primarily used for reporting and analysis.
Purpose of Data Warehouse
The main purpose is to support business intelligence (BI) and decision-making. It
provides a single, consistent, historical view of the business to help analysts and executives
identify trends, patterns, and insights.
ETL (Extract, Transform, Load)
ETL is a process that involves:
1. Extracting data from source systems.
2. Transforming the data into a clean, consistent format suitable for the data warehouse.
3. Loading the transformed data into the data warehouse.
Overview of ETL Operations
The operations include:
● Extraction: Reading data from various sources (databases, files, etc.).
● Transformation: Cleaning, filtering, validating, aggregating, and applying business
rules to the data.
● Loading: Writing the processed data into the target data warehouse or data mart.
ETL Architecture
ETL architecture typically involves a staging area (a temporary storage area for extracted
data), the ETL tool/server (where transformation logic is applied), and the target data
warehouse.
Data Mart Concepts
A Data Mart is a subset of the Data Warehouse that focuses on a specific business
function or department (e.g., Sales, Marketing, Finance). It's smaller, more specialized, and
easier to manage than a full DW.
Types of Data Mart
1. Dependent Data Mart: Created directly from the enterprise Data Warehouse.
2. Independent Data Mart: Created separately from operational data sources without
relying on an existing central DW.
Data Mart vs Data Warehouse
Feature Data Mart Data Warehouse
Scope Departmental/Specific Subject Enterprise-wide/Multiple
Subjects
Size Smaller Larger
Focus Specific analysis needs Consolidated view for BI
Data Sources DW or specific operational Multiple operational sources
sources
Dimensional Modeling Components
What is Dimension Table
A Dimension Table describes the "who, what, where, when, and how" of a business event.
It contains descriptive attributes used for filtering, grouping, and labeling (e.g., Customer
Name, Product Category, Date).
Types of Dimension Table
1. Conformed Dimension: Identical dimensions shared across multiple fact tables,
ensuring consistent reporting.
2. Junk Dimension: A table grouping small, unrelated flags and attributes to avoid adding
many columns to the fact table.
3. Role-Playing Dimension: A single dimension used multiple times in a fact table, where
each usage has a different meaning (e.g., using a single Date dimension for Order Date,
Ship Date, and Delivery Date).
4. Degenerate Dimension: A dimension attribute (like an order number) stored within the
fact table itself because there are no other associated attributes.
What is Fact Table
A Fact Table stores the "measurements" or "metrics" of a business process (e.g., Sales
Amount, Quantity Sold, Profit). It contains foreign keys to related dimension tables and
numerical facts.
Types of Fact Table
1. Additive Fact Table: Measures can be summed up across all dimensions (e.g., Sales
Amount).
2. Semi-Additive Fact Table: Measures can be summed up across some dimensions but
not others (e.g., Inventory level or Account Balance, which can be summed across
location but not time).
3. Non-Additive Fact Table: Measures that cannot be summed up at all (e.g.,
percentages, ratios, or unit prices).
OLAP and OLTP
What is OLAP
Online Analytical Processing (OLAP) is a category of software technology that enables
analysts to quickly and interactively analyze vast amounts of data from a data
warehouse, often involving complex queries, aggregation, and multi-dimensional views.
What is OLTP
Online Transaction Processing (OLTP) is a class of software that supports transaction-
oriented applications. Its focus is on the fast, real-time processing of day-to-day
operational transactions (e.g., bank withdrawals, online purchases, order entry).
OLAP vs OLTP
Feature OLAP OLTP
Purpose Analysis and Decision Support Day-to-day Transaction
Processing
Data Historical, Consolidated Current, Operational
Database Design Denormalized (Star/Snowflake Highly Normalized (Relational)
Schema)
Operations Read-heavy, Complex Queries, Insert/Update/Delete-heavy,
Aggregations Simple Transactions
Response Time High latency (seconds to Low latency (milliseconds)
minutes)
Data Modeling and Schemas
What is Data Modeling
Data Modeling is the process of creating a visual representation of either a whole
information system or parts of it to communicate connections between data points and
structures. In DW, it focuses on optimizing data for query performance.
Types of Data Modeling
1. Conceptual Data Model: High-level, business-focused model (independent of
technology).
2. Logical Data Model: Detailed model of data elements and relationships (independent
of a specific DBMS).
3. Physical Data Model: Specifies the database schema, including tables, columns, data
types, and constraints (DBMS-specific).
4. Dimensional Model: A logical design structure that uses Fact and Dimension tables,
popular for data warehousing.
Star Schema
A Star Schema is the simplest dimensional model where a central fact table directly links to
multiple surrounding dimension tables. It looks like a star, and the dimensions are not joined
to each other. It favors query simplicity and speed.
Snow-Flake Schema
A Snow-Flake Schema is an extension of the Star Schema where dimension tables are
further normalized (broken down) into multiple related tables. It saves storage space but
makes queries more complex due to more joins.
Normalization and De-Normalization
Normalization is the process of organizing the columns and tables in a relational
database to minimize data redundancy and dependencies. It is typically used in OLTP
systems.
De-Normalization is the strategy of intentionally introducing redundancy to a table to
improve query performance by reducing the number of joins required. It is commonly used in
OLAP systems (Data Warehouses).
Slowly Changing Dimension (SCD) and Types
Slowly Changing Dimension (SCD) is a set of techniques for managing and tracking
changes in dimension table data over time.
Common Types of SCD:
● SCD Type 1 (Overwrite): The old data is simply overwritten with the new data. No
history is preserved.
● SCD Type 2 (Add New Row): A new row is added to the dimension table to store the
new version of the data, while the old version is preserved, typically using flags,
start/end dates, or version numbers. Full history is preserved.
● SCD Type 3 (Add New Column): A new column is added to the dimension table to
store a specific previous value (e.g., the previous region). Partial history is preserved.
Uses Of SCD
SCDs are used to ensure that historical facts in the fact table are correctly linked to the
correct dimension attributes at the time the event occurred. This allows for accurate
historical reporting and trend analysis.
Surrogate Key
A Surrogate Key is a system-generated primary key in a dimension or fact table that is
not derived from the source data (i.e., it's a non-business key). It's typically a simple integer
sequence, used to ensure unique identification of rows and to maintain referential integrity
with the fact table, especially crucial for implementing SCD Type 2.