📘 Module 8: Joins, Sets & Conditional
Logic
8.1 Basics of Data relationships.
Why joins are required — the “before” picture
Imagine you are an analyst. You need “customer name + the product they bought + order
amount”. If the database stored everything in one giant table (customer & order & product
columns repeated), you’d get:
• Redundant data (customer repeated on every order),
• Huge table size → slow to update and read,
• Update anomalies → change customer address, must update many rows,
• Hard to model relationships (one order has many products).
So relational databases split data into normalized tables (Customers, Orders, Products)
and store relationships using IDs. To answer a question that spans tables you have to
recombine them — that’s what joins do.
Excel analogy: You have Customers in Sheet1 and Orders in Sheet2. You’d use
VLOOKUP/INDEX-MATCH — that’s a join in SQL, but safe, scaleable, and faster.
Primary Key (PK) and Foreign Key (FK)
Primary Key (PK) -
• A PK is a column (or set of columns) that uniquely identifies each row in a table.
• Properties:
o Unique
o NOT NULL (no missing values)
o Usually indexed (clustered index in many RDBMS)
• Example: Customers.customer_id, Products.product_id, Orders.order_id.
Foreign Key (FK) -
• An FK in one table is a column (or set) that refers to the PK of another table,
defining a relationship.
• Properties:
o Values typically must exist in the referenced PK column (unless FK
constraints are disabled or deferred).
o Enforces referential integrity (depending on DB settings).
• Example: Orders.customer_id → references Customers.customer_id.
How to identify PK / FK
• PK: Look for unique identifier column (id, code) declared PRIMARY KEY. If you’re
given raw CSVs, identify the column with unique values and no NULLs.
• FK: Look for columns named like <other_table>_id that match the PK values in
another table.
When to use PK / FK
• PK: every table should have a PK (either natural or surrogate).
• FK: use when you want rdbms-level referential integrity. Use cascading options (ON
UPDATE / ON DELETE) carefully.
Features, roles, conditions (short)
• PK enforces uniqueness and speeds up lookups.
• FK enforces referential integrity and enables joins to link tables.
• FK constraints may cause insert/update/delete to fail if integrity violated — good for
correctness, but during large data loads you may temporarily disable or defer FKs and
validate later.
The ONE core logic behind all joins
Core rule (matching keys):
1. For each row in Left table (L) and row in Right table (R), evaluate the join condition
[Link] = [Link].
2. If condition evaluates to TRUE → output a combined row (L columns + R columns).
3. If condition is FALSE or NULL → behavior depends on join type:
o INNER: drop (no output).
o LEFT: keep L columns + fill R columns with NULL.
o RIGHT: keep R columns + fill L columns with NULL.
o FULL: keep both sides with NULLs where no match.
o CROSS: output every pairing regardless of keys (no matching involved).
o SELF: same as above but table references itself — you use different aliases.
This rule applies at row level. Column selection is separate — after you build the combined
row, SQL projects only the columns in the SELECT list.
How joins perform at column level and row level
Column-level (projection) steps:
• SQL execution uses the SELECT list to decide which columns to transport to client.
• If the query only needs c.customer_id, o.total_amount, the engine may be able
to read from indexes only (covering index) and skip fetching the full row — less IO.
• Column projection reduces:
o Disk I/O (fewer bytes read),
o Network transfer,
o Memory usage in intermediate results.
Row-level (matching) steps:
• The engine must produce each matching pair of row(s) from L and R that satisfy the
join condition.
• Join algorithms control how the engine finds matching rows:
o Nested loops — for each row in outer table, scan inner (or index probe) to
find matches. Simple & works well if one side is small or inner is indexed.
o Hash join — build a hash table on the smaller input (on join key), then probe
with each row from the larger input. Good for large unsorted sets.
o Merge (sort-merge) join — both inputs are sorted on join key and merged
like a two-pointer sweep. Efficient when inputs are already sorted or when
using ordered indexes.
• The optimizer picks algorithm based on table sizes, indexes, available memory, and
statistics.
Join execution algorithms — how they work
Nested loops (NLJ)
• When used: small outer table, inner has index on join key, or when one side is tiny.
• How it works (pseudocode):
• for each row r1 in Outer:
• find matching rows in Inner using index (or scan) where [Link]
= [Link]
• for each match r2:
• emit combined row (r1 + r2)
• Cost model: O(N_outer × cost(find matches)). If index exists, find is O(log M) or
O(1) for hash index.
• Good when: outer small & inner indexed.
Hash join
When used: large unsorted sets, no useful indexes, memory available.
Steps:
1. Build phase: read the smaller input and build an in-memory hash table keyed
by join key.
2. Probe phase: stream the larger table and probe the hash for matches; emit
combined row on match.
Pseudocode:
hash_table = {}
for row s in SmallTable:
add s to hash_table[[Link]]
for row l in LargeTable:
if [Link] in hash_table:
for s in hash_table[[Link]]:
emit combined row (s + l)
Cost: roughly O(N_small + N_large).
Good when: no indexes, both large but one fits in memory.
Merge (Sort-Merge) join
When used: both inputs sorted on join key (or large indexes), or when DB can sort
cheaply.
Steps:
1. Sort both by key (if not already sorted).
2. Maintain pointers p1, p2; compare keys, advance the lesser pointer; when
equal, combine duplicates accordingly.
Pseudocode (simplified):
p1 = first of A; p2 = first of B
while p1 and p2:
if [Link] < [Link]: p1 = next(p1)
else if [Link] > [Link]: p2 = next(p2)
else:
output all combinations of rows having same key on both sides
advance pointers past this key
Good when: large sorted data, or when merge avoids random IO.
📘 Joins Explained with Small Data
Customers
customer_id customer_name
1 John
2 Neha
3 Priya
Orders
order_id customer_id product
201 1 Laptop
202 2 Phone
203 4 Tablet
1) INNER JOIN
SQL
SELECT c.customer_id, c.customer_name, o.order_id, [Link]
FROM Customers c
INNER JOIN Orders o
ON c.customer_id = o.customer_id;
Result Grid
customer_id customer_name order_id product
1 John 201 Laptop
2 Neha 202 Phone
Visual (Venn)
Customers ●∩● Orders
Only the overlap part is returned.
2) LEFT JOIN
SQL
SELECT c.customer_id, c.customer_name, o.order_id, [Link]
FROM Customers c
LEFT JOIN Orders o
ON c.customer_id = o.customer_id;
Result Grid
customer_id customer_name order_id product
1 John 201 Laptop
2 Neha 202 Phone
3 Priya NULL NULL
Visual (Venn)
● Customers (all kept)
+ overlap with Orders
Priya kept with NULLs.
3) RIGHT JOIN
SQL
SELECT c.customer_id, c.customer_name, o.order_id, [Link]
FROM Customers c
RIGHT JOIN Orders o
ON c.customer_id = o.customer_id;
Result Grid
customer_id customer_name order_id product
1 John 201 Laptop
2 Neha 202 Phone
NULL NULL 203 Tablet
Visual (Venn)
Orders ● (all kept)
+ overlap with Customers
Orphan Order 203 kept.
4) FULL OUTER JOIN
SQL
SELECT c.customer_id, c.customer_name, o.order_id, [Link]
FROM Customers c
FULL OUTER JOIN Orders o
ON c.customer_id = o.customer_id;
Result Grid
customer_id customer_name order_id product
1 John 201 Laptop
2 Neha 202 Phone
3 Priya NULL NULL
NULL NULL 203 Tablet
Visual (Venn)
Customers ●∪● Orders
Both sides fully included.
Priya + orphan Order shown.
5) CROSS JOIN (Cartesian)
SQL
SELECT c.customer_name, o.order_id, [Link]
FROM Customers c
CROSS JOIN Orders o;
Result Grid (3 × 3 = 9 rows)
customer_name order_id product
John 201 Laptop
John 202 Phone
John 203 Tablet
Neha 201 Laptop
Neha 202 Phone
Neha 203 Tablet
Priya 201 Laptop
Priya 202 Phone
Priya 203 Tablet
Visual (Venn)
Every row of Customers × every row of Orders.
Rectangle grid (no overlap logic).
6) SELF JOIN (Employee hierarchy)
SQL
SELECT e.emp_name AS employee, m.emp_name AS manager
FROM Employees e
LEFT JOIN Employees m
ON e.manager_id = m.emp_id;
Result Grid
employee manager
Raj NULL
Anita Raj
Mohit Anita
Visual (Venn)
Table joined with itself using aliases.
Shows employee–manager relation.
vs selecting specific columns — what happens and
SELECT *
why you should avoid * in joins
What SELECT * does
• Returns all columns from each table in the query. In a join, if tables have n and m
columns you get n + m columns (or more with duplicated names).
• When column names clash (e.g., id in both tables) the result may contain ambiguous
names in some clients — many clients will rename columns to [Link] but
that’s not always convenient.
Why SELECT * is bad with joins
1. Performance / Network / Memory — transfers more bytes than needed. If you only
need customer_name and total_amount, sending 30 columns per row wastes time
and memory.
2. Ambiguity — same column names from different tables cause confusion. You should
always qualify columns: c.customer_name, o.total_amount.
3. Breaking changes — if schema adds a column later, your code may silently break or
pick up the new column unexpectedly.
4. Index usage — when selecting only indexed columns (covering index), DBs can use
index-only scan; SELECT * kills that optimization.
Example (practical)
• SELECT * FROM Customers c JOIN Orders o ON c.customer_id =
o.customer_id; → large row width.
• SELECT c.customer_id, c.customer_name, o.order_id, o.total_amount
FROM Customers c JOIN Orders o ON c.customer_id = o.customer_id; →
smaller, clearer, more index-friendly.
Rule of thumb: Always select only the columns you need in production queries.
📊 Table of All Joins
Join Logic (in words) SQL Syntax Industry Example
Type
INNER Only rows that exist in both FROM A INNER Customers with actual
JOIN B ON [Link]
JOIN tables (intersec8on). = [Link] orders.
LEFT All rows from leD table, FROM A LEFT All employees + their
JOIN B ON [Link]
JOIN matched rows from right, = [Link] projects (NULL if none).
NULL otherwise.
RIGHT All rows from right table, FROM A RIGHT All orders, even if
JOIN B ON [Link]
JOIN matched rows from leD, NULL = [Link] customer data is missing.
otherwise.
FULL All rows from both tables, FROM A FULL Bank accounts vs
OUTER JOIN B ON
OUTER matched where possible, [Link] = [Link] transac8ons
NULL otherwise. reconcilia8on.
CROSS Every row from leD paired FROM A CROSS Generate all customer ×
JOIN B
JOIN with every row from right product promo
(Cartesian product). combina8ons.
SELF Table joined with itself using FROM A x JOIN A Employee-manager
y ON ...
JOIN alias. hierarchy in HR.
Reference dataset :
Note: For demonstration we deliberately do not declare FK constraints on the demo tables so
we can show unmatched rows and 'orphan' data (common real-world situation after data
imports). In real OLTP systems you typically have FKs.
-- Customers demo
CREATE TABLE customers_demo (
customer_id INT,
customer_name VARCHAR(50),
region VARCHAR(20)
);
INSERT INTO customers_demo (customer_id, customer_name, region) VALUES
(1, 'John Doe', 'North'),
(2, 'Neha Gupta', 'South'),
(3, 'Priya Sharma', 'West'),
(4, 'Amit Patel', 'North'),
(5, 'Sara Khan', 'South'),
(6, 'Ravi Iyer', 'East'),
(7, 'Emily Brown', 'West'),
(8, 'Chris Martin', 'North'),
(9, 'Zara Khan', 'East'); -- 9th customer with no orders in some test
cases
-- Products demo
CREATE TABLE products_demo (
product_id INT,
product_name VARCHAR(50),
category VARCHAR(20),
price DECIMAL(10,2)
);
INSERT INTO products_demo (product_id, product_name, category, price)
VALUES
(101, 'Laptop', 'Electronics', 80000),
(102, 'Headphones', 'Electronics', 3000),
(103, 'Smartphone', 'Electronics', 45000),
(104, 'Desk Chair', 'Furniture', 7000),
(105, 'Office Desk', 'Furniture', 15000);
-- Orders demo (includes some rows that reference missing product or
missing customer to illustrate unmatched behaviour)
CREATE TABLE orders_demo (
order_id INT,
customer_id INT,
product_id INT,
order_date DATE,
total_amount DECIMAL(10,2)
);
INSERT INTO orders_demo (order_id, customer_id, product_id, order_date,
total_amount) VALUES
(201, 1, 101, '2025-01-10', 80000),
(202, 2, 103, '2025-01-11', 45000),
(203, 3, 102, '2025-01-12', 3000),
(204, 4, 104, '2025-01-13', 7000),
(205, 5, 105, '2025-01-14', 15000),
(206, 6, 999, '2025-01-15', 100), -- product_id 999 DOES NOT EXIST in
products_demo (orphan product)
(207, 7, 101, '2025-01-16', 80000),
(208, 8, 103, '2025-01-17', 45000),
(209, 2, 105, '2025-01-18', 8500),
(210, 10, 102, '2025-01-19', 3000), -- customer_id 10 DOES NOT EXIST in
customers_demo (orphan customer)
(211, 1, 103, '2025-01-20', 45000);
Dataset facts:
• 9 customers (IDs 1..9).
• 5 known products + 1 orphan product (999).
• Orders: 11 rows with one order referencing non-existent product (206) and one order
referencing non-existent customer (210).
• This mix lets us show matched and unmatched behaviour.
8.2 Basics of Data relationships.
INNER JOIN (intersection)
Concept: Return only rows where the join condition is true in both tables.
SQL :
SELECT c.customer_id, c.customer_name, o.order_id, o.product_id,
o.total_amount
FROM customers_demo c
INNER JOIN orders_demo o
ON c.customer_id = o.customer_id
ORDER BY o.order_id;
Result grid :
customer_id customer_name order_id product_id total_amount
1 John Doe 201 101 80000.00
1 John Doe 211 103 45000.00
2 Neha Gupta 202 103 45000.00
2 Neha Gupta 209 105 8500.00
3 Priya Sharma 203 102 3000.00
4 Amit Patel 204 104 7000.00
5 Sara Khan 205 105 15000.00
6 Ravi Iyer 206 999 100.00
7 Emily Brown 207 101 80000.00
8 Chris Martin 208 103 45000.00
Row-by-row explanation (how this result formed):
• The engine scanned orders_demo and for each order checked whether
orders_demo.customer_id exists in customers_demo.customer_id.
• order 210 had customer_id = 10 which does not exist in customers_demo → it is
excluded because INNER JOIN keeps only matching rows.
• order 206 has product_id = 999 which is not in products_demo — but that does
NOT affect this join because we only joined Customers↔Orders in this query; order
206 has a customer_id = 6 which exists, so it's included.
Logical mapping example:
• o=201 → o.customer_id=1 → there exists c.customer_id=1 → output row (201 +
John Doe).
• o=210 → o.customer_id=10 → no c.customer_id=10 → not in output.
LEFT OUTER JOIN (keep all left rows)
Concept: Keep all rows from the left table (customers). For each left row, attach matching
right rows if found; otherwise fill right columns with NULL.
SQL:
SELECT c.customer_id, c.customer_name, o.order_id, o.total_amount
FROM customers_demo c
LEFT JOIN orders_demo o
ON c.customer_id = o.customer_id
ORDER BY c.customer_id, o.order_id;
Result grid :
customer_id customer_name order_id total_amount
1 John Doe 201 80000.00
1 John Doe 211 45000.00
2 Neha Gupta 202 45000.00
2 Neha Gupta 209 8500.00
3 Priya Sharma 203 3000.00
4 Amit Patel 204 7000.00
5 Sara Khan 205 15000.00
6 Ravi Iyer 206 100.00
7 Emily Brown 207 80000.00
8 Chris Martin 208 45000.00
9 Zara Khan NULL NULL
Row-by-row explanation:
• Every customer row appears at least once.
• Customers 1..8 have one or more matching orders → output one row per matching
order (thus some customers have multiple rows).
• Customer 9 (Zara) has no orders → appears with order_id = NULL, total_amount
= NULL.
• Orders referencing non-existent customers (order 210: customer_id=10) do not
generate rows because they're on the right side; LEFT JOIN keeps left rows
regardless.
Use case & reasoning: Use LEFT JOIN when you want to list all customers and include
orders where they exist (example: “list customers and their latest order if any”).
RIGHT OUTER JOIN (keep all right rows)
Concept: Mirror of LEFT JOIN — keep all rows from the right table (Orders). (Note:
MySQL supports RIGHT JOIN; some style guides prefer using LEFT JOIN by flipping table
order.)
SQL (RIGHT JOIN):
SELECT c.customer_id, c.customer_name, o.order_id, o.total_amount
FROM customers_demo c
RIGHT JOIN orders_demo o
ON c.customer_id = o.customer_id
ORDER BY o.order_id;
Result grid :
customer_id customer_name order_id total_amount
1 John Doe 201 80000.00
2 Neha Gupta 202 45000.00
3 Priya Sharma 203 3000.00
4 Amit Patel 204 7000.00
5 Sara Khan 205 15000.00
6 Ravi Iyer 206 100.00
7 Emily Brown 207 80000.00
8 Chris Martin 208 45000.00
2 Neha Gupta 209 8500.00
NULL NULL 210 3000.00
1 John Doe 211 45000.00
Row-by-row explanation:
• Order 210 has customer_id = 10 which is not present in customers_demo →
RIGHT JOIN keeps the order row and fills customer columns with NULL. This shows
orders that don’t have a valid customer record (orphan orders).
• Useful for identifying data quality issues on the right table.
Note: RIGHT JOIN can always be expressed as LEFT JOIN by swapping tables; many teams
standardize on LEFT JOIN for consistency.
FULL OUTER JOIN (keep everything from both sides)
Concept: Keep all rows from both tables. Where matches exist, combine; where not exist, fill
NULLs on the missing side.
SQL (supported in PostgreSQL, SQL Server; not in MySQL 5.x):
SELECT COALESCE(c.customer_id, -1) AS customer_id,
c.customer_name, o.order_id, o.total_amount
FROM customers_demo c
FULL OUTER JOIN orders_demo o
ON c.customer_id = o.customer_id
ORDER BY COALESCE(c.customer_id, o.customer_id), o.order_id;
Result grid :
customer_id customer_name order_id total_amount
1 John Doe 201 80000.00
1 John Doe 211 45000.00
2 Neha Gupta 202 45000.00
2 Neha Gupta 209 8500.00
3 Priya Sharma 203 3000.00
4 Amit Patel 204 7000.00
5 Sara Khan 205 15000.00
6 Ravi Iyer 206 100.00
7 Emily Brown 207 80000.00
8 Chris Martin 208 45000.00
9 Zara Khan NULL NULL
NULL NULL 210 3000.00
Row-by-row explanation:
• Shows both customers without orders (Zara) and orders without customers (order
210).
• FULL OUTER JOIN is the union of LEFT and RIGHT behavior.
When to use: Rare in OLTP queries but helpful for complete reconciliation between two
datasets (e.g., reconciliation of shipments vs orders).
CROSS JOIN (Cartesian product)
Concept: Every row from left is paired with every row from right — no matching needed.
SQL:
SELECT c.customer_id, c.customer_name, p.product_id, p.product_name
FROM customers_demo c
CROSS JOIN products_demo p
ORDER BY c.customer_id, p.product_id;
Result facts:
• customers_demo has 9 rows, products_demo has 5 rows → result = 9 × 5 = 45 rows.
Sample output (first few rows):
customer_id customer_name product_id product_name
1 John Doe 101 Laptop
1 John Doe 102 Headphones
1 John Doe 103 Smartphone
1 John Doe 104 Desk Chair
1 John Doe 105 Office Desk
2 Neha Gupta 101 Laptop
... ... ... ...
Use & caution:
• Useful to generate combinations (e.g., all customers × all promos), but dangerous if
either table is large → explosive size.
• Cartesion join occurs accidentally when you omit the ON clause in a JOIN. Always
include ON (or USING) unless you explicitly want a cross product.
SELF JOIN
Concept: Join a table to itself for relationships contained within the same table (e.g.,
employee-manager).
Example dataset for employees:
CREATE TABLE employees_demo (
emp_id INT,
emp_name VARCHAR(50),
manager_emp_id INT
);
INSERT INTO employees_demo (emp_id, emp_name, manager_emp_id) VALUES
(301, 'Raj Verma', NULL), -- top manager
(302, 'Anita Desai', 301),
(303, 'Mohit Kumar', 301),
(304, 'Rohit Nair', 302),
(305, 'Sonal Gupta', 302);
Query: employees with their manager names
SELECT e.emp_id, e.emp_name, m.emp_id AS manager_id, m.emp_name AS
manager_name
FROM employees_demo e
LEFT JOIN employees_demo m
ON e.manager_emp_id = m.emp_id
ORDER BY e.emp_id;
Output:
emp_id emp_name manager_id manager_name
301 Raj Verma NULL NULL
302 Anita Desai 301 Raj Verma
303 Mohit Kumar 301 Raj Verma
304 Rohit Nair 302 Anita Desai
305 Sonal Gupta 302 Anita Desai
Row-by-row logic: For each employee e, find row m whose m.emp_id =
e.manager_emp_id. If none, manager columns are NULL (top-level).
Cartesian join — what it is and why accidental Cartesian
joins happen
Definition: A Cartesian join produces the Cartesian product of two tables: every row in A
combined with every row in B.
Accidental cause: forgetting the ON clause in JOIN, e.g. SELECT * FROM A JOIN B; (no
ON) — or using commas FROM A, B without WHERE.
Example (danger demonstration):
SELECT COUNT(*) FROM customers_demo c CROSS JOIN orders_demo o;
/* returns 9 * 11 = 99 */
• Often accidental and leads to massive result sets and slow queries.
How to spot accidental Cartesian:
• Unusually high row counts (product of row counts).
• EXPLAIN plan showing Nested Loop without filter.
• Query returns unintended combinations.
How to avoid:
• Always write explicit JOIN ... ON ....
• Use aliases and column qualification.
• Run SELECT COUNT(*) for a quick sanity check during development.
How joins behave with small vs large data — tiny
examples and big-data behavior
Tiny dataset (teaching): the ones we used above — engine likely uses nested
loops or index lookup.
Medium / larger dataset example:
Suppose:
• customers has 1,000 rows,
• orders has 2,000,000 rows,
• there's an index on orders.customer_id.
Likely plans:
• If you do customers JOIN orders ON customers.customer_id =
orders.customer_id:
o Engine may pick customers as outer, orders as inner and do nested-loops
with index lookups for each customer (1,000 × index probe) — efficient.
• If neither side has good index and tables are both large (millions), optimizer prefers
hash join if memory allows.
Very large / Industry-scale scenario:
• customers 5 million, orders 500 million.
• You need to:
o Ensure statistics are up-to-date (optimizer needs them).
o Partition orders by date or region to avoid scanning whole table.
o Use appropriate indexes (e.g., clustered index on orders(customer_id,
order_date)).
o Consider moving aggregations down (filter first by date range, then join).
o Consider Pre-aggregating (CTAS) or using materialized views for repeated
analytic queries.
Performance Tips at scale:
• Prefer hash join for large unsorted datasets (DB picks automatically).
• For time-series joins, use partition pruning on date columns.
• Avoid functions on join keys (e.g., ON TO_CHAR([Link]) = o.cust_id) — these break
index usage (non-SARGable).
• Use covering indexes if possible to perform index-only scans.
Complex / industry example (E-commerce reconciliation)
— full reasoning
Problem statement: For the monthly dashboard, show all orders in January 2025 with
order_id, order_date, customer_name, product_name, and shipment_status. There may
be orders without shipments yet and shipments without match to orders (data quality).
Tables: orders_demo, customers_demo, products_demo, shipments (imagine shipments
table).
Approach:
• Use orders LEFT JOIN customers (to include orders even with bad customers)
• LEFT JOIN products (so orders with unknown product show NULL)
• LEFT JOIN shipments (so orders without shipments show NULL shipments)
• Use a FULL OUTER JOIN between shipments and orders only for reconciliation if
needed.
SQL (simplified):
SELECT o.order_id, o.order_date,
c.customer_name,
p.product_name,
s.shipment_status
FROM orders_demo o
LEFT JOIN customers_demo c ON o.customer_id = c.customer_id
LEFT JOIN products_demo p ON o.product_id = p.product_id
LEFT JOIN shipments s ON o.order_id = s.order_id
WHERE o.order_date BETWEEN '2025-01-01' AND '2025-01-31'
ORDER BY o.order_id;
Explanation of join choices:
• LEFT JOIN customers_demo because even if the customer record is missing (data
quality), we want to see the order (to troubleshoot).
• Similarly for products and shipments.
What to expect (row-level):
• Order 206 → customer = Ravi Iyer, product = NULL (product_id=999 missing),
shipment maybe NULL.
• Order 210 → customer = NULL (customer_id=10 missing), product = Headphones,
shipment maybe present.
Scaling nuance: If this report is run nightly against 500M orders table:
• Partition orders by month and use partition pruning on order_date.
• Ensure indexes on orders.order_date, orders.customer_id,
orders.product_id.
• Consider building a nightly pre-joined reporting table (CTAS) or materialized view
for dashboard.
Sub-concepts: duplicate keys, composite keys, non-equi joins
Duplicate keys
• If join key is not unique on either side, you may get multiplicative output.
• Example: Customer A placed 3 orders; product X appears in 4 rows; joining
customers × products × orders yields 3 × 4 rows for that key if join logic matches both
dimensions — be careful.
Composite keys
• Join on multiple columns: ON a.key1 = b.key1 AND a.key2 = b.key2. Use when
uniqueness requires more than one column.
Non-equi joins
• Join conditions other than equality (range joins): ON [Link] BETWEEN
b.start_date AND b.end_date. Hash joins not applicable; optimizer may use
nested loops or merge join.
Best practices & checklist (practical quick reference)
• Always use explicit JOIN ... ON ... syntax (no comma joins).
• Qualify columns with table aliases (c.customer_name) to avoid ambiguity.
• Select only columns you need (avoid SELECT *).
• Ensure join keys are indexed — index FK columns in child tables and PK columns are
already indexed.
• Keep joins simple in stepwise fashion: build up complex queries using CTEs
(readability + debug).
• Use EXPLAIN / execution plan to see chosen join algorithm.
• Avoid functions on join keys (ON UPPER([Link]) = UPPER([Link]) breaks index
use).
• When joining very large tables, consider pre-aggregating or creating materialized
views.
• Monitor #rows output and beware Cartesian join explosion.
Final best practices checklist (short)
• Always use JOIN .. ON .. (never comma joins).
• Qualify column names (c.customer_id).
• Avoid SELECT *, especially in joins.
• Index PKs and FKs; consider composite indexes for multi-column joins.
• Use EXPLAIN and examine actual plan and row estimates.
• Use CTEs to break complex joins into steps for readability and debugging.
• For very large joins: partitioning, pre-aggregation (CTAS/materialized views), batch
processing.
🔄 Cross-SQL Reference Table
Join Type MySQL PostgreSQL SQL Notes / Alternatives
8+ Server
INNER ✅ Yes ✅ Yes ✅ Yes Same syntax across DBs.
JOIN
LEFT JOIN ✅ Yes ✅ Yes ✅ Yes Alias: LEFT OUTER JOIN also valid.
RIGHT ✅ Yes ✅ Yes ✅ Yes Works in all three. Often less used,
JOIN can rewrite as LEFT JOIN with
swapped tables.
FULL ❌ No ✅ Yes ✅ Yes MySQL doesn’t support directly. Use
OUTER UNION of LEFT + RIGHT JOIN.
JOIN
CROSS ✅ Yes ✅ Yes ✅ Yes Or omit ON clause (danger: accidental
JOIN Cartesian).
SELF JOIN ✅ Yes ✅ Yes ✅ Yes Same as normal JOIN with table
aliases.
⚠ Key portability note:
• MySQL requires workarounds for FULL OUTER JOIN.
• SQL Server + PostgreSQL support all join types directly.
• All engines handle INNER/LEFT/RIGHT consistently.
Join Types (Upgraded with Bigger Dataset + Before/After Views)
🔹 Step 1 — Our Dataset
We now define two larger, more realistic tables:
Customers
customer_id customer_name region loyalty_points
1 John Doe North 1200
2 Neha Gupta South 2000
3 Priya Sharma West 1500
4 Amit Patel East 800
5 Sara Khan North 2500
6 David Miller South 300
7 Emily Brown West 1000
Orders
order_id customer_id product amount
101 1 Laptop 75,000
102 2 Phone 30,000
103 2 Tablet 20,000
104 5 Headphones 5,000
105 8 Camera 15,000
106 3 Monitor 12,000
107 9 Keyboard 2,500
Dataset Explanation
• Customers table: holds who our buyers are (id, name, region, loyalty points).
• Orders table: holds what they bought (id, customer_id FK, product, amount).
• Some orders belong to customers in our table (customer_id = 1,2,3,5).
• Some orders reference non-existent customers (customer_id = 8,9) → to test orphan rows.
• Some customers never placed orders (customer_id = 4,6,7) → to test missing matches.
This setup lets us see all possible join cases in action.
🔹 Step 2 — INNER JOIN
SQL
SELECT c.customer_id, c.customer_name, [Link], o.order_id, [Link],
[Link]
FROM Customers c
INNER JOIN Orders o
ON c.customer_id = o.customer_id;
Before & After Logic
• Take Customers (7 rows).
• Take Orders (7 rows).
• Match where c.customer_id = o.customer_id.
• Drop unmatched customers (4,6,7) and orphan orders (105,107).
Result Grid (INNER JOIN)
customer_id customer_name region order_id product amount
1 John Doe North 101 Laptop 75000
2 Neha Gupta South 102 Phone 30000
2 Neha Gupta South 103 Tablet 20000
3 Priya Sharma West 106 Monitor 12000
5 Sara Khan North 104 Headphones 5000
Venn (INNER JOIN)
Customers ●∩● Orders → only overlapping ids (1,2,3,5)
🔹 Step 3 — LEFT JOIN
SQL
SELECT c.customer_id, c.customer_name, [Link], o.order_id, [Link],
[Link]
FROM Customers c
LEFT JOIN Orders o
ON c.customer_id = o.customer_id;
Before & After Logic
• Keep all Customers (7 rows).
• Attach matching orders if any.
• Where no orders → fill with NULLs.
Result Grid (LEFT JOIN)
customer_id customer_name region order_id product amount
1 John Doe North 101 Laptop 75000
2 Neha Gupta South 102 Phone 30000
2 Neha Gupta South 103 Tablet 20000
3 Priya Sharma West 106 Monitor 12000
4 Amit Patel East NULL NULL NULL
5 Sara Khan North 104 Headphones 5000
6 David Miller South NULL NULL NULL
customer_id customer_name region order_id product amount
7 Emily Brown West NULL NULL NULL
Venn (LEFT JOIN)
All Customers kept.
Orphans in Orders (105,107) ignored.
Missing orders → NULL.
🔹 Step 4 — RIGHT JOIN
SQL
SELECT c.customer_id, c.customer_name, [Link], o.order_id, [Link],
[Link]
FROM Customers c
RIGHT JOIN Orders o
ON c.customer_id = o.customer_id;
Before & After Logic
• Keep all Orders (7 rows).
• Attach customer info if available.
• Orphan orders get NULLs for customer columns.
Result Grid (RIGHT JOIN)
customer_id customer_name region order_id product amount
1 John Doe North 101 Laptop 75000
2 Neha Gupta South 102 Phone 30000
2 Neha Gupta South 103 Tablet 20000
3 Priya Sharma West 106 Monitor 12000
5 Sara Khan North 104 Headphones 5000
NULL NULL NULL 105 Camera 15000
NULL NULL NULL 107 Keyboard 2500
Venn (RIGHT JOIN)
All Orders kept.
Missing customers → NULL.
🔹 Step 5 — FULL OUTER JOIN
SQL (SQL Server / Postgres)
SELECT c.customer_id, c.customer_name, [Link], o.order_id, [Link],
[Link]
FROM Customers c
FULL OUTER JOIN Orders o
ON c.customer_id = o.customer_id;
(MySQL workaround = LEFT JOIN UNION RIGHT JOIN)
Result Grid (FULL OUTER JOIN)
customer_id customer_name region order_id product amount
1 John Doe North 101 Laptop 75000
2 Neha Gupta South 102 Phone 30000
2 Neha Gupta South 103 Tablet 20000
3 Priya Sharma West 106 Monitor 12000
4 Amit Patel East NULL NULL NULL
5 Sara Khan North 104 Headphones 5000
6 David Miller South NULL NULL NULL
7 Emily Brown West NULL NULL NULL
NULL NULL NULL 105 Camera 15000
NULL NULL NULL 107 Keyboard 2500
Venn (FULL OUTER JOIN)
All Customers ∪ All Orders
Both sides fully kept.
🔹 Step 6 — CROSS JOIN (Cartesian)
SQL
SELECT c.customer_name, o.order_id, [Link]
FROM Customers c
CROSS JOIN Orders o;
Result Grid (7 × 7 = 49 rows)
(First few shown)
customer_name order_id product
John Doe 101 Laptop
John Doe 102 Phone
John Doe 103 Tablet
… … …
Logic: Every Customer paired with every Order. Dangerous if tables are huge.
Venn (CROSS JOIN)
Full rectangle grid (no matching condition).
🔹 Step 7 — SELF JOIN
Employees (new small table for example):
emp_id emp_name manager_id
1 Raj NULL
2 Anita 1
3 Mohit 2
4 Sonal 1
SQL
SELECT e.emp_name AS employee, m.emp_name AS manager
FROM Employees e
LEFT JOIN Employees m
ON e.manager_id = m.emp_id;
Result Grid
employee manager
Raj NULL
Anita Raj
Mohit Anita
Sonal Raj
Visual (SELF JOIN)
Same table joined to itself.
Shows reporting chain.
📘 Row-by-Row Join DemonstraAon
Dataset Reminder (Before Joins)
Customers
customer_id customer_name region loyalty_points
1 John Doe North 1200
2 Neha Gupta South 1500
3 Arjun Mehta East 900
4 Priya Sharma West 1800
5 Amit Patel North 1100
6 Sara Khan South 2000
7 Emily Brown West 1700
Orders
order_id customer_id product total_amount
101 1 Laptop 80000
102 2 Smartphone 45000
103 3 Headphones 3000
104 5 Desk 15000
105 8 Printer 12000
106 2 Tablet 25000
107 9 Chair 7000
📌 Notice:
• Customers 4, 6, 7 → have no orders.
• Orders 105, 107 → have no valid customers (customer_id 8, 9 not in Customers).
1⃣ INNER JOIN (Only matches)
Before Join Logic:
• DB checks each customer_id in Customers against Orders.
• Keeps rows only if match found.
After Join (Result):
customer_id customer_name order_id product total_amount
1 John Doe 101 Laptop 80000
2 Neha Gupta 102 Smartphone 45000
3 Arjun Mehta 103 Headphones 3000
5 Amit Patel 104 Desk 15000
2 Neha Gupta 106 Tablet 25000
Row-by-row filtering:
• ✅ Customer 1 → Match found (order 101).
• ✅ Customer 2 → Two matches (orders 102, 106).
• ✅ Customer 3 → Match (103).
• ❌ Customer 4 → No match → dropped.
• ✅ Customer 5 → Match (104).
• ❌ Customers 6 & 7 → dropped.
• ❌ Orders 105 & 107 → dropped.
2⃣ LEFT JOIN (All Customers, Orders if any)
Before Join Logic:
• Keep all customers.
• If order exists → attach.
• If not → NULLs in order columns.
After Join:
customer_id customer_name order_id product total_amount
1 John Doe 101 Laptop 80000
2 Neha Gupta 102 Smartphone 45000
2 Neha Gupta 106 Tablet 25000
3 Arjun Mehta 103 Headphones 3000
4 Priya Sharma NULL NULL NULL
5 Amit Patel 104 Desk 15000
6 Sara Khan NULL NULL NULL
7 Emily Brown NULL NULL NULL
Row-by-row filtering:
• Customer 4 → gets NULL row.
• Customers 6 & 7 → NULL rows.
• Orders 105 & 107 → dropped (since unmatched).
3⃣ RIGHT JOIN (All Orders, Customers if any)
Before Join Logic:
• Keep all orders.
• If customer exists → attach.
• If not → NULLs in customer columns.
After Join:
order_id customer_id customer_name product total_amount
101 1 John Doe Laptop 80000
102 2 Neha Gupta Smartphone 45000
103 3 Arjun Mehta Headphones 3000
104 5 Amit Patel Desk 15000
105 8 NULL Printer 12000
106 2 Neha Gupta Tablet 25000
107 9 NULL Chair 7000
Row-by-row filtering:
• Orders 105 & 107 → keep order, but NULL customer.
• Customers 4, 6, 7 → dropped.
4⃣ FULL OUTER JOIN (All Customers + All Orders)
Before Join Logic:
• Combination of LEFT + RIGHT.
• Everyone included, matched if possible.
After Join:
customer_id customer_name order_id product total_amount
1 John Doe 101 Laptop 80000
2 Neha Gupta 102 Smartphone 45000
2 Neha Gupta 106 Tablet 25000
3 Arjun Mehta 103 Headphones 3000
4 Priya Sharma NULL NULL NULL
5 Amit Patel 104 Desk 15000
6 Sara Khan NULL NULL NULL
7 Emily Brown NULL NULL NULL
NULL NULL 105 Printer 12000
NULL NULL 107 Chair 7000
5⃣ CROSS JOIN (Cartesian Product)
Before Join Logic:
• Every row in Customers × every row in Orders.
• 7 × 7 = 49 rows.
After Join (sample only):
customer_id customer_name order_id product
1 John Doe 101 Laptop
1 John Doe 102 Smartphone
1 John Doe 103 Headphones
… … … …
7 Emily Brown 107 Chair
📌 Used rarely except when you intentionally need combinations (e.g., campaigns × regions).
6⃣ ANTI-JOIN (Only Customers without Orders)
Before Join Logic:
• LEFT JOIN + filter NULLs.
After Join:
customer_id customer_name
4 Priya Sharma
6 Sara Khan
7 Emily Brown
7⃣ SEMI-JOIN (Customers who have ≥1 Order)
Before Join Logic:
• EXISTS subquery or IN.
• Only customer rows kept, not repeated for each order.
After Join:
customer_id customer_name
1 John Doe
2 Neha Gupta
3 Arjun Mehta
5 Amit Patel
Master Table
Cust
Customer Order_ INNER LEFT RIGHT FULL ANTI- SEMI-
omer Product
_Name ID JOIN JOIN JOIN OUTER JOIN JOIN
_ID
1 John Doe 101 Laptop ✅ ✅ ✅ ✅ ❌ ✅
Neha
2 102 Smartphone ✅ ✅ ✅ ✅ ❌ ✅
Gupta
Neha
2 106 Tablet ✅ ✅ ✅ ✅ ❌ ✅
Gupta
Arjun
3 103 Headphones ✅ ✅ ✅ ✅ ❌ ✅
Mehta
Priya ✅ ✅
4 NULL NULL ❌ ❌ ✅ ❌
Sharma (NULL) (NULL)
5 Amit Patel 104 Desk ✅ ✅ ✅ ✅ ❌ ✅
✅ ✅
6 Sara Khan NULL NULL ❌ ❌ ✅ ❌
(NULL) (NULL)
Emily ✅ ✅
7 NULL NULL ❌ ❌ ✅ ❌
Brown (NULL) (NULL)
✅ ✅
NULL NULL 105 Printer ❌ ❌ ❌ ❌
(NULL) (NULL)
✅ ✅
NULL NULL 107 Chair ❌ ❌ ❌ ❌
(NULL) (NULL)
Explanation by Join Type
• INNER JOIN: Only rows with matches (C1-O101, C2-O102/O106, C3-O103, C5-O104).
• LEFT JOIN: All customers retained → unmatched customers (C4, C6, C7) padded with NULL.
• RIGHT JOIN: All orders retained → orphan orders (O105, O107) padded with NULL customers.
• FULL OUTER JOIN: Union of LEFT + RIGHT → keeps all customers + all orders, filling NULLs.
• ANTI-JOIN: Only customers with no matching orders (C4, C6, C7).
• SEMI-JOIN: Only customers with at least one order (C1, C2, C3, C5).
📊 Visual Summary Grid (Conceptual)
Think of Customers as rows on the left, Orders as rows on the right.
• INNER JOIN → overlap only.
• LEFT JOIN → all left + overlap, pad right with NULLs.
• RIGHT JOIN → all right + overlap, pad left with NULLs.
• FULL OUTER → all rows from both sides, pad missing with NULLs.
• ANTI → left-only rows with no match.
• SEMI → left-only rows with ≥1 match, no repetition for multiple orders.
Practice Set
Q-Set (MCQs)
MCQs
1. Which join returns only rows present in both tables?
a) LEFT JOIN b) RIGHT JOIN c) INNER JOIN d) FULL OUTER JOIN
2. What happens when you forget the ON clause in a JOIN?
a) Query fails b) Cartesian product c) Optimizer fixes it d) Only left rows kept
3. Which algorithm builds a hash on the smaller input first?
a) Nested loops b) Merge join c) Hash join d) Index scan
4. SELECT * in a join may cause which of the following issues? (choose best)
a) Larger data transfer b) Ambiguous column names c) Possible loss of index-only
scan d) All of the above
5. To show customers with no orders, which construct is typical?
a) INNER JOIN b) RIGHT JOIN c) LEFT JOIN with WHERE o.order_id IS NULL
d) CROSS JOIN
Tasks
1. Using the demo tables, list customers who have never ordered (SQL + expected
result).
2. Show orders that have no matching product (orphan product_id).
3. For employee table, show employee + manager name using self join.
4. Show all customers with order counts (0 if none) — use LEFT JOIN + GROUP BY.
5. Demonstrate an accidental Cartesian join by mistake and then fix it.
📘 Summary Table of SQL Joins
Join Type Concept / SQL Syntax Output (on our Output Size Real-World
Logic (Generic) dataset) Expectation Analogy /
Industry
Example
INNER Only rows FROM A Customers with ≤ min(rows in A, Retail: Show
JOIN with INNER JOIN valid orders only rows in B) only
matching B ON [Link] (1,2,3,5) → 5 customers
keys in both = [Link] rows who actually
tables bought
(intersection) something
.
LEFT JOIN All rows FROM A LEFT All 7 Customers; = rows(A) (plus CRM: List all
from left JOIN B ON unmatched (4,6,7) duplicates if customers,
table (A) + [Link] = → NULL orders. multi) even if they
matching [Link] never placed
rows from orders
right.
Unmatched
→ NULL.
RIGHT All rows FROM A All 7 Orders; = rows(B) (plus Logistics:
JOIN from right RIGHT JOIN orphan orders duplicates if Show all
table (B) + B ON [Link] (105,107) get multi) orders, even
matching = [Link] NULL customer if customer
rows from info. details are
left. missing
Unmatched
→ NULL.
FULL Union of FROM A FULL 7 Customers + 7 ≤ Banking:
OUTER LEFT + OUTER JOIN Orders = 10 unique rows(A)+rows(B Combine
RIGHT → B ON [Link] matched/unmatche ) accounts +
all rows from = [Link] d rows. transactions
both tables. (even
mismatched
records).
CROSS Cartesian FROM A 7 × 7 = 49 rows rows(A) × Marketing:
JOIN product: CROSS JOIN (huge expansion). rows(B) Create all
every row in B combination
A × every s of
row in B. campaigns ×
products.
SELF JOIN Join a table FROM Employee ↔ Depends on HR: Org
with itself Employees e Manager mapping hierarchical chart or
(alias JOIN (Raj manages depth reporting
needed). Employees m Anita & Sonal; structure.
ON Anita manages
e.manager_i Mohit).
d =
m.emp_id
ANTI-JOIN Rows in A FROM A LEFT Customers without ≤ rows(A) Fraud
without JOIN B ON orders (4,6,7). detection:
matches in [Link] = inactive
B. [Link] WHERE accounts
[Link] IS with no
NULL OR transactions.
WHERE NOT
EXISTS (…)
SEMI-JOIN Rows in A FROM A Customers who ≤ rows(A) Retail: Find
where at WHERE placed ≥1 order loyal
least one EXISTS (1,2,3,5). customers
match exists (SELECT 1 (any
in B (like FROM B purchase).
EXISTS or WHERE
IN). [Link]=[Link]
)
CARTESIA Special case FROM A, B Explodes → 49 rows(A) × Common
N JOIN of CROSS (without ON) rows even if you rows(B) bug: Forgot
without ON didn’t want it. join
condition condition →
(often system crash
accidental, on large data.
bad
practice).
🔹 Quick Visuals (Venn-Style Overlaps)
• INNER JOIN = Overlap only.
• LEFT JOIN = All left + overlap.
• RIGHT JOIN = All right + overlap.
• FULL OUTER = All left + all right.
• CROSS JOIN = Every possible combination (rectangle).
• SELF JOIN = Matching rows within same set.
• ANTI-JOIN = Left side minus overlap.
• SEMI-JOIN = Left side if overlap exists (ignore duplicates).
🔹 DBMS Syntax Variations
Join Type SQL Server PostgreSQL MySQL Notes
8+
INNER ✅ Yes ✅ Yes ✅ Yes Standard.
JOIN
LEFT JOIN ✅ Yes ✅ Yes ✅ Yes Synonym: LEFT OUTER JOIN.
RIGHT ✅ Yes ✅ Yes ✅ Yes Same syntax across all.
JOIN
FULL ✅ Yes ✅ Yes ❌ No MySQL workaround = LEFT
OUTER JOIN UNION RIGHT JOIN.
JOIN
CROSS ✅ Yes ✅ Yes ✅ Yes Or use FROM A, B (but risky).
JOIN
SELF JOIN ✅ Yes ✅ Yes ✅ Yes Just alias table.
ANTI-JOIN Use NOT EXISTS or Same Same No direct keyword in any SQL.
LEFT JOIN IS
NULL
SEMI-JOIN EXISTS or IN Same Same No direct keyword.
Answers (A-set)
MCQs:
1 → c (INNER JOIN)
2 → b (Cartesian product)
3 → c (Hash join)
4 → d (All of the above)
5 → c (LEFT JOIN ... WHERE o.order_id IS NULL)
Tasks — SQL + Output
Task 1 — Customers who have never ordered:
SELECT c.customer_id, c.customer_name
FROM customers_demo c
LEFT JOIN orders_demo o ON c.customer_id = o.customer_id
WHERE o.order_id IS NULL;
Output:
customer_id customer_name
9 Zara Khan
Task 2 — Orders with no matching product:
SELECT o.order_id, o.product_id, o.total_amount
FROM orders_demo o
LEFT JOIN products_demo p ON o.product_id = p.product_id
WHERE p.product_id IS NULL;
Output:
order_id product_id total_amount
206 999 100.00
Task 3 — Employee + manager:
(Using employees_demo created earlier.)
SELECT e.emp_id, e.emp_name, m.emp_id AS manager_id, m.emp_name AS
manager_name
FROM employees_demo e
LEFT JOIN employees_demo m
ON e.manager_emp_id = m.emp_id
ORDER BY e.emp_id;
Output:
emp_id emp_name manager_id manager_name
301 Raj Verma NULL NULL
302 Anita Desai 301 Raj Verma
emp_id emp_name manager_id manager_name
303 Mohit Kumar 301 Raj Verma
304 Rohit Nair 302 Anita Desai
305 Sonal Gupta 302 Anita Desai
Task 4 — All customers with order counts (0 if none):
SELECT c.customer_id, c.customer_name, COUNT(o.order_id) AS order_count
FROM customers_demo c
LEFT JOIN orders_demo o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name
ORDER BY order_count DESC;
Output:
customer_id customer_name order_count
1 John Doe 2
2 Neha Gupta 2
3 Priya Sharma 1
4 Amit Patel 1
5 Sara Khan 1
6 Ravi Iyer 1
7 Emily Brown 1
8 Chris Martin 1
9 Zara Khan 0
Task 5 — accidental Cartesian fix:
❌ Wrong (accidental Cartesian)
SELECT c.customer_id, o.order_id
FROM customers_demo c, orders_demo o;
-- This returns 9 * 11 = 99 rows
✔ Fix: Add explicit JOIN with ON
SELECT c.customer_id, o.order_id
FROM customers_demo c
JOIN orders_demo o ON c.customer_id = o.customer_id;
-- Correct: only matched rows
8.3 — Advanced Join Topics
Reference Dataset —
We’ll use this dataset for all examples in this module. Run these in your SQL IDE (Postgres /
MySQL / SQL Server). Use a schema like sportsdb if you wish.
CREATE and INSERT (standard SQL)
-- Players table (master roster)
CREATE TABLE Players (
player_id INT PRIMARY KEY,
player_name VARCHAR(50),
team VARCHAR(30),
position VARCHAR(20),
age INT
);
INSERT INTO Players (player_id, player_name, team, position, age) VALUES
(1, 'Alex Morgan', 'City FC', 'Forward', 29),
(2, 'Ben Carter', 'United FC', 'Midfield', 25),
(3, 'Chris Shaw', 'City FC', 'Defender', 30),
(4, 'David Lee', 'Rovers', 'Goalkeeper',28),
(5, 'Evan Price', 'United FC', 'Forward', 22),
(6, 'Frank Turner', 'Rovers', 'Midfield', 27),
(7, 'George Hill', 'City FC', 'Forward', 21),
(8, 'Hector Alvarez', 'Rovers', 'Defender', 26),
(9, 'Ivan Petrov', 'Wanderers', 'Midfield', 24),
(10,'Jack O\'Neill', 'Wanderers', 'Forward', 31);
-- MatchAppearances: each row = a player appearance in a match
CREATE TABLE Appearances (
appearance_id INT PRIMARY KEY,
match_id INT,
player_id INT,
minutes_played INT,
goals INT
);
INSERT INTO Appearances (appearance_id, match_id, player_id,
minutes_played, goals) VALUES
(1001, 501, 1, 90, 1),
(1002, 501, 3, 90, 0),
(1003, 502, 2, 75, 0),
(1004, 502, 5, 60, 1),
(1005, 503, 8, 90, 0),
(1006, 504, 11, 45, 0), -- orphan: player_id 11 not in Players
(1007, 505, 2, 90, 2),
(1008, 506, 9, 30, 0);
Notes about dataset:
• Players: 10 players in roster (IDs 1..10).
• Appearances: 8 rows; includes:
o valid appearances referencing players in Players (e.g., player_id 1,2,3,...),
o one orphan (appearance 1006 references player_id = 11 which is NOT in
Players) — perfect to illustrate anti/join cases.
• Some players (4,6,7,10) may have no appearances (useful for anti-join).
Anti-Join: Players who never played
Concept & Meaning: Anti-join returns rows from the left table that do not have a matching
row in the right table. In sports: find squad players who never made an appearance.
Two canonical implementations:
A. LEFT JOIN ... WHERE right IS NULL
SELECT p.player_id, p.player_name, [Link]
FROM Players p
LEFT JOIN Appearances a ON p.player_id = a.player_id
WHERE a.appearance_id IS NULL
ORDER BY p.player_id;
Result grid (what the student will see):
player_id player_name team
4 David Lee Rovers
6 Frank Turner Rovers
7 George Hill City FC
10 Jack O'Neill Wanderers
Explanation: players 4,6,7,10 have no matching rows in Appearances.
B. NOT EXISTS (often preferred for readability/performance)
SELECT p.player_id, p.player_name, [Link]
FROM Players p
WHERE NOT EXISTS (
SELECT 1 FROM Appearances a WHERE a.player_id = p.player_id
)
ORDER BY p.player_id;
Same result.
Row-by-row thought process (example for player 4):
• Check Appearances where player_id = 4 → none found → include player 4.
Performance note: For large datasets, NOT EXISTS with proper indexes is usually efficient.
Avoid NOT IN if the subquery can return NULL values.
Semi-Join: Players who played at least once (no
duplication)
Concept & Meaning: Semi-join returns rows from the left table that have at least one
matching row in the right table — but does not duplicate the left row for each match. Use
when you want just the list of participating players.
Using EXISTS (recommended)
SELECT p.player_id, p.player_name, [Link]
FROM Players p
WHERE EXISTS (
SELECT 1 FROM Appearances a WHERE a.player_id = p.player_id
)
ORDER BY p.player_id;
Result:
player_id player_name team
1 Alex Morgan City FC
2 Ben Carter United FC
3 Chris Shaw City FC
5 Evan Price United FC
8 Hector Alvarez Rovers
9 Ivan Petrov Wanderers
Using IN (works but watch NULLs)
SELECT p.player_id, p.player_name
FROM Players p
WHERE p.player_id IN (SELECT player_id FROM Appearances)
ORDER BY p.player_id;
Same result on this dataset. But IN can misbehave when the subquery returns NULLs —
prefer EXISTS.
SQL-level logic / building thinking:
• Semi-join is a filter: “Is there at least one row in Appearances for this player?” Yes →
include; No → exclude.
• Efficient solutions (EXISTS) allow the DB to stop scanning once it finds a match.
Detect orphan appearances (appearances with no
matching player)
Concept: Identify data quality issues — appearances that reference players not in Players.
This is the mirror of anti-join.
SELECT a.appearance_id, a.match_id, a.player_id, a.minutes_played, [Link]
FROM Appearances a
LEFT JOIN Players p ON a.player_id = p.player_id
WHERE p.player_id IS NULL;
Result:
appearance_id match_id player_id minutes_played goals
1006 504 11 45 0
Fixing approach: Investigate player_id 11 — either INSERT into Players or correct the
appearance record.
Cartesian join (cross join) demonstration / caution
Concept & Meaning: Cartesian product pairs every row in left with every row in right.
Usually accidental (missing ON clause) and catastrophic on large tables.
Demonstration (intentional, but dangerous):
SELECT p.player_name, a.appearance_id, a.match_id
FROM Players p
CROSS JOIN Appearances a
LIMIT 20; -- limit to preview, remove in real cases
Result (first 10 rows shown):
player_name appearance_id match_id
Alex Morgan 1001 501
Alex Morgan 1002 501
Alex Morgan 1003 502
Alex Morgan 1004 502
Alex Morgan 1005 503
Alex Morgan 1006 504
Alex Morgan 1007 505
Alex Morgan 1008 506
Ben Carter 1001 501
Ben Carter 1002 501
Result size: 10 players × 8 appearances = 80 rows (full). That's small for demonstration, but
imagine 10k × 100k → billions of rows.
Accidental Cartesian join example (common mistake):
-- OOPS: missing ON clause
SELECT p.player_name, a.match_id
FROM Players p, Appearances a
WHERE [Link] = 'City FC';
The WHERE clause filters only on Players but no join condition — creates Cartesian product
between the selected players and all appearances — wrong.
Best practice: Always include explicit ON join conditions; prefer explicit JOIN syntax.
Performance considerations & optimization tips
• Indexing: Ensure Appearances.player_id is indexed (and Players.player_id
primary key). This speeds up EXISTS/NOT EXISTS and JOINs.
• NOT EXISTS vs LEFT JOIN IS NULL: Both can be optimized similarly by
modern DBs. NOT EXISTS is often less error-prone (NULL handling).
• IN vs EXISTS: For large subquery results, EXISTS typically better. IN may be OK
when the subquery result is small and non-nullable.
• Avoid correlated subqueries when possible: Correlated subqueries execute per
outer row (unless optimizer rewrites).
• EXPLAIN / Query Plan: Always check the plan — look for seq scans vs index
lookups.
Common mistakes & gotchas
• Using NOT IN when the subquery can return NULL (unexpected empty result).
• Missing ON clause → accidental Cartesian join — results blow up.
• Expecting IN to be faster than EXISTS blindly — depends on data distribution and
engine.
• Forgetting to deduplicate when you want distinct players (use DISTINCT or semi-join
semantics).
Q-Set (Questions)
MCQs (Basic–Intermediate)
1. Which SQL pattern returns rows from left table that have no match on right?
a) EXISTS b) LEFT JOIN IS NULL c) INNER JOIN d) CROSS JOIN
2. Which is safer for NULLs: EXISTS or IN?
3. What happens if you forget the ON clause?
a) Error b) Cartesian join c) No rows d) NULLs only
4. Which statement finds orphan records in Appearances?
a) NOT EXISTS b) LEFT JOIN ... IS NULL c) INNER JOIN d) a & b
5. Which approach will not produce duplicate left rows when there are multiple matches
in right?
a) LEFT JOIN b) SEMI-JOIN (EXISTS) c) CROSS JOIN d) INNER JOIN
Tasks
1. List players who have never played for their current team (i.e., appear in Appearances
but with different team) — hint: join Appearances→Players check team mismatch.
2. List players who have at least one match with goals > 0 (semi-join).
3. Find players who played only in matches where they played < 30 minutes (i.e., all
their appearances < 30) — use aggregation + anti/semi logic.
4. Find players present in Players but whose appearances reference an unknown team
(simulate by left joining to Teams table if provided) — data quality check.
5. Convert a correlated anti-join into an equivalent NOT EXISTS or LEFT JOIN and
compare EXPLAIN plans.
Concept Summary — Advanced Joins
Concept SQL Pattern Use case (football) Key tip
Anti-join LEFT JOIN ... Players with zero Use NOT EXISTS with index on
WHERE right IS appearances Appearances.player_id
NULL or NOT
EXISTS
Semi-join EXISTS or IN Players who played Use EXISTS to avoid NULL
at least once pitfalls
Orphan LEFT JOIN WHERE Appearances Fix master data or delete orphan
right IS NULL
detection pointing to unknown records
(but roles players
swapped)
Cartesian CROSS JOIN or Campaign Check for missing ON clauses to
missing ON combinations (rare) avoid explosion
or accidental bugs
Cross-DB differences
Feature Postgres SQL Server MySQL 8+ Note
LEFT JOIN IS ✅ ✅ ✅ Same across DBs
NULL
NOT EXISTS ✅ ✅ ✅ Same
EXISTS ✅ ✅ ✅ Same
CROSS JOIN ✅ ✅ ✅ Same
Correlated optimizer- optimizer- optimizer- Check EXPLAIN
subquery behavior dependent dependent dependent for each DB
A-Set (Answers + SQL + outputs)
MCQ answers
1 → b (LEFT JOIN IS NULL or NOT EXISTS)
2 → EXISTS is safer (so answer: EXISTS)
3 → b (Cartesian join)
4 → d (both NOT EXISTS and LEFT JOIN ... IS NULL can find orphan rows)
5 → b (SEMI-JOIN / EXISTS — since it returns one left row even if many right matches)
Task 1 — players who have never played for their current team
Explanation: We want players whose appearances (if any) show a team different from their
current [Link]. This requires joining Appearances→Players by player_id and
comparing team info stored in a hypothetical [Link] (if we had it). Since our
Appearances has no team field, we simulate: assume match_id implies team? To keep it
realistic, let's show a different advanced task which uses available data:
Alternate Task 1 (practical): Players who have played but only as substitutes (< 30 minutes
in all appearances).
SQL:
SELECT p.player_id, p.player_name
FROM Players p
WHERE EXISTS (SELECT 1 FROM Appearances a WHERE a.player_id = p.player_id)
AND NOT EXISTS (
SELECT 1 FROM Appearances a2 WHERE a2.player_id = p.player_id AND
a2.minutes_played >= 30
)
ORDER BY p.player_id;
Output:
player_id player_name
9 Ivan Petrov
(Explanation: player 9 had an appearance 1008 with 30 minutes exactly — adjust thresholds
as needed; if equal excluded, modify condition.)
Task 2 — players with at least one match with goals > 0
SQL:
SELECT p.player_id, p.player_name
FROM Players p
WHERE EXISTS (
SELECT 1 FROM Appearances a WHERE a.player_id = p.player_id AND [Link]
> 0
)
ORDER BY p.player_id;
Output:
player_id player_name
1 Alex Morgan
2 Ben Carter
5 Evan Price
(Players 1,2,5 scored goals in appearances.)
Task 3 — players who played only in matches where they played < 30 minutes
(all appearances < 30)
SQL:
SELECT p.player_id, p.player_name
FROM Players p
JOIN Appearances a ON a.player_id = p.player_id
GROUP BY p.player_id, p.player_name
HAVING MAX(a.minutes_played) < 30;
Output:
player_id player_name
9 Ivan Petrov
Task 4 — data quality / orphan appearances (we already showed):
appearance_id 1006 with player_id 11.
Task 5 — correlated anti-join vs NOT EXISTS
Correlated (inefficient):
SELECT p.player_id, p.player_name
FROM Players p
WHERE NOT EXISTS (
SELECT 1 FROM Appearances a WHERE a.player_id = p.player_id
);
LEFT JOIN variant:
SELECT p.player_id, p.player_name
FROM Players p
LEFT JOIN Appearances a ON p.player_id = a.player_id
WHERE a.appearance_id IS NULL;
Run EXPLAIN for both — modern optimizers will often produce similar plans if indexes exist.
8.4 — Set Operations in SQL
For set operations we use a different dataset: seasonal squad lists (Season2023 and
Season2024). This is exactly the scenario you requested: different dataset for this module.
The examples show UNION, UNION ALL, INTERSECT, EXCEPT / MINUS across seasons
(players retained, new signings, departures).
8.4.0 Introduction
What this module covers: UNION / UNION ALL, INTERSECT, EXCEPT (or MINUS)
semantics and performance; real world sports use cases (player movement across seasons),
cross-DB differences (EXCEPT vs MINUS), best practices.
Why it matters: Analysts often need to compare season-to-season squads, dedupe combined
lists, find new joiners, find players who left, or shared players between competitions.
Prerequisites: Basic SELECT, sorting, simple joins helpful.
Dataset — Season squads
CREATE and INSERT
-- Season2023 squads
CREATE TABLE Squad_2023 (
player_name VARCHAR(60) PRIMARY KEY,
team VARCHAR(40)
);
INSERT INTO Squad_2023 (player_name, team) VALUES
('Alex Morgan','City FC'),
('Ben Carter','United FC'),
('Chris Shaw','City FC'),
('David Lee','Rovers'),
('Evan Price','United FC'),
('Frank Turner','Rovers'),
('George Hill','City FC'),
('Helen Stone','City FC');
-- Season2024 squads (some players retained, some new, some departed)
CREATE TABLE Squad_2024 (
player_name VARCHAR(60) PRIMARY KEY,
team VARCHAR(40)
);
INSERT INTO Squad_2024 (player_name, team) VALUES
('Alex Morgan','City FC'),
('Ben Carter','United FC'),
('Ivan Petrov','Wanderers'),
('Jack O\'Neill','Wanderers'),
('Evan Price','United FC'),
('Liam King','Rovers'),
('George Hill','City FC'),
('Maya Cruz','United FC');
Notes:
• Common (retained): Alex Morgan, Ben Carter, Evan Price, George Hill.
• Departed from 2023 (not in 2024): Chris Shaw, David Lee, Frank Turner, Helen
Stone.
• New in 2024: Ivan Petrov, Jack O'Neill, Liam King, Maya Cruz.
UNION vs UNION ALL
Objective: Create combined squad list across both seasons.
UNION (deduplicated)
SELECT player_name, team, '2023' AS season FROM Squad_2023
UNION
SELECT player_name, team, '2024' AS season FROM Squad_2024
ORDER BY player_name;
Result: (note UNION deduplicates identical rows — but team might differ, so duplicate names
with different teams remain)
player_name team season
Alex Morgan City FC 2023
Alex Morgan City FC 2024
Ben Carter United FC 2023
Ben Carter United FC 2024
Chris Shaw City FC 2023
David Lee Rovers 2023
Evan Price United FC 2023
Evan Price United FC 2024
George Hill City FC 2023
George Hill City FC 2024
Helen Stone City FC 2023
Ivan Petrov Wanderers 2024
Jack O'Neill Wanderers 2024
Liam King Rovers 2024
Maya Cruz United FC 2024
Frank Turner Rovers 2023
Note: In this query UNION deduplicates identical rows. Because we included season column,
rows differ (2023 vs 2024) so all appear. If you instead want a deduped player list ignoring
season, use UNION without season tag (or UNION ALL with DISTINCT later).
UNION ALL (keep duplicates)
SELECT player_name, team FROM Squad_2023
UNION ALL
SELECT player_name, team FROM Squad_2024
ORDER BY player_name;
Result: Same columns but duplicates for retained players will appear twice (once per
season). UNION ALL is faster (no dedup cost).
When to use which:
• UNION ALL if you want to keep duplicates / preserve source rows or performance
matters.
• UNION if you need deduplication.
INTERSECT — players present in both seasons (retained
players)
Postgres / SQL Server:
SELECT player_name FROM Squad_2023
INTERSECT
SELECT player_name FROM Squad_2024
ORDER BY player_name;
Result:
player_name
Alex Morgan
Ben Carter
Evan Price
George Hill
MySQL note: MySQL does not support INTERSECT directly — emulate via INNER JOIN
or IN:
SELECT s.player_name
FROM Squad_2023 s
WHERE s.player_name IN (SELECT player_name FROM Squad_2024);
EXCEPT / MINUS — players from 2023 who are NOT in
2024 (departures)
Postgres / SQL Server:
SELECT player_name FROM Squad_2023
EXCEPT
SELECT player_name FROM Squad_2024
ORDER BY player_name;
Result:
player_name
Chris Shaw
David Lee
Frank Turner
Helen Stone
Oracle uses MINUS:
SELECT player_name FROM Squad_2023
MINUS
SELECT player_name FROM Squad_2024;
MySQL workaround:
SELECT s.player_name
FROM Squad_2023 s
LEFT JOIN Squad_2024 t ON s.player_name = t.player_name
WHERE t.player_name IS NULL
ORDER BY s.player_name;
Practical combined examples
a) Players who changed team between seasons (name present in both seasons
but team differs)
SELECT s23.player_name, [Link] AS team_2023, [Link] AS team_2024
FROM Squad_2023 s23
JOIN Squad_2024 s24 ON s23.player_name = s24.player_name
WHERE [Link] <> [Link];
Result: (none in our sample — teams match for retained players). If there were transfers,
they'd show here.
b) Full roster across both seasons deduped by player_name (unique player
list)
SELECT DISTINCT player_name FROM (
SELECT player_name FROM Squad_2023
UNION ALL
SELECT player_name FROM Squad_2024
) t
ORDER BY player_name;
Result (unique player list):
player_name
Alex Morgan
Ben Carter
Chris Shaw
David Lee
Evan Price
Frank Turner
George Hill
Helen Stone
Ivan Petrov
Jack O'Neill
Liam King
Maya Cruz
Performance considerations & tips
• UNION requires deduplication — often sorting or hashing — heavier cost. Use UNION
ALL if duplication is acceptable.
• INTERSECT and EXCEPT often implemented as sort/merge operations — on huge tables
they can be expensive. Consider indexed joins or temporary tables.
• For large datasets, create temp tables with appropriate indexes, run set operations on
indexed columns, then drop temp tables.
• MySQL lacks INTERSECT / EXCEPT — prefer semi/anti-joins (JOIN + IS NULL /
EXISTS) for large data.
Common mistakes & gotchas
• Column mismatch: both SELECTs in set operations must have same number of
columns and compatible types.
• Implicit deduplication: UNION removes duplicates — may surprise analysts who
expected counts to sum.
• Relying on ordering of UNION results — always apply ORDER BY on final result.
(Questions)
MCQs
1. Which operator removes duplicates: UNION or UNION ALL?
2. Which SQL flavors support INTERSECT natively?
a) MySQL b) Postgres c) SQL Server d) Oracle (pick all that apply)
3. How do you emulate EXCEPT in MySQL?
4. To combine lists and keep duplicates, use: a) UNION b) UNION ALL c)
INTERSECT d) EXCEPT
5. If you need players present in both squads, which is best: INTERSECT or UNION?
Tasks
1. Using Squad_2023 and Squad_2024, produce list of new signings in 2024 (players in
2024 but not in 2023).
2. Produce a combined, deduplicated list of player-team pairs across both years.
3. Find players that appear in both lists but with different team value (changed team).
4. Using UNION ALL, compute total rows count and then the distinct player count —
explain difference.
5. For a very large dataset (millions of rows), propose a scalable strategy to compute
INTERSECT.
Concept Summary — Set Operations
Operation Meaning Use Case DB support Cost
Consideration
UNION Merge, Combined Postgres/SQL dedup cost
remove unique list Server/MySQL (sort/hash)
duplicates
UNION ALL Merge, keep Audit log All DBs fast, no dedup
duplicates merges
INTERSECT Rows in both Retained Postgres/SQL sort/merge cost
sets players Server/Oracle
EXCEPT / Rows in left Departed Postgres/SQL sort/merge cost
MINUS but not right players Server / Oracle
(MINUS)
Emulation Use JOIN / For DBs MySQL may be faster with
NOT lacking proper indexes
EXISTS operators
Cross-DB syntax / availability (Module 8.4)
Feature Postgres SQL MySQL 8+ Oracle
Server
UNION ✅ ✅ ✅ ✅
UNION ALL ✅ ✅ ✅ ✅
INTERSECT ✅ ✅ ❌ ✅
EXCEPT / MINUS EXCEPT EXCEPT ✅ ❌ (use LEFT JOIN IS MINUS
✅ NULL) ✅
DISTINCT Standard Standard Standard Standard
behavior
Final Best Practices
• Index join keys (player_id, player_name) before running large semi/anti joins or
set operations.
• Use EXISTS/NOT EXISTS for semi/anti-join semantics; avoid NOT IN when NULLs
possible.
• Prefer UNION ALL if duplicates are acceptable; use UNION only when dedup necessary
and data volume manageable.
• Avoid accidental Cartesian joins: always include ON when using JOIN. Use EXPLAIN
to verify.
• For portability, write set-operation emulations for MySQL (LEFT JOIN + IS NULL
/ EXISTS).
• Materialize intermediate results into temp tables with indexes for very large
workloads.
A-Set
MCQ answers
1 → UNION removes duplicates; UNION ALL keeps duplicates.
2 → Postgres, SQL Server, Oracle support INTERSECT natively (MySQL does not). So
choose b, c, d.
3 → Use LEFT JOIN ... WHERE right IS NULL or use NOT EXISTS.
4 → b (UNION ALL).
5 → INTERSECT.
Task Answers
Task 1 — new signings in 2024
SELECT player_name FROM Squad_2024
EXCEPT
SELECT player_name FROM Squad_2023
ORDER BY player_name;
Result (Postgres/SQL Server):
player_name
Ivan Petrov
Jack O'Neill
Liam King
Maya Cruz
(MySQL version using LEFT JOIN IS NULL is shown earlier.)
Task 2 — combined deduped player-team pairs
SELECT DISTINCT player_name, team FROM (
SELECT player_name, team FROM Squad_2023
UNION ALL
SELECT player_name, team FROM Squad_2024
) t
ORDER BY player_name;
Result: shows unique player-team pairs (list omitted for brevity; same as unique players with
team).
Task 3 — players appearing in both lists with different teams
SELECT s23.player_name, [Link] AS team_2023, [Link] AS team_2024
FROM Squad_2023 s23
JOIN Squad_2024 s24 USING (player_name)
WHERE [Link] <> [Link];
Result: In our sample, none (empty result). If existed, it would show player_name with
teams.
Task 4 — UNION ALL counts vs DISTINCT
SELECT COUNT(*) AS total_rows FROM (
SELECT player_name FROM Squad_2023
UNION ALL
SELECT player_name FROM Squad_2024
) t;
-- total_rows = number of rows from both tables (8 + 8 = 16)
SELECT COUNT(DISTINCT player_name) AS distinct_players FROM (
SELECT player_name FROM Squad_2023
UNION ALL
SELECT player_name FROM Squad_2024
) t;
-- distinct_players = 12 (as shown earlier)
Task 5 — scalable INTERSECT strategy for huge data
• Load the two datasets into indexed temporary or staging tables (partitioned if
possible).
• Create index on player_name (or join key).
• Use an indexed join:
SELECT s1.player_name
FROM Squad_2023 s1
INNER JOIN Squad_2024 s2 ON s1.player_name = s2.player_name;
• Or, if memory is sufficient, use hashing or map-reduce approaches.
• If in distributed DB (e.g., Spark), use built-in set intersection methods with
partitioning.
8.5 — Conditional Logic (CASE, Nested
CASE, COALESCE)
🔹 Introduction
SQL is not just about filtering and joining data — analysts often need to derive new values
or handle missing data.
That’s where CASE expressions and COALESCE come in.
• CASE = conditional branching in SQL (like IF-ELSE).
• COALESCE = “first non-null value” function — quick way to replace nulls.
Industry relevance (football example):
• Categorize players into “Senior” or “Junior” based on age.
• Derive performance rating bands (“Star”, “Contributor”, “Bench”).
• Replace missing goal counts with 0.
🔹 Dataset — Player Performance (Football League)
CREATE TABLE PlayerStats (
player_id INT PRIMARY KEY,
player_name VARCHAR(50),
team VARCHAR(30),
age INT,
goals INT,
assists INT,
minutes_played INT,
yellow_cards INT,
red_cards INT
);
INSERT INTO PlayerStats VALUES
(1, 'Alex Morgan', 'City FC', 29, 12, 5, 2100, 3, 0),
(2, 'Ben Carter', 'United FC', 25, 2, 7, 1800, 5, 1),
(3, 'Chris Shaw', 'City FC', 30, 0, 3, 1500, 2, 0),
(4, 'David Lee', 'Rovers', 28, NULL, 1, 500, 0, 0),
(5, 'Evan Price', 'United FC', 22, 5, 2, 900, 1, 0),
(6, 'George Hill', 'City FC', 21, 0, 0, 300, 0, 0);
🔹 Simple CASE — Age category
Concept: Simple CASE checks equality.
SELECT player_name, age,
CASE age
WHEN 21 THEN 'Youngster'
WHEN 22 THEN 'Promising'
WHEN 25 THEN 'Prime'
ELSE 'Experienced'
END AS age_category
FROM PlayerStats;
Result:
player_name age age_category
Alex Morgan 29 Experienced
Ben Carter 25 Prime
Chris Shaw 30 Experienced
David Lee 28 Experienced
Evan Price 22 Promising
George Hill 21 Youngster
🔹 Searched CASE — Performance rating
Concept: Searched CASE uses boolean conditions.
SELECT player_name, goals, assists,
CASE
WHEN goals >= 10 THEN 'Star'
WHEN goals BETWEEN 3 AND 9 THEN 'Contributor'
WHEN goals IS NULL THEN 'Data Missing'
ELSE 'Bench'
END AS performance_band
FROM PlayerStats;
Result:
player_name goals assists performance_band
Alex Morgan 12 5 Star
Ben Carter 2 7 Bench
Chris Shaw 0 3 Bench
David Lee NULL 1 Data Missing
Evan Price 5 2 Contributor
George Hill 0 0 Bench
🔹 Nested CASE — Disciplinary flag
Concept: CASE inside CASE.
SELECT player_name, yellow_cards, red_cards,
CASE
WHEN red_cards > 0 THEN 'High Risk'
ELSE
CASE
WHEN yellow_cards >= 5 THEN 'Warning'
ELSE 'Clean'
END
END AS discipline_status
FROM PlayerStats;
Result:
player_name yellow_cards red_cards discipline_status
Alex Morgan 3 0 Clean
Ben Carter 5 1 High Risk
Chris Shaw 2 0 Clean
David Lee 0 0 Clean
Evan Price 1 0 Clean
George Hill 0 0 Clean
🔹 COALESCE vs CASE — Handling NULL goals
SELECT player_name,
COALESCE(goals,0) AS goals_with_coalesce,
CASE WHEN goals IS NULL THEN 0 ELSE goals END AS goals_with_case
FROM PlayerStats;
Result:
player_name goals_with_coalesce goals_with_case
Alex Morgan 12 12
Ben Carter 2 2
Chris Shaw 0 0
David Lee 0 0
Evan Price 5 5
George Hill 0 0
Takeaway: Both work. Use COALESCE for readability; use CASE for complex branching.
🔹 Summary Table (CASE/COALESCE)
Feature Syntax Use Case Industry
Example
Simple CASE CASE col WHEN val THEN Exact value match Categorize by age
... END
group
Searched CASE CASE WHEN condition Range/condition Band players by
THEN ... END checks goals
Nested CASE CASE inside CASE Multi-level logic Discipline flags
COALESCE COALESCE(expr1, expr2, Null handling Replace NULL
…)
goals with 0
CASE vs CASE = flexible, Choose based on -
COALESCE COALESCE = concise complexity
Cross-DB support: CASE and COALESCE are supported in Postgres, SQL Server,
MySQL, Oracle identically.
📘 8.6 — Joins & Set Case Studies
🔹 8.6.0 Introduction
Now we combine Joins + Set operations + Conditional logic into real business scenarios.
Here we use a Retail / Customer Orders dataset and an Org Hierarchy dataset.
🔹 Dataset A — Customers & Orders
CREATE TABLE Customers (
customer_id INT PRIMARY KEY,
customer_name VARCHAR(50),
region VARCHAR(30)
);
INSERT INTO Customers VALUES
(1, 'Alice', 'North'),
(2, 'Bob', 'South'),
(3, 'Charlie', 'East'),
(4, 'Diana', 'West'),
(5, 'Ethan', 'South');
CREATE TABLE Orders (
order_id INT PRIMARY KEY,
customer_id INT,
product VARCHAR(50),
amount DECIMAL(10,2)
);
INSERT INTO Orders VALUES
(101, 1, 'Laptop', 1200.00),
(102, 1, 'Mouse', 25.00),
(103, 2, 'Keyboard', 75.00),
(104, 3, 'Monitor', 250.00),
(105, 6, 'Chair', 150.00); -- Orphan (customer_id 6 not in Customers)
🔹 Case Study 1 — Customer orders with product details
Goal: List all customers with their orders, including those with no orders. Flag orphans.
SELECT c.customer_name, [Link], o.order_id, [Link], [Link],
CASE
WHEN o.customer_id IS NULL THEN 'No Order'
WHEN c.customer_id IS NULL THEN 'Orphan Order'
ELSE 'Valid'
END AS order_status
FROM Customers c
LEFT JOIN Orders o ON c.customer_id = o.customer_id
UNION ALL
SELECT NULL, NULL, o.order_id, [Link], [Link], 'Orphan Order'
FROM Orders o
WHERE NOT EXISTS (SELECT 1 FROM Customers c WHERE c.customer_id =
o.customer_id)
ORDER BY customer_name;
Result:
customer_name region order_id product amount order_status
Alice North 101 Laptop 1200.00 Valid
Alice North 102 Mouse 25.00 Valid
Bob South 103 Keyboard 75.00 Valid
Charlie East 104 Monitor 250.00 Valid
Diana West NULL NULL NULL No Order
Ethan South NULL NULL NULL No Order
NULL NULL 105 Chair 150.00 Orphan Order
🔹 Dataset B — Employee Hierarchy (Org Reporting)
CREATE TABLE Employees (
emp_id INT PRIMARY KEY,
emp_name VARCHAR(50),
manager_id INT,
dept VARCHAR(30)
);
INSERT INTO Employees VALUES
(1, 'CEO', NULL, 'Executive'),
(2, 'VP Sales', 1, 'Sales'),
(3, 'VP Tech', 1, 'Tech'),
(4, 'Manager A', 2, 'Sales'),
(5, 'Manager B', 2, 'Sales'),
(6, 'Lead Dev', 3, 'Tech'),
(7, 'Engineer X', 6, 'Tech');
🔹 Case Study 2 — Hierarchical reporting (recursive
CTE)
Goal: Produce a reporting line for all employees.
WITH RECURSIVE OrgChart AS (
SELECT emp_id, emp_name, manager_id, dept, 1 AS level
FROM Employees
WHERE manager_id IS NULL
UNION ALL
SELECT e.emp_id, e.emp_name, e.manager_id, [Link], [Link] + 1
FROM Employees e
JOIN OrgChart oc ON e.manager_id = oc.emp_id
)
SELECT emp_id, emp_name, dept, level FROM OrgChart ORDER BY level, emp_id;
Result:
emp_id emp_name dept level
1 CEO Executive 1
2 VP Sales Sales 2
3 VP Tech Tech 2
4 Manager A Sales 3
5 Manager B Sales 3
6 Lead Dev Tech 3
7 Engineer X Tech 4
🔹 Industry relevance
• Customer Orders Case:
Retail companies need to track orders, identify customers with no orders, and detect
orphan orders caused by ETL errors.
• Hierarchy Case:
HR analytics and org design use recursive CTEs to map reporting lines. Tech
companies need to compute spans of control and headcounts.
🔹 Summary Table (Joins & Sets Case Studies)
Case Study Concepts Used Industry SQL Feature
Example
Customer Orders + LEFT JOIN + UNION Retail order Detect orphans, flag
Products + CASE tracking missing data
Hierarchical Recursive CTE HR / Tech org Employee reporting
Reporting charts lines
Cross-DB Support:
• Customer orders query: works in Postgres, SQL Server, MySQL 8+.
• Recursive CTE: supported in Postgres, SQL Server, MySQL 8+. Older MySQL < 8
doesn’t support recursive CTEs.
Additional Prctice
Complex Join Lab (Multi-Table)
🗂 Dataset: Sports Ticketing
CREATE TABLE Teams (
team_id INT PRIMARY KEY,
team_name VARCHAR(50),
city VARCHAR(30)
);
INSERT INTO Teams VALUES
(1, 'City Lions', 'New York'),
(2, 'United Eagles', 'Boston'),
(3, 'Rovers FC', 'Chicago');
CREATE TABLE Players (
player_id INT PRIMARY KEY,
player_name VARCHAR(50),
team_id INT,
position VARCHAR(20),
FOREIGN KEY (team_id) REFERENCES Teams(team_id)
);
INSERT INTO Players VALUES
(101, 'Alex Morgan', 1, 'Forward'),
(102, 'Ben Carter', 1, 'Defender'),
(103, 'Chris Shaw', 2, 'Midfielder'),
(104, 'David Lee', 2, 'Forward'),
(105, 'Evan Price', 3, 'Goalkeeper');
CREATE TABLE Matches (
match_id INT PRIMARY KEY,
home_team_id INT,
away_team_id INT,
match_date DATE,
FOREIGN KEY (home_team_id) REFERENCES Teams(team_id),
FOREIGN KEY (away_team_id) REFERENCES Teams(team_id)
);
INSERT INTO Matches VALUES
(201, 1, 2, '2025-01-10'),
(202, 2, 3, '2025-01-15'),
(203, 3, 1, '2025-01-20');
CREATE TABLE Tickets (
ticket_id INT PRIMARY KEY,
match_id INT,
buyer_name VARCHAR(50),
price DECIMAL(10,2),
FOREIGN KEY (match_id) REFERENCES Matches(match_id)
);
INSERT INTO Tickets VALUES
(301, 201, 'Alice', 50.00),
(302, 201, 'Bob', 75.00),
(303, 202, 'Charlie', 60.00),
(304, 203, 'Diana', 55.00),
(305, 203, 'Ethan', 70.00);
📝 Problem:
Find tickets sold with full match context (home team, away team, players involved).
SELECT t.ticket_id, t.buyer_name, [Link],
m.match_date,
th.team_name AS home_team, ta.team_name AS away_team,
COUNT(p.player_id) AS total_players
FROM Tickets t
JOIN Matches m ON t.match_id = m.match_id
JOIN Teams th ON m.home_team_id = th.team_id
JOIN Teams ta ON m.away_team_id = ta.team_id
JOIN Players p ON p.team_id IN (m.home_team_id, m.away_team_id)
GROUP BY t.ticket_id, t.buyer_name, [Link], m.match_date, th.team_name,
ta.team_name
ORDER BY t.ticket_id;
✅ Output Grid
ticket_id buyer_name price match_date home_team away_team total_players
301 Alice 50.00 2025-01-10 City Lions United Eagles 4
302 Bob 75.00 2025-01-10 City Lions United Eagles 4
303 Charlie 60.00 2025-01-15 United Eagles Rovers FC 3
304 Diana 55.00 2025-01-20 Rovers FC City Lions 3
305 Ethan 70.00 2025-01-20 Rovers FC City Lions 3
🔎 Explanation
• Start from Tickets → Matches → Teams → Players.
• Each ticket links to a match → each match links to home & away teams → each team
links to players.
• COUNT(players) gives squad size per match.
Concepts touched: Multi-table joins, aggregation, group-by.
⚠ Sub-concepts & Mistakes
• ❌ Forgetting ON condition → Cartesian product.
• ❌ Not grouping → duplicate player rows per ticket.
• ✅ Best practice: Start with smallest table (Tickets) and expand outward.
🔹 Anti-Join Patterns
🗂 Dataset: Football Transfers
CREATE TABLE PlayersRoster (
player_id INT PRIMARY KEY,
player_name VARCHAR(50),
current_team VARCHAR(50)
);
INSERT INTO PlayersRoster VALUES
(1, 'Alex Morgan', 'City Lions'),
(2, 'Ben Carter', 'City Lions'),
(3, 'Chris Shaw', 'United Eagles'),
(4, 'David Lee', 'Rovers FC'),
(5, 'Evan Price', 'United Eagles');
CREATE TABLE Transfers (
transfer_id INT PRIMARY KEY,
player_id INT,
new_team VARCHAR(50)
);
INSERT INTO Transfers VALUES
(101, 1, 'United Eagles'),
(102, 4, 'City Lions');
📝 Problem: Find players with no transfer records (anti-join).
Method 1: LEFT JOIN IS NULL
SELECT p.player_name, p.current_team
FROM PlayersRoster p
LEFT JOIN Transfers t ON p.player_id = t.player_id
WHERE t.player_id IS NULL;
Method 2: NOT EXISTS
SELECT p.player_name, p.current_team
FROM PlayersRoster p
WHERE NOT EXISTS (
SELECT 1 FROM Transfers t WHERE t.player_id = p.player_id
);
✅ Output Grid
player_name current_team
Ben Carter City Lions
Chris Shaw United Eagles
Evan Price United Eagles
🔎 Explanation
• Anti-join = find rows in one table with no match in another.
• Both LEFT JOIN IS NULL and NOT EXISTS achieve it.
Industry use cases:
• Find customers with no orders.
• Detect players without transfers.
• Spot employees without payroll records.
🔹 Set Operation Puzzles
🗂 Dataset: Fan Memberships
CREATE TABLE ClubFans (
fan_name VARCHAR(50),
club VARCHAR(50)
);
INSERT INTO ClubFans VALUES
('Alice', 'City Lions'),
('Bob', 'City Lions'),
('Charlie', 'United Eagles'),
('Diana', 'Rovers FC'),
('Ethan', 'United Eagles');
CREATE TABLE StadiumFans (
fan_name VARCHAR(50),
stadium VARCHAR(50)
);
INSERT INTO StadiumFans VALUES
('Alice', 'New York Arena'),
('Charlie', 'Boston Stadium'),
('Frank', 'Chicago Dome'),
('George', 'New York Arena');
📝 Puzzle 1: UNION vs UNION ALL
SELECT fan_name FROM ClubFans
UNION
SELECT fan_name FROM StadiumFans;
✅ Output (no duplicates):
Alice, Bob, Charlie, Diana, Ethan, Frank, George
SELECT fan_name FROM ClubFans
UNION ALL
SELECT fan_name FROM StadiumFans;
✅ Output (with duplicates):
Alice, Bob, Charlie, Diana, Ethan, Alice, Charlie, Frank, George
📝 Puzzle 2: INTERSECT — Fans in both club & stadium
(Postgres / SQL Server syntax; MySQL → INNER JOIN)
SELECT fan_name FROM ClubFans
INTERSECT
SELECT fan_name FROM StadiumFans;
✅ Output:
Alice, Charlie
📝 Puzzle 3: EXCEPT (MINUS in Oracle)
Find fans registered in clubs but not attending stadiums.
SELECT fan_name FROM ClubFans
EXCEPT
SELECT fan_name FROM StadiumFans;
✅ Output:
Bob, Diana, Ethan