Understanding Relational Schema Basics
Understanding Relational Schema Basics
Ryan Marcus
Outline
●
Course logistics
●
Brief review
●
Relational schema
●
Understanding schema with ER diagrams
●
Understanding schema with normal forms
Outline
●
Course logistics
●
Brief review
●
Relational schema
●
Understanding schema with ER diagrams
●
Understanding schema with normal forms
Upcoming Deadlines
●
HW1 due Feb 15th @ 10pm (11 days)
●
Next time: guest lecture from Jeff Tao
Outline
●
Course logistics
●
Brief review
●
Relational schema
●
Understanding schema with ER diagrams
●
Understanding schema with normal forms
Combining Data
●
Combining data that ●
Combining data that
was intended to be was not intended to
combined be combined
Course Classroom Company Founded
CIS 5450
Joining
Meyerson B1 Data Integration
Apple 1976
CIS 5500 Towne 100 Microsoft 1975
Course Instructor Stock Value
- More complex
-CIS
Pure computation
5450 Mar & Gar -AAPL 188.64
Requires knowledge
CIS 5500 Dav & Nai MSFT 410.37
Single Source of Truth
●
Joins may seem inconvenient, but can greatly
increase data quality.
Professor Dept Admin
Marcus CIS Mari
Redundant data!
Gardner CIS Mari
Gerbner COM Andy
id id id
name title name
movie_companies cast_info
Entity tables
company_id movie_id
movie_id person_id
Joins in Pandas
●
How many movies starring Angelina Jolie were produced by Sony?
id id id
name title name
id
name Relationship tables id
title
id
name
movie_companies cast_info
company_id movie_id
movie_id person_id
Joins in Pandas
●
How many movies starring Angelina Jolie were produced by Sony?
id id id
name title name
movie_companies cast_info
company_id movie_id
movie_id person_id
Outline
●
Course logistics
●
Brief review
●
Relational schema
●
Understanding schema with ER diagrams
●
Understanding schema with normal forms
Relational schema
●
“Relational” = tabular = tables = dataframes
●
Relational schema is what the tables and
columns represent.
Employees
Department
Payroll
E_id
D_id
E_join_date E_id
D_name
E_name P_salary
D_boss_id
D_id
Employees
e_id
e_join_date
e_name
d_id
Columns / attribues
Data types
Datatypes
SQLite
Datatypes
SQLite
DuckDB
Primary and foreign keys
Primary key: a unique identifier for each row of the table.
Foreign key: a non-unique reference to a unique column of another table.
Employees
e_id
e_join_date
e_name
d_id
Relational schema in SQL
●
Unique key groups
●
Checked constraints
●
Custom data types
●
External tables
Why do we have schemas?
●
Schemas ensure all of our data follows the
same rules
– Benefit 1: helps keep data clean (integrity)
– Benefit 2: makes computation easier
– Benefit 3: provides a “map” of the data that is
guaranteed to be true
Outline
●
Course logistics
●
Brief review
●
Relational schema
●
Understanding schema with ER diagrams
●
Understanding schema with normal forms
ER Diagrams
●
ER (entity-relationship) diagrams are a common
tool to design and document databases.
●
Visual representation of the “nouns” and “verbs”
inside of a schema.
ER: Entities
●
Entities are the objects of the schema (rectangles)
Actor Movie
ER: Entities
●
Attributes are things that describe entities (ovals)
name title
year
note Actor Movie
region
language
ER: Entities
●
Compound attributes are made up of other attributes (double oval)
year
month
name title
dob
day
year
note Actor Movie
region
language
ER: Entities
●
Computed attributes are attributes that can be
computed from other attributes (dashed oval)
year
month
name title
dob
day
year
note Actor Movie
region
age
language
ER: Entities
●
Relationships are the verbs of the schema (diamonds)
year
month
name title
dob
day
year
note Actor Credit Movie
region
age
language
ER: Entities
●
Relationships can have attributes too!
year
month
name title
dob
day
year
note Actor Credit Movie
region
age
language
role
ER: Entities
●
Relationships have a degree
year
month
name title
dob
day
n m year
note Actor Credit Movie
region
age
language
role
year
month
name title
dob
day
n m year
note Actor Credit Movie
region
age m
language
role
Born
Country name
continent
year
month
name id id title
dob
day
n m year
note Actor Credit Movie
region
age m
language
role
Born
Country name
continent
ER Diagrams
Actor Entities; the objects or nouns of the schema
x1
x Compound attributes are composed of other attributes
x2
Country name
continent
ER Diagrams
●
Advantages:
– Visual
– Easy to understand
– Clear correspondence to the real world
●
Disadvantages:
– Imprecise: how do we know we have the right # of tables?
– Subjective: difficult to decide if an ER diagram is “right” or
“wrong”
●
Is Country related to Continent or is Continent an attribute?
Outline
●
Course logistics
●
Brief review
●
Relational schema
●
Understanding schema with ER diagrams
●
Understanding schema with normal forms
First Normal Form - 1NF
●
Today we will learn 1NF, 2NF,
3NF.
– Each applies to a table, such
that a table is either “in” a
normal form or “violating” a
normal form.
– Each is defined in terms of the
previous form.
●
1NF is the simplest: all values
Edgar F. Codd are themselves not tables
First Normal Form - 1NF
●
Today we will learn 1NF, 2NF,
3NF.
– Each applies to a table, such
that a table is either “in” a
normal form or “violating” a
normal form.
– Each is defined in terms of the
previous form.
●
1NF is the simplest: all values
are themselves not tables
First Normal Form - 1NF
●
TodayCustID
Cust we willT_ID
learn Date
1NF, 2NF,
Amt
3NF. 1
Abraham 12890 2003-10-14 -87
– Each
Abraham 1 applies
12904to 2003-10-15
a table, such
-50
Isaac that
2 a table is either
12898 “in” a -21
2003-10-14
Jacob
normal
3
form or “violating” a
12907 2003-10-15 -18
normal form.
Jacob 3 14920 2003-11-20 -70
– Each is defined in terms of the
Jacob 3 15003 2003-11-27 -60
previous form.
●
1NF is the simplest: all values
are themselves not tables
Functional Dependencies
●
Recall: the “vertical line test”
y y
x x
Functional Dependencies
●
Recall: the “vertical line test”
y y
x x
Not a function A function
Functional Dependencies
●
More precisely:
– Say f(x) is a function iff for all x, f(x) has only 1 value
– Or: Ex:
●
Given f(x1) = y1 and f(x2) = y2,
If f(x1) = 5 and f(x2) = 7
●
And given y1 != y2,
●
Then: x1 must not equal x2 Then EITHER:
x1 != x2 OR f(x) is not a function
●
We say a table has a functional dependency
between attributes A and B, written A → B, iff B is a
function of A. A → B read aloud is “A determines B”
Functional Dependencies
if f(x1) = y1 and f(x2) = y2, then if y1 != y2, x1 must not equal x2
Reflex: Y in X, X → Y
Augm: X → Y, for any Z, XZ → YZ
Proof: Trans: X → Y and Y→ Z, X → Z
Step 1: X → YZ (given)
Step 2: YZ → Y (reflex)
Step 3: X → Y (trans of 1 + 2)
Proving more properties
●
Union:
– if X→Y and X→Z, then X→YZ
Reflex: Y in X, X → Y
Proof:
Augm: X → Y, for any Z, XZ → YZ
Trans: X → Y and Y→ Z, X → Z
Step 1: X → Y (given)
Step 2: X → Z (given)
Step 3: X → XZ (augm 2 with X)
Step 4: XZ → YZ (augm 1 with Z)
Step 5: X → YZ (trans 3 + 4)
Proving more properties
●
Prove: if X→Y and YZ→W, then XZ→W
Reflex: Y in X, X → Y
Augm: X → Y, for any Z, XZ → YZ
Trans: X → Y and Y→ Z, X → Z
Proving more properties
●
Prove: if X→Y and YZ→W, then XZ→W
Reflex: Y in X, X → Y
Proof:
Augm: X → Y, for any Z, XZ → YZ
Trans: X → Y and Y→ Z, X → Z
Step 1: X → Y (given)
Step 2: YZ → W (given)
Step 3: XZ → YZ (augm 1 with Z)
Step 4: XZ → W (trans 3 and 2)
Lossless Joins
Professor Dept Admin Professor Dept Admin
Marcus CIS Mari Marcus CIS Mari
Gardner CIS Mari Gardner CIS Mari
Gerbner COM Andy Gerbner COM Andy
Professor Admin
Professor Dept
Marcus Mari
Marcus CIS
Gardner Mari Dept Admin
Gardner CIS Dept Admin
Gerbner Andy CIS Mari
Gerbner COM CIS Mari
CIS Mari
COM Andy
COM Andy
Good! … bad!
Lossless Joins
●
When can we turn one relation R into R 1 and
R2?
– When the join between them is lossless
(reconstructs the original table)
●
Theorem (lossless join decomposition):
– (R1, R2) is a lossless decomposition of R iff
– R1 ∩ R2 → R2
Lossless Joins
Theorem (lossless join decomposition):
– Given R1 ∪ R2 = R, when does R1 ⋈ R2 = R?
– (R1, R2) is a lossless decomposition of R iff
– R1 ∩ R 2 → R 2
R = (Prof, Dept, Admin)
R1 = (Prof, Dept)
R2 = (Dept, Admin)
R1 ∩ R2 = (Dept)
Does Dept → Dept, Admin?
Yes, by decomposition
Lossless Joins
Theorem (lossless join decomposition):
– Given R1 ∪ R2 = R, when does R1 ⋈ R2 = R?
– (R1, R2) is a lossless decomposition of R iff
– R1 ∩ R 2 → R 2
R = (Prof, Prof_Sal, Dept, Admin)
R1 = (Prof, Dept)
R2 = (Dept, Admin, Prof_Sal)
R1 ∩ R2 = (Dept)
Does Dept → Dept, Admin, Prof_Sal?
No.
Keys
●
Now that we know when a decomposition is
lossless, now we need to figure out which
decomposition to pick!
●
Next, we need to understand keys.
Keys
●
Suppose R = {A, B, C, D, E}
{ABCDE} and {ABCD} – AB → CE, CD → BE
are super keys
●
Any set of attributes that
{ABD} and {ACD} are determines R is a superkey
candidate keys
{ABD} can be picked
●
If a superkey is minimal, we
as primary key call it a candidate key
●
We pick one candidate key
to be our primary key
Keys
●
Suppose R = {A, B, C, D, E}
{ABCDE} and {ABCD} – AB → CE, CD → BE
are super keys
●
Any Ansetattribute
of attributes that
that is part of any
{ABD} and {ACD} are determines
candidateR isisaa superkey
key a prime
candidate keys attribute
{ABD} can be picked
●
If a superkey is minimal, we
as primary key call it a candidate key
●
We pick one candidate key
to be our primary key
Second Normal Form
●
R is in 2NF iff: R is in 1NF and no non-prime attribute is
determined by any proper subset of a candidate key.
Professor Dept Admin
R = {A, B, C, D, E} Marcus
Gardner
CIS
CIS
Mari
Mari
AB → CE, CD → BE Gerbner COM Andy
candidate keys
R2 = LHS + RHS of FD
Violates 2NF! AB → E R2 = {A, B, E}
AB → E CKs: {AB}
Second Normal Form
R = {A, B, C, D, E} R1 = R – RHS of FD
AB → CE, CD → BE R1 = {A, B, C, D}
AB → C CKs: {ABD} {ACD}
{ABD} and {ACD} are CD → B
R1 ∩ R2 → R2
R2 = LHS + RHS of FD
R2 = {A, B, E}
AB → E CKs: {AB}
Third Normal Form
●
R is in 1NF if every value is a value
●
R is in 2NF if:
– R is in 1NF AND
– R is in 1NF and no non-prime attribute is determined by
any proper subset of a candidate key.
●
R is in 3NF if:
– R is in 2NF AND
– No non-prime attribute determines a different non-prime
attribute
Third Normal Form
No non-prime
Professor Dept Admin
attribute determines a
Marcus CIS Mari
different non-prime
Gardner CIS Mari
attribute
Gerbner COM Andy
P → D, D → A
Candidate keys: {P}
Third Normal Form
No non-prime
Professor Dept Admin
attribute determines a
Marcus CIS Mari
different non-prime
Gardner CIS Mari
attribute
Gerbner COM Andy
P → D, D → A
Candidate keys: {P}
Third Normal Form
No non-prime
Professor Dept Admin
attribute determines a
Marcus CIS Mari
different non-prime
Gardner CIS Mari
attribute
Gerbner COM Andy
R1 = R – RHS of FD
R1 = {Prof, Dept}
P → D, D → A
Candidate keys: {P}
R2 = LHS + RHS of FD
R2 = {Dept, Admin}
Third Normal Form
Every non-prime attribute is determined by…
●
The keys (by definition)
●
The whole keys (2NF)
●
… and nothing but the keys (3NF)
Edgar F. Codd
The 3NF Algorithm
Repeat until all relations are in 3NF:
1) Pick a relation we haven’t checked yet.
2) Ensure the relation is in 1NF.
3) Write down the FDs
4) Identify the CKs
5) Check for 2NF violations
If any, decompose on a violating FD, go to (1)
6) Check for 3NF violations
If any, decompose on a violating FD, go to (1)
Actor Country Country
ActorID ActorDOB Movies
Name Name Continent
1 Jolie, Angeli June 4th, 1975 US NA [{“role”: “Lara
Croft”, “movieID”:
1, “movieName”:
“Tomb Raider”},
…]
2 Li, Jet April 26th, 1963 CN AS [{“role”: “Huo”,
“movieID”: 2,
“movieName”:
“Fearless”},
{“role”: “Liu”,
“movieID”: 3,
“movieName”:
Kiss of the
Dragon”}, …}
Actor Country Country
ActorID ActorDOB Movies
Name Name Continent
1 Jolie, Angeli June 4th, 1975 US NA [{“role”: “Lara
Croft”, “movieID”:
1, “movieName”:
“Tomb Raider”},
…]
2 Li, Jet April 26th, 1963 CN AS [{“role”: “Huo”,
“movieID”: 2,
“movieName”:
“Fearless”},
{“role”: “Liu”,
“movieID”: 3,
“movieName”:
Kiss of the
Step 1: Dragon”}, …}
Not in 1NF! Table-in-a-table.
Actor Actor Country Country Movie Movie
ActorDOB Role
ID Name Name Cont ID Name
1 Jolie, Angeli June 4th, 1975 US NA Lara 1 Tomb Raider
R2 = LHS + RHS of FD
R2 = {ActorID, ActorName, ActorDOB, CountryName, CountryCont}
R1 = {ActorID, Role, MovieID, MovieName}
R2 = {ActorID, ActorName, ActorDOB, CountryName, CountryCont}
R1 = {ActorID, Role, MovieID, MovieName} Step 1: Check for 1NF
R2 = {ActorID, ActorName, ActorDOB, CountryName, CountryCont} … generally always good
R4 = LHS + RHS of FD
R4 = {MovieID, MovieName}
R2 = {ActorID, ActorName, ActorDOB, CountryName, CountryCont}
R3 = {ActorID, Role, MovieID}
R4 = {MovieID, MovieName}
R2 = {ActorID, ActorName, ActorDOB, CountryName, CountryCont} Step 1: 1NF
R3 = {ActorID, Role, MovieID} … generally good
R4 = {MovieID, MovieName}
Step 2: Write down our FDs: Step 4: Check for 2NF
ActorID → ActorName, ActorDOB, CountryName, CountryCont Are any non-prime
CountryName → CountryCont attributes determined by a
subset of a candidate key?
Step 3: Find candidate keys (minimal keys)
None found!
CKs: {ActorID}
Step 7: Decompose
Step 5: Decompose with the violating FD (none)! R5 = R2 – RHS of FD
Step 6: Check for 3NF R5 = {ActorID, ActorName,
ActorDOB, CountryName}
Are any non-prime attributes determined by another
non-prime attribute?
R6 = LHS + RHS of FD
Violation! R6 = {CountryName, CountryCont}
R3 = {ActorID, Role, MovieID}
R4 = {MovieID, MovieName}
R5 = {ActorID, ActorName, ActorDOB, CountryName}
R6 = {CountryName, CountryCont}
R3 = {ActorID, Role, MovieID} 1NF + no non-trivial FDs,
must be in 3NF already
R4 = {MovieID, MovieName}
R6 = {CountryName, CountryCont}
R3 = {ActorID, Role, MovieID}
R4 = {MovieID, MovieName}
R6 = {CountryName, CountryCont}
m
role
Born
{ActorID, Role, MovieID}
{MovieID, MovieName}
1 {ActorID, ActorName, ActorDOB, CountryName}
{CountryName, CountryCont}
Country name
continent
Textbook chapters with
alternative treatments of FDs
and ER diagrams
Zack’s lecture on ER
diagrams (extended)