Understanding Functional Dependency in DBMS
Understanding Functional Dependency in DBMS
Functional Dependency:
They improve data integrity by ensuring that data is consistent and accurate
across the database.
They facilitate database maintenance by making it easier to modify, update, and
delete data.
Armstrong’s axioms/properties of functional dependencies:
STUDENT Table
STUD_N STUD_NAM STUD_PHON STUD_STAT STUD- STUD_AG
O E E E COUNTR E
Y
1 C1 DBMS
2 C2 Computer Networks
1 C2 Computer Networks
Table 2
Insertion Anomaly: If a tuple is inserted in referencing relation and referencing
attribute value is not present in referenced attribute, it will not allow insertion in
referencing relation.
Example: If we try to insert a record in STUDENT_COURSE with STUD_NO =7, it
will not allow it.
Deletion and Updation Anomaly: If a tuple is deleted or updated from referenced
relation and the referenced attribute value is used by referencing attribute in
referencing relation, it will not allow deleting the tuple from referenced relation.
Example: If we want to update a record from STUDENT_COURSE with STUD_NO
=1, We have to update it in both rows of the table. If we try to delete a record from
STUDENT with STUD_NO =1, it will not allow it.
To avoid this, the following can be used in query:
ON DELETE/UPDATE SET NULL: If a tuple is deleted or updated from
referenced relation and the referenced attribute value is used by referencing attribute
in referencing relation, it will delete/update the tuple from referenced relation and set
the value of referencing attribute to NULL.
ON DELETE/UPDATE CASCADE: If a tuple is deleted or updated from
referenced relation and the referenced attribute value is used by referencing attribute
in referencing relation, it will delete/update the tuple from referenced relation and
referencing relation as well.
How These Anomalies Occur?
Insertion Anomalies: These anomalies occur when it is not possible to insert data
into a database because the required fields are missing or because the data is
incomplete. For example, if a database requires that every record has a primary key,
but no value is provided for a particular record, it cannot be inserted into the database.
Deletion anomalies: These anomalies occur when deleting a record from a database
and can result in the unintentional loss of data. For example, if a database contains
information about customers and orders, deleting a customer record may also delete
all the orders associated with that customer.
Update anomalies: These anomalies occur when modifying data in a database and
can result in inconsistencies or errors. For example, if a database contains information
about employees and their salaries, updating an employee’s salary in one record but
not in all related records could lead to incorrect calculations and reporting.
Removal of Anomalies
According to [Link], who is the inventor of the Relational Database, the goals of
Normalization include:
It helps in vacatingall the repeated data from the database.
It helps in removing undesirable deletion, insertion, and update anomalies.
It helps in making a proper and useful relationship between tables.
If a table is not properly normalized and has data redundancy, it will not only take up
extra data storage space but also make it difficult to handle and update the database.
There are several factors that drive the need for normalization, from data
redundancy(as covered above) to difficulty managing relationships. Let’s get right
into it:
Insertion, deletion, and update anomalies: Any form of change in a table can lead to
errors or inconsistencies in other tables if not handled carefully. These changes can either
be adding new data to a database, updating the data, or deleting records, which can lead to
unintended loss of data.
Difficulty in managing relationships: It becomes more challenging to maintain complex
relationships in an unnormalized structure.
Other factors that drive the need for normalization are partial
dependencies and transitive dependencies, in which partial dependencies can lead to
data redundancy and update anomalies, and transitive dependencies can lead to data
anomalies. We will be looking at how these dependencies can be dealt with to ensure
database normalization in the coming sections.
Different Types of Database Normalization
Database normalization comes in different forms, each with increasing levels of data
organization.
Image by Author
First Normal Form (1NF)
This normalization level ensures that each column in your data contains only atomic
values. Atomic values in this context means that each entry in a column is indivisible.
It is like saying that each cell in a spreadsheet should hold just one piece of
information. 1NF ensures atomicity of data, with each column cell containing only a
single value and each column having unique names.
Second Normal Form (2NF)
Eliminates partial dependencies by ensuring that non-key attributes depend only on
the primary key. What this means, in essence, is that there should be a direct
relationship between each column and the primary key, and not between other
columns.
Third Normal Form (3NF)
Removes transitive dependencies by ensuring that non-key attributes depend only on
the primary key. This level of normalization builds on 2NF.
Boyce-Codd Normal Form (BCNF)
This is a more strict version of 3NF that addresses additional anomalies. At this
normalization level, every determinant is a candidate key.
Fourth Normal Form (4NF)
This is a normalization level that builds on BCNF by dealing with multi-valued
dependencies.
Fifth Normal Form (5NF)
5NF is the highest normalization level that addresses join dependencies. It is used in
specific scenarios to further minimize redundancy by breaking a table into smaller
tables.
We have already highlighted all the data normalization levels. Let’s further explore
each of them in more depth with examples and explanations.
First Normal Form (1NF) Normalization
1NF ensures that each column cell contains only atomic values. Imagine a library
database with a table storing book information (title, author, genre, and borrowed_by).
If the table is not normalized, borrowed_by could contain a list of borrower names
separated by commas. This violates 1NF, as a single cell holds multiple values. The
table below is a good representation of a table that violates 1NF, as described earlier.
J. R. R.
The Lord of the Rings Fantasy Emily Garcia, David Lee
Tolkien
The solution?
In 1NF, we create a separate table for borrowers and link them to the book table.
These tables can either be linked using the foreign key in the borrower table or a
separate linking table. The foreign key in the borrowers table approach involves
adding a foreign key column to the borrowers table that references the primary key of
the books table. This will enforce a relationship between the tables, ensuring data
consistency.
You can find a representation of this below:
Books table
J. R. R.
2 The Lord of the Rings Fantasy
Tolkien
Borrowers table
1 John Doe 1
2 Jane Doe 1
3 James Brown 1
4 Emily Garcia 2
5 David Lee 2
6 Michael Chen 3
book_id borrower_id
title author genre
(PK) (FK)
To Kill a Harper
1 Fiction 1
Mockingbird Lee
The Lord of the J. R. R.
2 Fantasy NULL
Rings Tolkien
This might look like a solution, but it violates 2NF simply because the borrower_id
only partially depends on the book_id. A book can have multiple borrowers, but a
single borrower_id can only be linked to one book in this structure. This creates a
partial dependency.
The solution?
We need to achieve the many-to-many relationship between books and borrowers to
achieve 2NF. This can be done by introducing a separate table:
Book_borrowings table
1 1 1 2024-05-04
2 2 4 2024-05-04
3 3 6 2024-05-04
This table establishes a clear relationship between books and borrowers. The book_id
and borrower_id act as foreign keys, referencing the primary keys in their respective
tables. This approach ensures that borrower_id depends on the entire primary key
(book_id) of the books table, complying with 2NF.
Third Normal Form (3NF)
3NF builds on 2NF by eliminating transitive dependencies. A transitive dependency
occurs when a non-key attribute depends on another non-key attribute, which in turn
depends on the primary key. It basically takes its meaning from the transitive law.
From the 2NF we already implemented, there are three tables in our library database:
Books table
book_id
title author genre
(PK)
J. R. R.
2 The Lord of the Rings Fantasy
Tolkien
Harry Potter and the
3 J.K. Rowling Fantasy
Sorcerer’s Stone
Borrowers table
1 John Doe 1
2 Jane Doe 1
3 James Brown 1
4 Emily Garcia 2
5 David Lee 2
6 Michael Chen 3
Book_borrowings table
1 1 1 2024-05-04
2 2 4 2024-05-04
3 3 6 2024-05-04
The 2NF structure looks efficient, but there might be a hidden dependency. Imagine
we add a due_date column to the books table. This might seem logical at first sight,
but it’s going to create a transitive dependency where:
The due_date column depends on the borrowing_id (a non-key attribute) from the
book_borrowings table.
The borrowing_id in turn depends on book_id (the primary key) of the books
table.
The implication of this is that due_date relies on an intermediate non-key attribute
(borrowing_id) instead of directly depending on the primary key (book_id). This
violates 3NF.
The solution?
We can move the due_date column to the most appropriate table by updating the
book_borrowings table to include the due_date and returned_date columns.
Below is the updated table:
2024-05-
1 1 1 2024-05-04
20
2024-05-
2 2 4 2024-05-04
18
2024-05-
3 3 6 2024-05-04
10
book_id
title author genre
(PK)
1 To Kill a Mockingbird Harper Lee Fiction
J. R. R.
2 The Lord of the Rings Fantasy
Tolkien
Borrowers table
1 John Doe 1
2 Jane Doe 1
3 James Brown 1
4 Emily Garcia 2
5 David Lee 2
6 Michael Chen 3
Book_borrowings table
2024-05-
1 1 1 2024-05-04
20
2024-05-
2 2 4 2024-05-04
18
2024-05-
3 3 6 2024-05-04
10
While the 3NF structure is good, there might be a hidden determinant in the
book_borrowings table. Assuming one borrower cannot borrow the same book twice
simultaneously, the combination of book_id and borrower_id together uniquely
identifies a borrowing record.
This structure violates BCNF since the combined set (book_id and borrower_id) is not
the primary key of the table (which is just borrowing_id).
The solution?
To achieve BCNF, we can either decompose the book_borrowings table into two
separate tables or make the combined attribute set the primary key.
Approach 1 (decompose the table): In this approach, we will be decomposing the
book_borrowings table into separate tables:
A table with borrowing_id as the primary key, borrowed_date, due_date, and
returned_date.
Another separate table to link books and borrowers, with book_id as a foreign key,
borrower_id as a foreign key, and potentially additional attributes specific to the
borrowing event.
Approach 2 (make the combined attribute set the primary key): We can consider making
book_id and borrower_id a composite primary key for uniquely identifying borrowing
records. The problem with this approach is that it won’t serve its purpose if a borrower
can borrow the same book multiple times.
In the end, your choice between these options depends on your specific data needs and
how you want to model borrowing relationships.
Fourth Normal Form (4NF)
4NF deals with multi-valued dependencies. A multi-valued dependency exists when
one attribute can have multiple dependent attributes, and these dependent attributes
are independent of the primary key. It’s quite complex, but we will be exploring it
deeper using an example.
The library example we’ve been using throughout these explanations is not applicable
at this normalization level. 4NF typically applies to situations where a single attribute
might have multiple dependent attributes that don’t directly relate to the primary key.
Let’s use another scenario. Imagine a database that stores information about
publications. We will be considering a “Publications” table with columns, title, author,
publication_year, and keywords.
publication_id
title author publication_year keywords
(PK)
Fantasy,
The Lord of J. R. R.
2 1954 Epic,
the Rings Tolkien
Adventure
1 Coming-of-Age
1 Legal
2 Fantasy
2 Epic
2 Adventure
3 Romance
3 Social Commentary
course_id
course_name department
(PK)
Computer
101 Introduction to Programming
Science
Computer
301 Web Development I
Science
Computer
401 Artificial Intelligence
Science
Enrollments table
1 12345 101 A
2 12345 202 B
3 56789 301 A-
4 56789 401 B+
Assuming these tables are already in 3NF or 4NF, a join dependency might exist
depending on how data is stored. For instance, a course has a prerequisite requirement
stored within the “Courses” table as the “prerequisite_course_id” column.
This might seem efficient at first glance. However, consider a query that needs to
retrieve a student’s enrolled courses and their respective prerequisites. In this scenario,
you would need to join the “Courses” and “Enrollments” tables, then potentially join
the “Courses” table to retrieve prerequisite information.
The Solution?
To potentially eliminate the join dependency and achieve 5NF, we could introduce a
separate “Course Prerequisites” table:
Course_prerequisite table
202 101
301 NULL
401 202
Structure is Simple, the result of decomposing the relations will give simple
and smaller sub tables and this helps in reducing the complexity in some cases.
Redundancy is Less: In some cases where loss of information is acceptable
will help in reducing redundancy in certain cases.
Loss of Data: When tables are joined back together then some of the
information will be loosed permanently which can cause problem.
Inconsistency: as discussed above, due to information loss the data integrity
problem will arise.
Hard to manage: It can be difficult to maintain data consistency as some
information may loss, which is making harder to manage the database.