Understanding ACID Transactions in DBMS
Understanding ACID Transactions in DBMS
Transaction Concept:
Transaction State
A transaction can be viewed as a set of operations used to perform a logical set of tasks.
Transactions are used to change data in the database. This can be done by inserting new data,
modifying existing data, or deleting existing data. It must follow ACID properties (Atomicity,
Consistency, Isolation, Durability) to ensure data integrity.
The different stages a transaction goes through during its lifecycle are known as the transaction
states. The following is a diagrammatic representation of the different stages of a transaction.
We shall discuss all the different stages as can be seen in the diagram above.
Active state: This is the very first state of the transaction. All the read-write operations
of the transaction are currently running then the transaction is in the active state. If there
is any failure, it goes to the failed state. If all operations are successful then the transaction
moves to the partially committed state. All the changes that are carried out in this stage
are stored in the buffer memory.
Syntax:
START TRANSACTION;
UPDATE accounts SET balance = balance - 500 WHERE account_no = 101;
Syntax:
UPDATE accounts SET balance = balance - 500 WHERE account_no = 101;
UPDATE accounts SET balance = balance + 500 WHERE account_no = 102;
After executing these statements, the transaction moves to the Partially Committed state.
Committed state: Once all the operations are successfully executed and the transaction
is out of the partially committed state, all the changes become permanent in the database.
That is the Committed state. There’s no going back! The changes cannot be rolled back
and the transaction goes to the terminated state.
Syntax:
COMMIT;
Failed state: In case there is any failure in carrying out the instructions while the
transaction is in the active state, or there are any issues while saving the changes
permanently into the database (i.e. in the partially committed stage) then the transaction
enters the failed state.
Syntax:
UPDATE accounts SET balance = balance - 500 WHERE account_no = 101;
Aborted state: If any of the checks fail and the transaction reaches a failed state, the
database recovery system ensures that the database is in a previously consistent state.
Otherwise, the transaction is aborted or rolled back, leaving the database in a consistent
state. If a transaction fails in the middle of a transaction, all running transactions are rolled
back to a consistent state before executing the transaction.
Syntax:
ROLLBACK;
Terminated state: If a transaction is aborted, then there are two ways of recovering the
DBMS, one is by restarting the task, and the other is by terminating the task and making
itself free for other transactions. The latter is known as the terminated state.
Example:
Transaction Completed.
OR
Final Result:
ACID properties
ACID properties are a set of properties that guarantee reliable processing of transactions in a
database management system (DBMS). Transactions are a sequence of database operations that
are executed as a single unit of work, and the ACID properties ensure that transactions are
processed reliably and consistently in a DBMS.
If a transaction has multiple steps, either all steps execute successfully or none.
If a failure occurs at any step, all previous changes made by the transaction are undone
(rolled back).
No partial execution is allowed.
Example of Atomicity:
Transaction T1: Transferring $500 from Account A to Account B
Problem:
The money is deducted from Account A but not added to Account B due to failure.
Solution:
The DBMS will rollback the transaction and restore Account A's balance.
The database must always remain in a valid state before and after a transaction.
Data must follow all integrity constraints and business rules before and after execution.
No transaction should violate referential integrity (foreign keys, unique constraints, etc.).
A transaction should transform the database from one consistent state to another.
Example of Consistency:
Before Transaction:
o Account A = $1000
o Account B = $500
After Transaction:
o Account A = $800
o Account B = $700
If Account A is deducted but B is not credited, the total sum of money ($1300 instead of
$1500) becomes inconsistent.
Solution:
DBMS ensures that data follows integrity constraints and remains consistent before and
after transaction execution.
Multiple transactions running concurrently should not interfere with each other.
One transaction should not read or modify uncommitted changes made by another
transaction.
Prevents issues like dirty reads, lost updates, and unrepeatable reads.
Example of Isolation:
Two Transactions Running Concurrently
If Transaction T4 reads before T3 commits, it may see an inconsistent value (Dirty Read
Problem).
Solution:
Once a transaction is committed, its changes must be permanent, even if the system
crashes.
The database system must store committed changes in non-volatile memory (like a hard
disk).
Ensures that data remains safe even in case of system failure.
Example of Durability:
Transaction T5:
If the transaction is not saved permanently, the ticket booking may be lost after a system
crash.
Solution:
Concurrent execution in a multi-user database system enables multiple users to access and
perform operations on the database at the same time. This improves efficiency but introduces
challenges such as lost updates, dirty reads, and unrepeatable reads. To maintain consistency,
transactions must execute in an interleaved manner, ensuring that no operation affects others
incorrectly. Effective concurrency control mechanisms, such as locking protocols, timestamp
ordering, and validation techniques, help manage these challenges and preserve data integrity.
Several problems that arise when numerous transactions execute simultaneously in a random
manner are referred to as Concurrency Control Problems.
The dirty read problem in DBMS occurs when a transaction reads the data that has been updated
by another transaction that is still uncommitted. It arises due to multiple uncommitted
transactions executing simultaneously.
3. t3: T2 reads the uncommitted value (1500) instead of 1000 → Dirty Read Occurs.
6. t6: T2 has already used incorrect value (1500) for further operations, leading to data
inconsistency.
The unrepeatable read problem occurs when two or more different values of the same data are
read during the read operations in the same transaction.
Transaction A and B initially read the value of DT as 1000. Transaction A modifies the value of
DT from 1000 to 1500 and then again transaction B reads the value and finds it to be 1500.
Transaction B finds two different values of DT in its two different read operations.
In the phantom read problem, data is read through two different read operations in the same
transaction. In the first read operation, a value of the data is obtained but in the second operation,
an error is obtained saying the data does not exist.
The Lost Update problem arises when an update in the data is done over another update but by
two different transactions.
Concurrency Control
Concurrency Control refers to the techniques used to manage concurrent transactions while
ensuring correctness, isolation, and serializability.
It ensures that transactions execute safely without interfering with each other.
2. Ensuring Serializability
o View Serializability: The final result of concurrent execution should match that
of some serial execution.
Serializability
1. Serial Schedule
A schedule in which only one transaction is executed at a time, i.e., one transaction is executed
completely before starting another transaction.
Example:
Here, we can see that Transaction-2 starts its execution after the completion of Transaction-1.
Serial schedules are always serializable because the transactions only work one after the other.
Also, for a transaction, there are n! serial schedules possible (where n is the number of
transactions).
2. Non-serial Schedule
A schedule in which the transactions are interleaving or interchanging. There are several
transactions executing simultaneously as they are being used in performing real-world database
operations. These transactions may be working on the same piece of data. Hence, the
serializability of non-serial schedules is a major concern so that our database is consistent before
and after the execution of the transactions.
Example:
We can see that Transaction-2 starts its execution before the completion of Transaction-1, and
they are interchangeably working on the same data, i.e., "a" and "b". Convert it into an equivalent
serial schedule (Serializable Schedule).
Serializability of any non-serial schedule can be verified using two types mainly: Conflict
Serializability and View Serializability.
One more way to check serializability is by forming an equivalent serial schedule that results in
the same as the original non-serial schedule. Since this process only focuses on the output rather
than the operations taking place in between the switching of transactions, it is not practically
used. Now let's discuss Conflict and View Serializability in detail.
A non-serial schedule is a conflict serializable if, after performing some swapping on the non-
conflicting operation results in a serial schedule. It is checked using the non-serial schedule and
an equivalent serial schedule. This process of checking is called Conflict Serializability in
DBMS.
It is tedious to use if we have many operations and transactions as it requires a lot of swapping.
For checking, we will use the same Precedence Graph technique discussed above. First, we will
check conflicting pairs operations (read-write, write-read, and write-write) and then form
directed edges between those conflicting pair transactions. If we can find a loop in the graph,
then the schedule is non-conflicting serializable otherwise it is surely a conflicting serializable
schedule.
Conflicting Operations
Operations are on the different data items then always it is Non conflict operations
Operations are on the same data items then it is sometimes non conflict operations/ Conflict
operations based on the 4 cases:
T1 T2
Read(A) Read(A) - Non Conflict
Swapping Allowed
Swapped:
Swapping Attempt:
step-by-step swapping process in table format to convert Non-serial Schedule (S1) → Serial
Schedule (S2) using Conflict Serializability.
If a non-serial schedule is view equivalent to some other serial schedule then the schedule is
called View Serializable Schedule. It is needed to ensure the consistency of a schedule.
The two conditions needed by schedules(S1 and S2) to be view equivalent are:
Example: If transaction t1 is reading "A" from database in schedule S1, then in schedule S2, t1
must read A.
Example: If a transaction t1 updated A at last in S1, then in S2, t1 should perform final write
as well.
DataItem: A
Intial read(A): T1
Update Read(A): T2
Final Write(A): T2
DataItem: B
Intial read(A): T1
Update Read(A): T2
Final Write(A): T2
DataItem: A
Intial read(A): T1
Update Read(A): T2
Final Write(A): T2
DataItem: B
Intial read(B): T1
Update Read(B): T2
Final Write(B): T2
DataItem: Q
Intial read(Q): T1
Final Write(Q): T1
DataItem: Q
Intial read(Q): T1
Final Write(Q): T2
Testing of Serializability
To test the serializability of a schedule, we can use Serialization Graph or Precedence Graph. A
serialization Graph is nothing but a Directed Graph of the entire transactions of a schedule.
It can be defined as a Graph G(V, E) consisting of a set of directed-edges E = {E1, E2, E3, ...,
En} and a set of vertices V = {V1, V2, V3, ...,Vn}. The set of edges contains one of the two
operations - READ, WRITE performed by a certain transaction.
Graph Representation:
If there is a cycle present in the serialized graph then the schedule is non-serializable because the
cycle resembles that one transaction is dependent on the other transaction and vice versa. It also
means that there are one or more conflicting pairs of operations in the transactions. On the other
hand, no-cycle means that the non-serial schedule is serializable.
Two operations inside a schedule are called conflicting if they meet these three conditions:
To conclude, let’s take two operations on data: "a". The conflicting pairs are:
1. READ(a) - WRITE(a)
2. WRITE(a) - WRITE(a)
3. WRITE(a) - READ(a)
For example:
Explanation:
The precedence graph for schedule S1 contains a cycle that's why Schedule S1 is non-
serializable.
Example 2:
Explanation:
The precedence graph for schedule S2 contains no cycle that's why ScheduleS2 is serializable.
Recoverability
Irrecoverable schedules/ non-recoverable schedules:
If a transaction does a dirty read operation from an uncommitted transaction and commits before
the transaction from where it has read the value, then such a schedule is called an irrecoverable
schedule.
Example
Recoverable Schedules
If any transaction that performs a dirty read operation from an uncommitted transaction and also its
committed operation becomes delayed till the uncommitted transaction is either committed or
rollback such type of schedules is called as Recoverable Schedules.
Example:
In the above schedule transaction T2 is now allowed to commit whereas T1 is not yet
committed.
In this case transaction T1 is failed, and transaction T2 still has a chance to recover by
rollback.
However, not all recoverable schedules are the same. There are three types based on how they
handle rollback situations:
1. Cascading Schedule
2. Cascadeless Schedule
3. Strict Schedule
1. Cascading Schedule
A cascading schedule allows a transaction to read uncommitted data from another transaction. If
the first transaction fails and rolls back, all dependent transactions must also rollback, causing a
cascading effect.
2. Cascadeless Schedule
A cascadeless schedule ensures that a transaction only reads committed data, preventing
cascading rollbacks.
Advantage:
No rollback propagation
Better performance
Example
T2 waits until T1 commits before reading A.
No cascading rollback possible.
Faster and more efficient.
3. Strict Schedule
A strict schedule prevents any transaction from reading OR writing an uncommitted value from
another transaction.
Stricter than cascadeless schedules, Transactions must wait until the writing transaction commits
or rolls back.
Advantage:
No cascading rollback.
Example
Strict schedules allow concurrency but make transactions wait only when they try to read/write
uncommitted data.
They are NOT the same as sequential execution because independent transactions can still
execute in parallel.
Implementation of Isolation
Levels of Isolation:
Isolation is divided into four stages. The ability of users to access the same data concurrently is
constrained by higher isolation. The greater the isolation degree, the more system resources are
required, and the greater the likelihood that database transactions would block one another.
o “Serializable” ensures the final result is the same as if the transactions executed one by
one (serially), but they can still run concurrently.
o Repeatable Reads all Ensures that if a transaction reads the same row twice, the value
will always be the same (until the transaction ends). Prevents dirty reads and non-
repeatable reads. But does NOT prevent phantom reads
o Read Committed Only allows reading data that has been committed. A transaction cannot
read uncommitted changes from another transaction. Eliminates dirty reads, but non-
repeatable reads and phantom reads are still possible.
o Read Uncommitted is the lowest level of isolation, allowing access to data before
modifications are performed.
Users are more prone to experience read phenomena like uncommitted dependencies, often
known as dirty reads, where data is read from a row that has been modified by another user but
has not yet been committed to the database, and the lower the isolation level.
Two-Phase Locking (2PL): Acquires locks in growing phase and releases in shrinking
phase.
2. Timestamp-Based Protocols
Each technique has its own advantages. Strict 2PL ensures isolation but can cause deadlocks,
while MVCC improves performance by allowing concurrent access.
Multiple users can access and use the same database at one time, which is known as the
concurrent execution of the database.
It ensures that Database transactions are performed concurrently and accurately.
It confirms that produce correct results without violating data integrity of the respective
Database.
Concurrency Control is the working concept that is required for controlling and managing
the concurrent execution of database operations.
It avoiding the inconsistencies in the database.
The concurrency control protocols ensure the atomicity, consistency, isolation,
durability, and serializability of the concurrent execution of the database transactions.
It is very essential in concurrency control which controls concurrent access to a data item.
It ensures that one transaction should not read and write record while another transaction
is performing a write operation on it.
This lock signifies that operations that can be performed on the data item.
Example:
✔ In traffic light signal that indicates stop and go, when one signal is allowed to pass at a time and
other signals are locked.
✔ In the same way in a database transaction, only one transaction is performed at a time meanwhile
other transactions are locked.
Shared Lock
Shared locks can only read without performing any changes to it from the database.
Other transactions also read the same data item but can’t update it until the read is
completed.
Example:
Exclusive Lock
The data item can be both reads as well as written by the transaction.
In this lock, multiple transactions do not modify the same data simultaneously.
Exclusive Locks are represented by X.
Timestamp Ordering Protocol
The Timestamp Ordering Protocol is used to order the transactions based on their
Timestamps.
The order of transaction is nothing but the ascending order of the transaction creation.
The priority of the older transaction is higher that's why it executes first.
To determine the timestamp of the transaction, this protocol uses system time, logical
counter or unique value.
The timestamp ordering protocol also maintains the timestamp of last 'read' and 'write'
operation on a data.
2. R_TS(X):
Example: R_TS(A) = 30
3. W TS(X):
Example: W_TS(A) = 20
Timestamp Ordering Rules: Read()
a. Check the following condition whenever a transaction Ti issues a Read (X) operation:
b. If W_TS (A) > TS (Ti) then Operation rejected & rollback. (Not Allow)
c. Otherwise execute R(A) operation. Set R_TS(A) = Last (R_TS(A), TS(Ti)). (Allow)
a. If R_TS (A) > TS (Ti) then Operation rejected & rollback. (Not Allow)
b. If W_TS (A) > TS (Ti) then Operation rejected & rollback. (Not Allow)
The transaction begins and reads the required data from the database.
No locks are applied to the data, so other transactions can freely access and modify the
same data.
Example:
Aravind and Ashish open the booking website.
Both see Seat No. 10 is available (because the database still shows it as free).
At this point, no actual booking has been made. The system allows both users to proceed without
locking the seat.
2. Validation Phase (Before Commit Phase)
The system checks whether any other transactions have modified the data that this
transaction has read.
If a conflict is detected (i.e., another transaction modified the same data), the transaction
is aborted and must restart.
Example:
Aravind and Ashish now try to confirm their bookings at the same time.
o Has Seat No. 10 been booked by someone else since Aravind or Ashish last
checked?
Here, suppose Aravind confirms her booking first. The system now marks Seat No. 10 as booked.
Since changes are applied only in this phase, the database remains consistent.
Example:
Since Aravind’s transaction passes validation, the system confirms her booking.
Ashish’s transaction now reaches validation and sees Seat No. 10 is no longer available.
Ashish receives a message: "Sorry, the seat is no longer available. Please select another seat."
Deadlock
Deadlock occurs when two or more transactions cannot proceed because each is waiting for the
other to release resources. This results in a cycle of dependencies where no transaction can be
completed. Deadlocks typically happen when transactions hold some resources and wait for
others, leading to a halt. DBMS uses deadlock detection and resolution techniques, like timeouts
or transaction aborts, to break the cycle and restore progress.
There are two types of deadlocks in Database management system such as:
Resource Deadlocks
Communication Deadlocks
1. Resource Deadlocks
Resource Deadlocks occur when multiple processes require access to resources that are held by
other processes, leading to a cycle of waiting.
For example, if Process A holds Resource 1 and waits for Resource 2, while Process B holds
Resource 2 and waits for Resource 1, a deadlock situation arises.
2. Communication Deadlocks
Communication Deadlocks are less common but can occur in distributed systems where
processes communicate through messages.
For example, process A waits for a signal from B, B waits for a signal from C, and C waits for a
signal from A. As each process depends on another to proceed, none can move forward, creating
a deadlock where all processes are stuck, and unable to progress.
1. Mutual Exclusion
The Mutual Exclusion condition states that only one process can use a resource at a time. If
multiple processes are attempting to access the same resource, the resource must be locked by
one process, preventing others from accessing it at the same time.
The Hold and Wait condition occurs when a process is using one resource and is waiting for
additional resources held by other processes. This creates a cycle where each process is waiting
for resources that are locked by others.
[Link] Preemption
The no preemption is the condition where resources cannot be taken from a process by force. A
process can only release a resource voluntarily after it has completed its task.
4. Circular Wait
It is the condition where a set of processes are waiting for each other in a circular chain. For
example, Process A waits for Resource 1, Process B waits for Resource 2, and Process C waits
for Resource 3, but ultimately, they all depend on each other, causing a circular wait.
In DBMS, it is the process of managing and understanding deadlocks to prevent them from
impacting the system's performance and reliability. There are several strategies for handling
deadlocks effectively such as:
1, Deadlock Avoidance
This is a technique in a database management system (DBMS) that prevents deadlocks from
occurring by monitoring the system's state and making decisions to keep processes from getting
stuck. It's a proactive strategy that's often better than recovering from a deadlock, which can
waste time and resources.
2. Deadlock Detection
Deadlock detection is a process that identifies if any processes in a system are stuck waiting for
each other, preventing them from moving forward. This can be done by using a Wait-for Graph,
where the system monitors the relationships between processes and resources. If the graph
contains a cycle, a deadlock is detected, and necessary actions are taken.
3. Deadlock Prevention
Deadlock prevention is a method to ensure that processes do not get stuck waiting for each other
and cannot move forward. It involves establishing rules to manage resource usage so that
processes do not get into a deadlock situation.
4. Deadlock Recovery
This is the process of breaking a deadlock, which is when two or more transactions cannot
proceed because they are waiting for resources held by other transactions.
Here are some applications of deadlock in database management systems such as:
Deadlocks prevent transactions from waiting indefinitely for locks. When a transaction
times out, it can be forced to release its current resources and try again later.
Here are some drawbacks of deadlocks in database management systems such as:
Deadlocks can cause the system to stop working, which can result in a loss of revenue
and productivity for businesses that use the DBMS.
When transactions are blocked, the resources they require remain unused, resulting in a
drop in system efficiency and wasted resources.
Deadlocks can lead to a decrease in system concurrency, which can result in slower
transaction processing and reduced throughput.
Resolving a deadlock can be a complex and time-consuming process that requires system
administrators to manually get involved.
In some cases, recovery algorithms may require rolling back the state of one or more
processes, which can lead to data loss or corruption.
Wait-for Graph in Deadlock Detection
A Wait-for Graph (WFG) is a directed graph used for deadlock detection in operating systems
and relational database systems. It represents the dependencies among processes and resources
in a system, helping to identify potential deadlocks.
In a WFG, processes are represented as nodes, and edges indicate the waiting relationship
between processes. An edge from process Pj to Pk represents that Pj is waiting for Pk to release
a lock on a resource. If a process is waiting for more than one resource to become available,
multiple edges may represent a conjunctive (and) or disjunctive (or) set of different resources.
Detection of Deadlocks
The possibility of a deadlock is implied by graph cycles in the conjunctive case, and by knots in
the disjunctive case. A cycle in the WFG indicates that a process is waiting for another process
to release a resource, which in turn is waiting for a third process to release a resource, and so on.
This creates a circular dependency, leading to a deadlock.
In DBMS there are several transactions running in a specified schedule. However sometimes
these transactions fail due to several reasons.
1. Transaction Failure
Logical Error: If the logic used in the statement itself is wrong, it can be fail.
System Error: When the transaction is executing but due to a fault in system, the transaction
fails abruptly.
The system on which the transactions are running can crash and that can result in failure of
currently running transactions.
Hardware issues
3. Hard-disk fail
Hard-disk fail can also cause transaction failure. When transactions are reading and writing data
into the disk, the failure in an underlying disk can cause failure of currently running transaction.
This is because transactions are unable to read and write data in disks due to disk not working
properly. This can result in loss of data as well.
There can be several reasons of a disk failure such as: formation of bad sectors in disk, corruption
of disk, viruses, not enough resources available on disk.
Introduction to Indexing Techniques
B+ Trees
B+ Tree is a type of self-balancing tree structure commonly used in databases and file systems to
maintain sorted data in a way that allows for efficient insertion, deletion, and search operations.
Unlike binary trees, B+ trees maintain balance by keeping all leaf nodes at the same level.
The data pointers are present only at the leaf nodes on a B+ tree whereas the data pointers are
present in the internal, leaf or root nodes on a B-tree.
The leaves are not connected with each other on a B-tree whereas they are connected on a B+
tree.
Properties of a B+ Tree
4. Each node except root can have a maximum of m children and at least m/2 children.
5. Each node can contain a maximum of m - 1 keys and a minimum of ⌈m/2⌉ - 1 keys.
Operations on B+ Trees
1. Search: Starts at the root and traverses down the tree, guided by the key values in each
node, until it reaches the appropriate leaf node.
2. Insert: Inserts a new key-value pair and then reorganizes the tree as needed to maintain
its properties.
3. Delete: Removes a key-value pair and then reorganizes the tree, again to maintain its
properties.
Insertion on a B+ Tree
Inserting an element into a B+ tree consists of three main events: searching the appropriate
leaf, inserting the element and balancing/splitting the tree.
Insertion Operation
Before inserting an element into a B+ tree, these properties must be kept in mind.
Each node except root can have a maximum of m children and at least m/2 children.
Each node can contain a maximum of m - 1 keys and a minimum of ⌈m/2⌉ - 1 keys.
1. Since every element is inserted into the leaf node, go to the appropriate leaf node.
Case I
1. If the leaf is not full, insert the key into the leaf node in increasing order.
Case II
1. If the leaf is full, insert the key into the leaf node in increasing order and balance the tree
in the following way.
Insertion Example
1. Insert 5
Insert 15
Insert 25
Insert 35.
Insert 45
Deletion from a B+ Tree
Deleting an element on a B+ tree consists of three main events: searching the node where the
key to be deleted exists, deleting the key and balancing the tree if required. Underflow is a
situation when there is less number of keys in a node than the minimum number of keys it should
hold.
Deletion Operation
Before going through the steps below, one must know these facts about a B+ tree of degree m.
4. A node (except root node) should contain a minimum of ⌈m/2⌉ - 1 keys. (i.e. 1)
While deleting a key, we have to take care of the keys present in the internal nodes (i.e. indexes)
as well because the values are redundant in a B+ tree. Search the key to be deleted then follow
the following steps.
Case I
The key to be deleted is present only at the leaf node not in the indexes (or internal nodes). There
are two cases for it:
1. There is more than the minimum number of keys in the node. Simply delete the key.
There is an exact minimum number of keys in the node. Delete the key and borrow a key from
the immediate sibling. Add the median key of the sibling node to the parent.
Case II
The key to be deleted is present in the internal nodes as well. Then we have to remove them from
the internal nodes as well. There are the following cases for this situation.
1. If there is more than the minimum number of keys in the node, simply delete the key
from the leaf node and delete the key from the internal node as well.
Fill the empty space in the internal node with the inorder successor.
If there is an exact minimum number of keys in the node, then delete the key and borrow a key
from its immediate sibling (through the parent).
Fill the empty space created in the index (internal node) with the borrowed key.
This case is similar to Case II(1) but here, empty space is generated above the immediate parent
node.
After deleting the key, merge the empty space with its sibling.
Fill the empty space in the grandparent node with the inorder successor.
Case III
In this case, the height of the tree gets shrinked. It is a little complicated. Deleting 55 from the
tree below leads to this condition. It can be understood in the illustrations below.
Searching on a B+ Tree
The following steps are followed to search for data in a B+ Tree of order m. Let the data to be
searched be k.
1. Start from the root node. Compare k with the keys at the root node [k1, k2, k3,......km -
1.
3. Else if k == k1, compare k2. If k < k2, k lies between k1 and k2. So, search in the left
child of k2.
k is found.
Hash-Based Indexing is a technique used to locate data in a database efficiently using a hash
function.
Instead of searching sequentially, it computes a hash value from a key and places or finds
the record at a position corresponding to that hash value.
A hash function takes an input (usually a key) and returns the address of a data block where the
corresponding record is stored.
Important Terminologies
1. Data Bucket:
Memory location where actual data records are stored.
2. Hash Function:
A mathematical function used to compute the address of a data bucket using the record's
key (usually the primary key).
Example: h(x) = x mod 7
3. Hash Index:
The result (address) generated by the hash function which points to the bucket.
4. Linear Probing:
If the computed bucket is full (collision occurs), linear probing checks the next
available bucket sequentially.
5. Quadratic Probing:
Instead of searching linearly, it uses a quadratic formula like i^2 to find the next
available bucket.
6. Bucket Overflow:
Occurs when a hash function maps multiple records to the same bucket (collision),
causing the bucket to exceed its capacity.
1. Static Hashing
2. Dynamic Hashing
In Static Hashing, the number of buckets is fixed in advance. The hash function will always
map keys to these fixed buckets. The hash function will always return the same bucket address
for the same key.
The number of buckets does not change even if the number of records increases or decreases.
h(x) = x % 5
Here, the modulus operator (% 5) ensures all keys are mapped to buckets 0 to 4.
Static Hashing is mainly divided into two types based on how it handles collisions:
i) Open Addressing
i) Open Addressing
In Open Addressing, if the target bucket (calculated by the hash function) is already occupied,
the system searches for the next available bucket inside the hash table itself.
a) Linear Probing
b) Quadratic Probing
o Keep moving linearly (one step at a time) until you find an empty bucket.
4. Wrap around to the start if you reach the end of the table (circular).
Hash Function:
h(k) = k % 7
Table size = 7
Keys to Insert:
Step-by-step insertion:
Quadratic Probing is another collision resolution technique under Open Addressing, where
instead of moving linearly (like linear probing), we move in quadratic steps (i.e., 1², 2², 3², …)
to find the next empty slot.
Quadratic Probing Formula:
General formula:
Where i = 0, 1, 2, 3...
Hash Function:
h(k) = k % 7
Table size = 7
Keys to Insert:
Step-by-step insertion:
Double Hashing is a collision resolution technique where two different hash functions are used:
The second for calculating the step size (gap) to jump on collision.
Where:
h₂(k) = R - (k % R) ➡️ Step hash function (R < table size and usually a prime)
i = 0, 1, 2, 3...
Table Size = 7
Hash Functions:
h₁(k) = k % 7
h₂(k) = 5 - (k % 5)
Keys to Insert:
Step-by-step Insertion
Final Hash Table
Closed Addressing means that all elements that hash to the same index are stored together in a
list or bucket at that index (instead of probing to find another empty slot like in open addressing).
If two keys hash to the same index, just "chain" them together using a linked list or another
structure!
How it works:
Hash Table is an array of linked lists (or other dynamic structures like arrays or trees).
When collision occurs, we append the item to the linked list at that slot.
Steps:
3. If slot has other keys (collision) ➡️ chain the new key at the end of the linked list.
Example of Chaining
Table Size = 7
Hash Function:
h(k)=k%7h(k) = k \% 7h(k)=k%7
Keys to Insert:
Step-by-step Insertion
Dynamic Hashing
o The dynamic hashing method is used to overcome the problems of static hashing like
bucket overflow.
o In this method, data buckets grow or shrink as the records increases or decreases. This
method is also known as Extendable hashing method.
o This method makes hashing dynamic, i.e., it allows insertion or deletion without resulting
in poor performance.
Terminology:
Local Depth (LD): Number of bits used to index into a specific bucket.
Bucket: Where the actual records (keys) are stored. Each bucket has a fixed capacity.
o Check how many bits are used in the directory, and these bits are called as i.
o Take the least significant i bits of the hash address. This gives an index of the directory.
o Now using the index, go to the directory and find bucket address where the record might
be.
o Firstly, you have to follow the same procedure for retrieval, ending up in some bucket.
o If there is still space in that bucket, then place the record in it.
o If the bucket is full, then we will split the bucket and redistribute the records.
For example:
Consider the following grouping of keys into buckets, depending on the prefix of their hash
address:
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6 are
01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into bucket B2.
The last two bits of 7 are 11, so it will go into B3.
Insert key 9 with hash address 10001 into the above structure:
o Since key 9 has hash address 10001, it must go into the first bucket. But bucket B1 is
full, so it will get split.
o The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so it will go
into bucket B1, and the last three bits of 6 are 101, so it will go into bucket B5.
o Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100 entry because
last two bits of both the entry are 00.
o Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110 entry because
last two bits of both the entry are 10.
o Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry because last two
bits of both the entry are 11.
Advantages of dynamic hashing
o In this method, the performance does not decrease as the data grows in the system. It
simply increases the size of memory to accommodate the data.
o In this method, memory is well utilized as it grows and shrinks with the data. There will
not be any unused memory lying.
o This method is good for the dynamic database where data grows and shrinks frequently.
o In this method, if the data size increases then the bucket size is also increased. These
addresses of data will be maintained in the bucket address table. This is because the data
address will keep changing as buckets grow and shrink. If there is a huge increase in data,
maintaining the bucket address table becomes tedious.
o In this case, the bucket overflow situation will also occur. But it might take little time to
reach this situation than static hashing.