CONTENTS
Evaluation Schemes
9
Brain Storming
1. Differentiate the terms:
• Database,
• Database Management System(DBMS) and
• Database System.
Definition of Terms
Database – is an organized collection of related data held in
a computer or a data bank, which is designed to
be accessible in various ways.
DBMS – The technology of storing and retrieving users’ data
with utmost efficiency along with appropriate
security measures.
– A software package/ system to facilitate the creation
and maintenance of a computerized database.
Database – The DBMS software together with the
System:
data itself.
11
Brain Storming
1. How has the database technology evolved?
Types of DBMS Models
13
• In a Hierarchical database model, the data is
organized in a tree-like structure.
• Data is Stored Hierarchically (top down or
bottom up) format.
• Data is represented using a parent-child
relationship that are one to one or one to many.
• In Hierarchical DBMS parent may have many
children, but children have only one parent.
14
• The network database model allows each child to have multiple
parents.
• More complex relationships such as many-to-many relationship
(e.g., orders/parts).
• The entities are organized in a graph which can be accessed
through several paths.
15
• Easiest and Most widely used DBMS model
• Based on normalizing data in the rows and columns
of the tables.
• Data is stored in fixed structures and manipulated
using SQL.
16
Characteristics of Relational model (70’s) ….
Clean and simple.
Great for administrative and transactional data.
Not as good for other kinds of complex data (e.g., multimedia,
networks, CAD).
Relations are the key concept, everything else is around
relations
Primitive data types, e.g., strings, integer, date, etc.
Great normalization, query optimization, and theory
17
What is missing??
Handling of complex objects
– Could not store complex data types
such as images or sound.
Handling of complex data types
– RDBMSs have provided limited data types.
Code is not coupled with data
– SQL is declarative but programming
languages are procedural.
18
• In Object-oriented Model data stored in the form of objects.
• The structure which is called classes which display data within
it.
• It defines a database as a collection of objects which stores both
data member values and operations.
• Properties
• Name
• Height
• Weight……..
• Behaviors
• Eat
• Pray
• Walk …..
19
Object-Oriented models (80’s):
▪ Complicated, but some influential ideas from Object Oriented
programming
▪ Handles Complex data types.
▪ Idea: Build DBMS based on OO model.
▪ Programming languages have evolved from Procedural to Object
Oriented. So why not DBMSs ???
Evolution of Database Models
Introduction to OODBMS
OO Concepts (Mandatory Concepts)
The Golden Rules/1 DBMS concepts
Complex objects (mandatory concepts)
The Golden Rules/2
Object identity
Persistence
Encapsulation
Secondary storage
Types and/or Classes
management
Class or Type Hierarchies
Overriding, overloading Concurrency
Computational completeness Recovery
Extensibility- user-defined types
can be used in the same way as Ad Hoc Query Facility
system-defined types
23
The Goodies (Optional concepts that may be implemented)
Multiple inheritance
Type checking and type inferencing
Distribution
Design transactions
Versions
24
Discussion …
1. What are the main features of OOP?
2. What are the main capabilities of Database?
3. What is an Object-Oriented Database (OODBMS)?
Object-Oriented Database
• An OODBMS is a type of database
management system (DBMS) that utilizes the
principles of object-oriented programming
(OOP).
• Data is stored in objects, which encapsulate
both data (attributes) and behavior (methods).
• Objects can interact with each other through
methods, promoting a more natural
representation of complex relationships
Why Object Oriented Databases?
• There are three reasons for need of OODBMS:
1. Limitation of RDBMS
2. Need for Advanced Applications
3. Popularity of Object Oriented Programming Paradigm
OODB Advantages and Disadvantages
▪ Advantages of OODBMS:
▪ Natural Data Modeling: Object-based modeling aligns well with real-
world entities, simplifying data representation.
▪ Reduced Development Time: Inheritance and built-in functionalities
can expedite development.
▪ Improved Code Maintainability: Encapsulation promotes modularity
and reduces code complexity.
▪ Complex Data Handling: OODBMS excel at managing intricate data
structures and relationships.
OODB Advantages and Disadvantages
• Disadvantages of OODBMS:
• Performance: OODBMS might have slower query performance
compared to relational databases for simple queries.
• Complexity: OOP concepts like inheritance and complex object
structures can add complexity.
• Limited Adoption: OODBMS have a smaller market share
compared to relational databases.
Approaches for OODBMS
❑ There are two approaches of Object oriented
Database:
❑ Object-Oriented Model (OODBMS)
– Pure OO concepts
– Example: Orian, Iris
❑ Object-Relational Model (ORDBMS)
– Extended relational model with OO concepts
– Example: Oracle, 8i, SQL Server 2000
30
Object Data Management Group(ODMG )
❑ ODMG — to define standards for OODBMSs.
❑ Its standard is 3.0 which is popular.
– provide a standard where previously there was none
– support portability between products
– standardize model, querying and programming issues
❑ The major components of ODMG architecture for an OODBMS are:
– Object Model (OM),
– Object Definition Language (ODL),
– Object Query Language (OQL), and C++, Java, and Smalltalk language
bindings.
31
ODMG Objects and Literals
The basic building blocks of the object model are:
1. Objects
2. Literals
▪ Objects - represent real-world entities with attributes (data) and
methods (behavior).
▪ Objects are described by four characteristics
1. Identifier(OID) : unique system wide identifier.
• The OID of an object is independent of the values of its
attributes
2. Name : is used to refer the objects. It is optional
ODMG Objects
Objects are described by four characteristics
3. Life time
▪ Transient - unstable and can be updated and deleted
▪ Persistent – permanent object like (oodb)
▪ Naming – giving name
▪ Reachable – Collecting similar objects under one name
(extents).
4. Structure: Specifies whether the object is atomic or collection
type
Object Identity
▪ Identity: every object has a unique identity
▪ Object identity must be unique and non-volatile(immutable)
In time: can not change and can not be re-assigned to another
object, when the original object is deleted
In space: must be unique within and across database
boundaries
Object Factory
▪ The object that helps to generate many objects through
operations.
Example:
▪ Date object – can generate many calendar dates
▪ Ethiopian calendar
▪ European calendar
▪ Arabic calendar
▪ Indian calendar
ODMG Literals
▪ ODMG Literals- are special values used in Object Database
Management Systems (ODBMS) that follow the ODMG (Object Data
Management Group) standard.
Object types
• An object type is a blueprint or template that defines the structure
and behavior of its objects.
• It acts as a category that groups similar objects.
• The object type specifies what attributes (data) objects of that type
will have and what methods (actions) they can perform.
• For example, if you have an object type named “Car”, all Car objects
would share certain attributes like model, color, and number of
doors.
• All Car objects also share methods like accelerate(), brake(), and
turn().
37
Object types
▪ An object is made of two things
▪ State
▪ Behaviour
State
– is defined by the values of an object carries for a set of properties,
which may be either an attribute of the object or a relationship
between the object and one or more other objects.
– Example- Attributes (name, address, birthDate of a person)
38
Relationships
Relationships are defined between types.
Only binary relationships with cardinality 1:1, 1:*, and *:*.
A relationship have a name &, is not a ‘first class’ object;
Traversal paths are defined for each direction of traversal.
Example: a Branch Has a set of Staff and a member of Staff WorksAt a
Branch.
39
Relationships
40
Relationships
41
Behaviour
▪ Behaviour – is defined by a set of operations that can be performed
on or by the object.
▪ E.g., operations(age of person is computed from birthDate & current date)
▪ Operations implement the object’s behavior
▪ Types of operations:
▪ Constructor: creates a new instance of a class
▪ Query: accesses the state of an object but does not alter its state
▪ Update: alters the state of an object
▪ Scope: operation applying to the class instead of an instance
Type constructors
▪ type constructors are special keywords used to define the
structure of complex objects.
▪ type constructors act like building blocks, allowing you to create
objects with various data types and relationships between them.
▪ The three most basic constructors are
▪ Atom,
▪ Struct (or tuple), and
▪ Collection.
43
Atomic constructors
Includes the basic built-in data types of the object model,
which are similar to the basic types in many programming
languages: integers, strings, floating point numbers,
enumerated types, Booleans.
They are called single-valued or atomic types, since
each value of the type is considered an atomic (indivisible)
single value.
44
struct (or tuple) constructor
Create standard structured types, such as the tuples (record
types) in the basic relational model.
Referred to as a compound or composite type
Example
struct Name<FirstName: string, MiddleInitial: char, LastName:
string>,
struct CollegeDegree<Major: string, Degree: string,Year: date>.
45
Collection (or multivalued)
Collection is used to create complex nested type structures in the
object model.
Collection type constructors includes;
1. Set(T) - unordered collections that do not allow duplicates.
2. Bag(T) - allows duplicate elements in the collection and also
inherits the collection interface.
3. List(T) - create collections where the order of the elements is
important.
4. Array(T) - set of sorted list referenced by index.
5. Dictionary(K,T) - allows the creation of a collection of
association pairs <K,V>, where all K (key) and V(values) are
unique.
46
Core Concepts of OODBMS
(Objects, classes, interfaces, inheritance , encapsulation, polymorphism)
In the ODMG Object Model there are two ways to specify object types:
§ Interfaces, and
§ classes.
Interface --- defines only the abstract behavior of an object type, using
operation signatures.
Allows behavior to be inherited by other interfaces and classes using
the ‘:’ symbol.
properties (attributes and relationships) cannot be inherited.
Interface is Noninstantiable
47
Classes
Class defines both the abstract state and behavior of an object type.
Class is instantiable (thus, interface is an abstract concept and class is
an implementation concept).
Use the extends keyword to specify single inheritance between classes.
Multiple inheritance is not allowed.
Classes encapsulate data + methods + relationships
In OODBMSs objects are persistent (unlike OOP languages)
The interface part of an operation is sometimes called the signature,
and the implementation is sometimes called the method.
48
Classes
• Classes are blueprints or templates for creating objects.
• A class defines the common properties and behaviors that
objects of the same type possess.
• It specifies the attributes and methods.
• A class has:
– A name
– A set of attributes
– A set of methods
– A set of constraints
49
Extents and keys
Extent and keys are Specified during class definition:
Extent is the set of all instances of a given type within a particular ODMS.
Deleting an object removes the object from the extent of the type.
Key uniquely identifies the instances of a type (similar to the concept of a
candidate key).
A type must have an extent to have a key.
A key is different from an object name; key is composed of properties
specified in an object type’s interface whereas an object name is defined
within the database type.
50
Abstraction, Encapsulation, and Information Hiding …
Abstraction
▪ Abstraction focuses on hiding unnecessary details and exposing only
relevant information.
▪ It provides a simplified and conceptual view of objects and their
interactions.
▪ Abstraction helps in managing complexity and improving code
maintainability.
Encapsulation ….
• Encapsulation is the concept of bundling
data and methods together within a class.
• Encapsulation helps achieve data abstraction, security, and code
reusability.
• Information hiding - separate the external aspects of an object from its
internal details, which are hidden from the outside world.
• Focus more on data security
• To encourage encapsulation, an operation is defined in two parts.
• Signature or interface of the operation specifies the operation name and
arguments (or parameters).
• Method or body specifies the implementation of the operation
• Operations can be invoked by passing a message.
Inheritance …
A class can be defined in terms of another one. Person
name: {firstName: string,
Allows the definition of new types based on other middleName: string,
lastName: string}
address: string
predefined types, leading to a type (or class) hierarchy. birthDate: date
age(): Integer
It enables the reuse of attributes and methods from a changeAddress(newAdd: string)
base class (superclass) in derived classes (subclasses).
Subclasses inherit the properties of the superclass and Student
can extend or modify them as needed. regNum: string {PK}
major: string
Example register(C: Course): boolean
Person is super-class and Student is sub-class.
Student class inherits attributes and operations of Person.
53
Types of inheritance(Multiple and Selective Inheritance)
Multiple Inheritance - a class inherits features from more
than one superclass .
Example: Engineer_Manager that is a subtype of both Engineer and Manager.
This leads to the creation of a type lattice rather than a type hierarchy.
Selective Inheritance - occurs when a subtype inherits only
some of the functions of a super type.
The mechanism of selective inheritance is not typically
provided in ODBs,
It is used more frequently in artificial intelligence applications.
54
Polymorphism
Polymorphism - the ability to appear in many forms.
Meaning
o Ability to process objects differently depending on their data type
or class.
o It is the ability to redefine methods for derived classes.
▪ Overloading –
▪ allows the name of a method to be reused within a class definition.
▪ Overriding –
▪allows the name of a property to be redefined in a subclass.
▪ Dynamic binding - allows the determination of an object’s type and methods
to be deferred until runtime.
55
Overriding Overloading
Extensibility
Extensibility allows the creation of new data types, i.e. user-defined
types, and operations from built-in atomic data types and user
defined data types using the type constructor.
A type constructor is a mechanism for building new domains.
A complex object is built using type constructors such as sets,
tuples, lists and nested combinations.
A combination of an user-defined type and its associated
methods is called an abstract data type (ADT).
57
Versioning
An object version represents an identifiable state of an object.
version history represents the evolution of an object.
The process of maintaining the evolution of objects is known as
version management.
58
Overview of ODL & OQL
The Object Definition Language (ODL) is a language for defining
the specifications of object types for ODMG-compliant systems.
The ODL defines the attributes, relationships and signature of
the operations, but it does not address the implementation of
signatures.
59
Overview of ODL & OQL
Object Query Language (OQL) provides declarative access to
the object database using an SQL-like syntax.
It does not provide explicit update operators, but leaves this to the
operations defined on object types.
An OQL query is a function that delivers an object whose type may
be inferred from the operator contributing to the query expression.
OQL can be used for both associative and navigational access.
60
Querying object-relational database
Most relational operators work on the object-relational tables
E.g., selection, projection, aggregation, set operations
Several major software companies including IBM, Informix,
Microsoft, Oracle, and Sybase have all released object-relational
versions of their products.
SQL-99 (SQL3): Extended SQL to operate on object-relational
databases
61
Brain Storming
1. Can you provide an overview of what SQL is and the basic
commands for querying and manipulating data?
2. Explore the categories of SQL?
3. List the commands of DDL, DML, DCL and TCL.
Defining Generalization
Assignment-1(Individual Ass 1 : 6%)
1. Develop a schema using ODL for the following object types:
• Project, document, project-leader and research-paper. Use
the following relations between the object types: project
has a set of documents and a project leader. Project leaders
publish research-papers. Make appropriate assumptions on
the cardinalities of the relations. Use document as an super
type, and research- paper is a subtype of document.
Lab Assignment-1(Group ass1: 7%)
Chapter Two
QUERY PROCESSING & OPTIMIZATION
Query Processing and Optimization: Outline
▪ Query processing
▪ Operator Evaluation Strategies
▪ Selection
▪ Join
▪ Query Optimization
▪ Heuristic query optimization
▪ Cost-based query optimization
▪ Measures of Query Cost
▪ Query Tuning
Overview of Query Processing
❖ Query processing -the activities involved in parsing,
validating, optimizing, and executing a query.
❖ Aims
❖ To transform a query written in a high-level language,
typically SQL, into a correct and efficient execution strategy
expressed in a low-level language (implementing the relational
algebra), and
❖ To execute the strategy to retrieve the required data.
Query
Processing ❖ Example – SELECT sName FROM student;
Scanning – Scan keywords, symbols, attributes, table list names.
– Line by line / word by word checking of query.
SELECT sName FROM Student ;
Keyword Attribute Keyword TableName Symbol
Parsing – checking the validity, syntax, order/ structure of a query.
Indicates Parse SELECT Student FROM sName ;
Error is
encountered. SELECT eName FROM Student ;
Steps of 1. Parsing and translation
Query 2. Optimization
Processing 3. Evaluation Query
SELECT sName FROM Student ;
Scanner
Runtime DB Relational Algebra Form
Processor parser
Convert to
Machine code Intermediate Form RA/RC
Code Generator Optimizes
Query Tree Query Graph the query,
so that the
query will
be executed
Query Execution strategy Query Optimizers in shorter
time
Steps of Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation
§ DBMS has algorithms to implement relational algebra expressions
§ SQL is a kind of high level language; specify what is wanted, not how it is
obtained
Query optimization:
❖ The activity of choosing an efficient execution strategy for
processing a query.
❖ Task: Find an efficient physical query plan (aka execution plan) for an
SQL query
Goal: Minimize the evaluation time for the query, i.e., compute
query result as fast as possible
Cost Factors: Disk accesses, read/write operations, [I/O, page
transfer] (CPU time is typically ignored)
Optimization: find the most efficient evaluation plan for a query because
there can be more than one way.
Examples:
❖ Find all Managers who work at a London branch.
Example - 2
SELECT * FROM Staff s, Branch b WHERE [Link] = [Link]
AND ([Link] = ‘Manager’ AND [Link] = ‘London’);
The equivalent relational algebra queries corresponding to
this SQL statement are:
The
Different
Strategies
Cost Comparison
❖ Cost (in disk accesses) are:
(1) (1000 + 50) + 2*(1000 * 50) = 101 050
(2) 2*1000 + (1000 + 50) = 3 050
(3) 1000 + 2*50 + 5 + (50 + 5) = 1 160
❖ The third option significantly reduces size of relations being joined together.
❖ Cartesian product and join operations are much more expensive than
selection.
We will see shortly that one of the fundamental strategies in query
processing is to perform the unary operations, Selection and Projection,
as early as possible, thereby reducing the operands of any subsequent
binary operations.
Phases of query processing
▪ Query Decomposition
Transform high-level query into RA query.
Check that query is syntactically and semantically correct.
▪ Typical stages are:
▪ Analysis,
▪ Normalization,
▪ Semantic analysis,
▪ Simplification,
▪ Query restructuring.
▪ Analysis
▪ Analyze query lexically and syntactically using compiler techniques.
▪ Verify relations and attributes exist.
▪ Verify operations are appropriate for object type.
Analysis
▪ Finally, query transformed into a query tree constructed as follows:
Leaf node for each base relation.
Non-leaf node for each intermediate relation produced by RA operation.
Root of tree represents query result.
Sequence is directed from leaves to root.
Normalization
Query normalization is the process of converting a user’s query
into a standardized format so that it can be matched with relevant
documents or data.
Predicate can be converted into one of two forms:
Conjunctive normal form:
(position = 'Manager' salary > 20000) (branchNo = 'B003')
Disjunctive normal form:
(position='Manager'branchNo='B003')(salary>20000branchNo ='B003')
Semantic Analysis
▪ Rejects normalized queries that are incorrectly formulated or
contradictory.
▪ Query is incorrectly formulated if components do not contribute to
generation of result.
▪ Query is contradictory if its predicate cannot be satisfied by any tuple.
▪ Algorithms to determine correctness exist only for queries that do not
contain disjunction and negation.
Semantic Analysis
To detect
➠ connection graph (query graph)
➠ join graph
Relation connection graph
▪ Relation connection graph not fully
connected, so query is not correctly
formulated.
▪ Have omitted the join condition
([Link] = [Link]) .
Example 2
SELECT Ename,Resp FROM Emp, Works, Project WHERE [Link]
= [Link] AND [Link] = [Link] AND Pname =
‘CAD/CAM’ AND Dur > 36 AND Title = ‘Programmer’
If the query graph is connected, the query is semantically correct.
Simplification- Removing unnecessary predicates that don't affect the
query outcome.
• Detects redundant qualifications,
• Eliminates common sub-expressions,
• Transforms query to semantically equivalent but
more easily and efficiently computed form.
➢ Apply well-known transformation rules of Boolean algebra.
Example
SELECT TITLE FROM Emp E WHERE(NOT (TITLE= “Programmer”) AND
TITLE=“Programmer” ) OR (TITLE=”Electrical Eng.” AND NOT
(TITLE=“Electrical Eng.”))OR ENAME=“[Link]”; is
equivalent to
SELECT TITLE FROM Emp E WHERE ENAME= “[Link]”;
Restructuring - Transforming the query into a formal relational
algebra expression suitable for optimization.
C
. onvert SQL to relational algebra
Make use of query trees
Example: SELECT Ename FROM Emp,
Works, Project WHERE [Link] =
[Link] AND [Link] = [Link]
AND Ename <> ‘J. Doe’ AND Pname =
‘CAD/CAM’ AND (Dur = 12 OR Dur = 24)
Query tree
Query tree is a data structure that corresponds to a relational algebra
expression
Input relations of the query as leaf nodes
Relational algebra operations as internal nodes
An execution of the query tree consists of executing internal node
operations
Query graph
Query graph is a graph data structure that corresponds to a
relational calculus expression.
It does not indicate an order on which operations to perform first.
There is only a single graph corresponding to each query.
Transformation Rules for RA Operations
1. Cascade of Selection:
Conjunctive Selection operations can cascade into individual Selection
operations (and vice versa).
2. Commutativity of Selection
3. In a sequence of Projection operations, only the last in the sequence is
required.
∏Col_list1 (∏Col_list2 (… (∏Col_listN (T))….)) = ∏Col_list1 (T)
∏ Std_Id, Std_name (∏Std_id, Std_name, Age, Address (∏Std_id, Std_name,
Age, Address, Class_id, Skills(Student))) = ∏Std_id, Std_name (Student)
Con…
4. Commutativity of Selection and Projection.
If predicate p involves only attributes in projection list, Selection and
Projection operations commute:
Con…
5. Commutativity of Theta join (and Cartesian product).
Rule also applies to Equijoin and Natural join.
Example:
6. Commutativity of Selection and Theta join (or Cartesian product)
If selection predicate involves only attributes of one of join
relations, Selection and Join (or Cartesian product) operations
commute:
If selection predicate is conjunctive predicate having form (p q),
where p only involves attributes of R, and q only attributes of S,
Selection and Theta join operations commute as:
7. Commutativity of Projection &Theta join (or Cartesian product)
8. Commutativity of Union & Intersection (but not set difference)
RS=SR
RS=SR
[Link] of Selection and set operations (Union,
Intersection, and Set difference).
p(R S) = p(S) p(R)
p(R S) = p(S) p(R)
p(R - S) = p(S) - p(R)
[Link] of Projection and Union.
L(R S) = L(S) L(R)
11. Associativity of Union & Intersection (but not Set difference).
(R S) T = S (R T), (R S) T = S (R T)
12 . Associativity of Theta join (and Cartesian product).
▪ Cartesian product and Natural join are always associative.
2. Query Optimization
❖ Query optimization is the process of improving the performance
of database queries by minimizing the time and resources required
to execute them.
❖ Optimization – not necessarily “optimal”, but reasonably efficient
❖ Techniques:
Heuristic rules
▪ Query tree (relational algebra) optimization
▪ Query graph optimization
Cost-based (physical) optimization
▪ Cost estimation(Comparing costs of different plans)
a. Heuristic based Processing Strategies
► Perform Selection operations as early as possible.
► Keep predicates on same relation together.
► Combine Cartesian product with subsequent Selection whose predicate
represents join condition into a Join operation.
► Use associativity of binary operations to rearrange leaf nodes so leaf nodes
with most restrictive Selection operations executed first.
► Perform Projection as early as possible.
► Keep projection attributes on same relation together.
► Compute common expressions once.
► If common expression appears more than once, and result not too large,
store result and reuse it when required.
Examples
What are the names of customers living on Elm Street who have
checked out “Terminator”?
SQL query:
SELECT Name FROM Customer CU, CheckedOut CH, Film F WHERE Title =
’Terminator’ AND [Link] = [Link] AND [Link] = [Link]
AND [Link] = ‘Elm’
Apply Selections Early
Apply More Restrictive Selections Early
Form Joins
Apply Projections Early
Cost- Based Optimization
Cost-based optimization is a technique used in query optimization that
involves analyzing the cost of different execution plans and selecting the most
efficient one.
This is typically done by estimating the cost of each possible execution plan
based on factors such as the number of rows to be processed, the complexity
of the query, and the available resources.
Cost can be CPU time, I/O time, communication time, main memory
usage, or a combination.
The candidate query tree with the least total cost is selected for execution.
Measures of Query Cost
▪ There are many possible ways to estimate cost, e.g., based on
disk accesses, CPU time, or communication overhead.
▪ Disk access is the cost of block transfers from/to disks.
▪ Simplifying assumption: each block transfer has the same cost
▪ Cost of algorithm (e.g. join or selection) depends on database buffer size;
▪ More memory for DB buffer reduces disk accesses.
▪ Thus DB buffer size is a parameter for estimating cost.
▪ We refer to the cost estimate of algorithm S as cost(S).
▪ We do not consider cost of writing output to disk.
Selectivity and Cost Estimates in Query
Optimization
Database Statistics
– For each base relation R
– nTuples(R) – the number of tuples (records) in relation R (its cardinality).
– bFactor(R) – the blocking factor of R (that is, the number of tuples of R that fit
into one block).
– nBlocks(R) – the number of blocks required to store R. If the tuples of R are
stored physically together, then:
– nBlocks(R) = [nTuples(R)/bFactor(R)]
– We use [x] to indicate that the result of the calculation is rounded to the
smallest integer that is greater than or equal to x.
For each attribute A of base relation R
nDistinctA(R) – the number of distinct values that appear for
attribute A in relation R.
minA(R),maxA(R) – the minimum and maximum possible
values for the attribute A in relation R.
SCA(R) – the selection cardinality of attribute A in relation R.
This is the average number of tuples that satisfy an equality
condition on attribute A.
SCA(R) is calculated as:
Selection Operation
Cost of Operations
▪ Cost = I/O cost + CPU cost
▪ I/O cost: # pages (reads & writes) or # operations (multiple pages)
▪ CPU cost: # comparisons or # tuples processed
▪ I/O cost dominates (for large databases)
▪ Cost depends on
▪ Types of query conditions
▪ Availability of fast access paths
▪ DBMSs keep statistics for cost estimation
Simple Selection
Simple selection: A op a(R)
A is a single attribute, a is a constant, op is one of =, , <, , >, .
Do not further discuss because it requires a sequential scan of
table.
How many tuples will be selected?
Selectivity Factor (SFA op a(R)) : Fraction of tuples of R satisfying
“A op a”
0 SFA op a(R) 1
# tuples selected: NS = nR SFA op a(R)
Options of Simple Selection
Sequential (linear) Scan
General condition: cost = bR
Equality on key: average cost = bR / 2
Binary Search
Records are stored in sorted order
Equality on key: cost = log2(bR)
Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one
Example: Cost of Selection
• Relation: R(A, B, C)
• nR = 10000 tuples
• bfR = 20 tuples/page
• dist(A) = 50, dist(B) = 500
• B+ tree clustering index on A with order 25 (p=25)
• B+ tree secondary index on B w/ order 25
• Query:
• select * from R where A = a1 and B = b1
• Relational Algebra: A=a1 B=b1 (R)
Example: Cost of Selection (cont.)
• Option 1: Sequential Scan
• Have to go thru the entire relation
• Cost = bR = 10000/20 = 500
• Option 2: Binary Search using A = a
• It is sorted on A (why?)
• NS = 10000/50 = 200
• assuming equal distribution
• Cost = log2(bR) + NS/bfR - 1
• = log2(500) + 200/20 - 1 = 18
Cost of Join
Cost = # I/O reading R & S + # I/O writing result
Additional notation:
M: # buffer pages available to join operation
LB: # leaf blocks in B+ tree index
Limitation of cost estimation
Ignoring CPU costs
Ignoring timing
Ignoring double buffering requirements
Estimate Size of Join Result
How many tuples in join result?
Cross product (special case of join)
NJ = nR nS
R.A is a foreign key referencing S.B
NJ = nR (assume no null value)
S.B is a foreign key referencing R.A
NJ = nS (assume no null value)
Both R.A & S.B are non-key
nR nS nR nS
NJ = min( , )
dist (R. A) dist (S.B)
Estimate Size of Join Result (cont.)
How wide is a tuple in join result?
Natural join: W = W(R) + W(S) – W(SR)
Theta join: W = W(R) + W(S)
What is blocking factor of join result?
bfJoin = block size / W
How many blocks does join result have?
bJoin = NJ / bfJoin
Query Execution Plans
A query execution plan is a blueprint that outlines the specific
steps the DBMS will take to retrieve data for a given SQL query.
It essentially acts like a roadmap, detailing the most efficient way
to access and process data based on the structure of the database
and the query itself.
Materialized evaluation - the result of an operation is stored as a
temporary relation.
Pipelined evaluation - takes a more streamlined approach.
The results of each operation are passed directly to the next operation in
the pipeline, without creating intermediate temporary files.
Query Tuning
Query tuning is the process of optimizing SQL queries to improve
their performance and efficiency.
The goal is to retrieve the desired data as quickly and with as few
resources as possible.
Tasks includes –
Proper Table Indexing:
Denormalization
Avoiding Unnecessary Operations
Optimize WHERE Clause Conditions
Wrap up
The goal of query processing and optimization is to find the data you need
as quickly and efficiently as possible.
The process of transforming a user's query into an actual result involves
several steps.
— Parsing and translation
— Optimization
— Evaluation
Query optimization is the process of fine-tuning SQL queries to retrieve the
desired data as quickly as possible with minimal resource consumption.
— Heuristic rule
— Physical cost estimations
Assignment -2 (Individual Ass2: 7%)
1. Using heuristic algorithm optimize the following sql query.
• SQL query: SELECT LNAME FROM EMPLOYEE, WORKS_ON,
PROJECT WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
2. work individually on the following cases.
Advanced Database System(CoSc2042)
Chapter – 3
TRANSACTION PROCESSING
Chapter Outline
01: Introduction to Transaction Processing
02: Transaction and System Concepts
03: Desirable Properties of Transactions
04: Characterizing Schedule based on Recoverability& Serializabillty
05: Transaction Support in SQL
Definition of transactions
— A transaction is a unit of a program execution that accesses and
possibly modifies various data objects (tuples, relations).
— Transaction are units or sequences of work accomplished in a
logical order, whether in a manual fashion by a user or automatically
by some sort of a database program.
— A transaction (set of operations) may be specified in SQL, or
may be embedded within a program.
Transaction and System Concepts
Single-User System: At most one
user at a time can use the system.
Multiuser System: Many users can
access the system concurrently.
Concurrency: means allowing more
than one transaction to run
simultaneously on the same database.
Interleaved processing: Concurrent Figure shows- Interleaved processing versus
execution of processes are interleaved parallel processing of concurrent transactions.
in a single CPU
Parallel processing:
– Processes are concurrently
executed in multiple CPUs.
Transaction boundaries:
▪ Begin and End transaction.
▪ An application program may contain several transactions separated by;
▪ Begin and End transaction boundaries.
▪ Suppose a bank employee transfers $500 from A‘s account to B's account.
▪ This very simple and small transaction involves several low-level tasks.
Simple Model of a Database
▪ A database - collection of named data items
▪ Granularity of data - a field, a record , or a whole disk block
▪ Basic operations are read and write
▪ read(A, x): assign value of database object A to variable x;
▪ write(x , A): write value of variable x to database object A
▪ Example: Let T1 be a transaction that transfers $500 from account A to account B.
This transaction can be defined as :
READ/ WRITE OPERATIONS
PROPERTIES OF TRANSACTIONS
Properties of a transaction generally called ACID
properties.
Those are;
Atomicity
Consistency preservation
Isolation
Durability (permanency)
Atomic transactions
• Atomicity: A transaction is an atomic unit of processing; it is either
performed in its entirety or not performed at all.
• Example: John wants to move $200 from his savings account to his
checking account.
1) Money must be subtracted from savings account.
2) Money must be added to checking count.
• If both happen, John and the bank are both happy.
If neither happens, John and the bank are both happy.
If only one happens, either John or the bank will be unhappy.
John’s transfer must be all or nothing.
Consistency
• A correct execution of the transaction must take the database from one
consistent state to another.
• Example: Wilma tries to withdraw $1000 from account 387.
Transactions are consistent
▪ A transaction must leave the database in valid state.
▪ valid state == no constraint violations
▪ Constraint is a declared rule defining /specifying database states
▪ Constraints may be violated temporarily …
but must be corrected before the transaction completes.
Isolation
Example:
Durability
• Once a transaction changes the database and the changes are committed,
these changes must never be lost because of subsequent failure.
Concurrency Control
Isolation (+ Consistency) => Concurrency Control
▪ Concurrency means allowing more than one transaction to run simultaneously
on the same database.
▪ When several transactions run concurrently database consistency can be
destroyed.
▪ It is meant to coordinate simultaneous transactions while preserving data
integrity.
▪ It is about to control the multi-user access of database.
WHY CONCURRENCY CONTROL IS NEEDED?
▪ Several problems can occur when transactions run in an uncontrolled manner
The Lost Update Problem.
▪ This occurs when two transactions that access the same database items have
their operations interleaved in a way that makes the value of some database
item incorrect.
The update performed by T1 gets
lost;
possible solution:
T1 locks/unlocks database object A
⇒ T2 cannot read A while A is
modified by T1
Example
▪ The Lost Update Problem
T1 T2 State of X
read_item(X); 20
X:= X+10; read_item(X); 20
X:= X+20;
write_item(X); 40
commit;
Lost update write_item(X); 30
commit;
✓ Changes of T2 are lost.
▪ The Temporary Update (or Dirty Read) Problem.
▪ This occurs when one transaction updates a database item
and then the transaction fails for some reason, and the
updated item is accessed by another transaction before it is
changed back to its original value.
T1 modifies db object, and then the
transactionT1 fails for some reason.
Meanwhile the modified db object,
however, has been accessed by another
transaction T2. Thus T2 has read data
that “never existed”.
▪ Example
▪The Temporary Update/ Dirty Read Problem
T1 T2 State of X sum
read_item(X); 20 0
X:= X+10;
Dirty update write_item(X);
read_ item(X); 30
sum:= sum+X;
write_item(sum); 30
X:=X+10; commit;
write_item(X); 40
Rollback
commit;
✓ T2 sees dirty data of T1.
▪ The Incorrect Summary Problem
▪ If one transaction is calculating an aggregate summary function on a
number of records while other transactions are updating some of these
records, the aggregate function may calculate some values before they are
updated and others after they are updated.
In this schedule, the total computed by T1 is wrong (off by 100).
⇒T1 must lock/unlock several db objects
▪ Example
▪The Incorrect Summary Problem
Let A=100
T1 T2 State of X State of Y sum=0
read_item(A); 0
sum:= sum+A;
write_item(A);
commit; 100
read_item(X); 30
X:= X-10;
write_item(X);
commit; read_item(X); 20
sum:= sum+X;
read_item(Y); 10
sum:= sum+Y;
read_item(Y); 10
Y:= Y+10;
write_item(Y); 20
commit;
Incorrect summary
✓ T2 reads X after 10 is subtracted and reads Y before 10 is added, hence incorrect
summary.
▪ Unrepeatable read problem
▪ Here a transaction T1 reads the same item twice and the item is
charged by another transaction T2 between the reads, T1 receives
different values for its two reads of the same item.
Q. Consider the schedule given below, in which, transaction T1 transfers
money from account A to account B and in the meantime, transaction T2
calculates the sum of 3 accounts namely, A, B, and C. The third column shows
the account balances and calculated values after every instruction is executed.
Discuss what problem is found in the schedule and what will be the correct
value of Accounts A, B & C averages?
➢ WHY RECOVERY IS NEEDED: (WHAT CAUSES A
TRANSACTION TO FAIL?)
➢ A computer failure (system crash)
➢ A transaction or system error
➢ Local errors or exception conditions detected by the
transaction:
➢ Concurrency control enforcement
➢ Disk failure
➢ Physical problems and catastrophes
Operations
▪ Recovery manager keeps track of the following operations:
▪ begin_transaction
▪ read or write
▪ end_transaction
▪ commit_transaction
▪ rollback (or abort)
▪ Recovery techniques use the following operators:
▪ Undo
▪ Redo
THE SYSTEM LOG
✓ Log or Journal: The log keeps track of all transaction
operations that affect the values of database items.
✓ Needed to permit recovery from transaction failures.
✓ The log is kept on disk, so it is not affected by any type of
failure except for disk or catastrophic failure.
✓ Log is periodically backed up to archival storage (tape) to
guard against such catastrophic failures.
✓ Have a unique transaction-id that is generated
automatically by the system
Types of log record
– [start_transaction, T] - Indicates that transaction T has started execution.
– [write_item, T, X, old_value, new_value] - Indicates that transaction T
has changed the value of database item X from old_value to new_value.
– [read_item, T, X] - Indicates that transaction T has read the value of
database item X.
– [commit, T] - Indicates that transaction T has completed successfully, and
affirms that its effect can be committed (recorded permanently) to the database.
– [abort, T] - Indicates that transaction T has been aborted.
Commit Point of a Transaction
▪ Definition: It refers to the completion of the transaction.
▪ Transaction T reaches its commit point when,
▪ all its operations accessing DB are executed successfully and
changes are recorded in the log.
▪ Beyond the commit point, the transaction is said to be
committed, and its effect is assumed to be permanently
recorded in the database.
▪ The transaction then writes an entry [commit, T] into the log.
Con..
✓ Undo (Roll Back) of transactions:
▪ Needed for transactions that have a [start_transaction,T] entry to the
log but no commit entry [commit,T] into the log.
✓ Redoing transactions:
✓ Transactions that have commit entry in the log
✓ write entries are redone from log
✓ Force writing a log:
✓ Before a transaction reaches its commit point,
✓ Write log to disk
✓ This process is called force-writing the log file before committing a
transaction.
SCHEDULES
➢ When transactions are executing concurrently in an interleaved
fashion, the order of execution of operations from various
transactions, is known as a transaction schedule (or history).
• Transaction Schedule reflects
chronological order of operations
Characterizing schedules based on
• Recoverability- How good is the system at recovering from errors?
• Serializability – How easy is the system to find schedules that allow
transactions to execute concurrently without interfering one another.
Schedules classified on recoverability
▪ Recoverable schedule- no transaction needs to be rolled back.
Recoverable schedule
• Strict schedules are more strict than cascadeless schedules.
• All strict schedules are cascadeless schedules.
• All cascadeless schedules are not strict schedules.
Characterizing schedules based on Serializability
▪ Serial schedule
• Transactions are ordered one after the other. Otherwise, the schedule is
called nonserial schedule.
▪ Serializable schedule
• A schedule is equivalent to some serial schedule of the same n transactions.
▪ Result equivalent
• Two schedules are producing the same final state of the database.
▪ Conflict equivalent
• The order of any two conflicting operations is the same in both schedules.
Figure 3.2 Examples of serial and nonserial schedules involving transactions
T1 and T2. (a) Serial schedule A: T1 followed by T2. (b) Serial schedule B: T2
followed by T1. (c) Two nonserial schedules C and D with interleaving of
operations.
Schedule Notation
• A more compact notation for schedules:
T3
b3, r3(Y), w3(Y), e3, c3
begin
read(Y)
r3(Y)
Y = Y+1
write(Y)
operation data item
end
transaction commit
note: we ignore the computations on the local copies of the data when
considering schedules (they're not interesting)
Examples
A serial schedule is one in which the transactions do not overlap (in
time).
b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1, These are all serial schedules for the
b2,r2(X),w2(X),e2,c2, three example transactions
b3,r3(Y),w3(Y),e3,c3
There are six possible serial schedules
b2,r2(X),w2(X),e2,c2, for three transactions
b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1,
b3,r3(Y),w3(Y),e3,c3 n! possible serial schedules for n
transactions
b2,r2(X),w2(X),e2,c2,
b3,r3(Y),w3(Y),e3,c3,
b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1
• Types of Serializability
– Conflict Serializability
– View Serializability:
▪ Conflict serializable:
▪ A schedule S is said to be conflict serializable if it is conflict equivalent
to some serial schedule S’.
• Being serializable is not the same as being serial
• Being serializable implies that the schedule is a correct schedule.
• It will leave the database in a consistent state.
• Serializability is hard to check.
• View serializability: definition of serializability based on view equivalence.
– A schedule is view serializable if it is view equivalent to a serial schedule.
• Interleaving of operations occurs in an operating system through some
scheduler
• Difficult to determine before hand how the operations in a schedule will be
interleaved.
Fig 3.3. Conflicts between operations of two transactions:
Conflict Equivalence
• Two schedules are conflict equivalent if the order of any two conflicting
operations is the same in both schedules.
• Two operations conflict
– they access the same data item (read or write)
– if they belong to different transactions
– at least one is a write
T1: b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1,
conflicting operations:
r1(X),w2(X)
T2: b2,r2(X),w2(X),e2,c2 w1(X), r2(X)
w1(X), w2(X)
– Find the conflicting operation?
Two operations are conflicting, if changing their order can result in a
different outcome
Example: Conflict Equivalence
schedule 1:
b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1,
b2,r2(X),w2(X),e2,c2
schedule 2: r1(X) < w2(X), w1(X) < r2(X), w1(X) < w2(X)
b2,r2(X),w2(X),
b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1, e2,c2
w2(X) < r1(X), r2(X) < w1(X), w2(X) < w1(X)
schedule 3:
b1,r1(X),w1(X),
b2,r2(X),w2(X),e2,c2, r1(Y),w1(Y),e1,c1,
r1(X) < w2(X), w1(X) < r2(X), w1(X) < w2(X)
Schedule1and schedule 3 are conflict equivalent schedule 2 is not
conflict equivalent to either schedule 1 or 3
Testing for Conflict Serializability
• Precedence graphs are a more efficient test
– graph indicates a partial order on the transactions required
by the order of the conflicting operations.
– the partial order must hold in any conflict equivalent serial
schedule
– if there is a loop in the graph, the partial order is not
possible in any serial schedule
– if the graph has no loops, the schedule is conflict serializable
Precedence Graph Examples: find the graph the conflict
operation between the transactions?
schedule 3:
b1,r1(X),w1(X),
b2,r2(X),w2(X),e2,c2, r1(Y),w1(Y),e1,c1,
Find the conflict operations ?
r1(X) < w2(X), w1(X) < r2(X), w1(X) < w2(X)
r1(X) < w2(X) arrows
indicate
T1 T2
w1(X) < r2(X) that T1
precedes T2
w1(X) < w2(X)
schedule 3 is conflict serializable
it is conflict equivalent to some serial schedule
in which T1 precedes T2
Precedence Graph Examples
schedule 2:
b2,r2(X),w2(X), b1,r1(X),w1(X),r1(Y),w1(Y),e1,c1,e2,c2
w2(X) < r1(X), r2(X) < w1(X), w2(X) < w1(X)
w2(X) < r1(X)
T1 T2
r2(X) < w1(X)
w2(X) < w1(X)
schedule 2 is conflict serializable
it is conflict equivalent to some serial schedule
in which T2 precedes T1.
Precedence Graph Examples
schedule 4:
b2,r2(X), b1,r1(X),w1(X),r1(Y),w1(Y),w2(X),e1,c1,e2,c2
r1(X) < w2(X), r2(X) < w1(X), w1(X) < w2(X)
r1(X) < w2(X)
T1 T2
r2(X) < w1(X)
w1(X) < w2(X)
schedule 4 is not conflict serializable
there is no serial schedule
in which T2 precedes T1 and T1 precedes T2
Transaction Support in SQL
• SQL Commands used to control transactions:
– COMMIT: to save the changes.
– ROLLBACK: to rollback the changes.
– SAVEPOINT: creates points within groups of
transactions in which to ROLLBACK
– SET TRANSACTION: Places a name on a
transaction.
Assignment-3 (Individual Ass3: 5%)
Q1. Using the precedence graph as a method of checking Serializability
based on this find the following questions
S: r1(x) r2(z) r3(x) r1(z) r2(y) r3(y) w1(x) w2(z) w3(y) w2(y)
e1,c1,e2,c2,e3,c3
A. Find the Ordering of conflicting operations?
B. Is this schedule serializable?
C. Is the schedule correct?
Advanced Database system(CoSc2042)
Chapter -4
Concurrency Control
What is Concurrency Control?
▪ Concurrency control — ensuring that each user
appears to execute in isolation.
▪ It is the procedure in DBMS for managing simultaneous
operations without conflicting with each other.
▪ Why?
Lost Updates
Temporary update (dirty read)
Non-Repeatable Read
Incorrect Summary issue
Purpose of Concurrency Control
- To force isolation (through mutual exclusion) among
conflicting Transactions
- To preserve database consistency through consistency
preserving execution of transactions
- To resolve read-write and write-write conflicts
Concurrency Control Techniques
Various concurrency control techniques are:
• Two Phase Protocols
• Timestamp-Based Protocols
• Validation-Based Protocols
• Multi version concurrency control
Locking Techniques
– Locking is an operation which secures: permission to read, OR
permission to write a data item.
– Two phase locking is a process used to gain ownership of shared
resources without creating the possibility of deadlock.
– The 3 activities taking place in the two phase update algorithm are:
(i). Lock Acquisition
(ii). Modification of Data
(iii). Release Lock
Rules in locking technique:
▪ LOCK() : operation must be issued by transaction before any
update operation like read() or write() operation.
▪ LOCK() :can’t be issued by transaction if already holds LOCK()
on data item
▪ UNLOCK(): must be issued after all read() and write()
operations are completed in a transaction.
▪ UNLOCK(): can’t be issued by transaction unless it already hold
the lock on the data item.
Example for binary lock
T1:LOCK(A)
T1:READ(A) A=100 ▪ This example is seriazible schedule.
T[Link]=A+200 A=300 ▪ Therefore, in case of a binary
T1:WRITE(A)
T1:UNLOCK(A) locking mechanism at most one
T2:LOCK(A)
transaction can hold the lock on
T2:READ(A) A=300
T[Link]=A+300 A=600 a particular data item .
T2:WRITE(A) A=600
▪ Thus no transaction can access
T2:UNLOCK(A)
the same item concurrently.
Type of Lock modes
1) shared (S) mode
If a transaction T1 has obtained a shared mode lock on item
X then other transaction can read but can’t write data item X.
It is also known as a Read-only lock.
2) Exclusive (X) mode –
If a transaction locks a data item then no other transaction
can access that item, even read until the lock is released by
the transaction.
It is also called a write lock
Lock-compatibility matrix
Example of a transaction performing locking:
T1 T2
LOCKX(A) LOCKX(SUM)
READ(A) A=1000 SUM:=0
A:=A-200 A=800 LOCKS(A)
WRITE(A) A=800 READ(A) A=800
UNLOCK(A) SUM:=SUM+A SUM=800
LOCKX(B) UNLOCK(A)
READ(B) B=900 LOCKS(B)
B:=B+200 B=1100 READ(B) B=1100
WRITE(B) B=1100 SUM:=SUM+B SUM=1900
UNLOCK(B) WRITE(SUM) SUM=1900
UNLOCK(B)
UNLOCK(SUM)
IF EXECUTED SERIALLY THE OUTPUT
WILL BE 1900
Consider another example
▪ T1
▪ UNLOCK(A)
▪ LOCKX(B)
▪ READ(B) ▪ LOCKS(B)
▪ B:=B-50
▪ WRITE(B) ▪ READ(B) B=150
▪ UNLOCK(B)
▪ LOCKX(A) ▪ UNLOCK(B)
▪ READ(A)
▪ DISPLAY(A+B)
▪ A:=A+50
▪ WRITE(A) ▪ THIS IS CLEAR THAT IF THEY
▪ UNLOCK(A) RUN SEQUENTIALLY THE OUT
▪ T2:
PUT WILL BE 300
▪ LOCKS(A)
▪ READ(A) A=150
▪ Phase 1: Growing Phase
Transaction may obtain locks
Transaction may not release locks
▪ Phase 2: Shrinking Phase
Transaction may release locks
Transaction may not obtain locks
Example:
T1 T1
LockX(B) LockX(B)
READ(B)
READ(B)
B:=B-50
B:=B-50 WRITE(B)
WRITE(B) LOCKX(A)
UNLOCK(B) READ(A)
A:=A+50
LOCKX(A)
WRITE(A)
READ(A) UNLOCK(B)
A:=A+50 UNLOCK(A)
WRITE(A)
UNLOCK(A) This is 2 phase because unlocks
This is not 2 phase because unlock(b) appears after all lock operation
appears before lock(a)
▪ Two-phase locking does not ensure freedom from deadlocks
▪ To avoid this, follow a modified protocol called strict two-phase
locking.
▪ Strict Two-Phase Locking:
▪ A transaction must hold all its exclusive locks till it commits/ aborts.
▪ Ensures that any data written by uncommitted transaction are locked in
exclusive mode until the transaction commits.
Conversion of locks
▪ Conversion from shared to exclusive modes is denoted by upgrade
▪ Conversion from exclusive to shared mode by downgrade.
▪ Lock conversion is not allowed to occur arbitrarily.
▪ Upgrading takes place only in growing phase whereas,
▪ Downgrading takes place only in shrinking phase
Implementation of Locking
▪ A Lock manager can be implemented as a separate process to
which transactions send lock and unlock requests.
▪ The lock manager replies to a lock request by sending a lock
grant messages (or a message asking the transaction to roll
back, in case of a deadlock).
▪ The requesting transaction waits until its request is answered
▪ The lock manager maintains a data structure called a lock table
to record granted locks and pending requests.
Lock Table
▪ New request is added to the end of
the queue of requests for the data
item, and granted if it is compatible
with all earlier locks.
▪ Unlock requests result in the request
being deleted
▪ If transaction aborts, all waiting or
▪Black rectangles indicate granted
granted requests of the transaction
locks, white ones indicate waiting
are deleted
requests.
▪Lock table also records the type of ▪ Lock manager may keep a list of
lock granted or requested locks held by each transaction, to
implement this efficiently
Pitfalls of Lock-Based Protocols
1. Deadlocks
2. Starvations
Deadlock occurs when two or more transactions are waiting on a
condition that cannot be satisfied.
Deadlock can arise if the following 4 conditions hold simultaneously in a system;
Mutual exclusion:At least one resource is held in a non-sharable mode.
Exclusive lock request.
Hold and wait:There is a transaction which acquired and held lock on a
data item, and waits for other data item.
No preemption: situation where a transaction releases the locks on data
items which it holds only after the successful completion of the
transaction.
Circular wait: A situation where a transaction T1 is waiting for another
transaction T2 to release lock on some data items, in turn T2 is waiting for
another transaction T3 to release lock, and so on.
Starvation
Starvation is the situation when a transaction has to wait for an
indefinite period of time to acquire a lock.
Reasons of Starvation –
If waiting scheme for locked items is unfair. ( priority queue )
Victim selection. ( same transaction is selected as a victim repeatedly)
Resource leak.
Via denial-of-service attack.
What are the solutions to starvation –
Increasing Priority
Modification in Victim Selection algorithm
First Come First Serve approach
Wait die and wound wait scheme
Example
Consider the partial schedule
▪ Neither T3 nor T4 can make progress —
executing lock-S(B) causes T4 to wait for T3
to release its lock on B, while executing
lock-X(A) causes T3 to wait for T4 to release
its lock on A.
▪ Such a situation is called a deadlock.
▪ To handle a deadlock one of T3 or T4 must
be rolled back and its locks released.
Timestamp-based Protocols
This protocol ensures that every conflicting read and write operations are
executed in timestamp order.
The protocol uses the System Time or Logical Count as a Timestamp.
The older transaction is always given priority in this method.
Timestamp-based Protocols
Advantages:
Schedules are serializable just like 2PL protocols
No waiting for the transaction, which eliminates the possibility
of deadlocks!
Disadvantages:
Starvation is possible if the same transaction is restarted and
continually aborted
Validation Based Protocol
Validation Based Protocol is also called Optimistic
Concurrency Control Technique.
Validation based protocol –
avoids the concurrency of the transactions and works based
on the assumption that if no transactions are running
concurrently then no interference occurs.
Three phases of Validation based Protocol
‒ Read Phase ‒ Values of committed data items from the database can be read
by a transaction. Updates are only applied to local data versions.
‒ Validation Phase ‒ Checking is performed to make sure that there is no
violation of serializability when the transaction updates are applied to
database.
‒ Write Phase ‒ on the success of validation phase, the transaction updates
are applied to the database, otherwise, the updates are discarded and the
transaction is slowed down.
Insert and Delete Operations
▪ If two-phase locking is used :
▪ A delete operation may be performed only if the transaction
deleting the tuple has an exclusive lock on the tuple to be
deleted.
▪ A transaction that inserts a new tuple into the database is given
an X-mode lock on the tuple.
▪ Insertions and deletions can lead to the phantom phenomenon.
Phantom problem can occur when a new record that is being inserted by
some transaction T satisfies a condition that a set of records accessed by
another transaction T must satisfy.
DATABASE RECOVERY
TECHNIQUES
Chapter– 5 Computer Science
Introduction
Database recovery is the process of restoring the
database to the most recent consistent state that
existed just before the failure.
Three states of database recovery:
Pre-condition: At any given point in time the database is in
a consistent state.
Condition: Occurs some kind of system failure.
Post-condition: Restore the database to the consistent state
that existed before the failure
Types of failures
1. Transaction failures.
◼ Erroneous parameter values
◼ Logical programming errors
◼ System errors like integer overflow, division by zero
◼ Local errors like “data not found”
◼ User interruption.
◼ Concurrency control enforcement
2. Malicious transactions.
3. System crash.
◼ A hardware, software, or network error (also called
media failure)
4. Disk crash.
Basic Properties of Every Recovery
Algorithm:
Before looking at less ideal but more effective strategies, it is useful to
identify some key points which must be kept in mind, regardless of
approach.
Commit point:
Every transaction has a commit point.
This is the point at which transaction is finished, and all of the database
modifications are made a permanent part of the database.
Recovery approaches
Steal approach-cache page updated by a transaction can be
written to disk before the transaction commits.
No-steal approach -cache page updated by a transaction cannot
be written to disk before the transaction commits.
Force approach- when a transaction commits, all pages updated
by the transaction are immediately written to disk.
No-force approach-when a transaction commits, all pages updated
by the transaction are not immediately written to disk.
Basic Update Strategies
Update strategies may be placed into two basic categories.
Most practical strategies are a combination of these two:
Deferred Update Immediate Update
Cont.
1. Deferred Update (No -Undo/Redo Algorithm)
These techniques do not physically update the DB on disk
until a transaction reaches its commit point.
These techniques need only to redo the committed
transaction and no-undo is needed in case of failure.
Cont.
While a transaction runs:
▪ Changes made by that transaction are not recorded in the
database.
On a commit:
▪ The new data is recorded in a log file and flushed to disk
▪ The new data is then recorded in the database itself.
▪ On an abort, do nothing (the database has not been
changed).
▪ On a system restart after a failure, REDO the log.
Cont.
2. Immediate Update (Undo/Redo Algorithm)
The DB may be updated by some operations of a
transaction before the transaction reaches it’s commit point.
The updates are recorded in the log must contain the old
values and new values.
These techniques need to undo the operations of the
uncommitted transactions and redo the operations of the
committed transactions .
Cont.
While a transaction runs:
▪ Changes made by the transaction can be written to the database at any
time. Both original and the new data being written, must be stored in the
log before storing it on the disk.
On a commit:
▪ All the updates which has not yet been recorded on the disk is first stored
in the log file and then flushed to disk.
▪ The new data is then recorded in the database itself.
▪ On an abort, redo all the changes which that transaction has made to the
database disk using the log entries.
▪ On a system restart after a failure, redo committed changes from log.
▪ Example: click here
Shadow Paging
In this technique, the database is considered to be made up of
fixed-size disk blocks or pages for recovery purposes.
Maintains two tables during the lifetime of a transaction
-current page table and shadow page table.
Store the shadow page table in nonvolatile storage, to recover the
state of the database prior to transaction execution
This is a technique for providing atomicity and durability.
When a transaction begins executing
Cont.
To recover from a failure
Advantages
•No-redo/no-undo
Disadvantages
•Creating shadow directory may take a long time.
•Updated database pages change locations.
•Garbage collection is needed
“ARIES” Recovery algorithm.
Recovery algorithms are techniques to ensure database
consistency ,transaction atomicity and durability without any
failure.
Recovery algorithms have two parts
1. Actions taken during normal transaction processing to ensure
enough information exists to recover from failures.
2. Actions taken after a failure to recover the database contents to
a state that ensures atomicity, consistency and durability.
Cont.
ARIES (Algorithms for Recovery and Isolation
Exploiting Semantics)
The ARIES recovery algorithm consist of three steps
• Analysis
• Redo
• Undo
Cont.
Analysis - Identify the dirty pages(updated pages) in the
buffer and set of active transactions at the time of failure.
Redo - Re-apply updates from the log to the database. It will be
done for the committed transactions.
Undo - Scan the log backward and undo the actions of the
active transactions in the reverse order.
Recovery from disk crashes.
Recovery from disk crashes is much more difficult than recovery
from transaction failures or machines crashes.
Loss from such crashes is much less common today than it was
previously, because of the wide use of redundancy in secondary
storage (RAID( Redundant Array of Independent Disk) technology).
(RAID - method of combining several hard disk drives into one
logical unit.)
Typical methods are;
The log for the database system is usually written on a separate
physical disk from the database.
or,
Periodically, the database is also backed up to tape or other
archival storage.
Conclusion.
✓ Types of failures.
✓ Steal/no steal, Force/no force approaches.
✓ Deferred and immediate update strategies.
✓ Shadow paging technique.
✓ ARIES recovery algorithm.
✓ Recovery from disk crashes.
DATABASE SECURITY AND
AUTHORIZATION
Chapter - 6 Introduction to Database Security Issues
Contents
Security types
Threats of database
Security mechanism
❑ Here we discuss the techniques used for protecting the
database against persons who are not authorized to access
either certain parts of a database or the whole database.
Introduction to Database Security Issues
• Authentication means confirming your own identity,
− It is the process of verifying who you are.
− There are three common factors used for authentication:
− Something you know (such as a password)
− Something you have (such as a smart card)
− Something you are (such as a fingerprint or other Biometric method)
• Authorization means granting access on the system.
• In simple terms, It is the process of verifying what you have access
to.
Types of Security
• Legal and ethical issues- Various legal and ethical issues
regarding the right to access certain information.
• Who has the right to read What information?
• Policy issues - At the governmental, institutional, or corporate
level as to what kinds of information should not be made
publicly available.
• Who should enforce security (government, corporations) ?
• System-related issues- whether a security function should be
handled at the physical hardware, the operating system, or the
DBMS level.
Threats to databases
o Confidentiality, integrity and availability, also known as
the CIA triad, is a model designed to guide policies for
information security within an organization.
• Loss of integrity: (users should be able to modify things they are not
supposed to.)
E.g., students’ can change grades.
• Loss of confidentiality(secrecy): (users should be able to see things they are
not supposed to.)
ƒE.g., A student can see other students’ grades.
• Loss of availability:(data or a system is not available when needed by a user.)
Con…
Data integrity in the database is the correctness, consistency and
completeness of data.
Data integrity is enforced using the following three integrity
constraints:
− Entity Integrity – every table must have primary key.
− Referential Integrity - ensures that only the required alterations,
additions, or removals happen via rules implanted into the database’s
structure about how foreign keys are used.
− Domain Integrity - all columns in a relational database must be declared
upon a defined domain.
Continued..
… To protect databases against these types of threats four kinds of
countermeasures can be implemented :
• Access control,
• Inference control,
• Flow control and
• Encryption
A DBMS typically includes a database security and authorization subsystem
…
Continued..
Access control - handled by creating user accounts and passwords
…
to control login
Controlling the access to a statistical database - used to provide
…
statistical information based on criteria.
The countermeasures to statistical database security problem - is
…
called inference control measures.
…Flow control - prevents information from flowing to unauthorized users.
− Channels that are pathways for information to flow implicitly in ways that
violate the security policy of an organization are called covert channels.
Continued..
A final counter measure is data encryption,
used to protect sensitive data(such as credit card
numbers) transmitted thro’ communication network.
The data is encoded using some coding algorithm.
Deciphering is required by authorized users to decode
or decrypt algorithms (or keys).
Database Security and the DBA
… The database administrator (DBA) -
central authority for managing a database system.
responsible for the overall security of the database system
…The DBA has a DBA account in the DBMS - called system or superuser account,
…Following are the major responsibilities of a DBA:
Account creation
Privilege granting
Privilege revocation
Security level assignment
Access Protection, User Accounts,
and Database Audits
… To use Db user needs an account
… The DBA will create a new account number and password
… The user must log in to the DBMS using account number and password
… The database system
keep track of all operations on the database that are applied by a certain user
in each login session
In the system log
…If any tampering with the database is suspected, a database audit is performed,
This consists of
reviewing the log-
to examine all accesses and operations applied to the database
during a certain time period.
• …
A database log that is used mainly for security purposes is sometimes called an
audit trail.
Types of database security mechanisms:
• Two types of database security mechanisms:
▪ Discretionary security mechanisms
• The typical method of enforcing discretionary access control in a database
system is based on the granting and revoking privileges.
▪ Mandatory security mechanisms
• Classify data and users into various security classes
• Implement security policy
Discretionary Access Control Based on
Granting and Revoking Privileges
… Types of Discretionary Privileges
• The account level:
• At this level, the DBA specifies the particular privileges that each
account holds independently of the relations in the database.
• … The privileges at the account level apply to the capabilities provided to the account
itself and can include the following:
CREATE SCHEMA or CREATE TABLE or CREATE VIEW privilege;
The ALTER privilege
The DROP privilege;
The MODIFY privilege
The SELECT privilege,
Continued..
Relation level:
…
The relation (or table level): At this level, the DBA can control the privilege
to access each individual relation or view in the database.
… The granting and revoking of privileges generally follow an authorization
model for discretionary privileges known as the access matrix model,
here the rows of a matrix M represents subjects (users, accounts, programs) and
the columns represent objects (relations, records, columns, views, operations).
Each position M(i, j) in the matrix represents the types of privileges (read, write,
update) that subject i holds on object j.
… To control the granting and revoking of relation privileges, each relation R
in a database is assigned and owner account (created first)
The owner of a relation is given all privileges on that relation.
The owner account holder can pass privileges on any of the owned relation to
other users by granting privileges to their accounts.
In SQL the following types of privileges can be granted on each individual
relation R:
SELECT (retrieval or read) privilege on R: Gives the account retrieval privilege.
In SQL this gives the account the privilege to use the SELECT statement to
retrieve tuples from R.
MODIFY privileges on R: Gives the account the capability to modify tuples of R.
▪ In SQL this privilege is further divided into UPDATE, DELETE, and INSERT
privileges to apply the corresponding SQL command to R.
▪ In addition, both the INSERT and UPDATE privileges can specify that only
certain attributes can be updated by the account.
REFERENCES privilege on R: This gives the account the
capability to reference relation R when specifying integrity
constraints.
The privilege can also be restricted to specific attributes of R.
Notice that to create a view, the account must have SELECT
privilege on all relations involved in the view definition.
Specifying Privileges Using Views
The mechanism of views is an important discretionary
…
authorization mechanism in its own right.
…Example:
if the owner A of a relation R wants another account B to
be able to retrieve only some fields of R, then A can create
a view V of R that includes only those attributes and then
grant SELECT on V to B.
the same applies to limiting B to retrieving only certain
tuples of R;
a view V’ can be created by defining the view by means of
a query that selects only those tuples from R that A wants to
allow B to access.
Revoking Privileges
• … In some cases it is desirable to grant a privilege to
a user temporarily.
• For example, the owner of a relation may want to
…
grant the SELECT privilege to a user for a specific
task and then revoke that privilege once the task is
completed.
• Hence, a mechanism for revoking privileges is needed.
… In SQL, a REVOKE command is included for the
purpose of canceling privileges.
Propagation of Privileges using the
GRANT OPTION
… Whenever the owner A of a relation R grants a privilege on
R to another account B, privilege can be given to B with or
without the GRANT OPTION.
… If the GRANT OPTION is given, this means that B can also
grant that privilege on R to other accounts.
Suppose that B is given the GRANT OPTION by A and that
…
B then grants the privilege on R to a third account C, also
with GRANT OPTION.
… In this way, privileges on R can propagate to other accounts
without the knowledge of the owner of R.
… If the owner account A now revokes the privilege granted to
B, all the privileges that B propagated based on that
privilege should automatically be revoked by the system.
Example(1)
…
• …
Suppose that the DBA creates four accounts A1, A2, A3, and A4 and wants only A1 to
be able to create base relations; then the DBA must issue the following GRANT
command in SQL:
…
GRANT CREATE TABLE TO A1;
• …
In SQL the same effect can be accomplished by having the DBA issue
a CREATE SCHEMA command as follows:
…
CREATE SCHAMA EXAMPLE AUTHORIZATION A1;
… User account A1 can create tables under the schema called EXAMPLE.
• …Suppose that A1 creates the two base relations EMPLOYEE and DEPARTMENT; A1
is then owner of these two relations and hence all the relation privileges on each of
them.
• Suppose that A1 wants to grant A2 the privilege to insert and delete tuples in both of these
…
relations, but A1 does not want A2 to be able to propagate these privileges to additional
accounts:
… GRANT INSERT, DELETE ON EMPLOYEE, DEPARTMENT TO A2;
Example(2)
Suppose that A1 wants to allow A3 to retrieve information from either of the
two tables and also to be able to propagate the SELECT privilege to other
accounts.
A1 can issue the command:
GRANT SELECT ON EMPLOYEE, DEPARTMENT
TO A3 WITH GRANT OPTION;
A3 can grant the SELECT privilege on the EMPLOYEE relation to A4 by
issuing:
GRANT SELECT ON EMPLOYEE TO A4;
Notice that A4 can’t propagate the SELECT privilege because GRANT OPTION was
not given to A4
Example(3)
Suppose that A1 decides to revoke the SELECT
privilege on the EMPLOYEE relation from A3; A1
can issue:
REVOKE SELECT ON EMPLOYEE FROM A3;
The DBMS must now automatically revoke the
SELECT privilege on EMPLOYEE from A4, too,
because A3 granted that privilege to A4 and A3 does
not have the privilege any more.
Example(4)
Suppose that A1 wants to give back to A3 a limited capability to SELECT from
the EMPLOYEE relation and wants to allow A3 to be able to propagate the
privilege.
The limitation is to retrieve only the NAME, BDATE, and ADDRESS
attributes and only for the tuples with DNO=5.
A1 then create the view:
CREATE VIEW A3EMPLOYEE AS SELECT NAME, BDATE, ADDRESS FROM
EMPLOYEE WHERE DNO = 5;
After the view is created, A1 can grant SELECT on the view A3EMPLOYEE
to A3 as follows:
GRANT SELECT ON A3EMPLOYEE TO A3 WITH GRANT OPTION;
Example(5)
Finally, suppose that A1 wants to allow A4 to update
only the SALARY attribute of EMPLOYEE;
A1 can issue:
GRANT UPDATE (SALARY) ON EMPLOYEE TO
A4;
The UPDATE or INSERT privilege can specify particular
attributes that may be updated or inserted in a relation.
Other privileges (SELECT, DELETE) are not attribute specific.
Mandatory Access Control
Based on system-wide policies that cannot be changed by individual users.
Each DB object is assigned a security class.
− Bell-LaPadula Model
• Objects (e.g., tables, views, tuples)
• Subjects (e.g., users, user programs)
− Security classes:
− Top secret(TS), secret (S), confidential (C), unclassified (U): TS > S> C > U
• Each object and subject is assigned a class.
• Subject S can read object O only if class(S) >= class(O) (Simple Security
Property)
• ƒSubject S can write object O only if class(S) <= class(O) (*-Property)
Question
In this chapter you will learn:
• The need for distributed databases.
• The differences between distributed database systems,
Chapter - 7 distributed processing, and parallel database systems.
• The advantages and disadvantages of distributed
Distributed
DBMSs.
Database • The functions that should be provided by a distributed
system DBMS.
• An architecture for a distributed DBMS.
• The main issues associated with distributed database
design, namely fragmentation, replication, and
allocation.
– Distributed database –
– logically interrelated collection of shared
Distributed
data (and a description of this data)
Database
physically distributed over a computer
Concepts
network.
– DDBMS –
– is a software system that manages a
distributed database while making the
distribution transparent to the user.
– A collection of logically related shared data;
– The data is split into a number of fragments;
– Fragments may be replicated;
– The sites are linked by a communications
Characteristics
network;
of DDBMS:
– The data at each site is under the control of a
DBMS;
– The DBMS at each site can handle local
applications, autonomously;
– Each DBMS participates in at least one global
application.
Advantages DDS
1. Management of distributed data with different levels of
transparency:
▪ Distribution transparency
– This refers to the physical placement of data (files, relations, etc.) is not
known to the user.
▪ Network transparency
– Users do not have to worry about operational details of the network.
▪ Location transparency
– refers to freedom of issuing command from any location without
affecting its work.
Advantages DDS…
▪ Naming transparency
– Allows access to any named object (files, relations, etc.) from any
location.
▪ Replication transparency
− Allows to store copies of a data at multiple sites.
− This is done to minimize access time to the required data.
▪ Fragmentation transparency
− Allows to segment a relation horizontally (create a subset of tuples of a
relation) or vertically (create a subset of columns of a relation).
Advantages of DDS
2. Increase reliability and availability:
− Reliability refers to system live time, that is, system is running efficiently most
of the time.
− Availability is the probability that the system is continuously available (usable
or accessible) during a time interval.
− A distributed database system has multiple nodes (computers) and if one fails
then others are available to do the job.
3. Improved performance:
− DDBMS fragments the database to keep data closer to where it is needed most.
− This reduces data management (access and modification) time significantly.
4. Scalability - Easier expansion
− Allows new nodes (computers) to be added anytime without chaining the entire
configuration.
– Complexity
– Cost
Disadvantages
of – Security
DDS – Integrity control more difficult
– Lack of standards
– Lack of experience
– Database design more complex
Database system architectures
▪ A Database Architecture is a representation of DBMS design.
▪ It helps to design, develop, implement, and maintain the database
management system.
▪ There are three database system architectures:
1. Centralized Database Architecture
2. Parallel Database Architectures
3. Distributed Database Architecture
Centralized database
• A centralized database is basically a type of database that is
stored, located and maintained at a single location only.
• This type of database is modified and managed from that
location itself.
Parallel database architectures
▪ Parallel DBMSs link multiple, smaller machines to achieve the
same throughput as a single, larger machine, often with greater
scalability and reliability.
▪ The three main architectures for parallel DBMSs:
▪ Shared memory(tightly coupled)
▪ Shared disk (loosely coupled architecture)
▪ Shared nothing-(massively parallel processing (MPP))
architecture
The three main architectures for parallel DBMSs:
■ Shared memory - tightly coupled architecture in which multiple processors
share secondary (disk) storage and primary memory.
The three main architectures for parallel DBMSs:
▪ Shared disk -loosely coupled architecture where multiple processors
share secondary (disk) storage but each has their own primary memory.
The three main architectures for parallel DBMSs:
▪ Shared nothing-(massively parallel processing (MPP)) architecture.
• Multiple processor architecture in which each processor is part of a
complete system, with its own memory and disk storage.
Distributed database
• A distributed database system allows applications to access data
from local and remote databases.
• There are two Types of distributed
database system:
Type of • Homogeneous Distributed Database.
Distributed • Heterogeneous Distributed Database.
database system
Homogeneous
• All sites of the database system have identical setup, i.e., same database
system software.
• The underlying operating systems can be a mixture of Linux, Window,
Unix, etc.
• For example, all sites run Oracle or DB2, or Sybase or some other database
system.
Window
Advantages Site 5 Unix
Oracle Site 1
✓ Easy to use Oracle
Window
✓ Easy to mange Site 4 Communications
neteork
✓ Easy to Design
Oracle
Disadvantages Site 3 Site 2
Linux Oracle Linux Oracle
✓ Difficult for most organizations to
force a homogeneous environment
Homogeneous Distributed Database Systems
▪ Autonomy determines the extent to which individual nodes or
DBs in a connected DDB can operate independently.
• Design autonomy refers to independence of data model usage and
transaction management techniques among nodes.
• Communication autonomy determines the extent to which each node
can decide on sharing of information with other nodes.
• Execution autonomy refers to independence of users to act as they
please.
▪ Non-autonomous − Data is distributed across the homogeneous nodes
and a central or master DBMS co-ordinates data updates across the sites.
Heterogeneous
✓ Different data center may run different DBMS products, with possibly different
underlying data models.
Object Unix Relational
Oriented Site 5 Unix
✓ Translations required to allow for: Site 1
Hierarchical
▪ Different hardware. Window
Site 4 Communications
▪ Change of codes and word lengths. network
▪ Different DBMS products. Network
▪ Mapping of data structures in one Object DBMS
Oriented Site 3 Site 2 Relational
data model to the equivalent data
Linux Linux
structures in another data model
▪ Translate the query language used (for example, a relational model SQL SELECT
statements are mapped to the network FIND and GET statements)
▪ Different hardware and different DBMS products.
▪ If both the hardware and software are different, then both these types of
translation are required. This makes the processing extremely complex.
Heterogeneous
⚫ Advantages
✓ Huge data can be stored in one Global center from different data center
✓ Remote access is done using the global schema.
✓ Different DBMSs may be used at each node
⚫ Disadvantages
✓ Difficult to mange
✓ Difficult to design.
.
Multidatabase system (MDBS)
• Multidatabase system (MDBS)- a distributed DBMS in which each
site maintains complete autonomy.
• MDBSs logically integrate a number of independent DDBMSs while
allowing the local DBMSs to maintain complete control of their operations.
• MDBS allows users to access and share data without requiring full database
schema integration.
• Federated database system - collection of cooperating database systems
that are autonomous and possibly heterogeneous.
❑ Differences in data models
❑ Differences in constraints
❑ Differences in query language
Distributed Processing and Distributed Database
DDBMS Components
DDBMS protocol
Computer workstations
To form the network system.
Network hardware and software
Components that reside in each workstation.
Communications media
Carry the data from one workstation to another.
Transaction processor (TP)
Receives and Processes the application’s data requests.
Data processor (DP)
Stores and Retrieves data located at the site.
Also Known as data manager (DM).
DDBMS protocol
• DDBMS protocol determines how the DDBMS will:
– Interface with the network to transport data and commands
between DPs and TPs.
– Synchronize all data received from DPs (TP side) and route
retrieved data to the appropriate TPs (DP side).
– Ensure common database functions in a distributed system --
security, concurrency control, backup, and recovery.
Distributed Database Design
• The design of a distributed database introduces three new
issues:
– How to partition the database into fragments?
– Which fragments to replicate?
– Where to locate those fragments and replicas?
Data Fragmentation
▪ Data fragmentation allows us to break a single object
into two or more segments or fragments.
▪ There are three Types of Fragmentation Strategies:
▪ Horizontal Fragmentation
▪ Vertical Fragmentation
▪ Mixed Fragmentation
Horizontal Fragmentation
▪ Horizontal Fragmentation - Consists of a subset of the
tuples of a relation.
▪ Fragment represents the equivalent of a SELECT statement, with
the WHERE clause on a single attribute.
Vertical fragment
▪ Vertical fragment Consists of a subset of the attributes of a
relation.
▪ Equivalent to the PROJECT statement.
Mixed fragment
▪ Mixed fragment - Consists of a horizontal
fragment that is subsequently vertically
fragmented, or a vertical fragment that is
then horizontally fragmented.
▪ A mixed fragment is defined using the
Selection and Projection operations of the
relational algebra.
Data Replication
⚫ Data replication refers to the storage of data copies at multiple
sites served by a computer network.
– Enhance data availability and response time, reducing
communication and total query costs.
Data Replication
• Mutual Consistency Rule
• All copies of data fragments be identical.
• DDBMS must ensure that a database update is performed at
all sites where replicas exist.
• Replication Conditions
• Fully Replicated database stores multiple copies of all
database fragments at multiple sites.
• Partially Replicated database stores multiple copies of some
database fragments at multiple sites.
• Factors for Data Replication Decision
– Database Size
– Usage Frequency
Data Allocation
⚫ Data allocation describes the processing of deciding where to
locate data.
⚫ Data Allocation Strategies
– Centralized
The entire database is stored at one site.
– Partitioned
The database is divided into several disjoint parts (fragments)
and stored at several sites.
– Replicated
Copies of one or more database fragments are stored at
several sites.
Data allocation algorithms
• Data allocation algorithm take into consideration a variety of
factors:
– Performance and data availability goals
– Size, number of rows, the number of relations that an entity
maintains with other entities.
– Types of transactions to be applied to the database, the
attributes accessed by each of those transactions.
Transparencies in a DDBMS
▪ Transparency hides implementation details from the
user.
‒ Distribution transparency
• Transaction transparency
• Failure transparency
• Performance transparency
Distribution Transparency
• Distribution transparency allows the user to perceive the database as a
single, logical entity.
• Allows us to manage a physically dispersed database as though it were
a centralized database.
• Three Levels of Distribution Transparency
– Fragmentation transparency
– Location transparency
– Local mapping transparency
Distribution Transparency
• Example :
• Employee data (EMPLOYEE) are distributed over three locations: New
York, Atlanta, and Miami.
• Depending on the level of distribution transparency support, three different
cases of queries are possible:
Distribution Transparency
• Case 1: DB Supports Fragmentation Transparency
SELECT * FROM EMPLOYEE WHERE EMP_DOB < '01-JAN-1940';
• Case 2: DB Supports Location Transparency
SELECT * FROM E1 WHERE EMP_DOB < '01-JAN-1940';
UNION
SELECT * FROM E2 WHERE EMP_DOC < '01-JAN-1940';
UNION
SELECT * FROM E3 WHERE EMP_DOC < '01-JAN-1940';
• Case 3: DB Supports Local Mapping Transparency
SELECT * FROM E1 NODE NY WHERE EMP_DOB < '01-JAN-1940';
UNION
SELECT * FROM E2 NODE ATL WHERE EMP_DOB < '01-JAN-1940';
UNION
SELECT * FROM E3 NODE MIA WHERE EMP_DOB < '01-JAN-1940';
Transaction Transparency
• Transaction transparency - ensures that database transactions will
maintain the database’s integrity and consistency.
• Transaction transparency consists:
– Remote Requests
– Remote Transactions
– Distributed Transactions
– Distributed Requests
A Remote Request
▪ Allows us to access data to be processed by a single remote database
processor.
A Remote Transaction
▪ Composed of several requests, may access data at only a single
site.
▪ Allows a transaction to reference several (local or remote) DP sites.
A Distributed Request
▪ Reference data from several remote DP sites.
▪ Allows a single request to reference a physically partitioned table.
Example2:
Distributed Request
Distributed Transactions and 2 Phase Commit
▪ Transaction transparency in a DDBMS environment ensures that all distributed
transactions maintain the distributed database’s integrity and consistency.
▪ Transaction may access data at several sites.
▪ Each site has a local transaction manager responsible for:
– Maintaining a log for recovery purposes
– Participating in coordinating the concurrent execution of the transactions
executing at that site.
▪ Each site has a transaction coordinator, which is responsible for:
– Starting the execution of transactions that originate at the site.
– Distributing sub transactions at appropriate sites for execution.
– Coordinating the termination of each transaction that originates at the site.
Two-Phase Commit Protocol
DO performs the operation and records the “before” and “after” values in the
transaction log.
UNDO reverses an operation, using the log entries written by the DO portion
of the sequence.
REDO redoes an operation, using the log entries written by DO portion of the
sequence.
– The write-ahead protocol forces the log entry to be written to permanent
storage before the actual operation takes place.
• Two-phase commit protocol defines the operations between two nodes;
• Coordinator and
• Subordinates or cohorts - one or more
Two-Phase Commit Protocol
• The protocol is implemented in two phases:
• Phase 1: Preparation
• The coordinator sends a PREPARE TO COMMIT message
to all subordinates.
• The subordinates receive the message, write the transaction
log using the write-ahead protocol, and send an
acknowledgement message to the coordinator.
• The coordinator makes sure that all nodes are ready to
commit, or it aborts the transaction.
Two-Phase Commit Protocol
⚫ Phase 2: The Final Commit
– The coordinator broadcasts a COMMIT message to all
subordinates and waits for the replies.
– Each subordinate receives the COMMIT message then
updates the database, using the DO protocol.
– The subordinates reply with a COMMITTED or NOT
COMMITTED message to the coordinator.
– If one or more subordinates uncommitted, the coordinator sends
an ABORT message, thereby forcing them to UNDO all
changes.
Performance Transparency and
Query Optimization
• Query optimization must provide distribution transparency as
well as replica transparency.
• Replica transparency refers to the DDBMSs ability to hide the
existence of multiple copies of data from the user.
• Query optimization algorithms are based on two principles:
• Selection of the optimum execution order
• Selection of sites to be accessed to minimize communication
costs
Operation Modes of Query Optimization
⚫ Automatic query optimization
– DDBMS finds the most cost-effective access path without user intervention.
⚫ Manual query optimization
– Optimization is selected and scheduled by the end user or programmer.
Timing of Query Optimization
– Static query optimization takes place at compilation time.
– Dynamic query optimization takes place at execution time.
• Optimization Techniques -
– Statistically based query optimization - uses statistical information about
the database.
– Rule-based query optimization algorithm - based on a set of user-defined
rules to determine the best query access strategy.
Date’s Twelve Rules for a DDBMS
• In this final section, we list Date’s twelve rules (or objectives) for
DDBMSs (Date, 1987b).
• Fundamental principle
• To the user, a distributed system should look exactly like a non-
distributed system.
1) Local autonomy
2) No reliance on a central site
3) Continuous operation
4) Location independence
Date’s Twelve Rules for a DDBMS
5) Fragmentation independence
6) Replication independence
7) Distributed query processing
8) Distributed transaction processing
9) Hardware independence
10) Operating system independence
11) Network independence
12) Database independence
Questions ?
1. Explain what is meant by a DDBMS and discuss the motivation in
providing such a system.
2. Compare and contrast a DDBMS with a parallel DBMS. Under what
circumstances would you choose a DDBMS over a parallel DBMS?
3. Discuss the advantages and disadvantages of a DDBMS.
4. What is the difference between a homogeneous and a heterogeneous
DDBMS? Under what circumstances would such systems generally
arise?