0% found this document useful (0 votes)

173 views42 pages

RAID and File Organization Techniques

Uploaded by

anithaselvi92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

173 views42 pages

RAID and File Organization Techniques

Uploaded by

anithaselvi92

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Unit: IV: Implementation Techniques

Syllabus
RAID-File Organization - Organization of Records in Files - Data dictionary
Storage - Column Oriented Storage - Indexing and Hashing - Ordered Indices -
B+ tree Index Files - B tree Index Files - Static Hashing - Dynamic Hashing -
Query Processing Overview - Algorithms for Selection, Sorting and join
operations - Query optimization using Heuristics - Cost Estimation.

RAID
• RAID stands for Redundant Array of Independent Disks. This is a technology in
which multiple secondary disks are connected together to increase the
performance, data redundancy or both.
• For achieving the data redundancy - in case of disk failure, if the same data is
also backed up onto another disk, we can retrieve the data and go on with the
operation.
• It consists of an array of disks in which multiple disks are connected to achieve
different goals.
• The main advantage of RAID, is the fact that, to the operating system the array of
disks can be presented as a single disk.
Need for RAID
• RAID is a technology that is used to increase the performance.
• It is used for increased reliability of data storage.
• An array of multiple disks accessed in parallel will give greater throughput than a
single disk.
• With multiple disks and a suitable redundancy scheme, your system can stay up
and running when a disk fails, and even while the replacement disk is being
installed and its data restored.
Features
(1) RAID is a technology that contains the set of physical disk drives.
(2) In this technology, the operating system views the separate disks as a single
logical disk.
(3) The data is distributed across the physical drives of the array.
(4) In case of disk failure, the parity information can be helped to recover the data.

RAID Levels
Level: RAID 0
• In this level, data is broken down into blocks and these blocks are stored across
all the disks.
• Thus striped array of disks is implemented in this level. For instance in the
following figure blocks "A B" form a stripe.
• There is no duplication of data in this level so once a block is lost then there is no
int lovol diri way recover it.
• The main priority of this level is performance and not the reliability.

Level: RAID 1
• This level makes use of mirroring. That means all data in the drive is duplicated
to another drive.
• This level provides 100% redundancy in case of failure.
• Only half space of the drive is used to store the data. The other half of drive is
just a mirror to the already stored data.
• The main advantage of this level is fault tolerance. If some disk fails then the
other automatically takes care of lost data.
Level: RAID 2
• This level makes use of mirroring as well as stores Error Correcting Codes (ECC)
for its data striped on different disks.
• The data is stored in separate set of disks and ECC is stored another set of disks.
• This level has a complex structure and high cost. Hence it is not used
commercially.
Level: RAID 3
• This level consists of byte-level stripping with dedicated parity. In this level, the
parity information is stored for each disk section and written to a dedicated. parity
drive.
• We can detect single errors with a parity bit. Parity is a technique that checks
whether data has been lost or written over when it is moved from one place in
storage to another.
• In case of disk failure, the parity disk is accessed and data is reconstructed from
the remaining devices. Once the failed disk is replaced, the missing data can be
restored on the new disk.
Level: RAID 4
• RAID 4 consists of block-level stripping with a parity disk.
• Note that level 3 uses byte-level striping, whereas level 4 uses block-
level striping.
Level: RAID 5
• RAID 5 is a modification of RAID 4.
• RAID 5 writes whole data blocks onto different disks, but the parity
bits generated for data block stripe are distributed among all the data disks rather
than storing them on a different dedicated disk.

Level: RAID 6
• RAID 6 is a extension of Level 5
• RAID 6 writes whole data blocks onto different disks, but the two independent
parity bits generated for data block stripe are distributed among all the data disks
rather than storing them on a different dedicated disk.
• Two parities provide additional fault tolerance.
• This level requires at least four disks to implement RAID.
The factors to be taken into account in choosing a RAID level are :
Monetary cost of extra disk-storage requirements.
1. Performance requirements in terms of number of I/O operations.
2. Performance when a disk has failed.
3. Performance during rebuild

File Organization
• A file organization is a method of arranging records in a file when the file is
stored on disk.
• A file is organized logically as a sequence of records.
• Record is a sequence of fields.
• There are two types of records used in file organization (1) Fixed Length Record
(2) Variable Length Record.
(1) Fixed length record
• A file where each records is of the same length is said to have fixed
length records.
• Some fields are always the same length (e.g. Phone Number is always 10
characters).
• Some fields may need to be 'padded out' so they are the correct length.
• For example -
type Employee = record
EmpNo varchar(4)
Ename varchar(10)
Salary integer(5)
Phone varchar(10)
End

For instance the first record of example file can be stored as

Thus total 29 bytes are required to store.

Advantage:
• Access is fast because the computer knows where each record starts.
Disadvantage:
(1) Due to fixed size, some larger sized record may cross the block boundaries.
That of means part of record will be stored in one block and other part of the
record may be stored in some another block. Thus we may require two block
access for each read or write.
(2) It is difficult to delete the record from this structure. If some intermediate
record is deleted from the file then the vacant space must be occupied by next
subsequent records.
• When a record is deleted,we could move the record that came after it into the
space formerly occupied by the deleted record. But this may require moving of
multiple records to occupy the vacant space. This is an undesirable solution to fill
up vacant space of deleted records.
• Another approach is to use a file header. At the beginning of the file, we allocate
a certain number of bytes as a file header. The header will contain information
such as address of the first record whose contents are deleted. We use this first
record to store the address of the second available record and so on. Thus the
stored record addresses are referred as pointer while the deleted records thus form
a linked list which is called as free list.
For example - Consider the employee record in a file is-

The representation of records maintaining free list after deletion of record 1,3 and
5

• On insertion of a new record, we use the record pointed to by the header. We

change the header pointer to point to the next available record. If no space is
available, we add the new record to the end of the file.
(2) Variable length record
• Variable-length records arise in a database in several ways:
(i) Storage of multiple record types in a file.
(ii) Record types that allow variable lengths for one or more fields.
(iii) Record types that allow repeating fields such as arrays or multisets.
• The representation of a record with variable-length attributes typically has two
parts:
a) an initial part with fixed length attributes. This initial part of the record is
represented by a pair (offset, length). The offset denotes the starting address of the
record while the length represents the actual length of the length.
b) followed by data for variable length attributes.
• For example - Consider the employee records stored in a file as

• The variable length representation of first record is,

• The figure also illustrates the use of a null bitmap, which indicates which
attributes of the record have a null value.
• The variable length record can be stored in blocks. A specialized structure called
slotted page structure is commonly used for organizing the records within a block.
This structure is as shown by following Fig. 4.2.1.
• This structure can be described as follows -
• At the beginning of each block there is a block header which contains -
[Link] number of entries (i.e. records).
[Link] to end of free list.
[Link] by an array of entries that contain size and location of each record.
• The actual records are stored in contiguous block. Similarly the free list is also
maintained as continuous block.

Data Dictionary Storage

• Definition:Data dictionary is mini database management system that manages
the metadata.
• Data dictionaries are helpful to the database administrators in management of
database.
• The general structure of data dictionary is as shown in the Fig. 4.4.1.
• The data in the database dictionary is maintained by several programs and
generates the reports if required.
• The data dictionary is integrated with the database systems in which the data is
controlled by the data dictionaries and is made available to the DBMS software.
• Following type of information is maintained by data dictionary -
a) Description of the schema of the database.
b) Detailed information about the physical database design.
c) Description of database users, their roles and their access rights.
d) Description of database transactions.
e) The description of the relationship between database transactions and data items
referenced by them.
f) Information about the usage statistics. That means, how many times the queries
are raised to the database, how many transactions are made by the DBMS.
• For example - Consider a Student database, in which various fields are RollNo,
FirstName, LastName, and CourseID. The data dictionary for this database ban
maintains the information about this database. The data dictionary contains the dog
column names, Data type of each field and the description of each column of the
database.

Active and Passive Data Dictionaries

Active data dictionary
• Active Data dictionary is managed automatically by the database management
system.
• They are consistent with current structure.
• In the active data dictionary, when any modification or changes is executed by
the DBMS, then this dictionary it also gets modified by the DBMS automatically.
• Most of the active data dictionaries are derived from system catalog.
Passive data dictionary
• Passive data dictionary is used only for documentation purpose.
• Passive dictionary is a self-contained application and set of files used for
documenting the data processing environment.
• The process of maintaining or modification of the database is manual.
• It is managed by the users of the database systems.
Difference between active and passive data dictionary
Indexing and Hashing
AU: Dec.-11, Marks 6
• An index is a data structure that organizes data records on the disk to make the
retrieval of data efficient.
• The search key for an index is collection of one or more fields of records using
which we can efficiently retrieve the data that satisfy the search conditions.
• The indexes are required to speed up the search operations on file of records.
• There are two types of indices -
• Ordered Indices: This type of indexing is based on sorted ordering values.
• Hash Indices: This type of indexing is based on uniform distribution of values
across range of buckets. The address of bucket is obtained using the hash function.
• There are several techniques of for using indexing and hashing. These techniques
are evaluated based on following factors -
• Access Types: It supports various types of access that are supported
efficiently.
• Access Time: It denotes the time it takes to find a particular data item or set
items.
• Insertion Time: It represents the time required to insert new data item.
• Deletion Time: It represents the time required to delete the desired data item.
• Space overhead: The space is required to occupy the index structure. But
allocating such extra space is worth to achieve improved performance.
Example 4.4.1 Since indices speed query processing. Why might they not be kept
on several search keys? List as many reasons as possible. AU: Dec.-11, Marks 6
Solution: Reasons for not keeping several search indices include:
a. Every index requires additional CPU time and disk I/O overhead during inserts
and deletions.
b. Indices on non-primary keys might have to be changed on updates, although an
index on the primary key might not (as updates typically do not modify the
primary key attributes).
c. Each extra index requires additional storage space.
d. For queries which involve conditions on several search keys, efficiency might
not be bad even if only some of the keys have indices on them.
Therefore database performance is improved less by adding indices when many
indices already exist.

Ordered Indices

Primary and Clustered Indices

Primary index :
• An index on a set of fields that includes the primary key is called a primary
index. The primary index file should be always in sorted order.
• The primary indexing is always done when the data file is arranged in sorted
order and primary indexing contains the primary key as its search key.
• Consider following scenario in which the primary index consists of few entries as
compared to actual data file.

• Once if you are able to locate the first entry of the record containing block, other
entries are stored continuously. For example if we want to search a record for Reg
No 11AS32 we need not have to search for the entire data file. With the help of
primary index structure we come to know the location of the record containing the
RegNo 11AS30, now when the first entry of block 30 is located, then we can easily
no rig locate the entry for 11AS32.
• We can apply binary search technique. Suppose there are n = 300 blocks in a
main data file then the number of accesses required to search the data file will be
log2n+ 1 = (log2 300) + 1≈9
• If we use primary index file which contains at the most n = 3 blocks then using
binary search technique, the number of accesses required to search using the
primary index file will be log2 n+1= (log2 3)+1=3
• This shows that using primary index the access time can be deduced to great
extent.
Clustered index:
• In some cases, the index is created on non-primary key columns which may not
be unique for each record. In such cases, in order to identify the records faster, we
will group two or more columns together to get the unique values and create index
out of them. This method is known as clustering index.
• When a file is organized so that the ordering of data records is the same as the
ordering of data entries in some index then say that index is clustered, otherwise it
is an unclustered index.
• Note that, the data file need to be in sorted order.
• Basically, records with similar characteristics are grouped together and indexes
are created for these groups.
• For example, students studying in each semester are grouped together. i.e.; 1st
semester students, 2nd semester students, 3rd semester students etc. are grouped.
Dense and Sparse Indices
There are two types of ordered indices :
1) Dense index:
• An index record appears for every search key value in file.
• This record contains search key value and a pointer to the actual record.
• For example:

2) Sparse index:
• Index records are created only for some of the records.
• To locate a record, we find the index record with the largest search key value less
than or equal to the search key value we are looking for.
• We start at that record pointed to by the index record, and proceed along the
pointers in the file (that is, sequentially) until we find the desired record.
• For example -

Single and Multilevel Indices

Single level indexing:
• A single-level index is an auxiliary file that makes it more efficient to search for a
record in the data file.
• The index is usually specified on one field of the file (although it could be
specified on several fields).
• Each index can be in the following form.

• The index file usually occupies considerably less disk blocks than the data file
because its entries are much smaller.
• A binary search on the index yields a pointer to the file record.
• The types of single level indexing can be primary indexing, clustering index or
secondary indexing.
• Example: Following Fig. 4.7.5 represents the single level indexing -
Multilevel indexing:
• There is an immense need to keep the index records in the main memory so as to
speed up the search operations. If single-level index is used, then a large size index
cannot be kept in memory which leads to multiple disk accesses.
• Multi-level Index helps in breaking down the index into several smaller indices in
order to make the outermost level so small that it can be saved in a single disk
block, which can easily be accommodated anywhere in the main memory.
• The multilevel indexing can be represented by following Fig. 4.7.6.
Secondary Indices
• In this technique two levels of indexing are used in order to reduce the mapping
size of the first level and in general.
• Initially, for the first level, a large range of numbers is selected so that the
mapping size is small. Further, each range is divided into further sub ranges.
• It is used to optimize the query. processing and access records in a database with
some information other than the usual search key.
For example -
B+ Tree Index Files
AU: May-03,06,16,19, Dec.-17, Marks 16
• The B+ tree is similar to binary search tree. It is a balanced tree in which the
internal nodes direct the search.
• The leaf nodes of B+ trees contain the data entries.
Structure of B+ Tree
• The typical node structure of B+ node is as follows –

• It contains up to n – 1 search-key values k1, k2, ……, kn-1 and n pointers

P1, P2,..., Pn
• The search-key values within a node are kept in sorted order; thus, if i < j, then
Ki<Kj.
• To retrieve all the leaf pages efficiently we have to link them using page pointers.
The sequence of leaf pages is also called as sequence set.
• Following Fig. 4.8.1 represents the example of B+ tree.
• The B+ tree is called dynamic tree because the tree structure can grow on
insertion of records and shrink on deletion of records.
Characteristics of B+ Tree
Following are the characteristics of B+ tree.
1) The B+ tree is a balanced tree and the operations insertions and deletion keeps
the tree balanced.
2) A minimum occupancy of 50 percent is guaranteed for each node except the
root.
3) Searching for a record requires just traversal from the root to appropriate leaf.

Insertion Operation
Algorithm for insertion :
Step 1: Find correct leaf L.
Step 2: Put data entry onto L.
i) If L has enough space, done!
ii) Else, must split L (into L and a new node L2)
• Allocate new node
• Redistribute entries evenly
• Copy up middle key.
• Insert index entry pointing to L2 into parent of L.
Step 3: This can happen recursively
i) To split index node, redistribute entries evenly, but push up middle key.
(Contrast with leaf splits.)
Step 4: Splits "grow" tree; root split increases height.
i) Tree growth: gets wider or one level taller at top.
Example 4.8.1 Construct B+ tree for following data. 30,31,23,32,22,28,24,29,
where number of pointers that fit in one node are 5.
Solution: In B+ tree each node is allowed to have the number of pointers to be 5.
That means at the most 4 key values are allowed in each node.
Step 1: Insert 30,31,23,32. We insert the key values in ascending order.

Step 2: Now if we insert 22, the sequence will be 22, 23, 30, 31, 32. The middle
key 30, will go up.

Step 3: Insert 28,24. The insertion is in ascending order.

Step 4: Insert 29. The sequence becomes 22, 23, 24, 28, 29. The middle key 24
will go up. Thus we get the B+ tree.

Example 4.8.2 Construct B+ tree to insert the following (order of the tree is 3)
26,27,28,3,4,7,9,46,48,51,2,6
Solution:
Order means maximum number of children allowed by each node. Hence order 3
means at the most 2 key values are allowed in each node.
Step 1: Insert 26, 27 in ascending order
Step 2: Now insert 28. The sequence becomes 26,27,28. As the capacity of the
node is full, 27 will go up. The B+ tree will be,

Step 3: Insert 3. The partial B+ Tree will be,

Step 4: Insert 4. The sequence becomes 3,4, 26. The 4 will go up. The partial B+
tree will be –

Step 5: Insert 7. The sequence becomes 4,7,26. The 7 will go up. Again from
4,7,27. the 7 will go up. The partial B+ Tree will be,

Step 6: Insert 9. By inserting 7,9, 26 will be the sequence. The 9 will go up. The
partial B+ tree will be,
Step 7: Insert 46. The sequence becomes 27,28,46. The 28 will go up. Now the
sequence becomes 9, 27, 28. The 27 will go up and join 7. The B+ Tree will be,

Step 8: Insert 48. The sequence becomes 28,46,48. The 46 will go up. The B+
Tree will become,

Step 9: Insert 51. The sequence becomes 46,48,51. The 48 will go up. Then the
sequence becomes 28, 46, 48. Again the 46 will go up. Now the sequence becomes
7,27, 46. Now the 27 will go up. Thus the B+ tree will be
Step 10: Insert 2. The insertion is simple. The B+ tree will be,

Step 11: Insert 6. The insertion can be made in a vacant node of 7(the leaf node).
The final B+ tree will be,
Deletion Operation
Algorithm for deletion:
Step 1: Start at root, find leaf L with entry, if it exists.
Step 2: Remove the entry.
i) If L is at least half-full, done!
ii) If L has only d-1 entries,
• Try to re-distribute, borrowing keys from sibling.
(adjacent node with same parent as L).
• If redistribution fails, merge L and sibling.
Step 3: If merge occurred, must delete entry (pointing to L or sibling) from parent
of L.
Step 4: Merge could propagate to root, decreasing height.
Example 4.8.3 Construct B+ Tree for the following set of key values
(2,3,5,7,11,17,19,23,29,31) Assume that the tree is initially empty and values are
added in ascending order. Construct B+ tree for the cases where the number of
pointers that fit one node is four. After creation of B+ tree perform following
series of operations:
(a) Insert 9. (b) Insert 10. (c) Insert 8. (d) Delete 23. (e) Delete 19.
Solution: The number of pointers fitting in one node is four. That means each node
contains at the most three key values.
Step 1: Insert 2, 3, 5.

Step 2: If we insert 7, the sequence becomes 2, 3, 5, 7. Since each node can

accommodate at the most three key, the 5 will go up, from the sequence 2, 3, 5, 7.

Step 3: Insert 11. The partial B+ tree will be,

Step 4: Insert 17. The sequence becomes 5,7, 11,17. The element 11 will go up.
Then the partial B+ tree becomes,

Step 5: Insert 19.

Step 6: Insert 23. The sequence becomes 11,17,19,23. The 19 will go up.

Step 7: Insert 29. The partial B+ tree will be,

Step 8: Insert 31. The sequence becomes 19,23,29, 31. The 29 will go up. Then at
the upper level the sequence becomes 5,11,19,29. Hence again 19 will go up to
maintain the capacity of node (it is four pointers three key values at the most).
Hence the complete B+ tree will be,

(a) Insertion of 9: It is very simple operation as the node containing 5,7 has one
space vacant to accommodate. The B+ tree will be,

(b) Insert 10: If we try to insert 10 then the sequence becomes 5,7,9,10. The 9 will
go up. The B+ tree will then become –
(c) Insert 8: Again insertion of 8 is simple. We have a vacant space at node 5,7. So
we just insert the value over there. The B+ tree will be-

(d) Delete 23: Just remove the key entry of 23 from the node 19,23. Then merge
the sibling node to form a node 19,29,31. Get down the entry of 11 to the leaf
node. Attach the node of 11,17 as a left child of 19.

(e) Delete 19: Just delete the entry of 19 from the node 19,29,31. Delete the
internal node key 19. Copy the 29 up as an internal node as it is an inorder
successor node.
Search Operation
1. Perform a binary search on the records in the current node.
2. If a record with the search key is found, then return that record.
3. If the current node is a leaf node and the key is not found, then report an
unsuccessful search.
4. Otherwise, follow the proper branch and repeat the process.
For example-

Consider the B+ tree as shown in above Fig. 4.8.2.

For searching a node 25, we start from the root node -
(1) Compare 20 with key value 25. As 25>20, move on to right branch.
(2) Compare 25 with key value 25. As the match is found we declare, that the
given node is present in the B+ tree.
For searching a node 10, we start form the root node -
(1) Compare 20 with key value 10, as 10<20, we follow left branch
(2) Compare 8 with 10, 10>8, then we compare 10 with the next adjacent value of
the same node. It is 11, as 10<11, we follow left branch of 11.
(3) We compare 10, with all the values in that node, as match is not found we
report unsuccessful search or node is not present in given B+ tree.
Merits of B+ Index Tree Structure
1. In B+ tree the data is stored in leaf node so searching of any data requires
scanning only of leaf node alone.
2. Data is ordered in linked list.
3. Any record can be fetched in equal number of disk accesses.
4. Range queries can be performed easily as leaves are linked up.
5. Height of the tree is less as only keys are used for indexing.
6. Supports both random and sequential access.
Demerits of B+ Index Tree Structure
1. Extra insertion of non leaf nodes.
2. There is space overhead.

Static Hashing
• In this method of hashing, the resultant data bucket address will be always same.
• That means, if we want to generate address for Stud_RollNo = 34789. Here if we
use mod 10 hash function, it always result in the same bucket address 9. There will
not be any changes to the bucket address here.
• Hence number of data buckets in the memory for this static hashing remains
constant throughout. In our example, we will have ten data buckets in the memory
used to store the data.

• If there is no space for some data entry then we can allocate new overflow page,
put the data record onto that page and add the page to overflow chain of the bucket.
For example if we want to add the Stud_RollNo= 35111 in above hash table then
as there is no space for this entry and the hash address indicate to place this record
at index 1, we create overflow chain as shown in Table 4.11.1.
Example 4.11.1 Why is hash structure not the best choice for a search key on
which range of queries are likely ? AU: May-06, Marks 8
Solution :
• A range query cannot be answered efficiently using a hash index, we will have to
read all the buckets.
• This is because key values in the range do not occupy consecutive locations in
the buckets, they are distributed uniformly and randomly throughout all the
buckets.
Advantages of Static Hashing
(1) It is simple to [Link]
(2) It allows speedy data storage.
Disadvantages of Static Hashing
There are two major disadvantages of static hashing:
1) In static hashing, there are fixed number of buckets. This will create a
problematic situation if the number of records grow or shrink.
2) The ordered access on hash key makes it inefficient.

Dynamic Hashing
AU: May-04,07,18, Dec.-08,17, Marks 13
• The problem with static hashing is that it does not expand or shrink dynamically
as the size of the database grows or shrinks.
• Dynamic hashing provides a mechanism in which data buckets are added and
removed dynamically and on-demand.
• The most commonly used technique of dynamic hashing is extendible hashing.

Extendible Hashing
The extendible hashing is a dynamic hashing technique in which, if the bucket is
overflow, then the number of buckets are doubled and data entries in buckets are
re- distributed.
Example of extendible hashing:
In extendible hashing technique the directory of pointers to bucket is used. Refer
following Fig. 4.12.1
To locate a data entry, we apply a hash function to search the data we us last two
digits of binary representation of number. For instance binary representation of 32*
= 10000000. The last two bits are 00. Hence we store 32* accordingly.
Insertion operation :
• Suppose we want to insert 20* (binary 10100). But with 00, the bucket A is full.
So we must split the bucket by allocating new bucket and redistributing the
contents, bellsp across the old bucket and its split image.
• For splitting, we consider last three bits of h(r).
• The redistribution while insertion of 20* is as shown in following Fig. 4.12.2.
The split image of bucket A i.e. A2 and old bucket A are based on last two bits i.e.
00. Here we need two data pages, to adjacent additional data record. Therefore
here it is necessary to double the directory using three bits instead of two bits.
Hence,
• There will be binary versions for buckets A and A2 as 000 and 100.
• In extendible hashing, last bits d is called global depth for directory and d is
called local depth for data pages or buckets. After insetion of 20*, the global depth
becomes 3 as we consider last three bits and local depth of A and A2 buckets
become 3 as we are considering last three bits for placing the data records. Refer
Fig. 4.12.3.
(Note: Student should refer binary values given in Fig. 4.12.2, for understanding
insertion operation)
• Suppose if we want to insert 11*, it belongs to bucket B, which is already full.
Hence let us split bucket B into old bucket B and split image of B as B2.
• The local depth of B and B2 now becomes 3.
• Now for bucket B, we get and 1=001
11 100011
• For bucket B2, we get
5=101
29 = 11101
and 21 =10101
After insertion of 11* we get the scenario as follows,
Query Processing Overview
AU: May-14,16,18, Dec.-19, Marks 16
• Query processing is a collection of activities that are involved in extracting data
from database.
• During query processing there is translation high level database language queries
into the expressions that can be used at the physical level of filesystem.
• There are three basic steps involved in query processing and those are -
1. Parsing and Translation
• In this step the query is translated into its internal form and then into relational
algebra.
• Parser checks syntax and verifies relations.
• For instance - If we submit the query as,
SELECT RollNo, name
FROM Student
HAVING RollNo=10
Then it will issue a syntactical error message as the correct query should be
SELECT RollNo, name
FROM Student
HAVING RollNo=10
Thus during this step the syntax of the query is checked so that only correct and
verified query can be submitted for further processing.
2. Optimization
• During this process thequery evaluation plan is prepared from all the relational
algebraic expressions. bud off
• The query cost for all the evaluation plans is calculated.
• Amongst all equivalent evaluation plans the one with lowest cost is chosen.
• Cost is estimated using statistical information from the database catalog, such
asthe number of tuples in each relation, size of tuples, etc.
3. Evaluation
• The query-execution engine takes a query-evaluation plan, executes that plan, and
returns the answers to the query.
For example - If the SQL query is,
SELECT balance
FROM account
WHERE balance<1000
Step 1: This query is first verified by the parser and translator unit for correct
syntax. If so then the relational algebra expressions can be obtained. For the above
given queries there are two possible relational algebra
(1) σbalance<1000(Πbalance (account))
(2) Πbalance ( σbalance<1000 (account))
Step 2: Query Evaluation Plan: To specify fully how to evaluate a query, we need
not only to provide the relational-algebra expression, but also to annotate it with
instructions specifying how to evaluate each operation. For that purpose, using the
order of evaluation of queries, two query evaluation plans are prepared. These are
as follows
Associated with each query evaluation plan there is a query cost. The query
optimization selects the query evaluation plan having minimum query cost.
Once the query plan is chosen, the query is evaluated with that plan and the result
of the query is output.

Query Optimization using Heuristics - Cost Estimation

AU: Dec,-13,16, May-15, Marks 16

Heuristic Estimation
• Heuristic is a rule that leads to least cost in most of cases.
• Systems may use heuristics to reduce the number of choices that must be made in
a cost-based fashion.
• Heuristic optimization transforms the query-tree by using a set of rules that
typically t improve execution performance. These rules are
1. Perform selection early (reduces the number of tuples)
2. Perform projection early (reduces the number of attributes)
3. Perform most restrictive selection and join operations before other similar on she
is ont ni bold do operations (such as cartesian product).
• Some systems use only heuristics, others combine heuristics with partial cost-
based optimization.
Steps in Heuristic Estimation
Step 1: Scanner and parser generate initial query representation
Step 2: Representation is optimized according to heuristic rules
Step 3: Query execution plan is developed
For example: Suppose there are two relational algebra -
(1) σcity= "Pune" (Tcname Branch) Account Customer)
(2) Πcname(σcity="Pune (Branch Account Customer))
The query evaluation plan can be drawn using the query trees as follows-

Out of the above given query evaluation plans, the Fig. 4.16.1 (b) is much faster
than Fig. 4.16.1 (a) because - in Fig. 4.16.1 (a) the join operation is among Branch,
Account and Customer, whereas in Fig. 4.16.1 (b) the join of (Account and
Customer) is made with the selected tuple for City="Pune". Thus the output of
entire table for join operation is much more than the join for some selected tuples.
Thus we get choose the optimized query.

Cost based Estimation

• A cost based optimizer will look at all of the possible ways or scenarios in which
a query can be executed.
• Each scenario will be assigned a 'cost', which indicates how efficiently that query
can be run.
• Then, the cost based optimizer will pick the scenario that has the least cost and
execute the query using that scenario, because that is the most efficient way to run
the query.
• Scope of query optimization is a query block. Global query optimization involves
multiple query blocks.
• Cost components for query execution
• Access cost to secondary storage
• Disk storage cost
• Computation cost
• Memory usage cost
• Communication cost
• Following information stored in DBMS catalog and used by optimizer
• File size
• Organization
• Number of levels of each multilevel index
• Number of distinct values of an attribute
• Attribute selectivity
• RDBMS stores histograms for most important attributes

RAID Techniques in DBMS for CS3492
No ratings yet
RAID Techniques in DBMS for CS3492
24 pages
Overview of Apache Hive Data Formats
100% (1)
Overview of Apache Hive Data Formats
47 pages
Anna University DBMS Syllabus R21
100% (1)
Anna University DBMS Syllabus R21
3 pages
CCS341 Data Warehousing Question Bank
No ratings yet
CCS341 Data Warehousing Question Bank
21 pages
Static vs Dynamic Hashing in DBMS
100% (1)
Static vs Dynamic Hashing in DBMS
8 pages
Banking System Design in C++ OOP
No ratings yet
Banking System Design in C++ OOP
6 pages
Disk Scheduling Methods Explained
No ratings yet
Disk Scheduling Methods Explained
15 pages
CW3551 Data Security Lesson Plan
No ratings yet
CW3551 Data Security Lesson Plan
5 pages
Data Science Fundamentals Answer Key
No ratings yet
Data Science Fundamentals Answer Key
8 pages
CCS341 Data Warehousing Syllabus
No ratings yet
CCS341 Data Warehousing Syllabus
2 pages
CS3391 Java Concepts Question Bank
No ratings yet
CS3391 Java Concepts Question Bank
40 pages
MCA Syllabus Overview 2025
No ratings yet
MCA Syllabus Overview 2025
36 pages
Understanding Cyclic Redundancy Check (CRC)
No ratings yet
Understanding Cyclic Redundancy Check (CRC)
4 pages
CS3492 DBMS Syllabus - Anna University
No ratings yet
CS3492 DBMS Syllabus - Anna University
2 pages
Classical Ciphers Overview
No ratings yet
Classical Ciphers Overview
44 pages
CS8492-Database Management Systems
No ratings yet
CS8492-Database Management Systems
15 pages
Routing Protocols and Techniques Explained
No ratings yet
Routing Protocols and Techniques Explained
19 pages
CS 3452 Theory of Computation Overview
No ratings yet
CS 3452 Theory of Computation Overview
41 pages
Data Warehousing Insights: Snowflake vs Oracle
No ratings yet
Data Warehousing Insights: Snowflake vs Oracle
30 pages
VCET SIS Login Instructions
No ratings yet
VCET SIS Login Instructions
135 pages
CCS372 Virtualization Course Overview
No ratings yet
CCS372 Virtualization Course Overview
53 pages
M.Tech CSE AI & ML Question Bank
No ratings yet
M.Tech CSE AI & ML Question Bank
15 pages
Windows OS Installation Guide
No ratings yet
Windows OS Installation Guide
64 pages
Implementing Symmetric Key Algorithms
No ratings yet
Implementing Symmetric Key Algorithms
87 pages
Fundamental Data Structure
No ratings yet
Fundamental Data Structure
15 pages
CD3291 Data Structures Lesson Plan
No ratings yet
CD3291 Data Structures Lesson Plan
4 pages
ACID Properties and Two-Phase Locking
No ratings yet
ACID Properties and Two-Phase Locking
33 pages
CCS341 Data Warehousing Assessment Key
No ratings yet
CCS341 Data Warehousing Assessment Key
18 pages
M.E. Cse (Ai&ml)
No ratings yet
M.E. Cse (Ai&ml)
63 pages
Depth First Search on Graph G
No ratings yet
Depth First Search on Graph G
20 pages
Understanding Linear Discriminant Analysis
No ratings yet
Understanding Linear Discriminant Analysis
21 pages
BI and DW: Key Concepts Explained
No ratings yet
BI and DW: Key Concepts Explained
11 pages
M.Tech R25 CSE - CC & Syllabus Updated
No ratings yet
M.Tech R25 CSE - CC & Syllabus Updated
68 pages
Similarity-Based Learning in ML
No ratings yet
Similarity-Based Learning in ML
51 pages
CS3391 OOP Notes and Key Concepts
No ratings yet
CS3391 OOP Notes and Key Concepts
12 pages
CS3301 Data Structures Overview
No ratings yet
CS3301 Data Structures Overview
13 pages
BCA Database Management Lab Manual
No ratings yet
BCA Database Management Lab Manual
80 pages
Understanding BIP 36 and Distributed Systems
No ratings yet
Understanding BIP 36 and Distributed Systems
11 pages
CCS354 Network Security Lab Exam Guide
No ratings yet
CCS354 Network Security Lab Exam Guide
2 pages
Deterministic vs Non-Deterministic Algorithms
No ratings yet
Deterministic vs Non-Deterministic Algorithms
5 pages
MongoDB in Big Data Analytics
No ratings yet
MongoDB in Big Data Analytics
14 pages
Market-Oriented Cloud Architecture Overview
No ratings yet
Market-Oriented Cloud Architecture Overview
32 pages
CS3481 Database Management Lab Guide
No ratings yet
CS3481 Database Management Lab Guide
85 pages
Overview of Intel 80386 Microprocessor
No ratings yet
Overview of Intel 80386 Microprocessor
3 pages
Data Warehousing & Mining Overview
100% (1)
Data Warehousing & Mining Overview
143 pages
CW3551 Data and Info Security Exam Guide
No ratings yet
CW3551 Data and Info Security Exam Guide
32 pages
Advanced Database Systems Overview
No ratings yet
Advanced Database Systems Overview
29 pages
Data Cleansing in Data Warehousing
No ratings yet
Data Cleansing in Data Warehousing
103 pages
UHV-II Assignment 3 Reflections
No ratings yet
UHV-II Assignment 3 Reflections
2 pages
Data Stream Mining Techniques
No ratings yet
Data Stream Mining Techniques
16 pages
Algorithm Lab Manual for Experiments
100% (1)
Algorithm Lab Manual for Experiments
37 pages
Energy-Efficient ML for IoT Devices
No ratings yet
Energy-Efficient ML for IoT Devices
1 page
DBMS Course Syllabus Overview
No ratings yet
DBMS Course Syllabus Overview
2 pages
Mathematical Induction in Computation
No ratings yet
Mathematical Induction in Computation
70 pages
Brute Force, Divide & Decrease Conquer
No ratings yet
Brute Force, Divide & Decrease Conquer
132 pages
Data Exploration with WEKA and ARFF
No ratings yet
Data Exploration with WEKA and ARFF
19 pages
CCS341 Data Warehousing Overview
No ratings yet
CCS341 Data Warehousing Overview
4 pages
Big Data Analytics Internal Assessment 2024
No ratings yet
Big Data Analytics Internal Assessment 2024
2 pages
Understanding Ad Hoc Networks
No ratings yet
Understanding Ad Hoc Networks
58 pages
File Organization and RAID Indexing
No ratings yet
File Organization and RAID Indexing
64 pages
5.0L PCM Pinout Specifications
No ratings yet
5.0L PCM Pinout Specifications
6 pages
OpenGL vs. OpenCL: Key Differences
No ratings yet
OpenGL vs. OpenCL: Key Differences
9 pages
AU Small Finance Bank Vehicle Loans Study
No ratings yet
AU Small Finance Bank Vehicle Loans Study
71 pages
Fodder Production in Livestock Management
No ratings yet
Fodder Production in Livestock Management
133 pages
NingXia Red®: Energy & Wellness Drink
No ratings yet
NingXia Red®: Energy & Wellness Drink
2 pages
Taguchi Method for Scandium Extraction
No ratings yet
Taguchi Method for Scandium Extraction
16 pages
Calorie Detection and Food Recommendations
No ratings yet
Calorie Detection and Food Recommendations
12 pages
Automatic Road Barriers Installation Guide
No ratings yet
Automatic Road Barriers Installation Guide
104 pages
GIMP Basics: A Student's Guide
No ratings yet
GIMP Basics: A Student's Guide
3 pages
Intelligent Solar Street Light Controller
No ratings yet
Intelligent Solar Street Light Controller
4 pages
Financial Management Course Module Overview
No ratings yet
Financial Management Course Module Overview
271 pages
Ecofit Type DS Square D, Westinghouse
No ratings yet
Ecofit Type DS Square D, Westinghouse
2 pages
Technology and Livelihood Education: Computer System Servicing Quarter 1-Module 2 - Part 2
No ratings yet
Technology and Livelihood Education: Computer System Servicing Quarter 1-Module 2 - Part 2
29 pages
Entry-Level IT Professional Profile
No ratings yet
Entry-Level IT Professional Profile
2 pages
74th PICPA ANC Invitation 2019
No ratings yet
74th PICPA ANC Invitation 2019
1 page
Business Studies MCQ Exam Paper
No ratings yet
Business Studies MCQ Exam Paper
7 pages
SELCO's Strategic Expansion Insights
No ratings yet
SELCO's Strategic Expansion Insights
11 pages
How To Build Csars: For Opentosca
No ratings yet
How To Build Csars: For Opentosca
25 pages
Technical Audit Engineer Job Overview
No ratings yet
Technical Audit Engineer Job Overview
2 pages
Business Plan Assessment Criteria
No ratings yet
Business Plan Assessment Criteria
9 pages
102.05.06 Moduflex Sheet Data H400
100% (1)
102.05.06 Moduflex Sheet Data H400
2 pages
ASML EUV Product Roadmap Overview
No ratings yet
ASML EUV Product Roadmap Overview
27 pages
Capital Structure Policy Explained
No ratings yet
Capital Structure Policy Explained
7 pages
Term 3 Welcome and Grade 12 Update
No ratings yet
Term 3 Welcome and Grade 12 Update
2 pages
CAEHS Fee Structure 2024-25
No ratings yet
CAEHS Fee Structure 2024-25
1 page
Hitachi CP-WX4042WN
No ratings yet
Hitachi CP-WX4042WN
4 pages
Sahal Travel Agency Online System
No ratings yet
Sahal Travel Agency Online System
34 pages
Annotation-Free Music Transcription Model
No ratings yet
Annotation-Free Music Transcription Model
7 pages
Sushi Box Quotation Details
No ratings yet
Sushi Box Quotation Details
1 page
Screenshot 2024-11-28 at 3.18.06 AM
No ratings yet
Screenshot 2024-11-28 at 3.18.06 AM
1 page

RAID and File Organization Techniques

Uploaded by

RAID and File Organization Techniques

Uploaded by

Unit: IV: Implementation Techniques

For instance the first record of example file can be stored as

Thus total 29 bytes are required to store.

• On insertion of a new record, we use the record pointed to by the header. We

• The variable length representation of first record is,

Data Dictionary Storage

Active and Passive Data Dictionaries

Primary and Clustered Indices

Single and Multilevel Indices

• It contains up to n – 1 search-key values k1, k2, ……, kn-1 and n pointers

Step 3: Insert 28,24. The insertion is in ascending order.

Step 3: Insert 3. The partial B+ Tree will be,

Step 2: If we insert 7, the sequence becomes 2, 3, 5, 7. Since each node can

Step 3: Insert 11. The partial B+ tree will be,

Step 5: Insert 19.

Step 7: Insert 29. The partial B+ tree will be,

Consider the B+ tree as shown in above Fig. 4.8.2.

Query Optimization using Heuristics - Cost Estimation

Cost based Estimation

You might also like