Multi-Tenant Database Architectures
Multi-Tenant Database Architectures
Chapter
Multi-Tenant Software
The term "software multitenancy" refers to a software architecture in which a single instance
of software runs on a server and serves multiple tenants. ... A tenant is a group of users who share a
common access with specific privileges to the software instance.
In a single-tenant cloud, only one customer is hosted on a server and is granted access to it.
Due to multi-tenancy architectures hosting multiples customers on the same servers, it is
important to fully understand the security and performance the provider is offering. Single-
tenant clouds give customers more control over the management of data, storage, security
and performance.
Multi-Entity Support:
UNIT-III
Figure shown below depicts the changes that need to be made in an application to support basic multi-entity
features, so that users only access data belonging to their own units. Each database table is appended with a
column (OU_ID) which marks the organizational unit each data record belongs to.
In the single schema model of Figure 9.2, a Custom Fields table stores meta information and data values for all
tables in the application. Mechanisms for handling custom fields in a single schema architecture are usually
variants of this scheme.
Multi-Schema approach
Instead of insisting on a single schema, it is sometimes easier to modify even an existing application to use
multiple schemas, as are supported by most relational databases. In this model, the application computes which
OU the logged in user belongs to, and then connects to the appropriate database schema. Such an architecture is
shown in Figure 9.3
UNIT-III
For the most part, multi-tenancy as discussed above appears to be of use primarily in a software as a service
model. There are also certain cases where multi-tenancy can be useful within the enterprise as well. We have
already seen that supporting multiple entities, such as bank branches, is essentially a multi-tenancy requirement.
Similar needs can arise if a workgroup level application needs to be rolled out to many independent teams, who
usually do not need to share data.
In these cases access to data may need to be controlled based on the values of any field of a table, such as high-
value transactions being visible only to some users, or special customer names being invisible without explicit
permission. Such requirements are referred to as Data Access Control (DAC) needs.
UNIT-III
UNIT-III
Chapter
Data in the Cloud
Since the 80s relational database technology has been the ‘default’ data storage and retrieval mechanism
used in the vast majority of enterprise applications. In the process of creating a planetary scale web search
service, Google in particular has developed a massively parallel and fault tolerant distributed file system (GFS)
along with a data organization (BigTable) and programming paradigm (MapReduce) that is markedly different
from the traditional relational model. Such ‘cloud data strategies’ are particularly well suited for large-volume
massively parallel text processing, as well as possibly other tasks, such as enterprise analytics. At the same time
there have been new advances in building specialized database organizations optimized for analytical data
processing, in particular column-oriented databases such as Vertica.
Relational databases:
Before we delve into cloud data structures we first review traditional relational database systems and how they
store data. Users (including application programs) interact with an RDBMS via SQL; the database ‘front-end’ or
parser transforms queries into memory and disk level operations to optimize execution time. Data records are
stored on pages of contiguous disk blocks, which are managed by the disk-space-management layer.
• Database systems usually do not rely on the file system layer of the OS and instead manage disk space
themselves
• Rows are stored on pages contiguously, also called a ‘row-store’, and indexed using B+-trees
• The database needs be able to adjust page replacement policy when needed and pre-fetch pages from
disk based on expected access patterns that can be very different from file operations.
• Relational records (tabular rows) are stored on disk pages and accessed through indexes on specified
columns, which can be B+-tree indexes, hash indexes, or bitmap indexes.
• B+ tree indexs are not the best for applications where reads dominate; For write and transaction
processing bitmap indexes, cross-table indexes and materialized views are used for efficient access to
records and their attributes.
• Recently column-oriented storage [61] has been proposed as a more efficient mechanism suited for
analytical workloads
UNIT-III
Over the years database systems have evolved towards exploiting the parallel computing capabilities of multi-
processor servers as well as harnessing the aggregate computing power of clusters of servers connected by a
high-speed network.
The Google File System (GFS) [26] is designed to manage relatively large files using a very large distributed
cluster of commodity servers connected by a high-speed network
It is therefore designed to
(a) expect and tolerate hardware failures, even during the reading or writing of an individual file (since files are
expected to be very large)
(b) support parallel reads, writes and appends by multiple client programs.
The Hadoop Distributed File System (HDFS) is an open source implementation of the GFS architecture that is
also available on the Amazon EC2 cloud platform; we refer to both GFS and HDFS as ‘cloud file systems.’
The architecture of cloud file systems is illustrated in Figure 10.3. Large files are broken up into ‘chunks’ (GFS)
or ‘blocks’ (HDFS), which are themselves large (64MB being typical). These chunks are stored on commodity
(Linux) servers called Chunk Servers (GFS) or Data Nodes (HDFS); further each chunk is replicated at least
three times, both on a different physical rack as well as a different network segment in anticipation of possible
failures of these components apart from server failures.
Large files are broken up into ‘chunks’ (GFS) or ‘blocks’ (HDFS), which are themselves large (64MB being
typical). These chunks are stored on commodity (Linux) servers called Chunk Servers (GFS) or Data Nodes
(HDFS); further each chunk is replicated at least three times, both on a different physical rack as well as a
different network segment in anticipation of possible failures of these components apart from server failures.
UNIT-III
• Data in a BigTable is accessed by a row key, column key and a timestamp. Each column can store
arbitrary name–value pairs of the form column-family:label, string.
• Each Bigtable cell (row, column) can contain multiple versions of the data that are stored in decreasing
timestamp order
Example:
Since data in each column family is stored together, using this data organization results in efficient data access
patterns depending on the nature of analysis: For example, only the location column family may be read for
traditional data-cube based analysis of sales, whereas only the product column family is needed for say, market-
basket analysis. Thus, the BigTable structure can be used in a manner similar to a column-oriented database.
Figure 10.5 illustrates how BigTable tables are stored on a distributed file system such as GFS or HDFS.
UNIT-III
BigTable and HBase rely on the underlying distributed file systems GFS and HDFS respectively and therefore
also inherit some of the properties of these systems.
In particular large parallel reads and inserts are efficiently supported, even simultaneously on the same table,
unlike a traditional relational database.
Dynamo
• Unlike BigTable, Dynamo was designed specifically for supporting a large volume of concurrent
updates, each of which could be small in size, rather than bulk reads and appends as in the case of
BigTable and GFS
• Dynamo also replicates data for fault tolerance, but uses distributed object versioning and quorum-
consistency to enable writes to succeed without waiting for all replicas to be successfully updated,
unlike in the case of GFS.
The Google and Amazon cloud services do not directly offer BigTable and Dynamo to cloud users.
Google and Amazon both offer simple key-value pair database stores, viz. Google App Engine’s Datastore and
Amazon’s SimpleDB.
UNIT-III
UNIT-III
Chapter
Database Technology
• First one is the general database solution that is implemented through installation of some
database solution on IaaS (virtual machine delivered as IaaS).
- The users can deploy database applications on cloud virtual machines like any other
applications software.
- Apart from this, the ready-made machine images supplied by the vendors are also
available with per-installed and pre-configured databases
- Example: Amazon provides ready-made EC2 machine image with pre-installed Oracle
Database.
• The other one is delivered by service providers as database-as-a- service (DBaas) where the
vendor fully manages the backend administration jobs like installation, security management
and resource assignment tasks.
- Operational burden of provisioning, configuration, backup facilities are managed by the
service operators
Data Models
- It is not made for distributed data storage and thus makes the scaling of a database difficult.
- Oracle Database, Microsoft SQL Server, Open-source MySQL or Open-source PostgreSQL come under
this category.
Database-as-a-service
• It is offered on a pay-per-usage basis that provides on-demand access to database for the storage of data
• Database-as-a-Service (DBaaS) is a cloud service offering which is managed by cloud service providers.
• DBaaS has all of the characteristics of cloud services like scaling, metered billing capabilities and else.
• Example of DBaaS for unstructured data include Amazon SimpleDB, Google Datastore and Apache Cassandra.
UNIT-III
Deploying some relational database on cloud server is the ideal choice for users who require absolute control over the
management of the database.
Relational Database-as-a-service
Many cloud service providers offer the customary relational database systems as fully-managed services which
provide functionalities similar to what is found in Oracle Server, SQL Server or MySQL Servers.
Amazon RDS:
Amazon Relational Database Service or Amazon RDS is a relational database service available with AWS.
It supports the capabilities of Oracle Server, Microsoft SQL server, open source PostgreSQL and MySQL.
- Reserved DB Instances : one-time payment and offers three different DB Instance types (for light,
medium and heavy utilization).
- On-Demand DB Instances : provide the opportunity of hourly payments with no long-term
commitments.
• Google Cloud SQL is a MySQL database that lives in Google’s cloud and fully managed by Google.
• It is very simple to use and integrates very well with Google App Engine applications written in Java, Python,
PHP and Go.
• Google Cloud SQL is also accessible through MySQL client and other tools those works with MySQL
databases.
• Google Cloud SQL offers updated releases of MySQL.
- Packages option is suitable for users who extensively use the database per month
- Per Use hourly-basis billing is preferable which is available
UNIT-III
Amazon RDS, Google Cloud SQL and Azure SQL Databases deliver RDBMS as-a-Service.
Non-relational database system is another unique offering in the field of data-intensive computing.
Big Data:
Big data is used to describe both structured and unstructured data that is massive in volume.
Volume: A typical PC probably had 10 gigabytes of storage in the year of 2000. During that time, excessive data
volume was a storage issue as storage was not so cheap like today. Today social networking sites use to generate few
thousand terabytes of data every day.
Velocity: Data streaming nowadays are happening at unprecedented rate as well as with speed. So things must be
dealt in a timely manner. Quick response to customers’ action is a business challenge for any organization.
Variety: Data of all formats are important today. Structured or unstructured texts, audio, video, image, 3D data and
others are all being produced every day.
NoSQL DBMS:
NoSQL is a class of database management system that does not follow all of the rules of a relational DBMS.
The term NoSQL can be interpreted as ‘Not Only SQL’ as it is not a replacement but rather it is a complementary
addition to RDBMS
NoSQL is not against SQL and it was developed to handle unstructured big data in an efficient way to provide
maximum business value
CAP Theorem:
The abbreviation CAP stands for Consistency, Availability and Partition tolerance of data.
CAP theorem (also known as Brewer’s theorem) says that it is impossible for a distributed computer system to meet
all of three aspects of CAP simultaneously.
UNIT-III
█ Consistency: This means that data in the database remains consistent after execution of an operation. For
example, once a data is written or updated, all of the future read requests will see that data.
█ Availability: It guarantees that the database always remains available without any downtime.
█ Partition tolerance: Here the database should be partitioned in such a way that if one part of the database
becomes unavailable, other parts remain unaffected and can function properly. This ensures availability of
information.
CA: It is suitable for systems being designed to run over cluster on a single site so that all of the nodes always remain
in contact. Hence, the worry of network partitioning problem almost disappears. But, if partition occurs, the system
fails.
CP: This model is tolerant to network partitioning problem, but suitable for systems where 24 × 7 availability is not a
critical issue. Some data may become inaccessible for a while but the rest remains consistent or accurate.
AP: This model is also tolerant to network partitioning problem as partitions are designed to work independently. 24
× 7 availability of data is also assured but sometimes some of the data returned may be inaccurate.
BASE Theorem
Relational database system treats consistency and availability issues as essential criteria. Fulfillments of these
criteria are ensured by following the ACID (Atomicity, Consistency, Isolation and Durability) properties in RDBMS.
NoSQL database tackles the consistency issue in a different way. It is not so stringent on consistency issue; rather it
focuses on partition tolerance and availability. Hence, NoSQL database no more need to follow the ACID rule.
NoSQL database should be much easier to scale out (horizontal scaling) and capable of handling large volume of
unstructured data. To achieve these, NoSQL databases usually follow BASE principle which stands for ‘Basically
Available, Soft state, Eventual consistency’.
█ Eventual Consistency: This principle states that immediately after operation, data may look like inconsistent but
ultimately they should converge to a consistent state in future. For example, two users querying for same data
immediately after a transaction (on that data) may get different values. But finally, the consistency will be regained.
█ Soft State: The eventual consistency model allows the database to be inconsistent for some time. But to bring it
back to consistent state, the system should allow change in state over time even without any input. This is known as
Soft state of system.
BASE does not address the consistency issue. The idea behind this is that data consistency is application developer’s
problem and should be handled by developer through appropriate programming techniques.
Despite many benefits, NoSQL fails to provide the rich analytical functionality in specific cases as RDBMS serves.
3. Column-Family Database
• A Column-Family Database (or Wide-Column Data Store/Column Store) stores data grouped in columns.
• Each column consists of three elements as name, value and a time-stamp.
• A similar type of columns together forms a column family which are often accessed together.
• A column family can contain virtually unlimited number of columns.
• The difference between column stores and key-value stores is that column stores are optimized to handle
data along columns.
UNIT-III
• Column stores show better analytical power and provide improved performance by imposing a certain
amount of rigidity to a database schema
• Hadoop’s Hbase
4. Graph Database
• Apache’s HBase
• Amazon’s DynamoDB
• Apache’s Cassandra
• Google Cloud Datastore
• MongoDB
• Amazon’s SimpleDB
• Apache’s CouchDB
• Neo4j
UNIT-III
Content is any kind of data ranging from text to audio, image, video and so on. Delivering this content to any location at any
time is a critical issue for the success of cloud services.
The Problem
The problem of delivering content in cloud exists due to the distance between the source of the content and the
locations of content consumers.
To meet the business needs and to fulfill application demands, the cloud-based services require a real-time information
delivery system (like live telecasting of events) to respond instantaneously. This is only possible when LAN-like
performance can be achieved in network communication for content delivery.
Cloud computing is basically built upon the Internet, but cloud based services require LAN like performance in
network communication.
The Solution
• Rather than remotely accessing content from data centers, they started treating content management as a set of
cached services located in servers near consumers.
• The basic idea is that instead of accessing data content in cloud centrally stored in a few data centers, it is better to
replicate the instances of the data at different locations.
• A network of such cached servers made for the faster and efficient delivery of content is called Content Delivery
Network (CDN).
The CDN is a network of distributed servers containing content replicas that serve by delivering web-content against each
request from a suitable content server based in the geographic location of the origin of request and the location of content
server.
The actual or original source of any content is known as content source or origin server.
The additional content servers, placed at different geographic locations, are known as edge servers.
CDN enables faster delivery of content by caching and replicating content from ‘content source’ to multiple ‘edge servers’ or
‘cache servers’ which are strategically placed around the globe.
Content Types:
- static content
- Live media or stream
Delivering live streaming media to users around the world is more challenging than the delivery of static content.
a. Placement of the edge servers in the network : The placement locations of edge servers are often
determined with the help of heuristic (or exploratory) techniques
b. Content management policy which decides replication management :
Two policies of content management: full-site replication and partial-site replication.
Full-site replication is done when a site is static.
c. Content delivery policy which mainly depends on the caching technique : the cache update policies
along with the cache maintenance are important in managing the content over CDN.
Content updates policies: on-demand updates and periodic updates.
d. Request routing policy : the user requests are directed to the optimally closest edge server over a CDN
that can best serve the request.
Policy decisions play a major role in the performance of any CDN service.
Advantages of CDN:
Disadvantages of CDN:
CDN services are offered by many vendors, that any content provider can use to deliver content to
customers worldwide.
Cloud service providers sometimes build up their own CDN infrastructure; otherwise they outsource
the content delivery task to some CDN service providers.
CDN service providers are specialists in content delivery and can deliver highest possible performance
and quality irrespective of delivery location.
CDN Providers:
Akamai
Limelight
Provider of content delivery network services and has extensive point-of-presence (PoP)
worldwide.
Amazon’s CloudFront
Access to Azure blobs through CDN is preferable over directly accessing them from source
containers.
CDN delivery of blobs stored in containers has been able through the Microsoft Azure Developer
Portal.
UNIT-III
When request for data is made using the Azure Blob service URL, the data is accessed directly
from the Microsoft Azure Blob service.
But if request is made using Azure CDN URL, the request is redirected to the CDN end point closest
to the request source location and delivery of data becomes faster.
CDNetworks
Originally founded in Korea in 2000, currently CDNetworks has offices in the Korea, US, China, UK
and Japan.
CDNetworks has developed massive network infrastructure having strong POP (point-of-
presence) coverage on all of the continents.
Currently it has more than 140 POPs in 6 continents including around 20 in China.
UNIT-III
Security Issues
Cloud-based security systems need to address all the basic needs of an information system like
confidentiality, integrity, availability of information, identity management, authentication and
authorization.
Cloud Security
Cloud computing demands shared responsibilities to take care of security issues. It should not be left
solely under the purview of the cloud provider, the consumers also have major roles to play.
Service-level agreements (SLAs) are used in different industries to establish a trust relationship
between service providers and consumers.
The SLA details the service-level capabilities promised by the providers to be delivered and
requirements/expectations stated by consumers.
Organizations should engage legal experts to review the SLA document during contract negotiation
and before making the final agreement.
The SLAs between the cloud service providers (CSPs) and consumers should have detailed mentioning
of the security capabilities of the solutions and the security standards to be maintained by the service
providers. Consumers, on the other hand, should provide clear-cut information to the service
providers about what they consider as a breach in security.
SLA document plays an important role in security management, for consumers moving towards cloud
solutions.
Threat is an event that can cause harm to a system. It can damage the system’s reliability and demote
confidentiality, availability or integrity of information stored in the system.
Vulnerability refers to some weaknesses or flaws in a system (hardware, software or process) that a threat
may exploit to damage the system
Risk is the ability of a threat to exploit vulnerabilities and thereby causing harm to the system. Risk occurs
when threat and vulnerability overlap.
■ Eavesdropping: This attack captures the data packets during network transmission and looks for
sensitive information to create foundation for an attack.
■ Fraud: It is materialized through fallacious transactions and misleading alteration of data to make
some illegitimate gains.
■ Theft: In computing system, theft basically means the stealing of trade secrets or data for gain. It
also means unlawful disclosure of information to cause harm.
■ Sabotage: This can be performed through various means like disrupting data integrity (referred as
data sabotage), delaying production, denial-of-service (DoS) attack and so on.
UNIT-III
■ External attack: Insertion of a malicious code or virus to an application or system falls under this
category of threat.
- Threats to Infrastructure
- Threats to Information
- Threats to Access Control
Public cloud deployment is the most critical case study to understand security concerns of cloud
computing. It covers all possible security threats to the cloud.
Infrastructure Security:
Infrastructure security describes the issues related with controlling access to physical resources which support
the cloud infrastructure.
Infrastructure security can be classified into three categories like network level, host level and service level.
The network-level security risks exist for all the cloud computing services (e.g., SaaS, PaaS or IaaS).
It is actually not the service being used but rather the cloud deployment type (public, private or hybrid) that
determine the level of risk.
In case of public cloud services, use of appropriate network topology is required to satisfy security
requirements.
Ensuring data confidentiality, integrity and availability are the responsibilities of network level
infrastructure security arrangement.
Most of the network-level security challenges are not new to cloud; rather, these have existed since the early
days of Internet. Advanced techniques are always evolving to tackle these issues.