Understanding Distributed Systems Basics
Understanding Distributed Systems Basics
Introduction
Motivation
Before the mid-80s, computers were
Very expensive (hundred of thousands or even millions of dollars)
Very slow (a few thousand instructions per second)
Not connected among themselves
After the mid-80s: two major developments
Cheap and powerful microprocessor-based computers appeared
(including smartphones)
Computer networks
LANs at speeds ranging from 10 to 1000 Mbps (now even 10,
40, and 100Gbps)
WANs at speed ranging from 64 Kbps to gigabits/sec
Consequence
Feasibility of using a large network of computers to work for the
same application; this is in contrast to the old centralized
systems where there was a single computer with its peripherals
1
Objectives of the Chapter
We will discuss
Alternative definitions of distributed systems
Why we need distribution and problems in distribution
Goals of a distributed system: transparency, openness, and
scalability
Three types of distributed systems
Distributed computing systems
Distributed information systems
Distributed pervasive systems
2
1.1 Introduction and Definition
A distributed system is a collection of autonomous computing
elements that appears to its users as a single coherent system
(Tanenbaum & Van Steen)
This definition has two aspects:
Hardware: Autonomous computing elements, also referred to as
nodes, be they hardware devices or software processes
We have to manage which nodes are part of the system
through group management mechanisms
A group can be open (any device can send messages to any
other) or closed (only members of the group can
communicate with each other)
A distributed system is usually organized as an overlay
network (a network which is built on top of another network,
e.g., P2P networks)
Software: Single coherent system; users or applications
perceive a single system ⇒ nodes need to collaborate
3
Introduction and Definition (2)
Other Definitions
A distributed system is a system designed to support the
development of applications and services which can exploit a
physical architecture consisting of multiple, autonomous
processing elements that do not share primary memory but
cooperate by sending asynchronous messages over a
communication network (Blair & Stefani)
A distributed system is one that stops you getting any work done
when a machine you have never even heard of crashes (Leslie)
A distributed system is one in which the failure of a computer you
didn’t even know existed can render your own computer unusable
(Lamport)
4
Introduction and Definition (3)
In summary
The computers that comprise of a distributed system can be
large or small, but they are geographically dispersed, why it is
called a distributed system
The size may vary from a handful of devices to millions of
computers
The interconnection network may be wired, wireless, or a
combination of both
Distributed systems are often dynamic in the sense that
computers can join or leave with the topology and performance
continuously changing
5
1.1.1 Why Distributed?
Resource and Data Sharing
Printers, databases, multimedia servers, ...
Availability, Reliability
The loss of some instances can be hidden
Scalability, Extensibility
The system grows with demand (e.g., extra servers)
Performance
Huge power (CPU, memory, ...) available
Inherent distribution, communication
Organizational distribution, e-mail, video
6
1.1.2 Problems of Distribution
Concurrency, Security
Clients must not disturb each other
Privacy
e.g., when building a preference profile such as using cookies
Unwanted communication such as spam
Partial failure
We often do not know where the error is (e.g., RPC)
Location, Migration, Relocation, Replication
Clients must be able to find their servers
Heterogeneity
Hardware, platforms, languages, management
7
1.1.3 Characteristics of Distributed Systems
Differences between the computers and the ways they
communicate are hidden from users
Users and applications can interact with a distributed system in a
consistent and uniform way regardless of location
Distributed systems should be easy to expand and scale
A distributed system is normally continuously available, even if
there may be partial failures
8
1.1.4 Middleware and Distributed Systems
To support heterogeneous computers and networks and to provide a
single-system view, a distributed system is often organized by means
of a layer of software called middleware that extends over multiple
machines
Ack: most diagrams in all slides are taken from the textbook
9
Middleware and Distributed Systems (2)
In a sense, a middleware is the same to a distributed system as
what an operating system is to a computer: a manager of
resources
In addition, it offers services such as
Facilities for interapplication communication
Security services
Accounting services
Masking of and recovery from failures
10
1.2 Goals of a Distributed System
A distributed system should easily connect users with resources
(printers, computers, storage facilities, data, files, Web pages, ...)
Some of the reasons
Economics: sharing resources such as printers and high-
speed computers
To collaborate and exchange information, e.g., BitTorrent
e-commerce: buying and selling goods
etc.
A distributed system should be transparent (invisible): hide the fact
that the resources and processes are distributed across multiple
computers
A distributed system should be open
A distributed system should be scalable
11
Goals of a Distributed System (2)
Transparency in a Distributed System
A distributed system that is able to present itself to users and
applications as if it were only a single computer system is said
to be transparent
Different forms of transparency in a distributed system
Transparency Description
Access Hide differences in machine architecture, data
representation (endianness, file naming, ...) and
how an object is accessed
Location Hide where an object is physically located; where
is [Link] (naming)
Migration Hide that an object may move to another location
Relocation Hide that an object may be moved to another
location while in use; e.g., mobile users using their
wireless laptops and moving from place to place
Note: an object refers to a resource or a process
12
Goals of a Distributed System (3)
Transparency Description
Replication Hide that an object is replicated (for availability
and performance); all replicas have the same
name
Concurrency Hide that an object may be shared by several
competitive users; an object must be left in a
consistent state; through locking
Failure Hide the failure and recovery of an object
13
Goals of a Distributed System (4)
Openness in a Distributed System
A distributed system should be open
We need well-defined interfaces
Interoperability
Components of different origin can communicate
Portability
Components work on different platforms
Another goal of an open distributed system is that it should be
flexible and extensible; easy to configure the system out of
different components; easy to add new components, to replace
existing ones; easier said than done
An Open Distributed System is a system that offers services
according to standard rules that describe the syntax and
semantics of those services; e.g., protocols in networks
Standards - a necessity
14
Goals of a Distributed System (5)
In distributed systems, such services are often specified through
interfaces often described using an Interface Definition
Language (IDL)
Specify only syntax: the names of the functions, types of
parameters, return values, possible exceptions, ...
Semantics is given in an informal way by means of natural
languages
15
Goals of a Distributed System (6)
Scalability in Distributed Systems
A distributed system should be scalable; there are three
dimensions
Size: adding more users and resources to the system without
any noticeable loss of performance
Geographically: users and resources may be far apart but the
fact that communication delays may be significant is hardly
noticed
Administratively: should be easy to manage even if it spans
many independent administrative organizations
But a scalable system may exhibit performance and other
problems
16
Goals of a Distributed System (7)
Size scalability problems leading to low performance
Concept Example
Single server (or a cluster of servers) for all
Centralized services
users - mostly for security reasons
Centralized data A single on-line telephone book
Doing routing based on complete
Centralized algorithms
information
In these cases there are essentially three root causes for a server
becoming a bottleneck:
The computational capacity, limited by the CPUs
The storage capacity, including the I/O transfer rate
The network between the user and the centralized service
17
Goals of a Distributed System (8)
A major problem in administrative scalability is that of conflicting
policies with respect to resource usage (and payment),
management, and security
18
Goals of a Distributed System (9)
Hide Communication Latencies (for geographical scalability)
Try to avoid waiting for responses to remote service requests
Let the requester do other useful job
i.e., construct requesting applications that use only
asynchronous communication instead of synchronous
communication; when a reply arrives the application is
interrupted
Good for batch processing and parallel applications since
independent tasks can be scheduled while another task is
waiting for communication to complete or use multithreading for
non-parallel programs
Hiding communication latencies is not in general applicable for
interactive applications
For interactive applications, try to reduce communication; move
part of the job to the client to reduce communication; e.g., filling
a form to access a database and checking the entries
19
Goals of a Distributed System (10)
e.g., checking the completeness of mandatory fields
Client Server
M
U
First Name MULUNEH L
U
Last Name KEBEDE N
E-Mail MUKEB@[Link] E
H
(a) A server checking the correctness of field entries (b) A client doing the job
21
Goals of a Distributed System (12)
Replication
Replicate components across a distributed system to increase
availability and for load balancing, leading to better performance
Also, in geographically widely dispersed systems, having a copy
nearby can hide much of the communication latency problems
Replication is decided by the owner of a resource
Caching (a special form of replication) also reduces
communication latency; decided by the user
But, caching and replication may lead to consistency problems
(see Chapter 7 - Consistency and Replication)
22
Pitfalls when Developing Distributed Systems
Because of false assumptions made by first time developers (of
distributed systems) which are related to the properties of
distributed systems and do not occur in nondistributed applications
The network is reliable (making it difficult to achieve failure
transparency)
The network is secure
The network is homogeneous
The topology does not change
Latency is zero
Bandwidth is infinite
Transport cost is zero
There is one administrator
23
1.3 Types of Distributed Systems
Three types: distributed computing systems, distributed information
systems, and distributed pervasive systems
Distributed Computing Systems
Used for high-performance computing tasks
Two groups: cluster computing and grid/cloud computing
Cluster Computing
A collection of similar workstations or PCs (homogeneous,
running the same operating system), closely connected by
means of a high-speed LAN
Used for parallel programming in which a single compute
intensive program is run in parallel on multiple machines
Each compute node runs the same operating system extended
with typical middleware functions for communication, storage,
fault tolerance, etc.
24
Types of Distributed Systems (2)
26
Types of Distributed Systems (4)
Cloud Computing
A recent and more general approach to grid computing
It is an alternative to maintaining huge local infrastructures, i.e.,
one of the important reasons to migrate to a cloud environment
is that it may be much cheaper compared to maintaining a local
computing infrastructure
Cloud computing is characterized by an easily usable and
accessible pool of virtualized resources
Which and how resources are used can be configured
dynamically, providing the basis for scalability: if more work
needs to be done, a customer can simply acquire more
resources
It is generally based on a pay-per-use model in which
guarantees are offered by means of customized service level
agreements (SLAs)
Note: Details on how specific cloud computations are actually
carried out are generally hidden
27
Types of Distributed Systems (5)
29
Types of Distributed Systems (7)
d. Application: Actual applications run in this layer and are offered to
users for further customization
Well-known examples include those found in office suites (text
processors, spreadsheet applications, presentation applications,
and so on)
They are executed in the vendor’s cloud
They can be compared to the traditional suite of applications
that are shipped when installing an operating system
30
Types of Distributed Systems (8)
Types of services
Cloud-computing providers offer these layers to their customers
through various interfaces (including command-line tools,
programming interfaces, and Web interfaces), leading to three
different types of services
Infrastructure as a Service (IaaS): covering the hardware and
infrastructure layer (basic infrastructure)
Platform as a Service (PaaS): covering the platform layer
(system-level services)
Software as a Service (SaaS): contains actual applications
Some obstacles in cloud computing
Provider/vendor lock-in
Security and privacy issues
Dependency on the availability of services
31
Types of Distributed Systems (9)
Distributed Information Systems
An organization may have many networked applications
Problem: interoperability
The issue is how to integrate applications into an enterprise-
wide information system
A networked application simply consists of a server running that
application (often including a database) and making it available
to remote programs, called clients
Such clients send a request to the server for executing a specific
operation, after which a response is sent back
Integration at the lowest level: wrap a number of requests into a
single larger request and have it executed as a distributed
transaction; all or none of the requests would be executed
Integration should also take place by letting applications
communicate directly with each other, i.e., Enterprise Application
Integration (EAI)
32
Types of Distributed Systems (10)
a. Transaction Processing Systems
Consider database applications
Special primitives are required to program transactions, supplied
either by the underlying distributed system or by the language
runtime system
Exact list of primitives depends on the type of application;
procedure calls, ordinary statements, etc. can also be included
Primitive Description
BEGIN_TRANSACTION Mark the start of a transaction
END_TRANSACTION Terminate the transaction and try to commit
33
Types of Distributed Systems (11)
The Transaction Model
The model for transactions comes from the world of business
A supplier and a retailer negotiate on
Price
Delivery date
Quality
etc.
Until the deal is concluded they can continue negotiating or one
of them can terminate
But once they have reached an agreement they are bound by
law to carry out their part of the deal
Transactions between processes is similar with this scenario
34
Types of Distributed Systems (12)
e.g., assume the following banking operation
Withdraw an amount x from account 1
Deposit the amount x to account 2
What happens if there is a problem after the first activity is
carried out?
Group the two operations into one transaction; either both are
carried out or neither
We need a way to roll back when a transaction is not completed
35
Types of Distributed Systems (13)
e.g. reserving a seat from Manchester to Lalibella through
Heathrow and AA Bole airports
BEGIN_TRANSACTION BEGIN_TRANSACTION
reserve Man → Heathrow; reserve Man → Heathrow;
reserve Heathrow → Bole; reserve Heathrow → Bole;
reserve Bole → Lalibella; reserve Bole → Lalibella full ⇒
END_TRANSACTION ABORT_TRANSACTION
(a) (b)
(a) Transaction to reserve three flights commits
(b) Transaction aborts when the third flight is unavailable
36
Types of Distributed Systems (14)
Properties of transactions, often referred to as ACID
Atomic: to the outside world, the transaction happens
indivisibly; a transaction either happens completely or not at all;
intermediate states are not seen by other processes
Consistent: the transaction does not violate system invariants;
e.g., in an internal transfer in a bank, the amount of money in
the bank must be the same as it was before the transfer (the
law of conservation of money); this may be violated for a brief
period of time, but not seen to other processes
Isolated or Serializable: concurrent transactions do not interfere
with each other; if two or more transactions are running at the
same time, the final result must look as though all transactions
run sequentially in some order
Durable: once a transaction commits, the changes are
permanent; see later in Chapter 8 - Fault Tolerance
37
Types of Distributed Systems (15)
Classification of Transactions
A transaction could be flat or nested
Flat Transaction
Consists of a series of operations that satisfy the ACID
properties
Simple and widely used but with some limitations
Does not allow partial results to be committed or aborted
i.e., atomicity is also partly a weakness
In our airline reservation example, we may want to
accept the first two reservations and find an alternative
one for the last
Some transactions may take too much time
38
Types of Distributed Systems (16)
Nested Transaction
Constructed from a number of subtransactions; it is logically
decomposed into a hierarchy of subtransactions; the flight
reservation can be split into three transactions, each accessing
a different database
The top-level transaction forks off children that run in parallel, on
different machines; to gain performance or for programming
simplicity Nested transaction
Each may also execute one Subtransaction Subtransaction
or more subtransactions
Permanence (durability)
applies only to the top-level
transaction; commits by Airline database Hotel database
children should be undone
Two different (independent) databases
39
Types of Distributed Systems (17)
b. Enterprise Application Integration
How to integrate applications independent from their databases
Application components should be able to communicate directly
with each other and not merely by means of the request/reply
behavior that was supported by transaction processing systems
How can applications communicate with each other; by means
of a middleware
There are different communication models
RPC (Remote Procedure Call)
MOM (Message-Oriented Middleware)
Multicast Communication
See later in Chapter 4 - Communication
40
Types of Distributed Systems (18)
41
Types of Distributed Systems (19)
Distributed Pervasive Systems
The distributed systems discussed so far are characterized by
their stability; fixed nodes having high-quality connection to a
network
There are also mobile and embedded computing devices which
are small, battery-powered, mobile, and with a wireless
connection leading to what are generally referred to as
pervasive systems
42
[ Diversion
Different approaches to distribution - Lost in the forest of
distribution
Distributed System
N autonomous computers (sites): n administrators, n
data/control flows
An interconnection network
User view: one single (virtual) system
(traditional) programmer view: client-server
Parallel System
1 computer, n nodes: one administrator, one scheduler, one
power source
Memory: it depends (shared or separate)
Programmer view: one single machine executing parallel
codes; various programming models (message passing,
distributed shared memory, …)
43
Diversion (2)
Cluster Computing
Use of PCs interconnected by a (high performance) network as
a parallel (cheap) machine
Network Computing
From LAN (cluster) computing to WAN computing
Set of machines distributed over a MAN/WAN that are used to
execute parallel loosely coupled codes
Depending on the infrastructure, network computing comes in
many flavours: grid/cloud computing, P2P, Internet computing,
etc.
a. Grid/Cloud Computing
Grid Computing: “Resource sharing and coordinated problem
solving in dynamic, multi-institutional virtual organizations”
(Ian Foster)
Cloud Computing: A general term for anything that involves
delivering hosted services over the Internet
44
Diversion (3)
b. Peer-to-Peer Computing
A site is both client and server
Application: mostly file sharing, but also others like Internet
Telephony (Skype)
2 approaches:
Centralized management: Napster
Distributed management: Gnutella, Kazaa
c. Internet Computing
Use of (idle) computers interconnected by Internet for
processing large throughput applications
Programmer view: a single master, n servants
]
45