Subject: DC Semester:VIII
File Models:
Unstructured and Structured files
In the unstructured model, a file is an unstructured sequence of bytes. The interpretation
of the meaning and structure of the data stored in the files is up to the application (e.g.,
UNIX and MS-DOS). Most modern operating systems use the unstructured file model.
In structured files (rarely used now) a file appears to the file server as an ordered
sequence of records. Records of different files of the same file system can be of different
sizes.
2. Mutable and immutable files
Based on the modifiability criteria, files are of two types, mutable and immutable. Most
existing operating systems use the mutable file model. An update performed on a file
overwrites its old contents to produce the new contents.
In the immutable model, rather than updating the same file, a new version of the file is
created each time a change is made to the file contents and the old version is retained
unchanged. The problems in this model are increased use of disk space and increased
disk activity.
File Accessing Models
This depends on the method used for accessing remote files and the unit of data access.
1. Accessing remote files
A distributed file system may use one of the following models to service a client’s file
access request when the accessed file is remote:
a. Remote service model
Processing of a client’s request is performed at the servers node. Thus, the clients request
for file access is delivered across the network as a message to the server, the server
machine performs the access request, and the result is sent to the client. Need to minimize
the number of messages sent and the overhead per message.
b. Data-caching model
This model attempts to reduce the network traffic of the previous model by caching the
data obtained from the server node. This takes advantage of the locality feature of the
found in file accesses. A replacement policy such as LRU is used to keep the cache size
bounded.
While this model reduces network traffic it has to deal with the cache coherency
problem during writes, because the local cached copy of the data needs to be updated, the
original file at the server node needs to be updated and copies in any other caches need to
be updated.
Comparison of Data-caching model over the Remote service model:
Prof.S.S. Aloni [Link] Computer Engineering
Subject: DC Semester:VIII
File Caching Schemes
Every distributed file system uses some form of caching. The reasons are:
1. Better performance since repeated accesses to the same information is handled additional
network accesses and disk transfers. This is due to locality in file access patterns.
2. It contributes to the scalability and reliability of the distributed file system since data can be
remotely cached on the client node.
Key decisions to be made in file-caching scheme for distributed systems:
1. Cache location
2. Modification Propagation
3. Cache Validation
Cache Location
This refers to the place where the cached data is stored. Assuming that the original location of a
file is on its servers’ disk, there are three possible cache locations in a distributed file system:
1. Servers main memory
In this case a cache hit costs one network access.
It does not contribute to scalability and reliability of the distributed file system.
Prof.S.S. Aloni [Link] Computer Engineering
Subject: DC Semester:VIII
Since we every cache hit requires accessing the server.
Advantages:
a. Easy to implement
b. Totally transparent to clients
c. Easy to keep the original file and the cached data consistent.
2. Client’s disk
In this case a cache hit costs one disk access. This is somewhat slower than having the
cache in server’s main memory. Having the cache in server’s main memory is also
simpler.
Advantages:
a. Provides reliability against crashes since modification to cached data is lost in a
crash if the cache is kept in main memory.
b. Large storage capacity.
c. Contributes to scalability and reliability because on a cache hit the access request
can be serviced locally without the need to contact the server.
3. Clients main memory
Eliminates both network access cost and disk access cost. This technique is not preferred
to a clients disk cache when large cache size and increased reliability of cached data are
desired.
Advantages:
a. Maximum performance gain.
b. Permit’s workstations to be diskless.
c. Contributes to reliability and scalability.
Modification Propagation
When the cache is located on clients’ nodes, a files data may simultaneously be cached on
multiple nodes. It is possible for caches to become inconsistent when the file data is changed by
one of the clients and the corresponding data cached at other nodes are not changed or discarded.
There are two design issues involved:
1. When to propagate modifications made to a cached data to the corresponding file server.
2. How to verify the validity of cached data.
The modification propagation scheme used has a critical effect on the systems performance and
reliability. Techniques used include:
Prof.S.S. Aloni [Link] Computer Engineering
Subject: DC Semester:VIII
a. Write-through scheme.
When a cache entry is modified, the new value is immediately sent to the server for updating the
master copy of the file.
Advantage:
High degree of reliability and suitability for UNIX-like semantics.
This is due to the fact that the risk of updated data getting lost in the event of a client crash is
very low since every modification is immediately propagated to the server having the master
copy.
Disadvantage:
This scheme is only suitable where the ratio of read-to-write accesses is fairly large. It does
not reduce network traffic for writes.
This is due to the fact that every write access has to wait until the data is written to the master
copy of the server. Hence the advantages of data caching are only read accesses because the
server is involved for all write accesses.
b. Delayed-write scheme.
To reduce network traffic for writes the delayed-write scheme is used. In this case, the new data
value is only written to the cache and all updated cache entries are sent to the server at a later
time.
There are three commonly used delayed-write approaches:
i. Write on ejection from cache
Modified data in cache is sent to server only when the cache-replacement policy has
decided to eject it from clients cache. This can result in good performance but there can
be a reliability problem since some server data may be outdated for a long time.
ii. Periodic write
The cache is scanned periodically and any cached data that has been modified since the
last scan is sent to the server.
iii. Write on close
Modification to cached data is sent to the server when the client closes the file. This does
not help much in reducing network traffic for those files that are open for very short
periods or are rarely modified.
Advantages of delayed-write scheme:
1. Write accesses complete more quickly because the new value is written only client
cache. This results in a performance gain.
Prof.S.S. Aloni [Link] Computer Engineering
Subject: DC Semester:VIII
2. Modified data may be deleted before it is time to send to send them to the server (e.g.,
temporary data). Since modifications need not be propagated to the server this results in a
major performance gain.
3. Gathering of all file updates and sending them together to the server is more efficient
than sending each update separately.
Disadvantage of delayed-write scheme:
Reliability can be a problem since modifications not yet sent to the server from a client’s cache
will be lost if the client crashes.
Cache Validation schemes
The modification propagation policy only specifies when the master copy of a file on the server
node is updated upon modification of a cache entry. It does not tell anything about when the file
data residing in the cache of other nodes is updated.
A file data may simultaneously reside in the cache of multiple nodes. A clients cache entry
becomes stale as soon as some other client modifies the data corresponding to the cache entry in
the master copy of the file on the server.
It becomes necessary to verify if the data cached at a client node is consistent with the master
copy. If not, the cached data must be invalidated and the updated version of the data must be
fetched again from the server.
There are two approaches to verify the validity of cached data: the client-initiated approach and
the server-initiated approach.
Client-initiated approach
The client contacts the server and checks whether its locally cached data is consistent with the
master copy. Two approaches may be used:
1. Checking before every access.
This defeats the purpose of caching because the server needs to be contacted on every
access.
2. Periodic checking.
A check is initiated every fixed interval of time.
Disadvantage of client-initiated approach: If frequency of the validity check is high, the cache
validation approach generates a large amount of network traffic and consumes precious server
CPU cycles.
Prof.S.S. Aloni [Link] Computer Engineering
Subject: DC Semester:VIII
Server-initiated approach
A client informs the file server when opening a file, indicating whether a file is being opened for
reading, writing, or both. The file server keeps a record of which client has which file open and
in what mode.
So, server monitors file usage modes being used by different clients and reacts whenever it
detects a potential for inconsistency. E.g., if a file is open for reading, other clients may be
allowed to open it for reading, but opening it for writing cannot be allowed. So also, a new client
cannot open a file in any mode if the file is open for writing.
When a client closes a file, it sends intimation to the server along with any modifications made to
the file. Then the server updates its record of which client has which file open in which mode.
When a new client makes a request to open an already open file and if the server finds that the
new open mode conflicts with the already open mode, the server can deny the request, queue the
request, or disable caching by asking all clients having the file open to remove that file from their
caches.
Note: On the web, the cache is used in read-only mode so cache validation is not an issue.
Disadvantage: It requires that file servers be stateful. Stateful file servers have a distinct
disadvantage over stateless file servers in the event of a failure.
File Replication
High availability is a desirable feature of a good distributed file system and file replication is the
primary mechanism for improving file availability.
A replicated file is a file that has multiple copies, with each file on a separate file server.
Difference Between Replication and Caching
1. A replica of a file is associated with a server, whereas a cached copy is normally
associated with a client.
2. The existence of a cached copy is primarily dependent on the locality in file access
patterns, whereas the existence of a replica normally depends on availability and
performance requirements.
3. As compared to a cached copy, a replica is more persistent, widely known, secure,
available, complete, and accurate.
4. A cached copy is contingent upon a replica. Only by periodic revalidation with respect
to a replica can a cached copy be useful.
Prof.S.S. Aloni [Link] Computer Engineering
Subject: DC Semester:VIII
Advantages of Replication
1. Increased Availability:
Alternate copies of a replicated data can be used when the primary copy is unavailable.
2. Increased Reliability:
Due to the presence of redundant data files in the system, recovery from catastrophic
failures (e.g., hard drive crash) becomes possible.
3. Improved response time:
It enables data to be accessed either locally or from a node to which access time is lower
than the primary copy access time.
4. Reduced network traffic:
If a files replica is available with a file server that resides on a clients node, the clients
access request can be serviced locally, resulting in reduced network traffic.
5. Improved system throughput:
Several clients request for access to a file can be serviced in parallel by different servers,
resulting in improved system throughput.
6. Better scalability:
Multiple file servers are available to service client requests since due to file replication.
This improves scalability.
Replication Transparency
Replication of files should be transparent to the users so that multiple copies of a replicated file
appear as a single logical file to its users. This calls for the assignment of a single identifier/name
to all replicas of a file.
In addition, replication control should be transparent, i.e., the number and locations of replicas of
a replicated file should be hidden from the user. Thus, replication control must be handled
automatically in a user-transparent manner.
Multicopy Update Problem
Maintaining consistency among copies when a replicated file is updated is a major design issue
of a distributed file system that supports file replication.
1. Read-only replication
In this case the update problem does not arise. This method is too restrictive.
Prof.S.S. Aloni [Link] Computer Engineering
Subject: DC Semester:VIII
2. Read-Any-Write-All Protocol
A read operation on a replicated file is performed by reading any copy of the file and a
write operation by writing to all copies of the file. Before updating any copy, all copies
need to be locked, then they are updated, and finally the locks are released to complete
the write.
Disadvantage: A write operation cannot be performed if any of the servers having a copy
of the replicated file is down at the time of the write operation.
3. Available-Copies Protocol
A read operation on a replicated file is performed by reading any copy of the file and a
write operation by writing to all available copies of the file. Thus if a file server with a
replica is down, its copy is not updated. When the server recovers after a failure, it brings
itself up to date by copying from other servers before accepting any user request.
4. Primary-Copy Protocol
For each replicated file, one copy is designated as the primary copy and all the others are
secondary copies. Read operations can be performed using any copy, primary or
secondary. But write operations are performed only on the primary copy. Each server
having a secondary copy updates its copy either by receiving notification of changes from
the server having the primary copy or by requesting the updated copy from it.
E.g., for UNIX-like semantics, when the primary-copy server receives an update request,
it immediately orders all the secondary-copy servers to update their copies. Some form of
locking is used and the write operation completes only when all the copies have been
updated. In this case, the primary-copy protocol is simply another method of
implementing the read-any-write-all protocol.
Prof.S.S. Aloni [Link] Computer Engineering