Distributed System
Distributed System
Distributed System
Distributed Systems
Chapter One
Introduction
1
Computer systems are undergoing a revolution. From 1945, when the modern
computer era began, until about 1985, computers were large and expensive. Even
minicomputers normally cost tens of thousands of dollars each.
As a result, most organizations had only a handful of computers, and for lack of a
way to connect them, these operated independently from one another.
2
The amount of improvement that has occurred in computer technology in the past half
century is truly staggering and totally unprecedented in other industries. The second
development was the invention of high-speed computer networks.
The local area networks or LANs allow dozens, or even hundreds, of machines within a
building to be connected in such a way that small amounts of information can be
transferred between machines in a millisecond or so.
3
What is a Distributed System?
A distributed system is a collection of independent computers that appear to the users of
the system as a single computer. This definition has two aspects.
The first one deals with hardware: the machines are autonomous.
The second one deals with software: the users think of the system as a single computer.
Item Description
Device sharing Allow many users to share expensive peripherals like color printers
Flexibility Spread the workload over the available machines in the most cost
effective way
5
Advantages of distributed systems over centralized systems.
ITEM Description
Economics Microprocessors offer a better price/performance than mainframes
Speed A distributed system may have more total computing power than a
mainframe
Inherent Some applications involve spatially separated machines
distribution
Reliability If one machine crashes, the system as a whole can still survive
6
Disadvantages of Distributed Systems
Item Description
7
Characteristics of Distributed Systems
Differences between the computers and the ways they communicate are
hidden from users
Users and applications can interact with a distributed system in a
consistent and uniform way regardless of location
Distributed systems should be easy to expand and scale
Distributed system is normally continuously available, even if there
may be partial failures
o Users and applications should not notice that parts are being replaced
or fixed, or that new parts are added to serve more users or
applications
8
Goals of a Distributed System
To support heterogeneous computers and networks
To provide a single-system view, a distributed system is often organized by
means of a layer of software called middleware that extends over multiple
machines
9
Distributed system should
Make resources accessible(printers, computers, storage facilities, data,
files, Web pages, ...)
o reasons: economics, to collaborate and exchange information
Be transparent: hide the fact that the resources and processes are
distributed across multiple computers.
Be open
Be scalable
Transparency in a Distributed System
Distributed system that is able to present itself to users
and applications as if it were only a single computer
system is said to be transparent.
It exist as a single entity without worrying about
location in which users can get the services
10
Different forms of transparency in a distributed system
Transparency Description
Access Hide differences in data representation
and how a resource is accessed
Location Hide where a resource is physically located; where
is http://www.prenhall.com/index.html? (naming)
Migration Hide that a resource may move to another location
Relocation Hide that a resource may be moved to another location while in use;
e.g., mobile users using their wireless laptops
Replication Hide that a resource is replicated
Concurrency Hide that a resource may be shared by several competitive users; a
resource must be left in a consistent state
Failure Hide the failure and recovery of a resource
Persistence Hide whether a (software) resource is in memory or on disk
11
Hardware Concepts
Distributed systems consist of multiple CPUs, there are several different ways the hardware
can be organized, especially in terms of how they are interconnected and how they
communicate.
Flynn(1972) picked two characteristics: the number of instruction streams and the number
of data streams.
A computer with a single instruction stream and a single data stream is called SISD.
SIMD, single instruction stream, multiple data stream. This type refers to array
processors with one instruction unit that fetches an instruction, and then commands many
data units to carry it out in parallel, each with its own data. These machines are useful for
computations that repeat the same calculation on many sets of data, for example, adding
up all the elements of 64 independent vectors. Some supercomputers are SIMD.
MISD multiple instruction stream, single data stream. No known computers fit this
model.
MIMD means a group of independent computers, each with its own program counter,
program, and data. All distributed systems are MIMD.
12
13
Systems are tightly coupled and in others they are loosely coupled.
In a tightly-coupled system, the delay experienced when a message is sent from one
computer to another is short, and the data rate is high; that is, the number of bits per second
that can be transferred is large.
In a loosely-coupled system, the opposite is true: the inter-machine message delay is large
and the data rate is low.
For example, two CPU chips on the same printed circuit board and connected by wires
etched onto the board are likely to be tightly coupled, whereas two computers connected by a
2400 bit/sec modem over the telephone system are certain to be loosely coupled.
Tightly-coupled systems tend to be used more as parallel systems (working on a single
problem) and loosely-coupled ones tend to be used as distributed systems (working on
many unrelated problems), although this is not always true.
One famous counterexample is a project in which hundreds of computers all over the world
worked together trying to factor a huge number (about 100 digits). Each computer was
assigned a different range of divisors to try, and they all worked on the problem in their
spare time, reporting the results back by electronic mail when they finished.
14
Multiprocessors tend to be more tightly coupled than multi-computers, because they can
exchange data at memory speeds, but some fiber-optic based multicomputer can also work at
memory speeds.
Bus multiprocessors,
Switched multiprocessors,
Bus multicomputer, and
Switched multicomputer.
15
Software Concepts
Although the hardware is important, the software is even more important. The image that a
system presents to its users, and how they think about the system, is largely determined by
the operating system software, not the hardware.
Operating systems cannot be put into nice, neat pigeonholes like hardware.
By nature software is vague and amorphous.
It is more-or-less possible to distinguish two kinds of operating systems for multiple CPU
systems: loosely coupled and tightly coupled.
o Loosely-coupled software allows machines and users of a distributed system to be
fundamentally independent of one another, but still to interact to a limited degree where
that is necessary.
o Consider a group of personal computers, each of which has its own CPU, its own
memory, its own hard disk, and its own operating system, but which share some
resources, such as laser printers and databases, over a LAN.
16
To show how difficult it is to make definitions in this area, now consider the same system as
above, but without the network.
To print a file, the user writes the file on a floppy disk, carries it to the machine with the
printer, reads it in, and then prints it. Is this still a distributed system, only now even more
loosely coupled? It's hard to say.
From a fundamental point of view, there is not really any theoretical difference between
communicating over a LAN and communicating by carrying floppy disks around.
At most one can say that the delay and data rate are worse in the second example.
At the other extreme we might find a multiprocessor dedicated to running a single chess
program in parallel. Each CPU is assigned a board to evaluate, and it spends its time
examining that board and all the boards that can be generated from it. When the evaluation
is finished, the CPU reports back the results and is given a new board to work on. The
software for this system, both the application program and the operating system required to
support it, is clearly much more tightly coupled than in our previous example.
17
Network Operating Systems
Let us start with loosely-coupled software on loosely-coupled hardware, since this is
probably the most common combination at many organizations.
A typical example is a network of workstations connected by a LAN. In this model, each
user has a workstation for his exclusive use. It may or may not have a hard disk. It definitely
has its own operating system. All commands are normally run locally, right on the
workstation.
However, it is sometimes possible for a user to log into another workstation remotely by
using a command such as rlogin machine.
The effect of this command is to turn the user's own workstation into a remote terminal
logged into the remote machine. Commands typed on the keyboard are sent to the remote
machine, and output from the remote machine is displayed on the screen. To switch to a
different remote machine, it is necessary first to log out, then to use the rlogin command to
connect to another machine. At any instant, only one machine can be used, and the selection
of the machine is entirely manual.
18
Networks of workstations often also have a remote copy command to copy files from one
machine to another.
For example, a command like rcp machinel:filel machine2:file2
might copy the file file1 from machine1 to machine2 and give it the name file2 there.
Again here, the movement of files is explicit and requires the user to be completely aware of
where all files are located and where all commands are being executed.
While better than nothing, this form of communication is extremely primitive and has led
system designers to search for more convenient forms of communication and
information sharing.
One approach is to provide a shared, global file system accessible from all the
workstations.
The file system is supported by one or more machines called file servers.
The file servers accept requests from user programs running on the other (non-server)
machines, called clients, to read and write files.
Each incoming request is examined and executed, and the reply is sent back as shown fig
below. 19
20
File servers generally maintain hierarchical file systems, each with a root directory
containing subdirectories and files.
Workstations can import or mount these file systems, augmenting their local file systems
with those located on the servers.
One has a directory called games, while the other has a directory called work.
These directories each contain several files. Both of the clients shown have mounted
both of the servers, but they have mounted them in different places in their respective
file systems.
Client 1 has mounted them in its root directory, and can access them as /games and
/work, respectively.
Client 2, like client 1, has mounted games in its root directory, but regarding the
reading of mail and news as a kind of game, has created a directory /games/work and
mounted work there.
Consequently, it can access news using the path /games/work/news rather than
/work/news.
21
Types Of Distributed Systems
22
Cluster Computing
Essentially a group of systems connected through a LAN.
Distributed Computing Systems Homogeneous: Same OS, near-identical hardware
Single managing node
Tightly coupled systems
Centralized job management & scheduling system
23
Grid Computing
Lots of nodes (including clusters across multiple subnets) from everywhere.
Heterogeneous
Diversity and dynamism (it can handle nodes dropping in and out at any point of
time)
Dispersed across several organizations
Can easily span a wide-area network
To allow for collaborations, grids generally use virtual organizations (grouping of
users that will allow for authorization on resource allocation).
Loosely coupled (decentralization)
Distributed job management & scheduling
24
Cloud Computing
Web-based tools or applications that users can access and use through a
web browser as if it were a program installed locally on their own
computer.
Internet-based computing
offers dynamically scalable and virtualized resources that make up
services for users to use over the internet
The only thing the user's computer needs to be able to run is the cloud
computing system's interface software
25
Distributed Pervasive Systems
A next-generation of distributed systems emerging in which the nodes are small, wireless,
battery-powered, mobile (e.g. PDAs, smart phones, wireless surveillance cameras, portable
ECG monitors, etc.), and often embedded as part of a larger system.
Some requirements:
Contextual change: The system is part of an environment in which changes should be
immediately accounted for.
Ad hoc composition: Each node may be used in a very different ways by different users.
-Requires ease-of-configuration.
Sharing is the default: Nodes come and go, providing sharable services and
information.
- Calls again for simplicity.
26
IF You Have Question ,
Comment, Suggestion You
well come!
27
Introduction to
Distributed Systems
Chapter Two
Architecture
28
Introduction
Distributed systems
are often complex pieces of software whose components are dispersed across multiple
machines.
To master their complexity, it is crucial that these systems are properly organized.
Its main organization is mostly about the software components that constitute the system
These software architectures tell us how the various software components are to be
organized and how they should interact.
There are many different choices for the actual realization of a distributed system
requires that we instantiate and place software components on real machines
The final instantiation of a software architecture is also referred to as a system
architecture
The main goal of distributed system is to separate applications from underlying
platforms by providing a middleware layer that provide distribution transparency.
Distributed systems are frequently organized in the form of feedback control
loops which form an important architectural element during a system's design 29
Architectural Styles
Architectural style is formulated
in terms of components
the way that components are connected to each other
the data exchanged between components
finally how these elements are jointly configured into a system
• A component is a modular unit with well-defined required and provided interfaces
that is replaceable within its environment.
• Connector can be formed by the facilities for (remote) procedure calls, message
passing, or streaming data.
Using components and connectors, we can come to various configurations, which, in tum
have been classified into architectural styles.
30
The most important ones for distributed systems are:
Layered architectures
Object-based architectures
Data-centered architectures
Event-based architectures
31
Layered and Object-based Architectures
components are organized in a layered fashion
component at layer L; is allowed to call components at the underlying layer Li
widely adopted by the networking community
An key observation is that control generally flows from layer to layer
requests go down the hierarchy whereas the results flow upward
A far looser organization is followed in object-based architectures, which
are illustrated in Fig
each object corresponds to what we have defined as a component, and these
components are connected through a (remote) procedure call mechanism
this software architecture matches the client-server system architecture
The layered and object based architectures still form the most important styles
for large software systems
32
The (a) layered and (b) object-based architectural style
33
Data-centered architectures
Processes communicate through a common (passive or active) repository
Are as important as the layered and object based architectures
A wealth of networked applications have been developed that rely on a
shared distributed file system in which virtually all communication takes
place through files
Likewise, Web-based distributed systems are largely data-centric: processes
communicate through the use of shared Web-based data services
34
Event-based architectures
35
The (a) event-based and (b) shared data-space architectural style
It can be combined with data-centered architectures, yielding what is also known as
shared data spaces.
many shared data spaces use a SQL-like interface to the shared repository in that sense
that data can be accessed using a description rather than an explicit reference, as is the
case with files
36
System Architectures
Centralized Architectures
In the basic client-server model, processes in a distributed system are divided into two
(possibly overlapping) groups
A server is a process implementing a specific service. Eg: a file system service or a
database service.
A client is a process that requests a service from a server by sending it a request and
subsequently waiting for the server's reply.
This client-server interaction, also known as request-reply.
37
Communication between a client and a server can be implemented by means of a
simple connectionless protocol.
Connectionless protocol is efficient while there is no message lost or corrupted, the
request/reply protocol just sketched works fine.
The problem, however, is that the client cannot detect whether the original request
message was lost, or that transmission of the reply failed.
As an alternative, many client-server systems use a reliable connection oriented
protocol.
Although this solution is not entirely appropriate in a local-area network due to
relatively low performance, it works perfectly tine in wide-area systems in which
communication is inherently unreliable.
38
Application Layering
considering that many client-server applications are targeted toward supporting
user access to databases, many people have advocated a distinction between the
following three levels
The user-interface level: contains all that is necessary to directly interface with the
user, such as display management.
o Clients typically implement the user-interface level
o allow end users to interact with applications
o Many client-server applications can be constructed from roughly three different pieces: a
part that handles interaction with a user, a part that operates on a database or file
system, and a middle part that generally contains the core functionality of an application
The processing level: typically contains the applications
o middle part is logically placed at the processing level
The data level: manages the actual data that is being acted on
39
The simplified organization of an Internet search engine into three different
layers
40
Alternative client-server organizations (a)-(e)
41
Cases:
a) only the terminal-dependent part of the user interface on the client machine
B: place the entire user-interface software on the client side
o Divide the application into a graphical front end, which communicates with the rest of the application
(residing at the server) through an application-specific protocol.
O the front end (the client software) does no processing other than necessary for presenting the
application's interface
C: move part of the application to the front end
o e.g. the application makes use of a form that needs to be filled in entirely before it can be processed
o front end can then check the correctness and consistency of the form, and where necessary interact with
the user
D: used where the client machine is a PC or workstation, connected through a network to a distributed
file system or database
o most of the application is running on the client machine, but all operations on files or database entries
go to the server
o e.g. many banking applications run on an end-user's machine where the user prepares transactions and
such.. Once finished, the application contacts the database on the bank's server and uploads the
transactions for further processing
E: used where the client machine is a PC or workstation, connected through a network to a distributed
42
file system or database
o the situation where the client's local disk contains part of the data
An example of a server acting as client
43
Decentralized Architectures
Peer-to-peer systems
o support horizontal distribution
o functions that need to be carried out are represented by every process that
constitutes the distributed system
o interaction between processes is symmetric.
• each process will act as a client and a server at the same time (which is also
referred to as acting as a servant)
A two-layered approach for constructing and maintaining specific overlay topologies using techniques from
unstructured peer-to-peer systems 46
Super-peers
Super-peers are often also organized in a peer-to-peer network, leading to a
hierarchical organization.
every regular peer is connected as a client to a super-peer
All communication from and to a regular peer proceeds through that peer's
associated super-peer
50
Introduction to
Distributed Systems
Chapter Three
Processes
51
Introduction to Threads
The concept of a process originates from the field of operating systems where it is
generally defined as a program in execution.
o to efficiently organize client-server systems, it is often convenient to make use
of multithreading techniques.
o main contribution of threads in distributed systems is that they allow clients
and servers to be constructed.
o Like a process, a thread executes its own piece of code, independently from
other threads
o in contrast to processes, no attempt is made to achieve a high degree of
concurrency transparency if this would result in performance degradation
o a thread system generally maintains only the minimum information to allow a
CPU to be shared by several threads
52
There are two important implications
o First of all, the performance of a multithreaded application need hardly
ever be worse than that of its single-threaded counterpart.
o In fact, in many cases, multithreading leads to a performance gain
o Second, because threads are not automatically protected against each other
the way processes are, development of multithreaded applications requires
additional intellectual effort.
o Proper design and keeping things simple, as usual, help a lot. Unfortunately,
current practice does not demonstrate that this principle is equally well
understood 53
Thread Usage in Non-distributed Systems
o To maintain spreadsheet program
o whenever a cell is modified, all dependent cells are automatically updated
o at least two threads of control:
• one for handling interaction with the user and
• one for updating the spreadsheet
• third thread could be used for backing up the spreadsheet to disk while
the other two are doing their work
o to exploit parallelism when executing the program on a multiprocessor system
o multithreading is also useful in the context of large applications those developed
as a collection of cooperating programs, each to be executed by a separate
process
54
Thread Implementation
Threads are often provided in the form of a thread package which contains operations
to create and destroy threads.
approaches to implement a thread package
o construct a thread library that is executed entirely in user mode.
o The second approach is to have the kernel be aware of threads and schedule
them
user-level thread library
o is cheap to create and destroy threads because all thread administration is kept in
the user's address space
o that switching thread context can often be done in just a few instructions
o drawback of user-level threads is that invocation of a blocking system call will
immediately block the entire process
o These problems can be mostly circumvented by implementing threads in the
operating system's kernel that pay high price in which every thread operation
55
(creation, deletion, synchronization, etc.) done by kernel.
Fig. Combining kernel-level lightweight processes and user-level threads
56
The combination of (user-level) threads and LWPs works as follows
The thread package has a single routine to schedule the next thread.
When creating an LWP (which is done by means of a system call), the LWP is
given its own stack, and is instructed to execute the scheduling routine in search
of a thread to execute.
If there are several LWPs, then each of them executes the scheduler.
The thread table, which is used to keep track of the current set of threads, is
shared by the LWPs.
Protecting this table to guarantee mutually exclusive access is done by means of
mutexes that are implemented entirely in user space.
In other words, synchronization between LWPs does not require any kernel
support.
When an LWP finds a runnable thread, it switches context to that thread.
Meanwhile, other LWPs may be looking for other runnable threads as well
57
Threads in Distributed Systems
58
If a thread needs to block on a mutex or condition variable, it does the necessary
administration and eventually calls the scheduling routine.
'When another runnable thread has been found, a context switch is made to that thread.
The beauty of all this is that the LWP executing the thread need not be informed: the
context switch is implemented completely in user space and appears to the LWP as
normal program code.
when a thread does a blocking system call.
In this at case, execution changes from user mode to kernel mode. but still continues in
the context of the current LWP.
At the point where the current LWP can no longer continue, the operating system may
decide to switch context to another LWP, which also implies that a context switch is
made back to user mode.
The selected LWP will simply continue where it had previously left off
59
advantages of using LWPs in combination with a user-level thread package
First, creating, destroying, and synchronizing threads is relatively cheap and involves no
kernel intervention at all.
Second, provided that a process has enough LWPs, a blocking system call will not
suspend the entire process.
Third, there is no need for an application to know about the LWPs.
o All it sees are user-level threads.
Fourth, LWPs can be easily used in multiprocessing environments, by executing different
LWPs on different CPUs.
o This multiprocessing can be hidden entirely from the application.
The only drawback of lightweight processes in combination with user-level threads is
that we still need to create and destroy LWPs, which is just as expensive as with kernel-
level threads.
However, creating and destroying LWPs needs to be done only occasionally, and is often
fully controlled by the operating system
60
Multithreaded Clients
61
important benefits multithreaded clients
the main use of multithreading in distributed systems is found at the server side.
Practice shows that multithreading not only simplifies server code considerably,
but also makes it much easier to develop servers that exploit parallelism to attain
high performance, even on uniprocessor systems.
However, now that multiprocessor computers are widely available as general-
purpose workstations, multithreading for parallelism is even more useful.
62
To understand the benefits of threads for writing server code consider file server that
occasionally has to block waiting for the disk
The dispatcher, reads incoming requests for a file operation.
The requests are sent by clients to a well-known end point for this server.
After examining the request, the server chooses an idle (i.e., blocked) worker thread
and hands it the request.
63
Three ways to construct a server
64
Clients
Networked User Interfaces
A major task of client machines is to provide the means for users to interact with
remote servers.
There are roughly two ways in which this interaction can be supported.
o First, for each remote service the client machine will have a separate counterpart
that can contact the service over the network.
o A second solution is to provide direct access to remote services by only offering a
convenient user interface. Effectively, this means that the client machine is used
only as a terminal with no need for local storage, leading to an application neutral
solution
65
Fig. (a) A networked application with its own protocol. (b) A general solution to allow
access to remote applications
A special class is formed by embedded client software, such as for automatic teller
machines (ATMs), cash registers, barcode readers, TV set-top boxes, etc.
69
Fig. Route optimization in a distributed server 70
Code Migration
entire process was moved from one machine to another
Moving a running process to a different machine is a costly and intricate task, and
there had better be a good reason for doing so.
If a client application needs to perform many database operations involving large
quantities of data, it may be better to ship part of the client application to the server
and send only the results across the network
71
Models for Code Migration
To get a better understanding of the different models for code migration
The code segment is the part that contains the set of instructions that make up
the program that is being executed.
The resource segment contains references to external resources needed. by the
process, such as files, printers, devices, other processes, and so on.
Finally, an execution segment is used to store the current execution state of a
process, consisting of private data, the stack, and, of course, the program
counter
72
weak mobility: transfer only the code segment and is very simple
o Example: Java applets that start execution from beginning
strong mobility: execution segment can be transferred.
o is that a running process can be stopped, subsequently moved to another
machine, and then resume execution where it left off.
o It is much more general than weak mobility, but also much harder to implement
sender initiated migration: is initiated at the machine where the code currently
resides or is being executed.
o is done when uploading programs to a compute server and sending a search
program across the Internet to a Web database server to perform the queries at that
server.
receiver-initiated migration: the initiative for code migration is taken by the target
machine.
o Java applets are an example of this approach.
o is simpler than sender-initiated migration.
73
Fig. Alternatives for code migration 74
IF You Have Question ,
Comment, Suggestion You
well come!
75
Introduction to
Distributed Systems
Chapter Four
Communication
76
Overview of the Chapter
Layer protocols
Types of Communication
Remote Procedure Call
Remote Object invocation
77
Layered Protocols
78
Fig. Layers, Interfaces and protocols in the OSI model
79
Fig. A typical message as it appears on the network
The information in the layer n header is used for the layer n protocol
Lower-Level Protocols
physical layer: deals with standardizing the electrical, mechanical, and signaling
interfaces that includes both 0 and 1.
data link layer: computing a checksum by adding up all the bytes in the frame in a
certain way
network layer: choose the best path which is called routing that deals with packet
session layer: enhanced version of the transport layer that provides dialog control, to
keep track of which party is currently talking, and it provides synchronization facilities.
o This layer allows users on different machines to establish active communications
sessions between them. It is responsible for establishing, maintaining, synchronizing,
terminating sessions between end-user applications
presentation layer: is concerned with the meaning of the bits . define records
containing fields like these and then have the sender notify the receiver that a message
contains a particular record in a certain format
application layer: contain a collection of standard network applications such FTP,
HTTP 82
Middleware Protocols
logically lives (mostly) in the application layer
contains many general-purpose protocols that warrant their own layers,
independent of other, more specific applications.
can be made between high-level communication protocols and protocols for
establishing various middleware services
support high-level communication services
offer reliable multicast services that scale to thousands of receivers spread
across a WAN
reliable multicasting services that guarantee scalability can be implemented
only if application requirements are taken into account
middleware communication services may include message-passing services
comparable to those offered by the transport layer
83
Types of Communication
persistent communication: a message that has been submitted for
transmission is stored by the communication middleware as long as it takes to
deliver it to the receiver.
o In this case, the middleware will store the message at one or several of the
storage facilities.
o It is not necessary for the sending application to continue execution after
submitting the message
transient communication: a message is stored by the communication system
only as long as the sending and receiving application are executing
o the middleware cannot deliver a message due to a transmission interrupt, or
because the recipient is currently not active, it will simply be discarded.
o all transport-level communication services offer only transient communication
o communication system consists traditional store-and-forward routers
84
asynchronous communication: sender continues immediately after it has
submitted its message for transmission
o message is (temporarily) stored immediately by the middleware upon
submission
synchronous communication: the sender is blocked until its request is
known to be accepted
o First, the sender may be blocked until the middleware notifies that it will take
over transmission of the request.
o Second, the sender may synchronize until its request has been delivered to the
intended recipient.
o Third, synchronization may take place by letting the sender wait until its
request has been fully processed, that is, up the time that the recipient returns
a response.
85
Fig. Viewing middleware as an intermediate (distributed) service in application-level
communication
86
Remote Procedure Call (RPC)
87
Conventional Procedure Call
88
After the read procedure has finished running,
o it puts the return value in a register,
o removes the return address, and
o transfers control back to the caller.
The caller then removes the parameters from the stack, returning the stack
to the original state it had before the call.
89
Fig. (a) Parameter passing in a local procedure call: the stack before the call to read.
(b) The stack while the called procedure is active
90
call-by-value and call-by-reference is quite important for RPC.
There is also another method of calling that names as call-by-copy/restore
which not used in many languages.
The decision of which parameter passing mechanism to use is normally
made by the language designers and is a fixed property of the language
Sometimes it depends on the data type being passed.
o for example in C, integers and other scalar types are always passed by value,
o whereas arrays are always passed by reference,
o Some Ada compilers use copy/restore for in out parameters, but others use
call-by-reference.
o The language definition permits either choice, which makes the semantics a
bit fuzzy
91
Client and Server Stubs
The idea behind RPC is to make a RPC look as much as possible like a local one.
In other words, we want RPC to be transparent-the calling procedure should not be
aware that the called procedure is executing on a different machine or vice versa.
o Suppose that a program needs to read some data from a file.
• The programmer puts a call to read in the code to get the data.
• In a traditional (single-processor) system, the read routine is extracted from the
library
by the linker and inserted into the object program.
• It is a short procedure, which is generally implemented by calling an equivalent read
system call.
• In other words, the read procedure is a kind of interface between the user code and
the local operating system.
92
• RPC achieves its transparency in an analogous way.
• When read is actually a remote procedure (e.g., one that will run on the file server's
machine), a different version of read, called a client stub, is put into the library.
• Like the original one, it, too, is called using the calling sequence.
• Also like the original one, it too, does a call to the local operating system.
• Only unlike the original one, it does not ask the operating system to give it data.
• Instead, it packs the parameters into a message and requests that message to be sent
to the server as illustrated.
• Following the call to send, the client stub calls receive, blocking itself until the reply
comes back.
93
Fig. Principle of RPC between a client and server program
94
When the message arrives at the server, the server's operating system passes
it up to a server stub.
A server stub is the server-side equivalent of a client stub:
o it is a piece of code that transforms requests coming in over the network into local
procedure calls.
o have called receive and be blocked waiting for incoming messages.
o The server stub unpacks the parameters from the message and then calls the server
procedure in the usual way
o From the server's point of view, it is as though it is being called directly by the
client-the parameters and return address are all on the stack where they belong
and nothing seems unusual.
o The server performs its work and then returns the result to the caller in the usual way.
For example, in the case of read, the server will fill the buffer, pointed to by the second
parameter, with the data. This buffer will be internal to the server stub.
95
When the server stub gets control back after the call has completed, it packs
the result (the buffer) in a message and calls send to return it to the client.
After that, the server stub usually does a call to receive again, to wait for the next
incoming request.
When the message gets back to the client machine, the client's operating system sees
that it is addressed to the client process (or actually the client stub, but the operating
system cannot see the difference).
The message is copied to the waiting buffer and the client process unblocked.
The client stub inspects the message, unpacks the result, copies it to its caller, and
returns in the usual way.
When the caller gets control following the call to read, all it knows is that its data are
available.
It has no idea that the work was done remotely instead of by the local operating
system.
96
A remote procedure call occurs in the following steps:
The client procedure calls the client stub in the normal way.
The client stub builds a message and calls the local operating system.
The client's as sends the message to the remote as.
The remote as gives the message to the server stub.
The server stub unpacks the parameters and calls the server.
The server does the work and returns the result to the stub.
The server stub packs it in a message and calls its local as.
The server's as sends the message to the client's as.
The client's as gives the message to the client stub.
The stub unpacks the result and returns to the client. 97
parameter passing in RPC systems
Passing Value Parameters
o Packing parameters into a message is called parameter marshaling.
o As a very simple example, consider a remote procedure, add(i, j), that takes
two integer parameters i and j and returns their arithmetic sum as a result.
o (As a practical matter, one would not normally make such a simple
procedure remote due to the overhead, but as an example it will do.)
o The call to add, is shown in the left-hand portion (in the client process) in
Fig. below.
o The client stub takes its two parameters and puts them in a message,
o It also puts the name or number of the procedure to be called in the
message because the server might support several different calls, and it has
to be told which one is required. 98
Fig. The steps involved in a doing a remote computation through RPC
99
When the message arrives at the server, the stub examines the message to see
which procedure is needed and then makes the appropriate call.
If the server also supports other remote procedures, the server stub might have a switch
statement in it to select the procedure to be called, depending on the first field of the
message.
The actual call from the stub to the server looks like the original client call, except that
the parameters are variables initialized from the incoming message.
When the server has finished, the server stub gains control again.
It takes the result sent back by the server and packs it into a message.
This message is sent back back to the client stub which unpacks it to extract the result and
returns the value to the waiting client procedure. 100
Passing Reference Parameters
We now come to a difficult problem: How are pointers, or in general, references passed?
102
Remote Object (Method) Invocation (RMI)
• resulted from object-based technology that has proven its value in developing non-
distributed applications
it is an expansion of the RPC mechanisms
• it enhances distribution transparency as a consequence of an object that hides its
internals from the outside world by means of a well-defined interface
• Distributed Objects
– an object encapsulates data, called the state, and the operations on those data, called
methods/behaviors
– methods are made available through an interface
– the state of an object can be manipulated only by invoking methods
– this allows an interface to be placed on one machine while the object itself resides on
another machine;
– such an organization is referred to as a distributed object
– if the state of an object is not distributed, but only the interfaces are, then such an
object is referred to as a remote object. 103
– the implementation of an object’s interface is called a proxy (analogous
to a client stub in RPC systems)
• it is loaded into the client’s address space when a client binds to a
distributed object
• tasks: a proxy marshals method invocation into messages and
unmarshals reply messages to return the result of the method invocation to
the client
– a server stub, called a skeleton, unmarshals messages and marshals
replies.
104
Fig. Common organization of a remote object with client-side proxy
105
106
IF You Have Question ,
Comment, Suggestion You
well come!
107
Introduction to
Distributed Systems
Chapter Five
Naming
108
names play an important role to:
share resources
uniquely identify entities
refer to locations
etc.
an important issue is that a name can be resolved to the entity it refers
to
to resolve names, it is necessary to implement a naming system
in a distributed system, the implementation of a naming system is itself
often distributed, unlike in non-distributed systems
efficiency and scalability of the naming system are the main issues
Names facilitate communication and resource sharing 109
Naming Entities
111
an address is a special kind of name
• it refers to at most one entity
• each entity is referred by at most one address; even when replicated such as in
Web pages
• an entity may change an access point, or an access point may be reassigned to a
different entity (like telephone numbers in offices)
• separating the name of an entity and its address makes it easier and more
flexible; such a name is called location independent
there are also other types of names that uniquely identify an entity; in any case
an identifier is a name with the following properties
• it refers to at most one entity
• each entity is referred by at most one identifier
• it always refers to the same entity (never reused)
• identifiers allow us to unambiguously refer to an entity 112
Descriptive attributes are another technique used in distributed systems for
identification
• Sometimes clients do not know the name of the particular entity that they seek,
but they do have some information that describes it.
• Or they may require a service and know some of its characteristics but not what
entity implements it.
• Any process that requires access to a specific resource must possess a name or
an identifier for it.
–Examples of human-readable names are
o Filenames such as /etc/passwd o URLs such as http://www.slu.edu.et/ and
o Internet domain names such as www.cdk5.net. 113
– Names well-understood by programs (a.k.a identifiers) – typical examples
• Remote object references and
• NFS file handles
114
• The above Fig. shows the domain name portion of a URL
resolved first via the DNS into an IP address and then, at the
final hop of Internet routing, via ARP to an Ethernet address
for the web server.
• The last part of the URL is resolved by the file system on the
web server to locate the relevant file.
115
Names and services
– Many of the names used in a distributed system are specific to
some particular service. E.g. twitter.com
– So, a client may use a service-specific name when requesting a service to
perform an operation upon a named object or resource that it manages.
– Names are also sometimes needed to refer to entities in a
distributed system that are beyond the scope of any single service.
• The major examples of these entities are users (with proper names and
email addresses), computers (with hostnames such as www.cdk5.net) and
116
services themselves (such as file service or printer service).
– In object-based middleware, names refer to remote objects that provide
services or applications.
• Note that many of these names must be readable by and meaningful to
humans, since users and system administrators need to refer to the major
components and configuration of distributed systems, programmers need to
refer to services in programs, and users need to communicate with each other
via the distributed system and discuss what services are available in different
parts of it.
• Given the connectivity provided by the Internet, these naming requirements
are potentially world-wide in scope. 117
– Common names
• Uniform Resource Identifiers (URIs)
– came about from the need to identify resources on the Web, and other Internet
resources such as electronic mailboxes.
– An important goal was to identify resources in a coherent way, so that they
could all be processed by common software such as browsers.
– are ‘uniform’ in that their syntax incorporates that of indefinitely many
individual types of resource identifiers (that is, URI schemes), and there are
procedures for managing the global namespace of schemes.
• Uniform Resource Locators (URLs)
– often used for URIs that provide location information and specify the method
for accessing the resource
» http://www.cdk5.net/ identifies a web page at the given path (‘/’) on the host
www.cdk5.net, and specifies that the HTTP protocol be used to access it.
» a ‘mailto’ URL, such as mailto:[email protected], which identifies the118
mailbox at the given address.
• Uniform Resource Names (URNs)
– are URIs that are used as pure resource names rather than locators.
» URI: mid:[email protected] is a
URN that identifies the email message containing its ‘Message-Id’ field.
» The URI distinguishes that message from any other email messages.
» But it does not provide the message’s address in any store, so a lookup
operation is needed to find it.
119
Name Services and DNS
• A name service stores information about a collection of textual names, in the
form of bindings between the names and the attributes of the entities they
denote, such as users, computers, services and objects.
• This aggregation is often subdivided into one or more naming contexts:
individual subsets of the bindings that are managed as a unit.
• The major operation that a name service supports is to resolve a name – that is,
to look up attributes from a given name.
• Operations are also required for creating new bindings, deleting bindings and
listing bound names, and adding and deleting contexts. 120
• General name service requirements
• Name services were originally quite simple, since they were designed only to
meet the need to bind names to addresses in a single management domain,
corresponding to a single LAN or WAN.
• The interconnection of networks and the increased scale of distributed systems
have produced a much larger name-mapping problem.
– Grapevine was one of the earliest extensible, multi-domain name services.
• It was designed to be scalable in the number of names and the load of requests
that it could handle.
– The Global Name Service(GNS), developed at the Digital Equipment Corporation
Systems Research Center, is a descendant of Grapevine with ambitious goals,
including:
• To handle an essentially arbitrary number of names and to serve an arbitrary
number of administrative organizations.
• A long lifetime and High availability. • Fault isolation and Tolerance of mistrust.
121
– Provides the main name service requirements
• The most popular name services that have concentrated on the goal of
scalability to large numbers of objects such as documents are
• the Globe Name Service,
• the Handle System and
• the Internet Domain Name System (DNS) - names computers (and other
entities) across the Internet.
a) Name spaces
– A name space is the collection of all valid names recognized by a particular
service.
– The service will attempt to look up a valid name, even though that name
may prove not to correspond to any object – i.e., to be unbound.
– require a syntactic definition to separate valid names from invalid names.
122
• For example, ‘...’ is not acceptable as the DNS name of a computer, whereas
www.cdk99.net is valid (even though it is unbound).
– Names may have an internal structure that represents their position in a
hierarchic name space such as pathnames in a file system, or in an
organizational hierarchy such as Internet domain names; or they may be chosen
from a flat set of numeric or symbolic identifiers.
• Hierarchic vs. flat name spaces
– are potentially infinite, so they enable a system to grow indefinitely.
– By contrast, flat name spaces are usually finite; their size is
determined by fixing a maximum permissible length for names.
– Another potential advantage of a hierarchic name space is that
different contexts can be managed by different people or
organizations.
123
• The URL name space
– includes relative names such as ../images/figure1.jpg.
– When a browser or other web client encounters such a relative name, it uses
the resource in which the relative name is embedded to determine the server
host name and the directory to which this pathname refers.
• DNS names
– are strings called domain names.
– Some examples are www.cdk5.net (a computer), net, com and ac.uk (the latter
three are domains).
– The DNS name space has a hierarchic structure: a domain name consists of
one or more strings called name components or labels, separated by the
delimiter ‘.’.
– DNS names are not case-sensitive, so www.cdk5.net and WWW.CDK5.NET
have the same meaning.
124
• DNS servers do not recognize relative names: all names are referred to the
global root.
– However, in practical implementations, client software keeps a list of
domain names that are appended automatically to any single-component name
before resolution.
– Names with more than one component, however, are normally presented
intact to the DNS, as absolute names.
i) Aliases
– An alias is a name defined to denote the same information as another
name, similar to a symbolic link between file path names.
– allow more convenient names to be substituted for relatively complicated
ones, and allow alternative names to be used by different people for the
same entity.
– are often used to specify the names of machines that run a web
125
server or an FTP server
– An example is the common use of URL shorteners, often used in Twitter
posts and other situations where space is at a premium.
• For example, using web redirection, http://bit.ly/ctqjvH refers to
http://cdk5.net/additional/rmi/programCode/ShapeListClient.java.
• As another example, the DNS allows aliases in which one domain name is
defined to stand for another.
• Yet another example, the name www.cdk5.net is an alias for cdk5.net.
– This has the advantage that clients can use either name for the web
server, and if the web server is moved to another computer, only the entry for
cdk5.net needs to be updated in the DNS database. 126
ii) Naming domains
– A naming domain is a name space for which there exists a single overall
administrative authority responsible for assigning names within it.
– This authority is in overall control of which names may be bound within the
domain, but it is free to delegate this task.
• Domains in DNS are collections of domain names;
– syntactically, a domain’s name is the common suffix of the domain names
within it, but otherwise it cannot be distinguished from, for example, a computer
name.
• For example, net is a domain that contains cdk5.net.
• Note that the term ‘domain name’ is potentially confusing, since only some
domain names identify domains (others identify computers).
• The administration of domains may be devolved to subdomains.
127
– For example
• The domain dcs.qmul.ac.uk – the Department of Computer Science at Queen
Mary, University of London in the UK – can contain any name the department
wishes.
• But the domain name dcs.qmul.ac.uk itself had to be agreed with the college
authorities, who manage the domain qmul.ac.uk.
• Similarly, qmul.ac.uk had to be agreed with the registered authority for ac.uk,
and so on.
• Responsibility for a naming domain normally goes hand in hand with
responsibility for managing and keeping up-to-date the corresponding part of the
database stored in an authoritative name server and used by the name service.
128
iii) Combining and customizing name spaces
• The DNS provides a global and homogeneous name space in which a
given name refers to the same entity, no matter which process on which computer
looks up the name.
• By contrast, some name services allow distinct name spaces – sometimes
heterogeneous name spaces – to be embedded into them; and some name services
allow the name space to be customized to suit the needs of individual groups,
users or even processes.
– Merging
• The practice of mounting file systems in UNIX and NFS provides an example
in which a part of one name space is conveniently embedded in another.
– But consider how to merge the entire UNIX file systems of two (or more)
computers.
• Each computer has its own root, with overlapping file names.
• Generally, we can always merge name spaces by creating a higher level root 129
131
i) Name servers and navigation
– Any name service, such as DNS, that stores a very large database and is used
by a large population will not store all of its naming information on a single
server computer.
• Such a server would be a bottleneck and a critical point of failure.
– Any heavily used name services should use replication (at least two failure
independent servers) to achieve high availability.
– in some cases, a name server may store data for more than one domain,
– Generally, data is partitioned into servers according to its domain.
• In DNS, most of the entries are for local computers.
• But there are also name servers for the higher domains, such as yahoo.com
and ac.uk, and for the root.
132
ii) Caching
– In DNS and other name services, client name resolution software and servers maintain
a cache of the results of previous name resolutions.
– When a client requests a name lookup, the name resolution software consults its cache.
– If it holds a recent result from a previous lookup for the name, it returns it to the client;
otherwise, it sets about finding it from a server.
– That server, in turn, may return data cached from other
servers.
– Caching is key to a name service’s performance and assists in maintaining the
availability of both the name service and other services in spite of name server crashes.
– Its role in enhancing response times by saving communication with name servers is
clear.
– Caching can be used to eliminate high-level name servers – the root server, in
particular
– from the navigation path, allowing resolution to proceed despite some server failures
133
c) The Domain Name System
– is a name service design whose main naming database is used across the Internet.
– devised principally by Mockapetris and specified in RFC 1034 and RFC 1035.
– DNS replaced the original Internet naming scheme, in which all host names and addresses
were held in a single central master file and downloaded by FTP to all computers that
required them.
– The objects named by the DNS are primarily computers
– for which mainly IP addresses are stored as attributes
– referred simply as domains in the DNS.
– In principle, however, any type of object can be named, and its architecture gives scope
for a variety of implementations.
– Organizations and departments within them can manage their own naming data.
– Millions of names are bound by the Internet DNS, and lookups are made against it from
around the world.
– Any name can be resolved by any client. This is achieved by hierarchical partitioning of
the name database, by replication of the naming data, and by caching.
134
i) Domain names
– The Internet DNS namespace is partitioned both organizationally and
according to geography.
– The names are written with the highest-level domain on the right.
– The original top-level organizational domains (also called generic domains) in
use across the Internet were:
• com – Commercial organizations
• edu – Universities and other educational institutions
• gov – US governmental agencies
• mil – US military organizations
• net – Major network support centers
• org – Organizations not mentioned above
• int – International organizations
135
– New top-level domains such as biz and mobi have been added since the early 2000s.
– A full list of current generic domain names is available from the
Internet Assigned Numbers Authority [www.iana.org ].
– In addition, every country has its own domains:
• us – United States
• uk – United Kingdom
• fr – France
• et – Ethiopia
– Countries, particularly those other than the US often use their own subdomains to
distinguish their organizations.
• The UK, for example, has domains co.uk and ac.uk, which correspond to com and edu
respectively (ac stands for ‘academic community’).
– Note that, despite its geographic-sounding uk suffix, a domain such as doit.co.uk could
have data referring to computers in the Spanish office of Doit Ltd., a notional British
company.
• In other words, even geographic-sounding domain names are conventional and are
136
completely independent of their physical locations
ii) DNS queries
• The Internet DNS is primarily used for simple host name resolution and for looking up
electronic mail hosts, as follows:
– Host name resolution: In general, applications use the DNS to resolve host names into IP
addresses.
– URL →DNS enquiry → IP address → HTTP request → web server port
number
– Mail host location: Electronic mail software uses the DNS to resolve domain names into
the IP addresses of mail hosts – i.e., computers that will accept mail for those domains.
– Email address →email sw → IP addresses of mail hosts
• Some other types of query that are implemented in some installations but are less
frequently used than those just given are:
– Reverse resolution - Some software requires a domain name to be returned given an IP
address.
– Host information - The DNS can store the machine architecture type and operating
system with the domain names of hosts.
137
iii) DNS name servers
• The problems of scale are treated by a combination of partitioning the naming database
and replicating and caching parts of it close to the points of need.
• The DNS database is distributed across a logical network of servers.
• Each server holds part of the naming database – primarily data for the local domain.
• Queries concerning computers in the local domain are satisfied by servers within that
domain.
• However, each server records the domain names and addresses of other name servers, so
that queries pertaining to objects outside the domain can be satisfied.
• The DNS naming data are divided into zones. A zone contains the following data:
– Attribute data for names in a domain, less any subdomains administered by lower level
authorities. For example, a zone could contain data for Queen Mary, University of London –
qmul.ac.uk
– less the data held by departments (for example the Department of Computer Science –
dcs.qmul.ac.uk).
– The names and addresses of at least two name servers that provide authoritative data for
the zone. These are versions of zone data that can be relied upon as being reasonably up-to-138
date.
– The names of name servers that hold authoritative data for delegated
subdomains; and ‘glue’ data giving the IP addresses of these servers.
– Zone-management parameters, such as those governing the caching and
replication of zone data.
• A server may hold authoritative data for zero or more zones.
• So that naming data are available even when a single server fails, the DNS
architecture specifies that each zone must be replicated authoritatively in at least
two servers.
• System administrators enter the data for a zone into a master file, which is the
source of authoritative data for the zone.
• There are two types of servers that are considered to provide authoritative data.
– A primary or master server reads zone data directly from a local master file.
– Secondary servers download zone data from a primary server
139
• They communicate periodically with the primary server to check whether
their stored version matches that held by the primary server.
• If a secondary’s copy is out of date, the primary sends it the latest version.
• The frequency of the secondary’s check is set by administrators as a zone
parameter, and its value is typically once or twice a day.
140
Fig. DNS name servers
141
iv) Navigation and query processing
• A DNS client is called a resolver.
– It is normally implemented as library software.
– It accepts queries, formats them into messages in the form expected under the DNS
protocol and communicates with one or more name servers in order to satisfy the queries. –
A simple request-reply protocol is used, typically using UDP packets on the Internet (DNS
servers use a well-known port number).
– The resolver times out and resends its query if necessary.
– The resolver can be configured to contact a list of initial name servers in order of
preference in case one or more are unavailable.
• The DNS architecture allows for recursive navigation as well as iterative navigation.
– The resolver specifies which type of navigation is required when contacting a name
server.
– recursive navigation may tie up server threads, meaning that other requests might be
delayed.
– In order to save on network communication, the DNS protocol allows for multiple
queries to be packed into the same request message and for name servers correspondingly142
to send multiple replies in their response messages.
v) Resource records
• Zone data are stored by name servers in files in one of several fixed types of resource
record.
• For the Internet database, these include the types given in Fig.
– Each record refers to a domain name, which is not shown.
– The entries in the table refer to items already mentioned, except that AAAA records store
IPv6 addresses whereas A records store IPv4 addresses, and TXT entries are included to
allow arbitrary other information to be stored along with domain names.
• The data for a zone starts with an SOA-type record, which contains the zone parameters
that specify, for example, the version number and how often secondary's should refresh
their copies.
– This is followed by a list of records of type NS specifying the name servers for the
domain and a list of records of type MX giving the domain names of mail hosts, each
prefixed by a number expressing its preference.
• For example, part of the database for the domain dcs.qmul.ac.uk at one point is shown in
Figure 13.6, where the time to live 1D means 1 day.
143
– Further records of type A later in the database give the IP addresses for the two name
servers dns0 and dns1.
– The IP addresses of the mail hosts and the third name server are given in the databases
corresponding to their domains.
144
Fig. DNS resource records
• The majority of the remainder of the records in a lower-level zone like
dcs.qmul.ac.uk will be of type A and map the domain name of a computer onto
its IP address.
• They may contain some aliases for the well-known services, for example: If
the domain has any subdomains, there will be further records of type NS
specifying their name servers, which will also have individual A entries.
• For example, at one point the database for qmul.ac.uk contained the following
records for the name servers in its subdomain dcs.qmul.ac.uk:
145
146
Directory services
• Sometimes users wish to find a particular person or resource, but they do not know its
name, only some of its other attributes.
– For example, a user may ask: ‘What is the name of the user with telephone number 020
5559980?’
• Likewise, sometimes users require a service, but they are not concerned with what
system entity supplies that service, as long as the service is conveniently accessible.
– For example, a user might ask, ‘Which computers in this building are Macintoshes
running the Mac OS X operating system?’ or ‘Where can I print a high-resolution color
image?’
• A service that stores collections of bindings between names and attributes and that looks
up entries that match attribute-based specifications is called a directory service.
– Examples are Microsoft’s Active Directory Services, X.500 and its cousin LDAP,
Univers and Profile.
• Directory services are sometimes called
– yellow pages services, – conventional name services are correspondingly called white
pages services, in an analogy with the traditional types of telephone directory. 147
– attribute-based name services.
• A directory service returns the sets of attributes of any objects found to match some
specified attributes.
– So, for example, the request ‘TelephoneNumber = 020 555 9980’ might return {‘Name =
John Smith’, ‘TelephoneNumber = 020 555 9980’, ‘emailAddress =
[email protected]’, ...}.
– The client may specify that only a subset of the attributes is of interest
– for example, just the email addresses of matching objects. X.500 and some other directory
services also allow objects to be looked up by conventional hierarchic textual names.
• The Universal Directory and Discovery Service, UDDI, provides both white pages and
yellow pages services to provide information about organizations and the web services they
offer. 148
Universal Description, Discovery and Integration – UDDI
• UDDI aside, the term discovery service normally denotes the special case of a
directory service for services provided by devices in a spontaneous networking
environment.
• Devices in spontaneous networks are liable to connect and disconnect
unpredictably.
• One core difference between a discovery service and other directory services
is that the address of a directory service is normally well known and
preconfigured in clients, whereas a device entering a spontaneous networking
environment has to resort to multicast navigation, at least the first time it
accesses the local discovery service
149
IF You Have Question ,
Comment, Suggestion You
well come!
150
Introduction to
Distributed Systems
Chapter Six
Time and Clocks –Synchronization
151
• Here, fundamental concepts and algorithms related to
– monitoring distributed systems as their execution unfolds, and
– timing the events that occur in their executions.
• Time is an important and interesting issue in distributed systems, for several
reasons.
– First, time is a quantity we often want to measure accurately.
– In order to know at what time of day a particular event occurred at a particular computer it
is necessary to synchronize its clock with an authoritative, external source of time.
• For example, an eCommerce transaction involves events at a merchant’s computer and at a
bank’s computer. It is important, for auditing purposes, that those events are timestamped
accurately.
– Second, algorithms that depend upon clock synchronization have been developed for
several problems in distribution.
• These include maintaining the consistency of distributed data, checking the authenticity of
a request sent to a server
• a version of the Kerberos authentication protocol depends on loosely synchronized clocks
and eliminating the processing of duplicate updates. 152
– Measuring time can be problematic due to the existence of multiple frames of
reference.
• Einstein demonstrated, in his Special Theory of Relativity, the intriguing
consequences that follow from the observation that the speed of light is constant
for all observers, regardless of their relative velocity.
– The relative order of two events can even be reversed for two different
observers!
• But this cannot happen if one event causes the other to occur.
• The notion of physical time is also problematic in a distributed system.
• This is not due to the effects of special relativity, which are negligible or
nonexistent for normal computers (unless one counts computers travelling in
spaceships!).
• Rather, the problem is based on a similar limitation in our ability to timestamp
events at different nodes sufficiently accurately to know the order in which any
pair of events occurred, or whether they occurred simultaneously. 153
• There is no absolute, global time to which we can appeal.
• And yet we sometimes need to observe distributed systems and establish whether
certain states of affairs occurred at the same time.
– For example, in object oriented systems we need to be able to establish whether
references to a particular object no longer exist – whether the object has become garbage
(in which case we can free its memory).
– Establishing this requires observations of the states of processes (to find out whether
they contain references) and of the communication channels between processes (in case
messages containing references are in transit).
• So, here we’ll, first, examine methods whereby computer clocks can be approximately
synchronized, using message passing.
• Then we go on to introduce logical clocks, including vector clocks, which are used to
define an order of events without measuring the physical time at which they
occurred.
– we also describe algorithms whose purpose is to capture global states of distributed
systems as they execute.
154
Clocks, events and process states
• Let’s consider how to order and timestamp the events that occur at a single
process.
• We take a distributed system to consist of a collection → of N processes pi, i = 1,
2, N.
• Each process executes on a single processor, and the processors do not share
memory.
• Each process pi in → has a state si that, in general, it transforms as it executes.
• The process’s state includes the values of all the variables within it.
• Its state may also include the values of any objects in its local operating system
environment that it affects, such as files.
• We assume that processes cannot communicate with one another in any way
except by sending messages through the network.
– So, for example, if the processes operate robot arms connected to their respective
nodes in the system, then they are not allowed to communicate by shaking one 155
another’s robot hands!
• As each process pi executes it takes a series of actions, each of which is either a message
send or receive operation, or an operation that transforms pi ’s state – one that changes one
or more of the values in si.
• In practice, we may choose to use a high-level description of the actions, according to the
application.
– We define an event to be the occurrence of a single action that a process carries out as it
executes
– a communication action or a state-transforming action.
– The sequence of events within a single process pi can be placed in a single, total ordering,
which we denote by the relation i’ between the events. That is, e i e' if and only if the
event e occurs before e ‘ at pi .
– This ordering is well defined, whether or not the process is multithreaded, since we have
assumed that the process executes on a single processor.
– Now we can define the history of process pi to be the series of events that take place
within it, ordered as we have described by the relation i’:
156
i) Clocks
• We’ve seen how to order the events at a process, but not how to timestamp them
– i.e., to assign to them a date and time of day.
– each computer contains its own physical clock
- electronic devices that count oscillations occurring in a crystal at a definite frequency,
and typically divide this count and store the result in a counter register.
• Clock devices can be programmed to generate interrupts at regular intervals in order
that, for example, time slicing can be implemented; however, we shall not concern
ourselves with this aspect of clock operation.
• The operating system reads the node’s hardware clock value, Hi (t) , scales it and adds
an offset so as to produce a software clock
C i(t) = Hi(t) + that approximately measures real, physical time t for process pi .
157
• In other words, when the real time in an absolute frame of reference is t, Ci(t) is the
reading on the software clock.
– For example, Ci(t) could be the 64-bit value of the number of nanoseconds that have
elapsed at time t since a convenient reference time.
– In general, the clock is not completely accurate, so Ci(t) will
differ from t.
– Nonetheless, if Ci behaves sufficiently well , we can use its value to timestamp any
event at pi .
– Note that successive events will correspond to different timestamps only if the clock
resolution
– the period between updates of the clock value
– is smaller than the time interval between successive events.
– The rate at which events occur depends on such factors as the length of the processor
instruction cycle.
158
ii) Clock skew and clock drift
• Computer clocks, like any others, tend not to be in perfect agreement. • The instantaneous
difference between the readings of any two clocks is called their skew.
• clock drift – clocks count time at different rates, and so diverge.
• The underlying oscillators are subject to physical variations, with the consequence that
their frequencies of oscillation differ.
• Moreover, even the same clock’s frequency varies with temperature.
• Designs exist that attempt to compensate for this variation, but they cannot eliminate it.
• The difference in the oscillation period between two clocks might be extremely small, but
the difference accumulated over many oscillations leads to an observable difference in the
counters registered by two clocks, no matter how accurately they were initialized to the
same value.
• A clock’s drift rate is the change in the offset (difference in reading) between the clock
and a nominal perfect reference clock per unit of time measured by the reference clock.
• For ordinary clocks based on a quartz crystal this is about 10–6 seconds/second, giving a
difference of 1 second every 1,000,000 seconds, or 11.6 days.
• The drift rate of ‘high-precision’ quartz clocks is about 10–7 or 10–8. 159
iii) Coordinated Universal Time
• Computer clocks can be synchronized to external sources of highly accurate time.
– The most accurate physical clocks use atomic oscillators, whose drift rate is about one part
in 1013.
– The output of these atomic clocks is used as the standard for elapsed real time, known as
International Atomic Time.
– Since 1967, the standard second has been defined as 9,192,631,770 periods of radiation
between the two hyperfine levels of the ground state of Caesium-133 (Cs133).
• Seconds and years and other time units that we use are rooted in astronomical time.
– They were originally defined in terms of the rotation of the Earth on its axis and its
rotation about the Sun.
– However, the period of the Earth’s rotation about its axis is gradually getting longer,
primarily because of tidal friction; atmospheric effects and convection currents within the
Earth’s core also cause short-term increases and decreases in the period.
• So astronomical time and atomic time have a tendency to get out of step.
160
• Coordinated Universal Time – abbreviated as UTC (from the French equivalent) – is an
international standard for timekeeping.
– It is based on atomic time, but a so-called ‘leap second’ is inserted
– or, more rarely, deleted
– occasionally to keep it in step with astronomical time.
– UTC signals are synchronized and broadcasted regularly from land based radio stations
and satellites covering many parts of the world.
• For example, in the USA, the radio station WWV broadcasts time signals on several
shortwave frequencies.
• Satellite sources include the Global Positioning System (GPS).
– Receivers are available commercially. Compared with ‘perfect’ UTC, the signals received
from land-based stations have an accuracy on the order of 0.1–10 milliseconds, depending
on the station used.
– Signals received from GPS satellites are accurate to about 1 microsecond. Computers
with receivers attached can synchronize their clocks with these timing signals.
161
physical clocks
• In order to know at what time of day events occur at the processes in our distributed
system
– it is necessary to synchronize the processes’ clocks, Ci , with an authoritative, external
source of time.
– This is external synchronization.
• And if the clocks C i are synchronized with one another to a known degree of accuracy,
then we can measure the interval between two events occurring at different computers by
appealing to their local clocks, even though they are not necessarily synchronized to an
external source of time.
– This is internal synchronization.
• We define these two modes of synchronization more closely and formally as follows,
over an interval of real time i:
– External synchronization: For a synchronization bound D > 0, and for a source S of
UTC time, |S(t) – Ci(t)| < D, for i = 1, 2, . . ., N and for all real times t in i.
• Another way of saying this is that the clocks Ci are accurate to within the bound D.
162
– Internal synchronization: For a synchronization bound D > 0 , |Ci(t) – Cj(t)|< D for i, j =
1, 2, . . ., N, and for all real times t in i.
• Another way of saying this is that the clocks Ci agree within the bound D.
• A clock that does not keep to whatever correctness conditions apply is defined to be
faulty.
• A clock’s crash failure is said to occur when the clock stops ticking altogether; any other
clock failure is an arbitrary failure.
– A historical example of an arbitrary failure is that of a clock with the ‘Y2K bug’, which
broke the monotonicity condition by registering the date after 31 December 1999 as 1
January 1900 instead of 2000;
– Another example is a clock whose batteries are very low and whose drift rate suddenly
becomes very large.
• Note that clocks do not have to be accurate to be correct, according to the definitions.
163
• Since the goal may be internal rather than external synchronization, the criteria for
correctness are only concerned with the proper functioning of the clock’s ‘mechanism’,
not its absolute setting.
• We now describe algorithms for external synchronization and for internal
synchronization.
i) Synchronization in a synchronous system
• the simplest possible case is internal synchronization b/n two processes in a
synchronous distributed system.
– Here, bounds are known for the drift rate of clocks, the maximum message transmission
delay, and the time required to execute each step of a process.
• One process sends the time t on its local clock to the other in a message m.
• In principle, the receiving process could set its clock to the time t + Ttrans , where
Ttrans is the time taken to transmit m between them.
– The two clocks would then agree (since the aim is internal synchronization, it does not
matter whether the sending process’s clock is accurate).
164
• In a synchronous system, by definition, there is also an upper bound max on the time
taken to transmit any message.
– Let the uncertainty in the message transmission time be u, so that u = (max – min).
– If the receiver sets its clock to be t + min , then the clock skew may be as much as u,
since the message may in fact have taken time max to arrive.
– Similarly, if it sets its clock to t + max , the skew may again be as large as u. If,
however, it sets its clock to the halfway point, t + (max + min)/ 2 , then the skew is at most
u / 2.
• In general, for a synchronous system, the optimum bound that can be achieved on clock
skew when synchronizing N clocks is u(1 – 1/N).
– Most distributed systems found in practice are asynchronous: the factors leading to
message delays are not bounded in their effect, and there is no upper bound max on
message transmission delays. E.g. the Internet.
• For an asynchronous system, we may say only that Ttrans = min + x , where x 0.
• The value of x is not known in a particular case, although a distribution of values may be
measurable for a particular installation
165
ii) Cristian’s method for synchronizing clocks
• Cristian suggested the use of a time server, connected to a device that receives signals
from a source of UTC, to synchronize computers externally.
• Upon request, the server process S supplies the time according to its clock, as shown in
• Cristian observed that while there is no upper bound on message transmission delays in
an asynchronous system, the round-trip times for messages exchanged between pairs of
processes are often reasonably short
– a small fraction of a second.
• He describes the algorithm as probabilistic: the method achieves synchronization only if
the observed round-trip times between client and server are sufficiently short compared
with the required accuracy.
166
Figure Clock synchronization using a time server
167
• A process p requests the time in a message mr , and receives the time value t in a message
mt (t is inserted in mt at the last possible point before transmission from S’s computer).
• Process p records the total round-trip time Tround taken to send the request mr and
receive the reply mt .
• It can measure this time with reasonable accuracy if its rate of clock drift is small.
– For example, the round-trip time should be on the order of 1–10 milliseconds on a LAN,
over which time a clock with a drift rate of 10–6 seconds/second varies by at most 10–5
milliseconds.
• A simple estimate of the time to which p should set its clock is t + Tround/2, which
assumes that the elapsed time is split equally before and after S placed t in mt.
– This is normally a reasonably accurate assumption, unless the two messages are
transmitted over different networks.
• If the value of the minimum transmission time min is known or can be conservatively
estimated, then we can determine the accuracy of this result as follows.
– The earliest point at which S could have placed the time in mt was min after p dispatched
mr .
168
– The latest point at which it could have done this was min before mt arrived at p.
– The time by S’s clock when the reply message arrives is therefore in the range [t + min, t
+Tround – min]. – The width of this range is Tround – 2min , so the accuracy is ±(Tround /
2 – min).
• Variability can be dealt with to some extent by making several requests to S (spacing the
requests so that transitory congestion can clear) and taking the minimum value of Tround
to give the most accurate estimate.
• The greater the accuracy required, the smaller the probability of achieving it.
– This is because the most accurate results are those in which both messages are
transmitted in a time close to min – an unlikely event in a busy network.
169
iii) The Berkeley algorithm
• Gusella and Zatti describe an algorithm for internal synchronization that they developed
for collections of computers running Berkeley UNIX.
• In it, a coordinator computer is chosen to act as the master.
• Unlike in Cristian’s protocol, this computer periodically polls the other computers
whose clocks are to be synchronized, called slaves.
• The slaves send back their clock values to it.
• The master estimates their local clock times by observing the round-trip times (similarly
to Cristian’s technique), and it averages the values obtained (including its own clock’s
reading).
• The balance of probabilities is that this average cancels out the individual clocks’
tendencies to run fast or slow.
• The accuracy of the protocol depends upon a nominal maximum round-trip time
between the master and the slaves.
• The master eliminates any occasional readings associated with larger times than this
maximum.
170
• Instead of sending the updated current time back to the other computers
– which would introduce further uncertainty due to the message transmission time
– the master sends the amount by which each individual slave’s clock requires
adjustment.
– This can be a positive or negative value
• The Berkeley algorithm eliminates readings from faulty clocks.
– Such clocks could have a significant adverse effect if an ordinary average was taken so
instead the master takes a fault-tolerant average.
– That is, a subset is chosen of clocks that do not differ from one another by more than a
specified amount, and the average is taken of readings from only these clocks.
• Gusella and Zatti describe an experiment involving 15 computers whose clocks were
synchronized to within about 20–25 milliseconds using their protocol.
• The local clocks’ drift rates were measured to be less than 2x10–5, and the maximum
round-trip time was taken to be 10 milliseconds.
• Should the master fail, then another can be elected to take over and function exactly as
its predecessor. 171
– Note that these are not guaranteed to elect a new master in bounded time, so the
difference between two clocks would be unbounded if they were used.
iv) The Network Time Protocol(NTP)
• defines an architecture for a time service and a protocol to distribute time information
over the Internet.
• NTP’s chief design aims and features are as follows:
– To provide a service enabling clients across the Internet to be synchronized accurately
to UTC: Although large and variable message delays are encountered in Internet
communication, NTP employs statistical techniques for the filtering of timing data and it
discriminates between the quality of timing data from different servers.
– To provide a reliable service that can survive lengthy losses of connectivity: There are
redundant servers and redundant paths between the servers. The servers can reconfigure
so as to continue to provide the service if one of them becomes unreachable.
– To enable clients to resynchronize sufficiently frequently to offset the rates of drift
found in most computers: The service is designed to scale to large numbers of clients and
servers. 172
– To provide protection against interference with the time service, whether malicious or
accidental: The time service uses authentication techniques to check that timing data
originate from the claimed trusted sources. It also validates the return addresses of
messages sent to it.
• The NTP service is provided by a network of servers located across the Internet.
– Primary servers are connected directly to a time source such as a radio clock receiving
UTC;
– Secondary servers are synchronized, ultimately, with primary servers.
– The servers are connected in a logical hierarchy called a synchronization subnet (see
Figure 14.3), whose levels are called strata.
– Primary servers occupy stratum 1: they are at the root.
– Stratum 2 servers are secondary servers that are synchronized directly with the primary
servers;
– Stratum 3 servers are synchronized with stratum 2 servers, and so on.
– The lowest-level (leaf) servers execute in users’ workstations.
173
174
175