Paper PDF
Paper PDF
Paper PDF
Distributed operating systems have many aspects in common with centralized ones, but
they also differ in certain ways. This paper is intended as an introduction to distributed
operating systems, and especially to current university research about them. After a
discussion of what constitutes a distributed operating system and how it is distinguished
from a computer network, various key design issues are discussed. Then several examples
of current research projects are examined in some detail, namely, the Cambridge
Distributed Computing System, Amoeba, V, and Eden.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1986 ACM 0360-0300/85/1200-0419 $00.75
sequent writes to that file go to the file Client crashes can also cause trouble for
server where the file was created. With servers. Consider, for example, the case of
mailboxes, arranging for this is straight- processes A and B communicating via the
forward. The client simply addresses the UNIX pipe model A ] B with A the server
WRITE messagesto the same mailbox that and B the client. B asks A for data and gets
the CREATE message was sent to. Since a reply, but unless that reply is acknowl-
each file server has its own mailbox, there edged somehow, A does not know when it
is no ambiguity. can safely discard data that it may not be
When RPC is used, the situation is more able to reproduce. If B crashes, how long
complicated, since all the client does is put should A hold onto the data? (Hint: If the
a procedure call such as answer is less than infinity, problems will
be introduced whenever B is slow in send-
write(FileDescriptor, BufferAddress, ByteCount); ing an acknowledgment.)
in his program. RPC intentionally hides all Closely related to this is the problem of
the details of locating servers from the what happens if a client cannot tell whether
client, but sometimes, as in this example, or not a server has crashed. Simply waiting
the details are important. until the server is rebooted and trying again
In some applications, broadcasting and sometimes works and sometimes does not.
multicasting (sending to a set of destina- This is a case in which it works: Client asks
tions, rather than just one) is useful. For to read block 7 of some file. This is a case
example, when trying to locate a certain in which it does not work: Client says
person, process, or service, sometimes the transfer a million dollars from one bank
only approach is to broadcast an inquiry account to another. In the former case, it
message and wait for the replies to come does not matter whether or not the server
back. RPC does not lend itself well to carried out the request before crashing;
sending messages to sets of processes carrying it out a second time does no harm.
and getting answers back from some or In the latter case, one would definitely pre-
all of them. The semantics are completely fer the call to be carried out exactly once,
different. no more and no less. Calls that may be
repeated without harm (like the first ex-
Despite all these disadvantages, RPC re-
mains an interesting form of communica- ample) are said to be idempotent. Unfortu-
tion, and much current research is being nately, it is not always possible to arrange
addressed toward improving it and solving for all calls to have this property. Any call
the various problems discussed above. that causes action to occur in the outside
world, such as transferring money, printing
2.1.3 Error Handling
lines, or opening a valve in an automated
chocolate factory just long enough to fill
Error handling in distributed systems is exactly one vat, is likely to cause trouble if
radically different from that of centralized performed twice.
systems. In a centralized system, a system Spector [1982] and Nelson [1981] have
crash means that the client, server, and looked at the problem of trying to make
communication channel are all completely sure that remote procedure calls are exe-
destroyed, and no attempt is made to revive cuted exactly once, and they have devel-
them. In a distributed system, matters are oped taxonomies for classifying the seman-
more complex. If a client has initiated a tics of different systems. These vary from
remote procedure call with a server that systems that offer no guarantee at all (zero
has crashed, the client may just be left or more executions), to those that guaran-
hanging forever unless a time-out is built tee at most one execution (zero or one), to
in. However, such a time-out introduces those that guarantee at least one execution
race conditions in the form of clients that (one or more).
time out too quickly, thinking that the Getting it right (exactly one) is probably
server is down, when in fact, it is merely impossible, because even if the remote ex-
very slow. ecution can be reduced to one instruction
a a
X > X
Figure 5.
>
>
Distributing
El C
global naming tree, with files and other could make its own synonym s and then
objects having names of the form: /coun- perform accesses using s/x/foobar. Each
try/city/network/pathname. When such a name server parsing a name that involves
name is presented to any name server, it multiple name servers just strips off the
can immediately route the request to some first component and passes the rest of the
name server in the designated country, name to the name server found by looking
which then sends it to a name server in the up the first component locally.
designated city, and so on until it reaches A more extreme way of distributing the
the name server in the network where the name server is to have each machine man-
object is located, where the mapping can be age its own names. To look up a name, one
done. Telephone numbers use such a hier- broadcasts it on the network. At each ma-
archy, composed of country code, area code, chine, the incoming request is passed to the
exchange code (first three digits of tele- local name server, which replies only if it
phone number in North America), and sub- finds a match. Although broadcasting is
scriber line number. easiest over a local network such as a ring
Having multiple name servers does not net or CSMA net (e.g., Ethernet), it is also
necessarily require having a single, global possible over store-and-forward packet
naming hierarchy. Another way to organize switching networks such as the ARPANET
the name servers is to have each one effec- [Dalal 19771.
tively maintain a table of, for example, Although the normal use of a name server
(ASCII string, pointer) pairs, where the is to map an ASCII string onto a binary
pointer is really a kind of capability for any number used internally to the system, such
object or domain in the system. When a as a process identifier or machine number,
name, say a/b/c, is looked up by the local once in a while the inverse mapping is also
name server, it may well yield a pointer to useful. For example, if a machine crashes,
another domain (name server), to which upon rebooting it could present its (hard-
the rest of the name, b/c, is sent for further wired) node number to the name server to
processing (see Figure 5). This facility can ask what it was doing before the crash, that
be used to provide links (in the UNIX is, ask for the ASCII string corresponding
sense) to files or objects whose precise to the service that it is supposed to be
whereabouts is managed by a remote name offering so that it can figure out what pro-
server. Thus if a file foobar is located in gram to reboot.
another local network, n, with name server
n.s, one can make an entry in the local 2.3 Resource Management
name server’s table for the pair (x, n.s) and
then access xlfoobar as though it were a Resource management in a distributed
local object. Any appropriately authorized system differs from that in a centralized
user or process knowing the name xlfoobar system in a fundamental way. Centralized
(4 (b)
people making real systems who care less static requirements, known traffic matrix,
about optimality than about devising algo- error-free processors and communication)
rithms that can actually be used. Let us now are ever met.
briefly look at each of these approaches.
Heuristic Load Balancing. When the
Graph-Theoretic Models. If the system goal of the scheduling algorithm is dy-
consists of a fixed number of processes, namic, heuristic load balancing, rather than
each with known CPU and memory re- finding related clusters, a different ap-
quirements, and a known matrix giving the proach is taken. Here the idea is for each
average amount of traffic between each pair processor to estimate its own load contin-
of processes, scheduling can be attacked as ually, for processors to exchange load in-
a graph-theoretic problem. The system can formation, and for process creation and
be represented as a graph, with each pro- migration to utilize this information.
cess a node and each pair of communicating Various methods of load estimation are
processes connected by an arc labeled with possible. One way is just to measure the
the data rate between them. number of runnable processes on each CPU
The problem of allocating all the pro- periodically and take the average of the last
cesses to k processors then reduces to the n measurements as the load. Another way
problem of partitioning the graph into k [Bryant and Finkel19811 is to estimate the
disjoint subgraphs, such that each subgraph residual running times of all the processes
meets certain constraints (e.g., total CPU and define the load on a processor as the
and memory requirements below some number of CPU seconds that all its pro-
limit). Arcs that are entirely within one cesses will need to finish. The residual time
subgraph represent internal communica- can be estimated mostly simply by assum-
tion within a single processor (=fast), ing it is equal to the CPU time already
whereas arcs that cut across subgraph consumed. Bryant and Finkel also discuss
boundaries represent communication be- other estimation techniques in which both
tween two processors (=slow). The idea is the number of processes and length of re-
to find a partitioning of the graph that maining time are important. When round-
meets the constraints and minimizes the robin scheduling is used, it is better to be
network traffic, or some variation of this competing against one process that needs
idea. Figure 7a depicts a graph of interact- 100 seconds than against 100 processes that
ing processors with one possible partition- each need 1 second.
ing of the processes between two machines. Once each processor has computed its
Figure 7b shows a better partitioning, with load, a way is needed for each processor to
less intermachine traffic, assuming that all find out how everyone else is doing. One
the arcs are equally weighted. Many papers way is for each processor to just broadcast
have been written on this subject, for ex- its load periodically. After receiving a
ample, Chow and Abraham [1982], Chow broadcast from a lightly loaded machine, a
and Kohler [1979], Stone [1977, 19781, processor should shed some of its load by
Stone and Bokhari [1978], and Lo [1984]. giving it to the lightly loaded processor.
The results are somewhat academic, since This algorithm has several problems. First,
in real systems virtually none of the as- it requires a broadcast facility, which may
sumptions (fixed number of processes with not be available. Second, it consumes
considerable bandwidth for all the “here is is not what it expected because that pro-
my load” messages. Third, there is a great cessor has bid for and won other work in
danger that many processors will try to the meantime.
shed load to the same (previously) lightly
loaded processor at once. 2.3.4 Distributed Deadlock Detection
A different strategy [Barak and Shiloh Some theoretical work has been done in the
1985; Smith 19791 is for each processor area of detection of deadlocks in distributed
periodically to pick another processor (pos- systems. How applicable this work may be
sibly a neighbor, possibly at random) and in practice remains to be seen. Two kinds
exchange load information with it. After of potential deadlocks are resource dead-
the exchange, the more heavily loaded pro- locks and communication deadlocks. Re-
cessor can send processes to the other one source deadlocks are traditional deadlocks,
until they are equally loaded. In this model, in which all of some set of processes are
if 100 processes are suddenly created in an blocked waiting for resources held by other
otherwise empty system, after one ex- blocked processes. For example, if A holds
change we will have two machines with 50 X and B holds Y, and A wants Y and B
processes, and after two exchanges most wants X, a deadlock will result.
probably four machines with 25 processes. In principle, this problem is the same in
Processes diffuse around the network like centralized and distributed systems, but it
a cloud of gas. is harder to detect in the latter because
Actually migrating running processes is there are no centralized tables giving the
trivial in theory, but close to impossible in status of all resources. The problem has
practice. The hard part is not moving the mostly been studied in the context of
code, data, and registers, but moving the database systems [Gligor and Shattuck
environment, such as the current position 1980; Isloor and Marsland 1978; Menasce
within all the open files, the current values and Muntz 1979; Obermarck 19821.
of any running timers, pointers or file de- The other kind of deadlock that can oc-
scriptors for communicating with tape cur in a distributed system is a communi-
drives or other I/O devices, etc. All of these cation deadlock. Suppose A is waiting for a
problems relate to moving variables and message from B and B is waiting for C and
data structures related to the process that C is waiting for A. Then we have a deadlock.
are scattered about inside the operating Chandy et al. [1983] present an algorithm
system. What is feasible in practice is to for detecting (but not preventing) commu-
use the load information to create new nication deadlocks. Very crudely summa-
processes on lightly loaded machines, in- rized, they assume that each process that
stead of trying to move running processes. is blocked waiting for a message knows
If one has adopted the idea of creating which process or processes might send the
new processes only on lightly loaded ma- message. When a process logically blocks,
chines, another approach, called bidding, is they assume that it does not really block
possible [Farber and Larson 1972; Stan- but instead sends a query message to each
kovic and Sidhu 19841. When a process of the processes that might send it a real
wants some work done, it broadcasts a re- (data) message. If one of these processes is
quest for bids, telling what it needs (e.g., a blocked, it sends query messages to the
68000 CPU, 512K memory, floating point, processes it is waiting for. If certain mes-
and a tape drive). sages eventually come back to the original
Other processors can then bid for the process, it can conclude that a deadlock
work, telling what their workload is, how exists. In effect, the algorithm is looking
much memory they have available, etc. The for a knot in a directed graph.
process making the request then chooses
the most suitable machine and creates the 2.4 Fault Tolerance
process there. If multiple request-for-bid
messages are outstanding at the same time, Proponents of distributed systems often
a processor accepting a bid may discover claim that such systems can be more relia-
that the workload on the bidding machine ble than centralized systems. Actually,
Computing Surveys, Vol. 17, No. 4, December 1985
Distributed Operating Systems l 439
Message arrives
at dispatcher
Dispatcher passes
,request to worker
I Shared data
Figure 10. The dispatcher task waits for requests and passes them on
to the worker tasks.
also be started, and possibly completed be- the dispatcher or some other previously
fore the reply for the first request comes in. blocked task can now run. Thus waiting for
In this way, the server is never idle if there a remote procedure call to finish only
is any work to be done. blocks one task, not the whole server.
Although this organization makes better The other way of organizing the server is
use of the server’s CPU, it makes the soft- to have each task capable of accepting new
ware much more complicated. Instead of requests for work. When a messagearrives,
doing nicely nested remote procedure calls the kernel gives it at random to one of the
to other machines whose services it needs, tasks listening to the address or port to
the server is back to using separate SEND which the message was addressed. That
and RECEIVE primitives, which are less task carries the work out by itself, and no
structured. dispatcher is needed.
One way of achieving both good perfor- Both of these schemes require some
mance and clean structure is to program method of locking the shared data to pre-
the server as a collection of miniprocesses, vent races. This locking can be achieved
which we call a cluster of tadas. Tasks share explicitly by some kind of LOCK and
the same code and global data, but each UNLOCK primitives, or implicitly by hav-
task has its own stack for local variables ing the scheduler not stop any task while it
and registers and, most important, its own is running. For example, task switching
program counter. In other words, each task only occurs when a task blocks. With or-
has its own thread of control. Multipro- dinary user programs, such a strategy
gramming of the tasks can be done either is undesirable, but with a server whose
by the operating system kernel or by a run behavior is well understood, it is not
time library within each process. unreasonable.
There are two ways of organizing the
tasks. The first way is to assign one task 25.2 File Service
the job of “dispatcher,” as shown in Figure
10. The dispatcher is the only task that There is little doubt that the most impor-
accepts new requests for work. After in- tant service in any distributed system is the
specting an incoming request, it determines file service. Many file services and file serv-
if the work can be done without blocking ers have been designed and implemented,
(e.g., if a block to be read is present in the so a certain amount of experience is avail-
cache). If it can, the dispatcher just carries able (e.g., Birrell and Needham [1980], Del-
out the work and sends the reply. If the lar [ 19821, Dion [1980], Fridrich and Older
work requires blocking, the dispatcher [ 19811,Fridrich and Older [ 19841, Mitchell
passes the work to some other task in the and Dion [ 19821, Mullender and Tanen-
cluster, which can start work on it. When baum [1985], Reed and Svobodova [1981],
that task blocks, task switching occurs, and Satyanarayanan et al. [1985], Schroeder et
END TRANSACTION, and that a crash of the available printers, possibly with some
or abort midway leaves the file in its origi- text justification or other formatting be-
nal state. Robust files are written on stable forehand. In some cases the whole file is
storage and contain sufficient redundancy sent to the print server in advance, and the
to survive disk crashes (generally two disks server must buffer it. In other cases only
are used). Finally, multiversion files consist the file name or capability is sent, and the
of a sequence of versions, each of which print server reads the file block by block as
is immutable. Changes are made to a file needed. The latter strategy eliminates the
by creating a new version. Different file need for buffering (read: a disk) on the
servers support various combinations of ’ server side but can cause problems if the
these. file is modified after the print command is
All robust file servers need some mecha- given but prior to the actual printing. Users
nism for handling concurrent updates to a generally prefer “call-by-value” rather than
file or group of files. Many of them allow “call-by-reference” semantics for printers.
users to lock a file, page, or record to pre- One way to achieve the “call-by-value”
vent conflicting writes. Locking introduces semantics is to have a printer spooler
the problem of deadlocks, which can be server. To print a file, the client process
dealt with by using two-phase locking sends the file to the spooler. When the file
[Eswaran et al. 19761 or timestamps [Reed has been copied to the spooler’s directory,
19831. an acknowledgment is sent back to the
When the file system consists of multiple client.
servers working in parallel, it becomes pos- The actual print server is then imple-
sible to enhance reliability by replicating mented as a print client. Whenever the
some or all files over multiple servers. print client has nothing to print, it requests
Reading also becomes easier because the another file or block of a file from the print
workload can now be split over two servers, spooler, prints it, and then requests the
but writing is much harder because multi- next one. In this way the print spooler is a
ple copies must be updated simultaneously, server to both the client and the printing
or this effect simulated somehow. device.
One approach is to distribute the data Printer service is discussed by Janson et
but keep some of the control information al. [1983] and Needham and Herbert
(semi-) centralized. In LOCUS [Popek et [1982].
al. 1981; Walker et al. 19831, for example,
files can be replicated at many sites, but 2.5.4 Process Service
when a file is opened, the file server at one
site examines the OPEN request, the num- Every distributed operating system needs
ber and status of the file’s copies, and the some mechanism for creating new pro-
state of the network. It then chooses one cesses.At the lowest level, deep inside the
site to carry out the OPEN and the subse- system kernel, there must be a way of cre-
quent READS and WRITES. The other ating a new process from scratch. One way
sites are brought up to date later. is to have a FORK call, as UNIX does, but
other approaches are also possible. For ex-
2.5.3 Print Service
ample, in Amoeba, it is possible to ask the
kernel to allocate chunks of memory of
Compared with file service, on which a given sizes. The caller can then read and
great deal of time and energy has been write these chunks, loading them with the
expended by a large number of people, the text, data, and stack segments for a new
other services seem rather meager. Still, it process. Finally, the caller can give the
is worth saying at least a little bit about a filled-in segments back to the kernel and
few of the more interesting ones. ask for a new process built up from these
Nearly all distributed systems have some pieces. This scheme allows processes to be
kind of print service to which clients can created remotely or locally, as desired.
send files, file names, or capabilities for At a higher level it is frequently useful to
files with instructions to print them on one have a process server that one can ask
variable queuing delays. If there are multi- nal format to those demanded by the wide-
ple time servers, they may get out of syn- area network carrier.
chronization because their crvstals run at
slightly different rates. Einstein’s special 3. EXAMPLES OF DISTRIBUTED
theory of relativity also puts constraints on OPERATING SYSTEMS
synchronizing remote clocks.
The result of all these problems is that Having disposed with the principles, it is
having a single global time is impossible. now time to look at some actual distributed
Distributed algorithms that depend on systems that have been constructed as re-
being able to find a unique global ordering search projects in universities around the
of widely separated events may not work as world. Although many such projects are in
expected. A number of researchers have various stages of development, space limi-
tried to find solutions to the various prob- tations prevent us from describing all of
lems caused by the lack of global time (see, them in detail. Instead of saying a few
e.g., Jefferson [1985], Lamport [1978, words about each system, we have chosen
19841, Marzullo and Owicki [1985], Reed to look at four systems that we consider
[1983], and Reif and Spirakis [1984]). representative. Our selection criteria were
as follows. First, we only chose systems that
2.58 Boot Service
were designed from scratch as-distributed
systems (systems that gradually evolved by
The boot service has two functions: bring- connecting together existing centralized
ing up the system from scratch when the systems or are multiprocessor versions of
power is turned on and helping important UNIX were excluded). Second, we only
services survive crashes. In both cases, it is chose systems that have actually been im-
helpful if the boot server has a hardware plemented; paper designs did not count.
mechanism for forcing a recalcitrant ma- Third, we only chose systems about which
chine to jump to a program in its own read- a reasonable amount of information was
only memory (ROM) in order to reset it. available.
The ROM program could simply sit in a Even with these criteria, there were
loop waiting for a message from the boot many more systems that could have been
service. The messagewould then be loaded discussed. As an aid to the reader interested
into that machine’s memory and executed in pursuing this subject further, we provide
as a program. here some references to other relevant
The second function alluded to above is work: Accent [Fitzgerald and Rashid 1985;
the “immortality service.” An important Rashid and Robertson 19811,Argus [Liskov
service could register with the boot service, 1982,1984; Liskov and Scheifler 1982; Oki
which would then poll it periodically to see et al. 19851, Chorus [Zimmermann, et al.
if it were still functioning. If not, the boot 19811, CRYSTAL [Dewitt et al. 19841,
service could initiate measures to patch DEMOS [Powell and Miller 19831,Distrib-
things up, for example, forcibly reboot it or uted UNIX [Luderer et al. 19811, HXDP
allocate another processor to take over its [Jensen 19781, LOCUS [Popek et al. 1981;
work. To provide high reliability, the boot Walker et al. 1983; Weinstein et al. 19851,
service should itself consist of multiple Meglos [Gaglianello and Katseff 19851,
processors, each of which keeps checking MICROS [Curtis and Wittie 1984; Mohan
that the others are still working properly. and Wittie 1985; Wittie and Curtis 1985;
Wittie and van Tilborg 19801, RIG [Ball et
2.5.9 Gateway Service
al. 19761, Roscoe/Arachne [Finkel et al.
1979; Solomon and Finkell978,1979], and
If the distributed system in question needs the work at Xerox Palo Alto Research Cen-
to communicate with other systems at re- ter [Birrell 1985; Birrell and Nelson 1984;
mote sites, it may need a gateway server to Birrell et al. 1984; Boggs et al. 1980; Brown
convert messagesand protocols from inter- et al. 1985; Swinehart et al. 19791.
Filing File
Figure 12. The filing machine is posi- machine server
tioned between the users and the file server.
It maintains a block cache and handles
ASCII names. Block cache Regular files
Processor ASCII names Special files
bank Index files
machines
powered up, the same procedure is carried consisting of a sequence of 16-bit words,
out automatically. numbered from 0 to some maximum.
Another area in which some effort has Operations are provided for reading or writ-
been put to make the system fault tolerant ing arbitrary numbers of words, starting
is the file system, which supports atomic anywhere in the file. Each file is uniquely
updates on special files. This facility is identified by a 64-bit PUID (Permanent
described in the next section. User IDentifier) consisting of a 32-bit disk
address and a 32-bit random number.
The first variation is the special file,
3.1.5 Services
which has the property that writes to it are
We have already described several key serv- atomic, that is, they will either succeed
ers, including the name server, resource completely or not be done at all. They will
manager, ancilla, and active name server. never be partly completed, even in the face
Other small servers include the time server, of server crashes.
print server, login server, terminal server, The second variation is a file called an
and error server, which records system er- index, which is a special file consisting of a
rors for maintenance purposes. The tile sequence of slots, each holding one PUID.
server is examined here. When a file is created, the process creating
The file system started out with the idea it must specify an index and slot in that
of a single universal file server that pro- index into which the new file’s PUID is
vided basic storage service but very primi- stored. Since indexes are also files and as
tive naming and protection system, coupled such have PUIDs themselves, an index may
with single-user TRIPOS operating sys- contain pointers (PUIDs) to other indices,
tems in the processor bank machines, in allowing arbitrary directory trees and
which the naming and directory manage- graphs to be built. One index is distin-
ment would be done. The CAP computer (a guished as being the root index, which has
large research machine within the Cam- the property that the file server’s internal
bridge Computing Laboratory that does not garbage collector will never remove a file
have any disks of its own) also uses the file reachable from the root index.
server. After some experience with this In the initial implementation, the full
model, it was decided to create a new server, code of the TRIPOS operating system was
known as the filing machine, as a front end loaded into each pool processor. All of the
to the file system to improve the perform- directory management and handling of
ance (mostly by providing the filing ma- ASCII names was done on the processor
chine with a large cache, something that bank machines. Unfortunately, this scheme
the small user machines could not afford). had several problems. First, TRIPOS was
The CAP machine, which has adequate rather large and filled up so much memory
memory, continues to use the file server that little room was left for buffers, mean-
directly. The position of the filing machine ing that almost every read or write request
is shown in Figure 12. actually caused a disk access (the universal
The universal file server supports one file server has hardly any buffers). Second,
basic file type, with two minor variations. looking up a name in the directory hierar-
The basic file type is an unstructured file chy required all the intermediate directo-
but it works much of the time and has zero itors, and shells, are operational. A server
overhead under normal conditions. has also been implemented on UNIX to
allow Amoeba programs to put capabilities
3.2.5 Services for UNIX files into their directories and
use them without having to know that the
Amoeba has several kinds of block, file, and files are actually located on a VAX running
directory services. The simplest one is a UNIX.
server running on top of the Amoeba kernel In addition to the UNIX emulation
that provides a file service functionally work, various applications have been im-
equivalent to the UNIX system call inter- plemented using pure Amoeba, including
face, to allow most UNIX programs to run parallel traveling salesman and parallel
on Amoeba with only the need to relink alpha-beta search [Bal et al. 19851. Current
them with a special library. research includes connecting Amoeba sys-
A more interesting server, however, is tems at five locations in three countries
FUSS (Free University Storage System), using wide-area networks.
which views each file as a sequence of
versions. A process can acquire a capability
for a private copy of a new version, modify 3.3 The V Kernel
it, and then commit it in a single indivisible
atomic action. Providing atomic commits The V kernel is a research project on dis-
at the file level (rather than only as a tributed systems at Stanford University
facility in some database systems) simpli- under the direction of David Cheriton
fies the construction of various servers, [Cheriton 1984a; Cheriton and Mann 1984;
such as the bank server, that have to be Cheriton and Zwaenepoel 1984a, 1984131.
highly robust. FUSS also supports multiple, It was motivated by the increasing avail-
simultaneous access using optimistic con- ability of powerful microcomputer-based
currency control. It is described in greater workstations, which can be seen as an
detail by Mullender and Tanenbaum alternative to traditional time-shared mini-
[1985]. computers. The V kernel is an outgrowth
Other key services are the directory ser- of the experience acquired with earlier sys-
vice, bank service, and boot service, all of tems, Thoth [Cheriton 1982; Cheriton et al.
which have already been discussed. 19791 and Verex.
The V kernel can be thought of as a
software back plane, analogous to the
3.2.6 Implementation
Multibus or S-100 bus back planes. The
The Amoeba kernel has been ported to five function of a back plane is to provide an
different CPUs: 68010, NS32016, 8088, infrastructure for components (for hard-
VAX, and PDP-11. All the servers de- ware, boards; for software, processes) to
scribed above, except the boot server, have communicate, and nothing else. Conse-
been written and tested, along with a num- quently, most of the facilities found in tra-
ber of others. Measurements have shown ditional operating systems, such as a file
that a remote procedure call from user system, resource management, and protec-
space on one 68010 to user space on a tion, are provided in V by servers outside
different 68010 takes just over 8 millisec- the kernel. In this respect V and Amoeba
onds (plus the time to actually carry out are conceptually very similar.
the service requested). The data rate be- Another point on which V and Amoeba
tween user processes on different machines agree is the free-market model of services.
has been clocked at over 250,000 bytes per Services such as the file system are, in
second, which is about 20 percent of the principle, just ordinary user processes. Any
raw network bandwidth, an exceptionally user who is dissatisfied with the standard
high value. file system [Stonebraker 1981; Tanenbaum
A library has been written to allow UNIX and Mullender 19821 is free to write his or
programs to run on Amoeba. A substantial her own. This view is in contrast to the
number of utilities, including compilers, ed- “centrally planned economy” model of most
Work- Work-
station station
Network I ’ I I
I I I
File File Rint
server server server
time-sharing systems, which present the tually. In this way, messages longer than
file system on a “like it or lump it” basis. 32 bytes can be achieved.
The V system consists of a collection of Servers use the RECEIVE and REPLY
workstations (currently SUNS), each run- calls. The RECEIVE call can provide a
ning an identical copy of the V kernel. The segment buffer in addition to the regular
kernel consists of three components: the message buffer, so that if (part of) a seg-
interprocess communication handler, the ment has been piggybacked onto the mes-
kernel server (for providing basic services, sage, it will have a place to go. The REPLY
such as memory management), and the de- call can also provide a segment buffer for
vice server (for providing uniform access to the case in which the client provides a
I/O devices). Some of the workstations pseudopointer that the server can use to
support an interactive user, whereas others return results exceeding 32 bytes.
function as file servers, print servers, and To make this communication system eas-
other kinds of servers, as shown in Figure ier to use, calls to servers can be embedded
15. Unlike Amoeba, V does not have a in stubs so that the caller just sees an
processor pool. ordinary procedure call. Stub generation is
not automated, however.
3.3.1 Communication Primitives
3.3.2 Naming and Protection
The V communication primitives have been
designed in accordance with the back-plane V has three levels of naming. At the bottom
model mentioned above. They provide level, each process has a unique 32-bit pid,
basic, but fast communication. To access a which is the address used to send messages
server, a client does SEND(message, pid), to it. At the next level, services (i.e., pro-
which transmits the fixed-length (32-byte) cesses that carry out requests for clients)
“message” to the server, and then blocks can have symbolic (ASCII string) names in
until the server has sent back a reply, which addition to their pids. A service can register
overwrites “message.” The second param- a symbolic name with its kernel so that
eter, “pid,” is a 32-bit integer that uniquely clients can use the symbolic name instead
identifies the destination process. A mes- of the pid. When a client wants to access a
sage may contain a kind of pseudopointer service by its name, the client’s kernel
to one of the client’s memory segments. broadcasts a query to all the other kernels,
This pseudopointer can be used to permit to see where the server is. The (Server-
the server to read from or write to the Name, pid) pair is then put in a cache for
client’s memory. Such reads and writes are future use.
handled by kernel primitives COPYFROM The top level of naming makes it possible
and COPYTO. As an optimization, when a to assign symbolic names to objects, such
client does a SEND containing one of these as files. Symbolic names are always inter-
pseudopointers with READ permission, the preted in some “context,” analogous to
first 1K of the segment is piggybacked onto looking up a file name in some directory in
the message, on the assumption that the other systems. A context is a set of records,
server will probably want to read it even- each including the symbolic name, server’s
pid, context number, and object identifier. all run on the same processor. Application
Each server manages its own contexts; programs can make use of concurrency by
there is no centralized “name server.” A running as a team of processes, each of
symbolic name is looked up in a context by which does part of the kernel. If one process
searching all the records in that context for in a team is blocked waiting for a reply to
one whose name matches the given name. a message, the other ones are free to run.
When a match is found, the context num- The kernel server is prepared to carry out
ber and object identifier can be sent to the operations such as creating new processes
appropriate server to have some operation and teams, destroying processes and teams,
carried out. reading and writing processes’ states, and
Names may be hierarchical, as in u/b/c. mapping processes onto memory.
When a is looked up in some context, the All I/O in V is done using a uniform
result will probably be a new context, pos- interface called the V I/O protocol. The
sibly managed by a new server on a differ- protocol allows processes to read and write
ent machine. In that case the remaining specific blocks on the device. This block
string, b/c is passed on to that new server orientation was chosen to provide idempo-
for further lookup, and so on. tency. Terminal drivers must store the last
It is also possible to prefix a symbolic block read and filter out duplicate requests
name with an explicit context, as in in order to maintain the idempotency prop-
[HomeDirectory] a/b/c, in which case the erty. Implementation of byte streams is up
name is looked up in the context specified, to the users. The I/O protocol has proved
rather than in the current context (analo- general enough to handle disks, printers,
gous to the current working directory in terminals, and even a mouse.
other systems). A question that quickly
arises is, “Who keeps track of the various 3.3.4 Fault Tolerance
context names, such as ‘HomeDirectory’
above?” The answer is that each worksta- Since it was designed primarily for use in
tion in the system has a Context Prefix an interactive environment, V provides
Server, whose function is to map context little in the way of fault tolerance. If
names onto server names, so that the ap- something goes wrong, the user just does
propriate server can be found to interpret it again. V, however, does address excep-
the name itself. tion handling. Whenever a process causes
an exceptional condition to occur, such as
3.3.3 Resource Management stack overflow or referencing nonexistent
memory, the kernel detecting the error
Each processor in V has a dedicated func- sends a specially formatted message to the
tion, either as a user workstation or a file, exception server, which is outside the ker-
print, or other dedicated server; so no form nel. The exception server can then invoke
of dynamic processor allocation is provided. a debugger to take over. This scheme does
The key resources to be managed are pro- not require a process to make any advance
cesses, memory, and the I/O devices. preparation for being debugged and in prin-
Process and memory management is pro- ciple, can allow the process to continue
vided by the kernel server. I/O manage- execution afterward.
ment is provided by the device server. Both
of these are part of the kernel present on 3.3.5 Services
each machine, and are accessed via the
standard message mechanism described Since most of the V workstations do not
above. They are special only in that they have a disk, the central file server plays a
run in kernel mode and can get at the raw key role in the system. The file server is
hardware. not part of the operating system. Instead,
Processes are organized into groups it is just an ordinary user program running
called teams. A team of processes share a on top of the V kernel. Internally it is
common address space, and therefore must structured as a team of processes. The main
When the message arrives at the desti- Each directory entry contains the ASCII
nation machine, a stub routine there un- string by which the capability is accessed
packs the ESCII message and makes a local and the capability itself. Clients can only
call on Lookup using the normal EPL call- access the contents of a directory by invok-
ing sequence. The reply proceeds analo- ing the directory object with one of the
gously in the opposite direction. The stub valid operations, which include add entry,
routines on both sides are automatically delete entry, lookup string, and rename ca-
generated by the EPL compiler. pability. Capabilities are protected from
The implementation of invocation is forgery by the kernel, but users keep copies
slightly complicated by the fact that an of capabilities for their own use; the kernel
object may contain multiple processes. verifies them when they are used.
When one process blocks waiting for a re- The basic protection scheme protects ob-
ply, the others must not be affected. This jects, using capabilities. Since all processes
problem is handled by splitting the invo- are embedded in objects, a process can be
cation into two layers. The upper layer protected by restricting the distribution of
builds the message, including the capability capabilities to its object. The only way to
for the object to be invoked and the ESCII obtain service from an object is by invoking
parameters, passes it to the lower layer, the object with the proper capability, pa-
and blocks the calling process until the rameters, etc., all of which are checked by
reply arrives. The lower layer then makes the kernel and EPL run-time system before
a nonblocking call to the kernel to actually the call is made.
send the message. If other processes are
active within the object, they can now be 3.4.3 Resource Management
run; if none are active, the object waits until Because no version of Eden runs on bare
a message arrives. machines, most of the issues associated
On the receiving side, a process within with low-level resource management have
the invoked object will normally have pre- not yet been dealt with. Nevertheless, some
viously executed a call announcing its will- resource management issues have been ad-
ingness to perform some operation (e.g., dressed. For example, when an object is
Lookup in the above example), thereby created, the issue arises of where to put it.
blocking itself. When the Lookup message At present, it is just put on the same work-
comes in, it is accepted by a special dis- station as the object that created it unless
patcher process that checks to see which an explicit request has been given to put it
process, if any, is blocked waiting to per- somewhere else.
form the operation requested by the mes- Another issue that has received consid-
sage. If a willing process is found, it runs erable attention is how to achieve concur-
and sends a reply, unblocking the caller. If rency within an object. From the beginning
no such process can be found, the message of the project it was considered desirable to
is queued until one becomes available. allow multiple processes to be simultane-
ously active within an object. These pro-
3.4.2 Naming and Protection
cesses all share a common address space,
although each one has its own stack for
Naming and protection in Eden are accom- local variables, procedure call/return in-
plished using the capability system. Data formation, etc. Having multiple active
are encapsulated within objects, and are processes within an object, coupled with
only accessible by invoking one of the op- the basic Eden semantics of remote invoca-
erations defined by the object. To invoke tions that block the caller but not the
an object, a process must have a valid ca- whole object, makes the implementation
pability. Thus there is a uniform naming somewhat complicated. It is necessary to
and protection scheme throughout Eden. allow one process to block waiting for a
Capabilities may be stored in any object. reply without blocking the object as a
Directories provide a convenient mecha- whole. Monitors are used for synchroniza-
nism for grouping capabilities together. tion. This multiprogramming of processes
and to limit resource usage by accounting entire objects can be checkpointed, making
for it. checkpointing a slow operation and thus
In V, each processor is dedicated as either discouraging its frequent use.
a workstation or a server, so processors are
not resources to be dynamically allocated. 3.5.5 Services
Each V kernel manages its own local re-
sources; there is no systemwide resource The file systems used by Cambridge,
management. Amoeba, V, and Eden are all quite different.
Eden has been built on top of existing The Cambridge system has two servers, the
operating systems, and therefore most of universal file server, and the filing machine,
the issues of resource management are done which was added later to improve the per-
by the underlying operating system. The formance by providing a large buffer cache.
main issue remaining for Eden is allocating The universal file server supports a primi-
and deallocating storage for objects. tive flat file, with no directory structure,
which is provided by the filing machine or
3.5.4 Fault Tolerance the user machines. The universal file server
has regular and special files, of which the
None of the four systems go to great lengths latter can be updated atomically.
to make themselves fault tolerant; for ex- Amoeba has several file systems. One of
ample, none support atomic actions as a them is compatible with UNIX, to allow
basic primitive. All four (with the possible UNIX applications to run on Amoeba. An-
exception of Eden) were designed with the other one, FUSS, supports multiversion,
intention of actually being used, so that the multiserver, tree-structured, immutable
inherent trade-off between performance files with atomic commit. Directory servers
and fault tolerance tended to get resolved map ASCII names to capabilities, thus al-
in favor of performance. lowing an arbitrary graph of files and direc-
In the Cambridge system the only con- tories to be constructed.
cession to fault tolerance is a feature in the V has a traditional file server similar to
ring interface to allow a machine to be UNIX. It is based on the earlier Thoth
remotely reset by sending a special packet system.
to the interface. There is also a small server Eden has no file server at all in the usual
that helps get the servers started up. sense. Instead, each file object has embed-
Amoeba provides some fault tolerance ded in it a process that acts like a private
through its boot server, with which pro- file server for that one file. Like Amoeba,
cesses can register. The boot server pools Eden has separate directory servers that
the registered processes periodically and, map ASCII strings onto capabilities and
finding one that fails to respond, requests provides the ability to map one string onto
a new processor and downloads the failed several files, thus providing for file repli-
program to it. This strategy does not re- cation. All four systems have a heteroge-
trieve the processes that were killed when neous variety of other services (e.g., print,
a machine has gone down, but it does au- mail, bank).
tomatically ensure that no key service is
ever down for more than, say, 30 seconds. 4. SUMMARY
V does not address the problem of fault
tolerance at all. Distributed operating systems are still in
Of the four systems, Eden makes the an early phase of development, with many
most effort to provide a higher degree of unanswered questions and relatively little
reliability than provided by the bare hard- agreement among workers in the field about
ware. The main tool used is checkpointing how things should be done. Many experi-
complete objects from time to time. If a mental systems use the client-server model
processor crashes, each of its objects can be with some form of remote procedure call as
restored to the state it had at the time of the communication base, but there are also
the last checkpoint. Unfortunately, only systems built on the connection model.
m C., AND NOE, J. D. 1985. Nested transactions Principles (Pacific Grove, Calif., Dec. 10-12).
for general objects. Rep. TR-85-12-03, Computer ACM, New York, pp. 108-114.
Science Dept., Univ. of Washington, Seattle, SPECTOR, A. Z. 1982. Performing remote operations
Wash. efficiently on a local computer network. Commun.
pu, C., NOE, J. D., AND PROUDFOOT, A. 1986. ACM 25,4 (Apr.), 246260.
Regeneration of replicated objects: A technique STANKOVIC, J. A., AND SIDHU, I. S. 1984. An adap-
and its Eden implementation. In Proceedings of tive bidding algorithm for processes, clusters, and
the 2nd International Conference on Data &I&- distributed ups. In Proceedings of the 4th Inter-
neering (Los Angeles, Calif., Feb. 4-6). IEEE, national Conference on Distributed Computing
New York, pp. 175-187. Systems. IEEE, New York, pp. 49-59.
RASHID, R. F., AND ROBERTSON, G. G. 1981. Accent: STONE, H. S. 1977. Multiprocessor scheduling with
A communication oriented network operating the aid of network flow algorithms. IEEE Trans.
system kernel. In Proceedings of tne 8th Sympo- Softw. Eng. SE-3 (Jan.), 88-93.
sium on Operating Systems Principles (Pacific STONE, H. S. 1978. Critical load factors in distrib-
Grove, Calif., Dec. 14-16). ACM, New York, pp. uted computer systems. IEEE Trans. Softw. Eng.
64-75. SE-4 (May), 254-258.
REED, D. P. 1983. Implementing atomic actions on STONE, H. S., AND BOKHARI, S. H. 1978. Control
decentralized data. ACM Trans. Comput. Syst. 1, of distributed processes. Computer 11 (July),
1 (Feb.), 3-23. 97-106.
REED, D. P., AND SVOBODOVA, L. 1981. STONEBRAKER, M. 1981. Operating system support
SWALLOW: A distributed data storage system for database management. Commun. ACM 24, 7
for a local network. In Local Networks for Com- (July), 412-418.
puter Communications, A. West and P. Janson,
STURGIS, H. E., MITCHELL, J. G., AND ISRAEL, J.
Eds. North-Holland Publ., Amsterdam, pp. 355-
1980. Issues in the design and use of a distrib-
373.
uted file system. Oper. Syst. Rev. 24 (July),
REIF, J. H., AND SPIRAKIS, P. G. 1984. ‘Real-time 55-69.
synchronization of interprocess communications. SVENTEK, J., GREIMAN, W., O’DELL, M., AND
ACM Trans. Program. Lang. Syst. 6, 2 (Apr.), JANSEN, A. 1983. Token ring local networks-
215-238. A comparison of experimental and theoretical
RITCHIE, D. M., AND THOMPSON, K. 1974. The performance. Lawrence Berkeley Lab. Rep.
UNIX time-sharing system. Commun. ACM 19, 16254.
7 (July), 365-375. SVOBODOVA, L. 1981. A reliable object-oriented data
SALTZER, J. H., REED, D. P., AND CLARK, D. D. repository for a distributed computer system. In
1984. End-to-end arguments in system design. Proceedings of the 8th Symposium on Operating
ACM Trans. Comput. Syst. 2, 4 (Nov.), 277-278. Systems Principles (Pacific Grove, Calif., Dec. 14-
SATYANARAYANAN, M., HOWARD, J., NICHOLS, D., 16). ACM, New York, pp. 47-58.
SIDEBOTHAM, R., SPECTOR, A., AND WEST, M. SVOBODOVA, L. 1984. File servers for network-based
1985. The ITC distributed file system: Princi- distributed systems. ACM Comput. Sure. 16, 4
ples and design. In Proceedings of the 10th Sym- (Dec.), 353-398.
posium on Operating Systems Principles (Orcas SWINEHART, D., MCDANIEL, G., AND Boccs, D.
Island, Wash., Dec. l-4). ACM, New York, 1979. WFS: A simple shared file system for a
pp. 35-50. distributed environment. In Proceedings of the
SCHROEDER, M., GIFFORD, D., AND NEEDHAM, R. 7th Symposium on Operating Systems Principles
1985. A caching file system for a program- (Pacific Grove, Calif., Dec. 10-12). ACM, New
mer’s workstation. In Proceedings of the 10th York, pp. 9-17.
Symposium on Operating Systems Principles TANENBAUM, A. S., AND MULLENDER, S. J. 1982.
(Orcas Island, Wash., Dec. l-4). ACM, New Operating system requirements for distributed
York, pp. 25-34. data base systems. In Distributed Data Bases,
SMITH, R. 1979. The contract net protocol: High- H.-J. Schneider, Ed. North-Holland Publ.,
level communication and control in a distributed Amsterdam, pp. 105-114.
problem solver. In Proceedings of the 1st Znter- TANENBAUM, A. S., MULLENDER, S. J., AND VAN
national Conference on Distributed Computing RENESSE, R. 1986. Using sparse capabilities in
Systems. IEEE, New York, pp. 185-192. a distributed operating system. In Proceedings of
SOLOMON, M. H., AND FINKEL, R. A. 1978. the 6th International Conference on Distributed
ROSCOE: A multimicrocomputer operating - sys- Computer Systems. IEEE, New York, 1986, pp.
tern. In Proceedings of the 2nd Rocky Mountain 558-563.
Symposium on Microcomputers (Aug.), pp. 201- VAN TILBORG, A. M., AND W~IE, L. D. 1981. Wave
210. scheduling: Distributed allocation of task forces
SOLOMON, M. H., AND FINKEL, R. A. 1979. The in network computers. In Proceedings of the 2nd
Roscoe distributed operating system. In Proceed- International Conference on Distributed Comput-
ings of the 7th Symposium on Operating Systems ing Systems. IEEE, New York, pp. 337-347.
AND POPEK, G. J. 1985. Transactions and syn- ZIMMERMANN, H. 1980. OS1 reference model-The
chronization in a distributed operating system. In IS0 model of architecture for open systems in-
Proceedings of the 20th Symposium on Operating terconnection. IEEE Trans. Commun. COM-28
Systems Principles (Orcas Island, Wash., Dec. (Apr.), 425-432.
l-4). ACM, New York, pp. 115-125. ZIMMERMANN, H., BANINO, J.-S., CARISTAN, A.,
WILKES, M. V., AND NEEDHAM, R. M. 1980. The GUILLEMONT, M., AND MORISSET, G.
Cambridge model distributed system. Oper. Syst. 1981. Basic concepts for the support of distrib-
Reu. 14 (Jan.), 21-29. uted systems: The chorus approach. In Proceed-
WITTIE, L., AND CURTIS, R. 1985. Time manage- ings of t& 2nd International Conference on Dis-
ment for debugging distributed systems. In Pro- tributed Computing Systems. IEEE, New York,
ceedings of the 5th International Conference on pp. 60-66.