Paper PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 52

Distributed Operating Systems

ANDREW S. TANENBAUM and ROBBERT VAN RENESSE


Department of Mathematics and Computer Science, Vrije Universiteit, Amsterdam, The Netherlands

Distributed operating systems have many aspects in common with centralized ones, but
they also differ in certain ways. This paper is intended as an introduction to distributed
operating systems, and especially to current university research about them. After a
discussion of what constitutes a distributed operating system and how it is distinguished
from a computer network, various key design issues are discussed. Then several examples
of current research projects are examined in some detail, namely, the Cambridge
Distributed Computing System, Amoeba, V, and Eden.

Categories and Subject Descriptors: C.2.4 [Computer-Communications Networks]:


Distributed Systems-network operating system; D.4.3 [Operating Systems]: File
Systems Management-distributed file systems; D.4.5 [Operating Systems]:
Reliability-fault tolerance; D.4.6 [Operating Systems]: Security and Protection-access
controls; D.4.7 [Operating Systems]: Organization and Design-distributed systems
General Terms: Algorithms, Design, Experimentation, Reliability, Security
Additional Key Words and Phrases: File server

INTRODUCTION more convenient to use than the bare ma-


chine. Examples of well-known centralized
Everyone agrees that distributed systems
(i.e., not distributed) operating systems are
are going to be very important in the future.
CP/M,’ MS-DOS,’ and UNIX.3
Unfortunately, not everyone agrees on
A distributed operating system is one that
what they mean by the term “distributed
looks to its users like an ordinary central-
system.” In this paper we present a view-
ized operating system but runs on multi-
point widely held within academia about
ple, independent central processing units
what is and is not a distributed system, we
(CPUs). The key concept here is transpar-
discuss numerous interesting design issues
ency. In other words, the use of multiple
concerning them, and finally we conclude
processors should be invisible (transparent)
with a fairly close look at some experimen-
to the user. Another way of expressing the
tal distributed systems that are the subject
same idea is to say that the user views
of ongoing research at universities.
the system as a “virtual uniprocessor,” not
To begin with, we use the term “distrib-
as a collection of distinct machines. This
uted system” to mean a distributed operat-
is easier said than done.
ing system as opposed to a database system
Many multimachine systems that do not
or some distributed applications system,
fulfill this requirement have been built. For
such as a banking system. An operating
system is a program that controls the re- ’ CP/M is a trademark of Digital Research, Inc.
sources of a computer and provides its users ’ MS-DOS is a trademark of Microsoft.
with an interface or virtual machine that is 3 UNIX is a trademark of AT&T Bell Laboratories.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or
distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its
date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To
copy otherwise, or to republish, requires a fee and/or specific permission.
0 1986 ACM 0360-0300/85/1200-0419 $00.75

ComputingSurveys,Vol. 17, No. 4, December 1985


420 l A. S. Tanenbaum and R. van Renesse
CONTENTS To make the contrast with distributed
operating systems stronger, let us briefly
look at another kind of system, which we
call a “network operating system.” A typical
INTRODUCTION configuration for a network operating sys-
Goals and Problems tem would be a collection of personal com-
System Models puters along with a common printer server
1. NETWORK OPERATING SYSTEMS
1.1 File System
and file server for archival storage, all tied
1.2 Protection together by a local network. Generally
1.3 Execution Location speaking, such a system will have most of
1.4 An Example: The Sun Network the following characteristics that distin-
File System
2. DESIGN ISSUES
guish it from a distributed system:
2.1 Communication Primitives
2.2 Naming and Protection
l Each computer has its own private oper-
2.3 Resource Management
ating system, instead of running part of
2.4 Fault Tolerance a global, systemwide operating system.
2.5 Services l Each user normally works on his or her
3. EXAMPLES OF DISTRIBUTED own machine; using a different machine
OPERATING SYSTEMS
3.1 The Cambridge Distributed
invariably requires some kind of “remote
Computing System login,” instead of having the operating
3.2 Amoeba system dynamically allocate processes to
3.3 The V Kernel CPUS.
3.4 The Eden Project l Users are typically aware of where each
3.5 Comparison of the Cambridge,
Amoeba, V, and Eden Systems
of their files are kept and must move files
4. SUMMARY between machines with explicit “file
ACKNOWLEDGMENTS transfer” commands, instead of having
REFERENCES file placement managed by the operating
system.
l The system has little or no fault toler-
ance; if 1 percent of the personal com-
puters crash, 1 percent of the users are
example, the ARPANET contains a sub- out of business, instead of everyone sim-
stantial number of computers, but by this ply being able to continue normal work,
definition it is not a distributed system. albeit with 1 percent worse performance.
Neither is a local network consisting of
personal computers with minicomputers Goals and Problems
and explicit commands to log in here or
copy a file from there. In both cases we The driving force behind the current inter-
have a computer network but not a distrib- est in distributed systems is the enormous
uted operating system. Thus it is the soft- rate of technological change in micropro-
ware, not the hardware, that determines cessor technology. Microprocessors have
whether a system is distributed or not. become very powerful and cheap, compared
As a rule of thumb, if you can tell which with mainframes and minicomputers, so it
computer you are using, you are not using has become attractive to think about de-
a distributed system. The users of a true signing large systems composed of many
distributed system should not know (or small processors. These distributed sys-
care) on which machine (or machines) their tems clearly have a price/performance ad-
programs are running, where their files vantage over more traditional systems.
are stored, and so on. It should be clear by Another advantage often cited is the rela-
now that very few distributed systems are tive simplicity of the software-each pro-
currently used in a production environ- cessor has a dedicated function-although
ment. However, several promising research this advantage is more often listed by
projects are in progress. people who have never tried to write a

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 421
distributed operating system than by those VAXs), each with multiple users. Each user
who have. is logged onto one specific machine, with
Incremental growth is another plus; if remote access to the other machines. This
you need 10 percent more computing model is a simple outgrowth of the central
power, you just add 10 percent more pro- time-sharing machine.
cessors. System architecture is crucial to In the workstation model, each user has
this type of system growth, however, since a personal workstation, usually equipped
it is hard to give each user of a personal with a powerful processor, memory, a bit-
computer another 10 percent of a personal mapped display, and sometimes a disk.
computer. Reliability and availability can Nearly all the work is done on the work-
also be a big advantage; a few parts of the stations. Such a system begins to look dis-
system can be down without disturbing tributed when it supports a single, global
people using the other parts. On the minus file system, so that data can be accessed
side, unless one is very careful, it is easy without regard to their location.
for the communication protocol overhead The processor pool model is the next
to become a major source of inefficiency. evolutionary step after the workstation
There has been built more than one system model. In a time-sharing system, whether
requiring the full computing power of its with one or more processors, the ratio of
machines just to run the protocols, leaving CPUs to logged-in users is normally much
nothing over to do the work. The occasional less than 1; with the workstation model it
lack of simplicity cited above is a real prob- is approximately 1; with the processor pool
lem, although in all fairness, this problem model it is much greater than 1. As CPUs
comes from inflated goals: With a central- get cheaper and cheaper, this model will
ized system no one expects the computer to become more and more widespread. The
function almost normally when half the idea here is that whenever a user needs
memory is sick. With a distributed system, computing power, one or more CPUs are
a high degree of fault tolerance is often, at temporarily allocated to that user; when
least, an implicit goal. the job is completed, the CPUs go back into
A more fundamental problem in distrib- the pool to await the next request. As an
uted systems is the lack of global state example, when ten procedures (each on a
information. It is generally a bad idea to separate file) must be recompiled, ten pro-
even try to collect complete information cessors could be allocated to run in parallel
about any aspect of the system in one table. for a few seconds and then be returned to
Lack of up-to-date information makes the pool of available processors. At least
many things much harder. It is hard to one experimental system described below
schedule the processors optimally if you are (Amoeba) attempts to combine two of these
not sure how many are up at the moment. models, providing each user with a work-
Many people, however, think that these station in addition to the processor pool for
obstacles can be overcome in time, so there general use. No doubt other variations will
is great interest in doing research on the be tried in the future.
subject.
1. NETWORK OPERATING SYSTEMS
System Models
Before starting our discussion of distrib-
Various models have been suggested for uted operating systems, it is worth first
building a distributed system. Most of them taking a brief look at some of the ideas
fall into one of three broad categories, involved in network operating systems,
which we call the “minicomputer” model, since they can be regarded as primitive
the “workstation” model, and the “proces- forerunners. Although attempts to connect
sor pool” model. In the minicomputer computers together have been around for
model, the system consists of a few (per- decades, networking really came into the
haps even a dozen) minicomputers (e.g., limelight with the ARPANET in the early

Computing Surveys, Vol. 17, No. 4, December 1985


422 . A. S. Tanenbaum and R. van Renesse
1970s. The original design did not provide file systems. In this approach, programs on
for much in the way of a network operating one machine can open files on another ma-
system. Instead, the emphasis was on using chine by providing a path name telling
the network as a glorified telephone line to where the file is located. For example, one
allow remote login and file transfer. Later, could say
several attempts were made to create net- open(’ ‘/machinel/pathname’ ‘, READ);
work operating systems, but they never open(“machinel!pathname”, READ);
were widely used [Millstein 19771.
In more recent years, several research open(‘f/. ./machinel/pathname”, READ);
organizations have connected collections of
minicomputers running the UNIX operat- The latter naming scheme is used in the
ing system [Ritchie and Thompson 19741 Newcastle Connection [Brownbridge et al.
into a network operating system, usually 19821 and Netix [Wambecq 19831 and is
via a local network [Birman and Rowe derived from the creation of a virtual
1982; Brownbridge et al. 1982; Chesson “superdirectory” above the root directories
1975; Hwang et al. 1982; Luderer et al. 1981; of all the connected machines. Thus “/. .”
Wambecq 19831. Wupit [1983] gives a good means start at the local root directory and
survey of these systems, which we shall go upward one level (to the superdirectory),
draw upon for the remainder of this section. and then down to the root directory of
As we said earlier, the key issue that “machine.” In Figure 1, the root directory
distinguishes a network operating system of three machines, A, B, and C are shown,
from a distributed one is how aware the with a superdirectory above them. To ac-
users are of the fact that multiple machines cess file x from machine C, one could say
are being used. This visibility occurs in open(’ ‘/. ./C/x’ ‘, READ-ONLY)
three primary areas: the file system, pro-
tection, and program execution. Of course, In the Newcastle system, the naming tree
it is possible to have systems that are highly is actually more general, since “machine 1”
transparent in one area and not at all in may really be any directory, so one can
the other, which leads to a hybrid form. attach a machine as a leaf anywhere in the
hierarchy, not just at the top.
1.1 File System The third approach is the way it is done
in distributed operating systems, namely,
When connecting two or more distinct sys- to have a single global file system visible
tems together, the first issue that must be from all machines. When this method is
faced is how to merge the file systems. used, there is one “bin” directory for binary
Three approaches have been tried. The first programs, one password file, and so on.
approach is not to merge them at all. Going When a program wants to read the pass-
this route means that a program on ma- word file it does something like
chine A cannot access files on machine B
open(’ ‘/etc/passwd’ ‘, READ-ONLY)
by making system calls. Instead, the user
must run a special file transfer program without reference to where the file is. It is
that copies the needed remote files to the up to the operating system to locate the file
local machine, where they can then be ac- and arrange for transport of data as they
cessed normally. Sometimes remote print- are needed. LOCUS is an example of a
ing and mail is also handled this way. system using this approach [Popek et al.
One of the best-known examples of net- 1981; Walker et al. 1983; Weinstein et al.
works that primarily support file transfer 19851.
and mail via special programs, and not The convenience of having a single global
system call access to remote files, is the name space is obvious. In addition, this
UNIX “uucp” program, and its network, approach means that the operating system
USENET. is free to move files around among ma-
The next step upward in the direction of chines to keep all the disks equally full and
a distributed file system is to have adjoining busy, and that the system can maintain

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 423
way, the network is just being used as a
fancy switch to allow users at any terminal
to log onto any computer, just as a tele-
phone company switching center allows
any subscriber to call any other subscriber.
This solution is usually inconvenient for
people and impractical for programs, so
something better is needed. The next step
up is to allow any user to access files on
any machine without having to log in, but
to have the remote user appear to have
the UID corresponding to “GUEST” or
r 9 t u v w x Y 2 “DEMO” or some other publicly known
login name. Generally such names have
Figure 1. A (virtual) superdirectory above the root little authority and can only access files
directory provides access to remote files. that have been designated as readable or
writable by all users.
A better approach is to have the operat-
replicated copies of files if it so chooses. ing system provide a mapping between
When the user or program must specify the UIDs, so that when a user with UID 12 on
machine name, the system cannot decide his or her home machine accesses a remote
on its own to move a file to a new machine machine on which his or her UID is 15, the
because that would change the (user visi- remote machine treats all accesses as
ble) name used to access the file. Thus in a though they were done by user 15. This
network operating system, control over file approach implies that sufficient tables are
placement must be done manually by the provided to map each user from his or her
users, whereas in a distributed operating home (machine, UID) pair to the appropri-
system it can be done automatically by the ate UID for any other machine (and that
system itself. messages cannot be tampered with).
In a true distributed system there should
1.2 Protection be a unique UID for every user, and that
UID should be valid on all machines with-
Closely related to the transparency of the out any mapping. In this way no protection
file system is the issue of protection. UNIX problems arise on remote accesses to files;
and many other operating systems assign a as far as protection goes, a remote access
unique internal identifier to each user. can be treated like a local access with the
Each file in the file system has a little table same UID. The protection issue makes the
associated with it (called an i-node in difference between a network operating
UNIX) telling who the owner is, where the system and a distributed one clear: In one
disk blocks are located, etc. If two previ- case there are various machines, each with
ously independent machines are now con- its own user-to-UID mapping, and in the
nected, it may turn out that some internal other there is a single, systemwide mapping
User IDentifier (UID), for example, num- that is valid everywhere.
ber 12, has been assigned to a different user
on each machine. Consequently, when user 1.3 Execution Location
12 tries to access a remote file, the remote
file system cannot see whether the access Program execution is the third area in
is permitted since two different users have which machine boundaries are visible in
the same UID. network operating systems. When a user or
One solution to this problem is to require a running program wants to create a new
all remote users wanting to access files on process, where is the process created? At
machine X to first log onto X using a user least four schemes have been used thus far.
name that is local to X. When used this The first of these is that the user simply

Computing Surveys, Vol. 17, No. 4, December 1985


424 l A. S. Tanenbaum and R. van Renesse
says “CREATE PROCESS” in one way or all the system calls and decide whether each
another, and specifies nothing about where. one was local or remote [Brownbridge et al.
Depending on the implementation, this can 19821. Although most system calls can be
be the best or the worst way to do it. In the handled this way without modifying the
most distributed case, the system chooses kernel, invariably there are a few things,
a CPU by looking at the load, location of such as interprocess signals, interrupt char-
files to be used, etc. In the least distributed acters (e.g., BREAK) from the keyboard,
case, the system always runs the process on etc., that are hard to get right. In a true
one specific machine (usually the machine distributed operating system one would
on which the user is logged in). normally write the kernel from scratch.
The second approach to process location
is to allow users to run jobs on any machine 1.4 An Example: The Sun Network
by first logging in there. In this model, File System
processes on different machines cannot
communicate or exchange data, but a sim- To provide a contrast with the true distrib-
ple manual load balancing is possible. uted systems described later in this paper,
The third approach is a special command in this section we look briefly at a network
that the user types at a terminal to cause a operating system that runs on the Sun
program to be executed on a specific ma- Microsystems’ workstations. These work-
chine. A typical command might be stations are intended for use as personal
computers. Each one has a 68000 series
remote vax4 who CPU, local memory, and a large bit-
to run the who program on machine vax4. mapped display. Workstations can be
In this arrangement, the environment of configured with or without local disk, as
the new process is the remote machine. In desired. All the workstations run a ver-
other words, if that process tries to read or sion of 4.2BSD UNIX specially modified
write files from its current working direc- for networking.
tory, it will discover that its working direc- This arrangement is a classic example of
tory is on the remote machine, and that a network operating system: Each com-
files that were in the parent process’s di- puter runs a traditional operating system,
rectory are no longer present. Similarly, UNIX, and each has its own user(s), but
files written in the working directory will with extra features added to make network-
appear on the remote machine, not the local ing more convenient. During its evolution
one. the Sun system has gone through three
The fourth approach is to provide the distinct versions, which we now describe.
“CREATE PROCESS” system call with a In the first version each of the work-
parameter specifying where to run the new stations was completely independent from
process, possibly with a new system call for all the others, except that a program rep
specifying the default site. As with the pre- was provided to copy files from one work-
vious method, the environment will gener- station to another. By typing a command
ally be the remote machine. In many cases, such as
signals and other forms of interprocess rep Ml:/usr/jim/file.c M2:/usr/ast/f.c
communication between processes do not
work properly among processes on different it was possible to transfer whole files from
machines. one machine to another.
A final point about the difference be- In the second version, Network Disk
tween network and distributed operating (ND), a network disk server was provided
systems is how they are implemented. A to support diskless workstations. Disk
common way to realize a network operating space on the disk server’s machine was
system is to put a layer of software on top divided into disjoint partitions, with each
of the native operating systems of the in- partition acting as the virtual disk for some
dividual machines (e.g., Mamrak et al. (diskless) workstation.
[1982]). For example, one could write a Whenever a diskless workstation needed
special library package that would intercept to read a file, the request was processed
Computing Surveys, Vol. 17, No. 4, December 1985
Distributed Operating Systems l 425
locallv until it not down to the level of the tually working on behalf of a remote kernel.
device driver, it which point the block The reply retraces the same path in the
needed was retrieved by sending a message other direction.
to the remote disk server. In effect, the The protocol between workstations has
network was merely being used to simulate been carefully designed to be robust in the
a disk controller. With this network disk face of network and server crashes. Each
system, sharing of disk partitions was not request completely identifies the file (by its
possible. vnode), the position in the file, and the byte
The third version, the Network File Sys- count. Between requests, the server does
tem (NFS), allows remote directories to be not maintain any state information about
mounted in the local file tree on any work- which files are open or where the current
station. By mounting, say, a remote direc- file position is. Thus, if a server crashes
tory “dot” on the empty local directory and is rebooted, there is no state informa-
“/usr/doc,” all subsequent references to tion that will be lost.
“/usr/doc” are automatically routed to the The ND and NFS facilities are quite
remote system. Sharing is allowed in NFS, different and can both be used on the same
so several users can read files on a remote workstation without conflict. ND works at
machine at the same time. a low level and just handles remote block
To prevent users from reading other peo- I/O without regard to the structure of the
ple’s private files, a directory can only be information on the disk. NFS works at a
mounted remotely if it is explicitly exported much higher level and effectively takes re-
by the workstation it is located on. A direc- quests appearing at the top of the operating
tory is exported by entering a line for it in system on the client machine and gets them
a file “/etc/exports.” To improve perform- over to the top of the operating system on
ance of remote access, both the client ma- the server machine, where they are pro-
chine and server machine do block caching. cessed in the same way as local requests.
Remote services can be located using a
Yellow Pages server that maps service 2. DESIGN ISSUES
names onto their network locations.
The NFS is implemented by splitting the Now we turn from traditional computer
operating system up into three layers. The systems with some networking facilities
top layer handles directories, and maps added on to systems designed with the
each path name onto a generalized i-node intention of being distributed. In this sec-
called a unode consisting of a (machine, tion we look at five issues that distributed
i-node) pair, making each vnode globally systems’ designers are faced with:
unique.
Vnode numbers are presented to the mid- l communication primitives,
dle layer, the virtual file system (VFS). l naming and protection,
This layer checks to see if a requested l resource management,
vnode is local or not. If it is local, it calls 0 fault tolerance,
the local disk driver or, in the case of an l services to provide.
ND partition, sends a message to the re- Although no list could possibly be exhaus-
mote disk server. If it is remote, the VFS tive at this early stage of development,
calls the bottom layer with a request to these topics should provide a reasonable
process it remotely. impression of the areas in which current
The bottom layer accepts requests for research is proceeding.
accessesto remote vnodes and sends them
over the network to the bottom layer on 2.1 Communication Primitives
the serving machine. From there they prop-
agate upward through the VFS layer to the The computers forming a distributed sys-
top layer, where they are reinjected into the tem normally do not share primary mem-
VFS layer. The VFS layer sees a request ory, and so communication via shared
for a local vnode and processes it normally, memory techniques such as semaphores
without realizing that the top layer is ac- and monitors is generally not applicable.
Computing Surveys, Vol. 17, No. 4, December 1985
426 . A. S. Tanenbaum and R. van Renesse
Instead, message passing in one form or
another is used. One widely discussed
framework for message-passing systems is
I+
Client
the IS0 OS1 reference model, which has sends
request +-El
seven layers, each performing a well- Remage
defined function [Zimmermann 19801. The Server
rend8
seven layers are the physical layer, data-
reply
link layer, network layer, transport layer, marage
session layer, presentation layer, and ap-
plication layer. By using this model it is Figure 2. Client-server model of communication.
possible to connect computers with widely
different operating systems, character
codes, and ways of viewing the world. Precisely what semantics these primi-
Unfortunately, the overhead created by tives ought to have has been a subject
all these layers is substantial. In a distrib- of much controversy among researchers.
uted system consisting primarily of huge Two of the fundamental decisions that
mainframes from different manufacturers, must be made are unreliable versus reli-
connected by slow leased lines (say, 56 able and nonblocking versus blocking prim-
kilobytes per second), the overhead might itives. At one extreme, SEND can put a
be tolerable. Plenty of computing capacity message out onto the network and wish it
would be available for running complex good luck. No guarantee of delivery is pro-
protocols, and the narrow bandwidth vided, and no automatic retransmission
means that close coupling between the sys- is attempted by the system if the message
tems would be impossible anyway. On the is lost. At the other extreme, SEND can
other hand, in a distributed system consist- handle lost messages, retransmissions, and
ing of identical microcomputers connected acknowledgments internally, so that when
by a lo-megabyte-per second or faster local SEND terminates, the program is sure
network, the price of the IS0 model is that the message has been received and
generally too high. Nearly all the experi- acknowledged.
mental distributed systems discussed in the
Blocking versus Nonblocking Primitives.
literature thus far have opted for a differ-
The other choice is between nonblocking
ent, much simpler model, so we do not
and blocking primitives. With nonblocking
mention the IS0 model further in this
primitives, SEND returns control to the
paper.
user program as soon as the message has
2.1.1 Message Passing
been queued for subsequent transmission
(or a copy made). If no copy is made, any
The model that is favored by researchers changes the program makes to the data
in this area is the client-server model, in before or (heaven forbid) while they are
which a client process wanting some service being sent are made at the program’s peril.
(e.g., reading some data from a tile) sends When the message has been transmitted
a message to the server and then waits for (or copied to a safe place for subsequent
a reply message, as shown in Figure 2. In transmission), the program is interrupted
the most naked form the system just pro- to inform it that the buffer may be reused.
vides two primitives: SEND and RE- The corresponding RECEIVE primitive
CEIVE. The SEND primitive specifies the signals a willingness to receive a message
destination and provides a message; the and provides a buffer for it to be put into.
RECEIVE primitive tells from whom a When a message has arrived, the program
message is desired (including “anyone”) is informed by interrupt, or it can poll for
and provides a buffer where the incoming status continuously or go to sleep until the
message is to be stored. No initial setup is interrupt arrives. The advantage of these
required, and no connection is established, nonblocking primitives is that they provide
hence no tear down is required. the maximum flexibility: Programs can

ComputingSurveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 427
compute and perform message I/O in par- receiver. Although buffered message pass-
allel in any way they want. ing can be implemented in many ways, a
Nonblocking primitives also have a dis- typical approach is to provide users with a
advantage: They make programming tricky system call CREATEBUF, which creates a
and difficult. Irreproducible, timing- kernel buffer, sometimes called a mailbox,
dependent programs are painful to write of a user-specified size. To communicate,
and awful to debug. Consequently, many a sender can now send messages to the
people advocate sacrificing some flexibility receiver’s mailbox, where they will be
and efficiency by using blocking primitives. buffered until requested by the receiver.
A.blocking SEND does not return control Buffering is not only more complex (creat-
to the user until the message has been sent ing, destroying, and generally managing
(unreliable blocking primitive) or until the the mailboxes), but also raises issues of pro-
message has been sent and an acknowledg- tection, the need for special high-priority
ment received (reliable blocking primitive). interrupt messages, what to do with mail-
Either way, the program may immediately boxes owned by processes that have been
modify the buffer without danger. A block- killed or died of natural causes, and more.
ing RECEIVE does not return control until A more structured form of communica-
a message has been placed in the buffer. tion is achieved by distinguishing requests
Reliable and unreliable RECEIVES differ from replies. With this approach, one typ-
in that the former automatically acknowl- ically has three primitives: SEND-GET,
edges receipt of a message, whereas the GET-REQUEST, and SEND-REPLY.
latter does not. It is not reasonable to com- SEND-GET is used by clients to send re-
bine a reliable SEND with an unreliable quests and get replies. It combines a SEND
RECEIVE, or vice versa; so the system to a server with a RECEIVE to get the
designers must make a choice and provide server’s reply. GET-REQUEST is done by
one set or the other. Blocking and non- servers to acquire messages containing
blocking primitives do not conflict, so there work for them to do. When a server has
is no harm done if the sender uses one and carried the work out, it sends a reply with
the receiver the other. SEND-REPLY. By thus restricting the
message traffic and using reliable, blocking
Buffered versus Unbuffered Primitives.
primitives, one can create some order in the
Another design decision that must be made
chaos.
is whether or not to buffer messages. The
simplest strategy is not to buffer. When a
sender has a message for a receiver that has
2.1.2 Remote Procedure Call (RPC)
not (yet) executed a RECEIVE primitive,
the sender is blocked until a RECEIVE The next step forward in message-passing
has been done, at which time the mes- systems is the realization that the model of
sage is copied from sender to receiver. This “client sends request and blocks until ser-
strategy is sometimes referred to as a ver sends reply” looks very similar to a
rendezvous. traditional procedure call from the client to
A slight variation on this theme is to the server. This model has become known
copy the message to an internal buffer on in the literature as “remote procedure call”
the sender’s machine, thus providing for a and has been widely discussed [Birrell and
nonblocking version of the same scheme. Nelson 1984; Nelson 1981; Spector 19821.
As long as the sender does not do any more The idea is to make the semantics of inter-
SENDS before the RECEIVE occurs, no machine communication as similar as pos-
problem occurs. sible to normal procedure calls because the
A more general solution is to have a latter is familiar and well understood, and
buffering mechanism, usually in the oper- has proved its worth over the years as a
ating system kernel, which allows senders tool for dealing with abstraction. It can be
to have multiple SENDS outstanding, even viewed as a refinement of the reliable,
without any interest on the part of the blocking SEND-GET, GET-REQUEST,

Computing Surveys, Vol. 17, No. 4, December 1985


428 . A. S. Tanenbaum and R. van Renesse
SENDREP primitives, with a more user- Client Machine Server Machine
friendly syntax.
The remote procedure call can be organ-
ized as follows. The client (calling program) ~I-~~
makes a normal procedure call, say, p(x, y)
on its machine, with the intention of invok-
ing the remote procedure p on some other Figure 3. Remote procedure call.
machine. A dummy or stub procedure p
must be included in the caller’s address
space, or at least be dynamically linked to leans, the amount of overhead and mecha-
it upon call. This procedure, which may be nism needed to create a capability and send
automatically generated by the compiler, it in a protected way is so large that this
collects the parameters and packs them solution is highly undesirable.
into a messagein a standard format. It then Still another problem that must be dealt
sends the message to the remote machine with is how to represent parameters and
(using SEND-GET) and blocks, waiting results in messages. This representation is
for an answer (see Figure 3). greatly complicated when different types of
At the remote machine, another stub pro- machines are involved in a communication.
cedure should be waiting for a message A floating-point number produced on one
using GET-REQUEST. When a message machine is unlikely to have the same value
comes in, the parameters are unpacked by on a different machine, and even a negative
an input-handling procedure, which then integer will create problems between the l’s
makes the local call p(x, y). The remote complement and 2’s complement machines.
procedure p is thus called locally, and so its Converting to and from a standard for-
normal assumptions about where to find mat on every message sent and received is
parameters, the state of the stack, etc., are an obvious possibility, but it is expensive
identical to the case of a purely local call. and wasteful, especially when the sender
The only procedures that know that the and receiver do, in fact, use the same inter-
call is remote are the stubs, which build nal format. If the sender uses its internal
and send the messageon the client side and format (along with an indication of which
disassemble and make the call on the server format it is) and lets the receiver do the
side. The result of the procedure call follows conversion, every machine must be pre-
an analogous path in the reverse direction. pared to convert from every other format.
When a new machine type is introduced,
Remote Procedure Call Design Issues. much existing software must be upgraded.
Although at first glance the remote Any way it is done, with remote procedure
procedure call model seems clean and sim- call (RPC) or with plain messages, it is an
ple, under the surface there are several unpleasant business.
problems. One problem concerns parameter Some of the unpleasantness can be hid-
(and result) passing. In most programming den from the user if the remote procedure
languages, parameters can be passed by call mechanism is embedded in a program-
value or by reference. Passing value param- ming language with strong typing, so that
eters over the network is easy; the stub just the receiver at least knows how many pa-
copies them into the message and off they rameters to expect and what types they
go. Passing reference parameters (pointers) have. In this respect, a weakly typed lan-
over the network is not so easy. One needs guage such as C, in which procedures with
a unique, systemwide pointer for each ob- a variable number of parameters are com-
ject so that it can be remotely accessed.For mon, is more complicated to deal with.
large objects, such as files, some kind of Still another problem with RPC is the
capability mechanism [Dennis and Van issue of client-server binding. Consider, for
Horn 1966; Levy 1984; Pashtan 19821could example, a system with multiple file ser-
be set up, using capabilities as pointers. For vers. If a client creates a file on one of the
small objects, such as integers and Boo- file servers, it is usually desirable that sub-

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 429

sequent writes to that file go to the file Client crashes can also cause trouble for
server where the file was created. With servers. Consider, for example, the case of
mailboxes, arranging for this is straight- processes A and B communicating via the
forward. The client simply addresses the UNIX pipe model A ] B with A the server
WRITE messagesto the same mailbox that and B the client. B asks A for data and gets
the CREATE message was sent to. Since a reply, but unless that reply is acknowl-
each file server has its own mailbox, there edged somehow, A does not know when it
is no ambiguity. can safely discard data that it may not be
When RPC is used, the situation is more able to reproduce. If B crashes, how long
complicated, since all the client does is put should A hold onto the data? (Hint: If the
a procedure call such as answer is less than infinity, problems will
be introduced whenever B is slow in send-
write(FileDescriptor, BufferAddress, ByteCount); ing an acknowledgment.)
in his program. RPC intentionally hides all Closely related to this is the problem of
the details of locating servers from the what happens if a client cannot tell whether
client, but sometimes, as in this example, or not a server has crashed. Simply waiting
the details are important. until the server is rebooted and trying again
In some applications, broadcasting and sometimes works and sometimes does not.
multicasting (sending to a set of destina- This is a case in which it works: Client asks
tions, rather than just one) is useful. For to read block 7 of some file. This is a case
example, when trying to locate a certain in which it does not work: Client says
person, process, or service, sometimes the transfer a million dollars from one bank
only approach is to broadcast an inquiry account to another. In the former case, it
message and wait for the replies to come does not matter whether or not the server
back. RPC does not lend itself well to carried out the request before crashing;
sending messages to sets of processes carrying it out a second time does no harm.
and getting answers back from some or In the latter case, one would definitely pre-
all of them. The semantics are completely fer the call to be carried out exactly once,
different. no more and no less. Calls that may be
repeated without harm (like the first ex-
Despite all these disadvantages, RPC re-
mains an interesting form of communica- ample) are said to be idempotent. Unfortu-
tion, and much current research is being nately, it is not always possible to arrange
addressed toward improving it and solving for all calls to have this property. Any call
the various problems discussed above. that causes action to occur in the outside
world, such as transferring money, printing
2.1.3 Error Handling
lines, or opening a valve in an automated
chocolate factory just long enough to fill
Error handling in distributed systems is exactly one vat, is likely to cause trouble if
radically different from that of centralized performed twice.
systems. In a centralized system, a system Spector [1982] and Nelson [1981] have
crash means that the client, server, and looked at the problem of trying to make
communication channel are all completely sure that remote procedure calls are exe-
destroyed, and no attempt is made to revive cuted exactly once, and they have devel-
them. In a distributed system, matters are oped taxonomies for classifying the seman-
more complex. If a client has initiated a tics of different systems. These vary from
remote procedure call with a server that systems that offer no guarantee at all (zero
has crashed, the client may just be left or more executions), to those that guaran-
hanging forever unless a time-out is built tee at most one execution (zero or one), to
in. However, such a time-out introduces those that guarantee at least one execution
race conditions in the form of clients that (one or more).
time out too quickly, thinking that the Getting it right (exactly one) is probably
server is down, when in fact, it is merely impossible, because even if the remote ex-
very slow. ecution can be reduced to one instruction

Computing Surveys, Vol. 17, No. 4, December 1985


430 l A. S. Tanenbaum and R. van Renesse
(e.g., setting a bit in a device register that to the network, transmission time), achiev-
opens the chocolate valve), one can never ing even 80 kilobytes per second will be
be sure after a crash if the system went difficult, if not impossible, no matter how
down a microsecond before or a micro- high the network bandwidth or disk speed.
second after the one critical instruction. Thus it is desirable to avoid copying, but
Sometimes one can make a guess based on this is not always simple to achieve since
observing external events (e.g., looking to without copies, (part of) a needed message
see whether the factory floor is covered may be swapped or paged out when it is
with a sticky, brown material), but in gen- needed.
eral there is no way of knowing. Note that Another point worth making is that there
the problem of creating stable storage is always a substantial fixed overhead with
[Lampson 19811 is fundamentally different, preparing, sending, and receiving a mes-
since remote procedure calls to the stable sage, even a short message, such as a re-
storage server in that model never cause quest to read from a remote file server. The
events external to the computers. kernel must be invoked, the state of the
current process must be saved, the desti-
nation must be located, various tables must
2.1.4 Implementation issues
be updated, permission to access the net-
Constructing a system in principle is al- work must be obtained (e.g., wait for the
ways easier than constructing it in practice. network to become free or wait for the
Building a 16-node distributed system that token), and quite a bit of bookkeeping must
has a total computing power about equal to be done.
a single-node system is surprisingly easy. This fixed overhead argues for making
This observation leads to tension between messages as long as possible, to reduce the
the goals of making it work fast in the number of messages. Unfortunately, many
normal case and making the semantics rea- current local networks limit physical pack-
sonable when something goes wrong. Some ets to 1K or 2K; 4K or 8K would be much
experimental systems have put the empha- better. Of course, if the packets become too
sis on one goal and some on the other, but long, a highly interactive user may occa-
more research is needed before we have sionally be queued behind ten maximum-
systems that are both fast and graceful in length packets, degrading response time; so
the face of crashes. the optimum size depends on the work load.
Some things have been learned from past
work, however. Foremost among these is Virtual Circuits versus Datagrams
that making message passing efficient is There is much controversy over whether
very important. To this end, systems remote procedure call ought to be built on
should be designed to minimize copying of top of a flow-controlled, error-controlled,
data [Cheriton 1984a]. For example, a re- virtual circuit mechanism or directly on top
mote procedure call system that first copies of the unreliable, connectionless (data-
each message from the user to the stub, gram) service. Saltzer et al. [1984] have
from the stub to the kernel, and finally pointed out that since high reliability can
from the kernel to the network interface only be achieved by end-to-end acknowl-
board requires three copies on the sending edgments at the highest level of protocol,
side, and probably three more on the re- the lower levels need not be 100 percent
ceiving side, for a total of six. If the call is reliable. The overhead incurred in provid-
to a remote file server to write a 1K block ing a clean virtual circuit upon which to
of data to disk, at a copy time of 1 micro- build remote procedure calls (or any other
second per byte, 6 milliseconds are needed message-passing system), is therefore
just for copying, which puts an upper limit wasted. This line of thinking argues for
of 167 calls per second, or a throughput of building the message system directly on the
167 kilobytes per second. When other raw datagram interface.
sources of overhead are considered (e.g., the The other side of the coin is that it would
reply message, the time waiting for access be nice for a distributed system to be able

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 431
to encompass heterogeneous computers in Request
different countries with different post, Request Ack
telephone, and telegraph (PTT) networks Reply
and possibly different national alphabets, Reply Ack
and that this environment requires com-
plex multilayered protocol structures. It is (4
our observation that both arguments are
valid, but, depending on whether one is
trying to forge a collection of small com- Request
I

puters into a virtual uniprocessor or merely RePlY


access remote data transparently, one or Reply Ack
the other will dominate.
Even if one opts for building RPC on top (b)
of the raw datagram service provided by a
local network, there are still a number of Request f
protocols open to the implementer. The
RePlY
simplest one is to have every request and
reply separately acknowledged. The mes- Request 2
sage sequence for a remote procedure call
is then: REQUEST, ACK, REPLY, ACK, (c)
as shown in Figure 4a. The ACKs are man- Figure 4. Remote procedure call (a) with individual
aged by the kernel without user knowledge. acknowledgments per message, (b) with the reply as
The number of messages can be reduced the request acknowledgment, (c) with no explicit
from four to three by allowing the REPLY acknowledgments.
to serve as the ACK for the REQUEST, as
shown in Figure 4b. However, a problem
arises when the REPLY can be delayed for the opportunity to capture it. In theory,
a long time. For example, when a login such a network is “fair” in that each user
process makes an RPC to a terminal server has equal access to the network and no one
requesting characters, it may be hours or user can monopolize it to the detriment of
days before someone steps up to a terminal others. In practice, suppose that two users
and begins typing. In this event, an addi- each want to read a long file from a file
tional message has to be introduced to allow server. User A sends a request message to
the sending kernel to inquire whether the the server, and then replaces the token on
message has arrived or not. the network for B to acquire.
A further step in the same direction is to After A’s message arrives at the server,
eliminate the other ACK as well, and let it takes a short time for the server to handle
the arrival of the next REQUEST imply an the incoming message interrupt and reen-
acknowledgment of the previous REPLY able the receiving hardware. Until the re-
(see Figure 4~). Again, some mechanism is ceiver is reenabled, the server is deaf.
needed to deal with the case that no new Within a microsecond or two of the time A
REQUEST is forthcoming quickly. puts the token back on the network, B sees
One of the great difficulties in imple- and grabs it, and begins transmitting a
menting efficient communication is that it request to the (unbeknown to B) deaf file
is more of a black art than a science. Even server. Even if the server reenables halfway
straightforward implementations can have through B’s message, the message will be
unexpected consequences, as the following rejected owing to missing header, bad frame
example from Sventek et al. [1983] shows. format, and checksum error. According to
Consider a ring containing a circulating the ring protocol, after sending one mes-
token. To transmit, a machine captures and sage, B must now replace the token, which
removes the token, puts a message on the A captures for a successful transmission.
network, and then replaces the token, thus Once again B transmits during the server’s
allowing the next machine “downstream” deaf period, and so on. Conclusion: B gets

Computing Surveys, Vol. 17, No. 4, December 1985


432 . A. S. Tanenbaum and R. van Renesse
no service at all until A is finished. If A have to map the service name onto the
happens to be scanning through the Man- name of a server process that is prepared
hattan telephone book, B may be in for a to offer the service. As a second step, the
long wait. This specific problem can be server would then be mapped onto the num-
solved by inserting random delays in places ber of the CPU on which that process is
to break the synchrony, but our point is running. The mapping need not always
that totally unexpected problems like this be unique, for example, if there are multi-
make it necessary to build and observe real ple processes prepared to offer the same
systems to gain insight into the problems. service.
Abstract formulations and simulations are
not enough. 2.2.2 Name Servers
In centralized systems, the problem of nam-
2.2 Naming and Protection ing can be effectively handled in a straight-
All operating systems support objects such forward way. The system maintains a table
as files, directories, segments, mailboxes, or database providing the necessary name-
processes, services, servers, nodes, and I/O to-object mappings. The most straightfor-
devices. When a process wants to access ward generalization of this approach to
one of these objects, it must present some distributed systems is the single name
kind of name to the operating system to server model. In this model, a server ac-
specify which object it wants to access. In cepts names in one domain and maps them
some instances these names are ASCII onto names in another domain. For exam-
strings designed for human use; in others ple, to locate services in some distributed
they are binary numbers used only inter- systems, one sends the service name in
nally. In all cases they have to be managed ASCII to the name server, and it replies
and protected from misuse. with the node number where that service
can be found, or with the process name of
the server process, or perhaps with the
2.2.1 Naming as Mapping
name of a mailbox to which requests for
Naming can best be seen as a problem of service can be sent. The name server’s da-
mapping between two domains. For exam- tabase is built up by registering services,
ple, the directory system in UNIX provides processes, etc., that want to be publicly
a mapping between ASCII path names and known. File directories can be regarded as
i-node numbers. When an OPEN system a special case of name service.
call is made, the kernel converts the name Although this model is often acceptable
of the file to be opened into its i-node in a small distributed system located at a
number. Internal to the kernel, files are single site, in a large system it is undesira-
nearly always referred to by i-node number, ble to have a single centralized component
not ASCII string. Just about all operating (the name server) whose demise can bring
systems have something similar. In a dis- the whole system to a grinding halt. In
tributed system a separate name server is addition, if it becomes overloaded, perform-
sometimes used to map user-chosen names ance will degrade. Furthermore, in a geo-
(ASCII strings) onto objects in an analo- graphically distributed system that may
gous way. have nodes in different cities or even coun-
Another example of naming is the map- tries, having a single name server will be
ping of virtual addresses onto physical ad- inefficient owing to the long delays in ac-
dresses in a virtual memory system. The cessing it.
paging hardware takes a virtual address as The next approach is to partition the
input and yields a physical address as out- system into domains, each with its own
put for use by the real memory. name server. If the system is composed of
In some cases naming implies only a multiple local networks connected by gate-
single level of mapping, but in other cases ways and bridges, it seems natural to have
it can imply multiple levels. For example, one name server per local network. One
to use some service, a process might first way to organize such a system is to have a

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems . 433
Nameserver 1 Nameserver 2 Nameserver 3
looks up a/b/c looks up b/c looks up c

a a

X > X

Figure 5.
>

>

Distributing
El C

the lookup of a/b/c over three name servers.

global naming tree, with files and other could make its own synonym s and then
objects having names of the form: /coun- perform accesses using s/x/foobar. Each
try/city/network/pathname. When such a name server parsing a name that involves
name is presented to any name server, it multiple name servers just strips off the
can immediately route the request to some first component and passes the rest of the
name server in the designated country, name to the name server found by looking
which then sends it to a name server in the up the first component locally.
designated city, and so on until it reaches A more extreme way of distributing the
the name server in the network where the name server is to have each machine man-
object is located, where the mapping can be age its own names. To look up a name, one
done. Telephone numbers use such a hier- broadcasts it on the network. At each ma-
archy, composed of country code, area code, chine, the incoming request is passed to the
exchange code (first three digits of tele- local name server, which replies only if it
phone number in North America), and sub- finds a match. Although broadcasting is
scriber line number. easiest over a local network such as a ring
Having multiple name servers does not net or CSMA net (e.g., Ethernet), it is also
necessarily require having a single, global possible over store-and-forward packet
naming hierarchy. Another way to organize switching networks such as the ARPANET
the name servers is to have each one effec- [Dalal 19771.
tively maintain a table of, for example, Although the normal use of a name server
(ASCII string, pointer) pairs, where the is to map an ASCII string onto a binary
pointer is really a kind of capability for any number used internally to the system, such
object or domain in the system. When a as a process identifier or machine number,
name, say a/b/c, is looked up by the local once in a while the inverse mapping is also
name server, it may well yield a pointer to useful. For example, if a machine crashes,
another domain (name server), to which upon rebooting it could present its (hard-
the rest of the name, b/c, is sent for further wired) node number to the name server to
processing (see Figure 5). This facility can ask what it was doing before the crash, that
be used to provide links (in the UNIX is, ask for the ASCII string corresponding
sense) to files or objects whose precise to the service that it is supposed to be
whereabouts is managed by a remote name offering so that it can figure out what pro-
server. Thus if a file foobar is located in gram to reboot.
another local network, n, with name server
n.s, one can make an entry in the local 2.3 Resource Management
name server’s table for the pair (x, n.s) and
then access xlfoobar as though it were a Resource management in a distributed
local object. Any appropriately authorized system differs from that in a centralized
user or process knowing the name xlfoobar system in a fundamental way. Centralized

Computing Surveys, Vol. 17, No. 4, December 1985


434 l A. S. Tanenbaum and R. van Renesse
systems always have tables that give com- This hierarchy can be extended ad infini-
plete and up-to-date status information tum, with the number of levels needed
about all the resources being managed; dis- growing logarithmically with the number of
tributed systems do not. For example, the workers. Since each processor need only
process manager in a traditional centralized maintain communication with one superior
operating system normally uses a “process and k subordinates, the information stream
table” with one entry per potential process. is manageable.
When a new process has to be started, it is An obvious question is, “What happens
simple enough to scan the whole table to when a department head, or worse yet, a
see whether a slot is free. A distributed big cheese, stops functioning (crashes)?”
operating system, on the other hand, has a One answer is to promote one of the direct
much harder job of finding out whether a subordinates of the faulty manager to fill
processor is free, especially if the system in for the boss. The choice of which one
designers have rejected the idea of having can either be made by the subordinates
any central tables at all, for reasons of themselves, by the deceased’s peers, or in a
reliability. Furthermore, even if there is a more autocratic system, by the sick man-
central table, recent events on outlying ager’s boss.
processors may have made some table en- To avoid having a single (vulnerable)
tries obsolete without the table manager manager at the top of the tree, one can
knowing it. truncate the tree at the top and have a
The problem of managing resources committee as the ultimate authority. When
without having accurate global state infor- a member of the ruling committee malfunc-
mation is very difficult. Relatively little tions, the remaining members promote
work has been done in this area. In the someone one level down as a replacement.
following sections we look at some work Although this scheme is not completely
that has been done, including distributed distributed, it is feasible and works well in
process management and scheduling. practice. In particular, the system is self-
repairing, and can survive occasional
2.3.1 Processor Allocation
crashes of both workers and managers
without any long-term effects.
One of the key resources to be managed in In MICROS, the processors are mono-
a distributed system is the set of available programmed, so if a job requiring S pro-
processors. One approach that has been cesses suddenly appears, the system must
proposed for keeping tabs on a collection of allocate S processors for it. Jobs can be
processors is to organize them in a logical created at any level of the hierarchy. The
hierarchy independent of the physical strategy used is for each manager to keep
structure of the network, as in MICROS track of approximately how many workers
[Wittie and van Tilborg 19801. This ap- below it are available (possibly several
proach organizes the machines like people levels below it). If it thinks that a sufficient
in corporate, military, academic, and number are available, it reserves some
other real-world hierarchies. Some of the number R of them, where R 2 S, because
machines are workers and others are the estimate of available workers may not
managers. be exact and some machines may be down.
For each group of k workers, one manager If the manager receiving the request
machine (the “department head”) is as- thinks that it has too few processors avail-
signed the task of keeping track of who is able, it passes the request upward in the
busy and who is idle. If the system is large, tree to its boss. If the boss cannot handle
there will be an unwieldy number of de- it either, the request continues propagating
partment heads; so some machines will upward until it reaches a level that has
function as “deans,” riding herd on k de- enough available workers at its disposal. At
partment heads. If there are many deans, that point, the manager splits the request
they too can be organized hierarchically, into parts and parcels them out among the
with a “big cheese” keeping tabs on k deans. managers below it, which then do the same

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 435
thing until the wave of scheduling requests TiilE
slot Machine Machine
hits bottom. At the bottom level, the pro-
cessors are marked as “busy,” and the ac- 0 1 01234567
tual number of processors allocated is re- A c
ported back up the tree. 1 0 Q
To make this strategy work well, R must 2 A c
be large enough so that the probability is 3 0 D
high that enough workers will be found to 4 A c
handle the whole job. Otherwise, the re- 5 0 D
0 1
quest will have to move up one level in the
tree and start all over, wasting considerable (a) (b)
time and computing power. On the other
Figure6. (a) Two jobs running out of phase with
hand, if R is too large, too many processors each other. (b) Scheduling matrix for eight machines,
will be allocated, wasting computing capac- each with six time slots. The X’s indicated allocated
ity until word gets back to the top and they slots.
can be released.
The whole situation is greatly compli-
cated by the fact that requests for proces- example in which processes A and B run
sors can be generated randomly anywhere on one machine and processes C and D run
in the system, so at any instant, multiple on another. Each machine is time shared
requests are likely to be in various stages in, say, lOO-millisecond time slices, with A
of the allocation algorithm, potentially giv- and C running in the even slices, and B and
ing rise to out-of-date estimates of available D running in the odd ones, as shown in
workers, race conditions, deadlocks, and Figure 6a. Suppose that A sends many mes-
more. In Van Tilborg and Wittie [1981] a sages or makes many remote procedure
mathematical analysis of the problem is calls to D. During time slice 0, A starts up
given and various other aspects not de- and immediately calls D, which unfortu-
scribed here are covered in detail. nately is not running because it is now C’s
turn. After 100 milliseconds, process
switching takes place, and D gets A’s mes-
2.3.2 Scheduling
sage, carries out the work, and quickly re-
The hierarchical model provides a general plies. Because B is now running, it will be
model for resource control but does not another 100 milliseconds before A gets the
provide any specific guidance on how to do reply and can proceed. The net result is one
scheduling. If each process uses an entire message exchange every 200 milliseconds.
processor (i.e., no multiprogramming), and What is needed is a way to ensure that
each process is independent of all the oth- processes that communicate frequently run
ers, any process can be assigned to any simultaneously.
processor at random. However, if it is com- Although it is difficult to determine dy-
mon that several processes are working to- namically the interprocess communication
gether and must communicate frequently patterns, in many cases a group of related
with each other, as in UNIX pipelines or processes will be started off together. For
in cascaded (nested) remote procedure example, it is usually a good bet that the
calls, then it is desirable to make sure that filters in a UNIX pipeline will communi-
the whole group runs at once. In this sec- cate with each other more than they will
tion we address that issue. with other, previously started processes.
Let us assume that each processor can Let us assume that processes are created
handle up to N processes. If there are in groups, and that intragroup commu-
plenty of machines and N is reasonably nication is much more prevalent than
large, the problem is not finding a free intergroup communication. Let us further
machine (i.e., a free slot in some process assume that a sufficiently large number
table), but something more subtle. The of machines are available to handle the
basic difficulty can be illustrated by an largest group, and that each machine is

Computing Surveys, Vol. 17, No. 4, December 1985


436 l A. S. Tanenbaum and R. van Renesse
multiprogrammed with N process slots (N- cesses are assigned to the empty slots;
way multiprogramming). otherwise the window is slid to the right
Ousterhout [ 19821 has proposed several and the algorithm repeated. Scheduling is
algorithms based on the concept of co- done by starting the window at the left edge
scheduling, which takes interprocess and moving rightward by about one win-
communication patterns into account while dow’s worth per time slice, taking care not
scheduling to ensure that all members of a to split groups over windows. Ousterhout’s
group run at the same time. The first al- paper discusses these and other methods in
gorithm uses a conceptual matrix in which more detail and gives some performance
each column is the process table for one results.
machine, as shown in Figure 6b. Thus, col-
umn 4 consists of all the processes that run 2.3.3 Load Balancing
on machine 4. Row 3 is the collection of all
processes that are in slot 3 of some ma- The goal of Ousterhout’s work is to place
chine, starting with the process in slot 3 of processes that work together on different
machine 0, then the process in slot 3 of processors, so that they can all run in par-
machine 1, and so on. The gist of his idea allel. Other researchers have tried to do
is to have each processor use a round-robin precisely the opposite, namely, to find sub-
scheduling algorithm with all processors sets of all the processes in the system that
first running the process in slot 0 for a fixed are working together, so that closely related
period, then all processors running the groups of processes can be placed on the
process in slot 1 for a fixed period, etc. A same machine to reduce interprocess com-
broadcast message could be used to tell each munication costs [Chow and Abraham
processor when to do process switching, to 1982; Chu et al. 1980; Gylys and Edwards
keep the time slices synchronized. 1976; Lo 1984; Stone 1977,1978; Stone and
By putting all the members of a process Bokhari 19781. Yet other researchers have
group in the same slot number, but on been concerned primarily with load balanc-
different machines, one has the advantage ing, to prevent a situation in which some
of N-fold parallelism, with a guarantee that processors are overloaded while others are
all the processes will be run at the same empty [Barak and Shiloh 1985; Efe 1982;
time, to maximize communication through- Krueger and Finkel 1983; Stankovic and
put. Thus in Figure 6b, four processes that Sidhu 19841. Of course, the goals of maxi-
must communicate should be put into slot mizing throughput, minimizing response
3, on machines 1, 2, 3, and 4 for optimum time, and keeping the load uniform are to
performance. This scheduling technique some extent in conflict, so many of the
can be combined with the hierarchical researchers try to evaluate different com-
model of process management used in promises and trade-offs.
MICROS by having each department head Each of these different approaches to
maintain the matrix for its workers, assign- scheduling makes different assumptions
ing processes to slots in the matrix and about what is known and what is most
broadcasting time signals. important. The people trying to cluster
Ousterhout also described several varia- processes to minimize communication
tions to this basic method to improve per- costs, for example, assume that any process
formance. One of these breaks the matrix can run on any machine, that the comput-
into rows and concatenates the rows to ing needs of each process are known in
form one long row. With k machines, any advance, and that the interprocess com-
k consecutive slots belong to different ma- munication traffic between each pair of
chines. To allocate a new process group to processes is also known in advance. The
slots, one lays a window k slots wide over people doing load balancing typically make
the long row such that the leftmost slot is the realistic assumption that nothing about
empty but the slot just outside the left edge the future behavior of a process is known.
of the window is full. If sufficient empty The minimizers are generally theorists,
slots are present in the window, the pro- whereas the load balancers tend to be

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 437
Machine 1 f Machine 2 Machine 1 I Machine 2

Figure 7. Two ways of statically al-


locating processes (nodes in the
graph) to machines. Arcs show which
pairs of processes communicate.

(4 (b)

people making real systems who care less static requirements, known traffic matrix,
about optimality than about devising algo- error-free processors and communication)
rithms that can actually be used. Let us now are ever met.
briefly look at each of these approaches.
Heuristic Load Balancing. When the
Graph-Theoretic Models. If the system goal of the scheduling algorithm is dy-
consists of a fixed number of processes, namic, heuristic load balancing, rather than
each with known CPU and memory re- finding related clusters, a different ap-
quirements, and a known matrix giving the proach is taken. Here the idea is for each
average amount of traffic between each pair processor to estimate its own load contin-
of processes, scheduling can be attacked as ually, for processors to exchange load in-
a graph-theoretic problem. The system can formation, and for process creation and
be represented as a graph, with each pro- migration to utilize this information.
cess a node and each pair of communicating Various methods of load estimation are
processes connected by an arc labeled with possible. One way is just to measure the
the data rate between them. number of runnable processes on each CPU
The problem of allocating all the pro- periodically and take the average of the last
cesses to k processors then reduces to the n measurements as the load. Another way
problem of partitioning the graph into k [Bryant and Finkel19811 is to estimate the
disjoint subgraphs, such that each subgraph residual running times of all the processes
meets certain constraints (e.g., total CPU and define the load on a processor as the
and memory requirements below some number of CPU seconds that all its pro-
limit). Arcs that are entirely within one cesses will need to finish. The residual time
subgraph represent internal communica- can be estimated mostly simply by assum-
tion within a single processor (=fast), ing it is equal to the CPU time already
whereas arcs that cut across subgraph consumed. Bryant and Finkel also discuss
boundaries represent communication be- other estimation techniques in which both
tween two processors (=slow). The idea is the number of processes and length of re-
to find a partitioning of the graph that maining time are important. When round-
meets the constraints and minimizes the robin scheduling is used, it is better to be
network traffic, or some variation of this competing against one process that needs
idea. Figure 7a depicts a graph of interact- 100 seconds than against 100 processes that
ing processors with one possible partition- each need 1 second.
ing of the processes between two machines. Once each processor has computed its
Figure 7b shows a better partitioning, with load, a way is needed for each processor to
less intermachine traffic, assuming that all find out how everyone else is doing. One
the arcs are equally weighted. Many papers way is for each processor to just broadcast
have been written on this subject, for ex- its load periodically. After receiving a
ample, Chow and Abraham [1982], Chow broadcast from a lightly loaded machine, a
and Kohler [1979], Stone [1977, 19781, processor should shed some of its load by
Stone and Bokhari [1978], and Lo [1984]. giving it to the lightly loaded processor.
The results are somewhat academic, since This algorithm has several problems. First,
in real systems virtually none of the as- it requires a broadcast facility, which may
sumptions (fixed number of processes with not be available. Second, it consumes

Computing Surveys, Vol. 17, No. 4, December 1985


438 l A. S. Tanenbaum and R. van Renesse

considerable bandwidth for all the “here is is not what it expected because that pro-
my load” messages. Third, there is a great cessor has bid for and won other work in
danger that many processors will try to the meantime.
shed load to the same (previously) lightly
loaded processor at once. 2.3.4 Distributed Deadlock Detection
A different strategy [Barak and Shiloh Some theoretical work has been done in the
1985; Smith 19791 is for each processor area of detection of deadlocks in distributed
periodically to pick another processor (pos- systems. How applicable this work may be
sibly a neighbor, possibly at random) and in practice remains to be seen. Two kinds
exchange load information with it. After of potential deadlocks are resource dead-
the exchange, the more heavily loaded pro- locks and communication deadlocks. Re-
cessor can send processes to the other one source deadlocks are traditional deadlocks,
until they are equally loaded. In this model, in which all of some set of processes are
if 100 processes are suddenly created in an blocked waiting for resources held by other
otherwise empty system, after one ex- blocked processes. For example, if A holds
change we will have two machines with 50 X and B holds Y, and A wants Y and B
processes, and after two exchanges most wants X, a deadlock will result.
probably four machines with 25 processes. In principle, this problem is the same in
Processes diffuse around the network like centralized and distributed systems, but it
a cloud of gas. is harder to detect in the latter because
Actually migrating running processes is there are no centralized tables giving the
trivial in theory, but close to impossible in status of all resources. The problem has
practice. The hard part is not moving the mostly been studied in the context of
code, data, and registers, but moving the database systems [Gligor and Shattuck
environment, such as the current position 1980; Isloor and Marsland 1978; Menasce
within all the open files, the current values and Muntz 1979; Obermarck 19821.
of any running timers, pointers or file de- The other kind of deadlock that can oc-
scriptors for communicating with tape cur in a distributed system is a communi-
drives or other I/O devices, etc. All of these cation deadlock. Suppose A is waiting for a
problems relate to moving variables and message from B and B is waiting for C and
data structures related to the process that C is waiting for A. Then we have a deadlock.
are scattered about inside the operating Chandy et al. [1983] present an algorithm
system. What is feasible in practice is to for detecting (but not preventing) commu-
use the load information to create new nication deadlocks. Very crudely summa-
processes on lightly loaded machines, in- rized, they assume that each process that
stead of trying to move running processes. is blocked waiting for a message knows
If one has adopted the idea of creating which process or processes might send the
new processes only on lightly loaded ma- message. When a process logically blocks,
chines, another approach, called bidding, is they assume that it does not really block
possible [Farber and Larson 1972; Stan- but instead sends a query message to each
kovic and Sidhu 19841. When a process of the processes that might send it a real
wants some work done, it broadcasts a re- (data) message. If one of these processes is
quest for bids, telling what it needs (e.g., a blocked, it sends query messages to the
68000 CPU, 512K memory, floating point, processes it is waiting for. If certain mes-
and a tape drive). sages eventually come back to the original
Other processors can then bid for the process, it can conclude that a deadlock
work, telling what their workload is, how exists. In effect, the algorithm is looking
much memory they have available, etc. The for a knot in a directed graph.
process making the request then chooses
the most suitable machine and creates the 2.4 Fault Tolerance
process there. If multiple request-for-bid
messages are outstanding at the same time, Proponents of distributed systems often
a processor accepting a bid may discover claim that such systems can be more relia-
that the workload on the bidding machine ble than centralized systems. Actually,
Computing Surveys, Vol. 17, No. 4, December 1985
Distributed Operating Systems l 439

there are at least two issues involved here: Network


reliability and availability. Reliability has Message Message
to do with the system not corrupting or
losing one’s data. Availability has to do Backup
RE2 EEZ’ process
with the system being up when it is needed. messa e
A system could be highly reliable in the to bo a h
sense that it never loses data, but at the
Figure 8. Each process has its own backup process.
same time be down most of the time and
hence hardly usable. However, many people
use the term “reliability” to cover availa- machines to cause that machine to tell the
bility as well, and we will not make the world that it had a delay time of zero to
distinction either in the rest of the paper. every machine in the network, which re-
Distributed systems are potentially more sulted in all network traffic being rerouted
reliable than a centralized system because to the broken machine.
if a system only has one instance of some One of the key advantages of distributed
critical component, such as a CPU, disk, or systems is that there are enough resources
network interface, and that component to achieve fault tolerance, at least with
fails, the system will go down. When there respect to expected errors. The system can
are multiple instances, the system may be be made to tolerate both hardware and
able to continue in spite of occasional fail- software errors, although it should be em-
ures. In addition to hardware failures, one phasized that in both cases it is the soft-
can also consider software failures. These ware, not the hardware, that cleans up the
are of two types: The software failed to mess when an error occurs. In the past few
meet the formal specification (implemen- years, two approaches to making distrib-
tation error), or the specification does not uted systems fault tolerant have emerged.
correctly model what the customer wanted They differ radically in orientation, goals,
(specification error). All work on program and attitude toward the theologically sen-
verification is aimed at the former, but the sitive issue of the perfectability of man-
latter is also an issue. Distributed systems kind (programmers in particular). One
allow both hardware and software errors to approach is based on redundancy and the
be dealt with, albeit in somewhat different other is based on the notion of an atomic
ways. transaction. Both are described briefly
An important distinction should be made below.
between systems that are fault tolerant and
those that are fault intolerant. A fault- 2.4.1 Redundancy Techniques
tolerant system is one that can continue
functioning (perhaps in a degraded form) All the redundancy techniques that have
even if something goes wrong. A fault- emerged take advantage of the existence of
intolerant system collapses as soon as any multiple processors by duplicating critical
error occurs. Biological systems are highly processes on two or more machines. A par-
fault tolerant; if you cut your finger, you ticularly simple, but effective, technique is
probably will not die. If a memory failure to provide every process with a backup
garbles l/10 of 1 percent of the program process on a different processor. All pro-
code or stack of a running program, the cesses communicate by message passing.
program will almost certainly crash in- Whenever anyone sends a message to a
stantly upon encountering the error. process, it also sends the same message to
It is sometimes useful to distinguish be- the backup process, as shown in Figure 8.
tween expected faults and unexpected The system ensures that neither the pri-
faults. When the ARPANET was designed, mary nor the backup can continue running
people expected to lose packets from time until it has been verified that both have
to time. This particular error was expected correctly received the message.
and precautions were taken to deal with it. Thus, if one process crashes because of
On the other hand, no one expected a mem- any hardware fault, the other one can con-
ory error in one of the packet-switching tinue. Furthermore, the remaining process
Computing Surveys, Vol. 17, No. 4, December 1985
440 l A. S. Tanenbaum and R. van Renesse
can then clone itself, making a new backup Message Message
-----------+
to maintain the fault tolerance in the fu-
ture. Borg et al. [1983] have described a
system using these principles. Receiving Recorder
One disadvantage of duplicating every process process
process is the extra processors required, but i%EG1l
another, more subtle problem is that, if traffic
processes exchange messages at a high rate, Figure9. A recorder process copies and stores all
a considerable amount of CPU time may go network traffic without affecting the sender and
into keeping the processes synchronized at receiver.
each message exchange. Powell and Pre-
sotto [1983] have described a redundant very reliable, but they are not perfect. If
system that puts almost no additional load occasional messages can be lost, the whole
on the processes being backed up. In their scheme becomes much less attractive.
system all messages sent on the network Still, one has to be very careful about
are recorded by a special “recorder” process reliability, especially when the problem is
(see Figure 9). From time to time, each caused by faulty software. Suppose that a
process checkpoints itself onto a remote processor crashes because of a software bug.
disk. Both the schemes discussed above [Borg et
If a process crashes, recovery is done by al. 1983; Powell and Presotto 19831 deal
sending the most recent checkpoint to an with crashes by allocating a spare processor
idle processor and telling it to start run- and restarting the crashed program, possi-
ning. The recorder process then spoon feeds bly from a checkpoint. Of course the new
it all the messages that the original process processor will crash too, leading to the al-
received between the checkpoint and the location of yet another processor and
crash. Messages sent by the newly restarted another crash. Manual intervention will
process are discarded. Once the new process eventually be required to figure out what is
has worked its way up to the point of crash, going on. If the hardware designers could
it begins sending and receiving messages provide a bit somewhere that tells whether
normally, without help from the recording a crash was due to hardware or software, it
process. would be very helpful.
The beauty of this scheme is that the Both of the above techniques apply only
only additional work that a process must to tolerance of hardware errors. It is also
do to become immortal is to checkpoint possible, however, to use redundancy in
itself from time to time. In theory, even the distributed systems to make systems toler-
checkpoints can be disposed with, if the ant of software errors. One approach is to
recorder process has enough disk space to structure each program as a collection of
store all the messages sent by all the cur- modules, each one with a well-defined func-
rently running processes. If no checkpoints tion and a precisely specified interface to
are made, when a process crashes, the re- the other modules. Instead of writing a
corder will have to replay the process’s module only once, N programmers are
whole history. asked to program it, yielding N functionally
When a process successfully terminates, identical modules.
the recorder no longer has to worry about During execution, the program runs on
having to rerun it; so all the messages that N machines in parallel. After each module
it received can be safely discarded. For serv- finishes, the machines compare their re-
ers and other processes that never termi- sults and vote on the answer. If a majority
nate, this idea must be varied to avoid of the machines say that the answer is X,
repeating individual transactions that have then all of them use X as the answer, and
successfully completed. all continue in parallel with the next mod-
One drawback of this scheme is that it ule. In this manner the effects of an occa-
relies on reliable reception of all messages sional software bug can be voted down. If
all the time. In practice, local networks are formal specifications for any of the modules

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems 441
are available, the answers can also be property of not interleaving two jobs is
checked against the specifications to guard called serializability. The goal of people
against the possibility of accepting an an- working on the atomic transaction ap-
swer that is clearly wrong. proach to fault tolerance has been to regain
A variation of this idea can be used to the advantages of the old tape system,
improve system performance. Instead of without giving up the convenience of
always waiting for all the processes to fin- databases on disk that can be modified in
ish, as soon as k of them agree on an place, and to be able to do everything in a
answer, those that have not yet finished distributed way.
are told to drop what they are doing, accept Lampson [1981] has described a way of
the value found by the k processes, and achieving atomic transactions by building
continue with the next module. Some work up a hierarchy of abstractions. We sum-
in this area is discussed by Avizienis and marize his model below. Real disks can
Chen [ 19771, Avizienis and Kelly [ 19841, crash during READ and WRITE opera-
and Anderson and Lee [1981]. tions in unpredictable ways. Furthermore,
even if a disk block is correctly written,
2.4.2 Atomic Transactions
there is a small (but nonzero) probability
of it subsequently being corrupted by a
When multiple users on several machines newly developed bad spot on the disk sur-
are concurrently updating a distributed face. The model assumes that spontaneous
database and one or more machines crash, block corruptions are sufficiently infre-
the potential for chaos is truly impressive. quent that the probability of two such
In a certain sense, the current situation is events happening within some predeter-
a step backward from the technology of the mined time T is negligible. To deal with
1950s when the normal way of updating a real disks, the system software must be able
database was to have one magnetic tape, to tell whether or not a block is valid, for
called the “master file,” and one or more example, by using a checksum.
tapes with updates (e.g., daily sales reports The first layer of abstraction on top of
from all of a company’s stores). The master the real disk is the “careful disk,” in which
tape and updates were brought to the com- every CAREFUL-WRITE is read back im-
puter center, which then mounted the mas- mediately to verify that it is correct. If the
ter tape and one update tape, and ran the CAREFUL-WRITE persistently fails, the
update program to produce a new master system marks the block as “bad” and then
tape. This new tape was then used as the intentionally crashes. Since CAREFUL-
“master” for use with the next update tape. WRITES are verified, CAREFUL-READS
This scheme had the very real advantage will always be good, unless a block has gone
that if the update program crashed, one bad after being written and verified.
could always fall back on the previous mas- The next layer of abstraction is stable
ter tape and the update tapes. In other storage. A stable storage block consists of
words, an update run could be viewed as an ordered pair of careful blocks, which are
either running correctly to completion (and typically corresponding careful blocks on
producing a new master tape) or having no different drives, to minimize the chance of
effect at all (crash part way through, new both being damaged by a hardware failure.
tape discarded). Furthermore, update jobs The stable storage algorithm guarantees
from different sources always ran in some that at least one of the blocks is always
(undefined) sequential order. It never hap- valid. The STABLE-WRITE primitive
pened that two users would concurrently first does a CAREFUL-WRITE on one
read a field in a record (e.g., 6), each add 1 block of the pair, and then the other. If the
to the value, and each store a 7 in that field, first one fails, a crash is forced, as men-
instead of the first one storing a 7 and the tioned above, and the second one is left
second storing an 8. untouched.
The property of run-to-completion or do- After every crash, and at least once every
nothing is called an atomic update. The time period T, a special cleanup process is

Computing Surveys, Vol. 17, No. 4, December 1985


442 . A. S. Tanenbaum and R. van Renesse
run to examine each stable block. If both by the operating system. This approach
blocks are “good” and identical, nothing leads to a smaller (hence more reliable)
has to be done. If one is “good” and one kernel and makes it easier to provide, mod-
is “bad” (failure during a CAREFUL- ify, and test new services. In the following
WRITE), the “bad” one is replaced by the sections, we look at some of these services,
“good” one. If both are “good” but different but first we look at how services and servers
(crash between two CAREFUL-WRITES), can be structured.
the second is replaced by a copy of the first.
This algorithm allows individual disk 2.5.1 Server Structure
blocks to be updated atomically and survive
infrequent crashes. The simplest way to implement a service is
Stable storage can be used to create “sta- to have one server that has a single, se-
ble processors” [Lampson 19811. To make quential thread of control. The main loop
itself crashproof, a CPU must checkpoint of the server looks something like this:
itself on stable storage periodically. If it
subsequently crashes, it can always restart while true do
itself from the last checkpoint. Stable stor- begin
GetRequest;
age can also be used to create stable moni- CarryOutRequest;
tors in order to ensure that two concurrent SendReply
processes never enter the same critical re- end
gion at the same time, even if they are
running on different machines. This approach is simple and easy to under-
Given a way to implement crashproof stand, but has the disadvantage that if the
processors (stable processors) and crash- server must block while carrying out the
proof disks (stable storage), it is possible to request (e.g., in order to read a block from
implement multicomputer atomic transac- a remote disk), no other requests from
tions. Before updating any part of the data other users can be started, even if they
in place, a stable processor first writes an could have been satisfied immediately. An
intentions list to stable storage, providing obvious example is a file server that main-
the new value for each datum to be tains a large disk block cache, but occasion-
changed. Then it sets a commit flag to in- ally must read from a remote disk. In the
dicate that the intentions list is complete. time interval in which the server is blocked
The commit flag is set by atomically up- waiting for the remote disk to reply, it
dating a special block on stable storage. might have been able to service the next
Finally it begins making all the changes ten requests, if they were all for blocks that
called for in the intentions list. Crashes happened to be in the cache. Instead, the
during this phase have no serious conse- time spent waiting for the remote disk is
quences because the intentions list is stored completely wasted.
in stable storage. Furthermore, the actual To eliminate this wasted time and im-
making of the changes is idempotent, so prove the throughput of the server, the
repeated crashes and restarts during this server can maintain a table to keep track
phase are not harmful. of the status of multiple partially completed
Atomic actions have been implemented requests. Whenever a request requires the
in a number of systems (see, e.g., Fridrich server to send a message to some other
and Older [1981, 19841, Mitchell and Dion machine and wait for the result, the server
[ 19821, Brown et al. [ 19851, Popek et al. stores the status of the partially completed
[1981], and Reed and Svobodova [1981]). request in the table and goes back to the
top of the main loop to get the next
message.
2.5 Services
If the next message happens to be the
In a distributed system, it is natural for reply from the other machine, that is fine
user-level server processes to provide func- and it is processed, but if it is a new request
tions that have been traditionally provided for service from a different client, that can

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 443

Message arrives
at dispatcher

Dispatcher passes
,request to worker
I Shared data

Figure 10. The dispatcher task waits for requests and passes them on
to the worker tasks.

also be started, and possibly completed be- the dispatcher or some other previously
fore the reply for the first request comes in. blocked task can now run. Thus waiting for
In this way, the server is never idle if there a remote procedure call to finish only
is any work to be done. blocks one task, not the whole server.
Although this organization makes better The other way of organizing the server is
use of the server’s CPU, it makes the soft- to have each task capable of accepting new
ware much more complicated. Instead of requests for work. When a messagearrives,
doing nicely nested remote procedure calls the kernel gives it at random to one of the
to other machines whose services it needs, tasks listening to the address or port to
the server is back to using separate SEND which the message was addressed. That
and RECEIVE primitives, which are less task carries the work out by itself, and no
structured. dispatcher is needed.
One way of achieving both good perfor- Both of these schemes require some
mance and clean structure is to program method of locking the shared data to pre-
the server as a collection of miniprocesses, vent races. This locking can be achieved
which we call a cluster of tadas. Tasks share explicitly by some kind of LOCK and
the same code and global data, but each UNLOCK primitives, or implicitly by hav-
task has its own stack for local variables ing the scheduler not stop any task while it
and registers and, most important, its own is running. For example, task switching
program counter. In other words, each task only occurs when a task blocks. With or-
has its own thread of control. Multipro- dinary user programs, such a strategy
gramming of the tasks can be done either is undesirable, but with a server whose
by the operating system kernel or by a run behavior is well understood, it is not
time library within each process. unreasonable.
There are two ways of organizing the
tasks. The first way is to assign one task 25.2 File Service
the job of “dispatcher,” as shown in Figure
10. The dispatcher is the only task that There is little doubt that the most impor-
accepts new requests for work. After in- tant service in any distributed system is the
specting an incoming request, it determines file service. Many file services and file serv-
if the work can be done without blocking ers have been designed and implemented,
(e.g., if a block to be read is present in the so a certain amount of experience is avail-
cache). If it can, the dispatcher just carries able (e.g., Birrell and Needham [1980], Del-
out the work and sends the reply. If the lar [ 19821, Dion [1980], Fridrich and Older
work requires blocking, the dispatcher [ 19811,Fridrich and Older [ 19841, Mitchell
passes the work to some other task in the and Dion [ 19821, Mullender and Tanen-
cluster, which can start work on it. When baum [1985], Reed and Svobodova [1981],
that task blocks, task switching occurs, and Satyanarayanan et al. [1985], Schroeder et

Computing Surveys, Vol. 17, No. 4, December 1985


444 l A. S. Tanenbaum and R. van Renesse
al. [1985], Sturgis et al. [1980], Svobodova The client need not be concerned with how
[1981], and Swinehart et al. [1979]). A or where the data in the file are stored.
survey about file servers can be found in The directory service provides a mecha-
Svobodova [ 19841. nism for naming and protecting tiles, so
File services can be roughly classified they can be accessed conveniently and
into two kinds, “traditional” and “robust.” safely. The directory service typically pro-
Traditional file service is offered by nearly vides objects called directories that map
all centralized operating systems (e.g., the ASCII names onto the internal identifica-
UNIX file system). Files can be opened, tion used by the file service.
read, and rewritten in place. In particular,
a program can open a file, seek to the Design Issues. One important issue in a
middle of the file, and update blocks of data distributed system is how closely the three
within the file. The file server implements components of a traditional file service are
these updates by simply overwriting the integrated. At one extreme, the system can
relevant disk blocks. Concurrency control, have distinct disk, file, and directory ser-
if there is any, usually involves locking vices that run on different machines and
entire tiles before updating them. only interact via the official interprocess
Robust file service, on the other hand, is communication mechanism. This approach
aimed at those applications that require is the most flexible, because anyone need-
extremely high reliability and whose users ing a different kind of file service (e.g., a B-
are prepared to pay a significant penalty in tree file) can use the standard disk server.
performance to achieve it. These file ser- It is also potentially the least efficient,
vices generally offer atomic updates and since it generates considerable interserver
similar features lacking in the traditional traffic.
file service. At the other extreme, there are systems
In the following paragraphs, we discuss in which all three functions are handled by
some of the issues relating to traditional a single program, typically running on a
file service (and file servers) and then look machine to which a disk is attached. With
at those issues that specifically relate to this model, any application that needs a
robust file service and servers. Since robust slightly different file naming scheme is
file service normally includes traditional forced to start all over making its own
file service as a subset, the issues covered private disk server. The gain, however, is
in the first part also apply. increased run-time efficiency, because the
Conceptually, there are three compo- disk, file, and directory services do not have
nents that a traditional file service nor- to communicate over the network.
mally has: Another important design issue in dis-
tributed systems is garbage collection. If
l disk service,
the directory and file services are inte-
l flat file service,
grated, it is a straightforward matter to
l directory service.
ensure that, whenever a tile is created, it is
The disk service is concerned with reading entered into a directory. If the directory
and writing raw disk blocks without regard system forms a rooted tree, it is always
to how they are organized. A typical com- possible to reach every file from the root
mand to the disk service is to allocate and directory. However, if the file directory ser-
write a disk block, and return a capability vice and file service are distinct, it may be
or address (suitably protected) so that the possible to create files and directories that
block can be read later. are not reachable from the root directory.
The flat file service is concerned with In some systems this may be acceptable,
providing its clients with an abstraction but in others unconnected files may be
consisting of files, each of which is a linear regarded as garbage to be collected by the
sequence of records, possibly l-byte records system.
(as in UNIX) or client-defined records. The Another approach to the garbage collec-
operations are reading and writing records, tion problem is to forget about rooted trees
starting at some particular place in the file. altogether and permit the system to remove
Computing Surveys, Vol. 17, No. 4, December 1985
Distributed Operating Systems l 445
any file that has not been accessedfor, say, model each request is completely self-con-
five years. This approach is intended to tained (file name, file position, etc.), so a
deal with the situation of a client creating newly reincarnated server will have no
a temporary file and then crashing before trouble carrying it out.
recording its existence anywhere. When the The price paid for this robustness, how-
client is rebooted, it creates a new tempo- ever, is a slightly longer message,since each
rary file, and the existence of the old one is file request must contain the full file name
lost forever unless some kind of time-out and position. Furthermore, the virtual-
mechanism is used. circuit model is sometimes less complex in
There are a variety of other issues that environments in which the network can
the designers of a distributed file system reorder messages,that is, deliver the second
must address; for example, will the file ser- messagebefore the first. Local networks do
vice be virtual-circuit oriented or stateless? not have this defect, but some wide-area
In the virtual-circuit approach, the client networks and internetworks do.
must do an OPEN on a file before reading
it, at which time the file server fetches some Protection. Another important issue
information about the file (in UNIX terms, faced by all file servers is access control-
the i-node) into memory, and the client is who is allowed to read and write which file.
given some kind of a connection identifier. In centralized systems, the same problem
This identifier is used in subsequent exists and is solved by using either an ac-
READS and WRITES. In the stateless ap- cess control list or capabilities. With access
proach each READ request identifies the control lists, each file is associated with a
file and file position in full, so the server list of users who may access it. The UNIX
need not keep the i-node in memory (al- RWX bits are a simple form of access con-
though most servers will maintain a cache trol list that divides all users into three
for efficiency reasons). categories: owner, group, and others. With
Both virtual-circuit and stateless file capabilities, a user must present a special
servers can be used with the IS0 OS1 and “ticket” on each file access proving that he
RPC models. When virtual circuits are used or she has access permission. Capabilities
for communication, having the file server are normally maintained in the kernel to
maintain open files is natural, However, prevent forgery.
each request message can also be self-con- With a distributed system using remote
tained so that the file server need not hold file servers, both of these approaches have
the file open throughout the communica- problems. With access control lists the file
tion session. server has to verify that the user in fact is
Similarly, RPC fits well with a stateless who he or she claims to be. With capabili-
file server, but it can also be used with a ties, how do you prevent users from making
file server that maintains open files. In the them up?
latter case the client does an RPC to the One way to make access control lists
file server to OPEN the file and get back a viable is to insist that the client first set up
tile identifier of some kind. Subsequent an authenticated virtual circuit with the
RPCs can do READ and WRITE opera- file server. The authentication may involve
tions using this file identifier. a trusted third party as in Birrell et al.
The difference between these two be- [1982, 19841. When remote procedure calls
comes clear when one considers the effects are used, setting up an authenticated ses-
of a server crash on active clients. If a sion in advance is less attractive. The
virtual-circuit server crashes and is then problem of authentication using RPC is
quickly rebooted, it will almost always lose discussed by Birrell [ 19851.
its internal tables. When the next request With capabilities, the protection nor-
comes in to read the current block from file mally results from the fact that the kernel
identifier 28, it will have no way of knowing can be trusted. With personal computers
what to do. The client will receive an error on a network, how can the file server trust
message, which will generally lead to the the kernel? After all, a user can easily boot
client process aborting. In the stateless up a nonstandard kernel on his or her
Computing Surveys, Vol. 17, No. 4, December 1985
446 . A. S. Tanenbaum and R. van Renesse
machine. A possible solution is to encrypt Furthermore, even if the server did know,
the capabilities, as discussed by Mullender having the server initiate contact with its
and Tanenbaum [1984, 1985, 19861 and clients is certainly an unpleasant reversal
Tanenbaum et al. [1986]. of the normal client-server relationship, in
which clients make remote procedure calls
Performance. Performance is one of the on servers, but not vice versa. More re-
key problems in using remote file servers search is needed in this area before we have
(especially from diskless workstations). a satisfactory solution. Some results are
Reading a block from a local disk requires presented by Schroeder et al. [1985].
a disk access and a small amount of CPU
processing. Reading from a remote server Reliability. Reliability is another key
has the additional overhead of getting the design issue. The simplest approach is to
data across the network. This overhead has design the system carefully, use good qual-
two components: the actual time to move ity disks, and make occasional tape back-
the bits over the wire (including contention ups. If a disk ever gets completely wiped
resolution time, if any) and the CPU time out because of hardware failure, all the
the file server must spend running the pro- work done since the last tape backup is lost.
tocol software. Although this mode of operation may seem
Cheriton and Zwaenepoel [ 19831describe scary at first, nearly all centralized com-
measurements of network overhead in puter systems work this way, and with a
the context of the V system. With an 8- mean time between failure of 20,000 or
megahertz 68000 processor and a lo-me- more hours for disks these days, it works
gabyte-per-second Ethernet, they observe pretty well in practice.
that reading a 512-byte block from the local For those applications that demand a
machine takes 1.3 milliseconds and from a higher level of reliability, some distributed
remote machine 5.7 milliseconds, assuming systems have a more robust file service, as
that the block is in memory and no disk mentioned at the beginning of this section.
access is needed. They also observe that The simplest approach is mirrored disks:
loading a 64K program from a remote file Every WRITE request is carried out in
server takes 255 milliseconds versus 60 mil- parallel on two disk drives. At every instant
liseconds locally, when transfers are in 16K the two drives are identical, and either one
units. A tentative conclusion is that access can take over instantly for the other in the
to a remote file server is four times as event of failure.
expensive as to a local one. (It is also worth A refinement of this approach is to have
noting that the V designers have gone to the file server offer stable storage and
great lengths to achieve good performance; atomic transactions, as discussed earlier.
many other file servers are much slower Systems offering this facility are described
than V’s.) by Brown et al. [1985], Dion [1980],
One way to improve the performance of Mitchell and Dion [1982], Needham and
a distributed file system is to have both Herbert [ 19821, Reed and Svobodova
clients and servers maintain caches of disk [ 19811,Sturgis et al. [ 19801,and Svobodova
blocks and possibly whole files. However, [1981]. A detailed comparison of a number
maintaining distributed caches has a num- of file servers offering sophisticated con-
ber of serious problems. The worst of these currency control and atomic update facili-
is, “What happens when someone modifies ties is given by Svobodova [1984]. We just
the ‘master copy’ on the disk?” Does the touch on a few of the basic concepts here.
file server tell all the machines maintaining At least four different kinds of files can
caches to purge the modified block or be supported by a file server. Ordinary files
file from their caches by sending them consist of a sequence of disk blocks that
“unsolicited messages” as in XDFS [Sturgis may be updated in place and that may be
et al. 1980]? How does the server even know destroyed by disk or server crashes.
who has a cache? Introducing a complex Recoverable files have the property that
centralized administration to keep track is groups of WRITE commands can be brack-
probably not the way to go. eted by BEGIN TRANSACTION and

Computing Surveys, Vol. 1’7, No. 4, December 1985


Distributed Operating Systems l 447

END TRANSACTION, and that a crash of the available printers, possibly with some
or abort midway leaves the file in its origi- text justification or other formatting be-
nal state. Robust files are written on stable forehand. In some cases the whole file is
storage and contain sufficient redundancy sent to the print server in advance, and the
to survive disk crashes (generally two disks server must buffer it. In other cases only
are used). Finally, multiversion files consist the file name or capability is sent, and the
of a sequence of versions, each of which print server reads the file block by block as
is immutable. Changes are made to a file needed. The latter strategy eliminates the
by creating a new version. Different file need for buffering (read: a disk) on the
servers support various combinations of ’ server side but can cause problems if the
these. file is modified after the print command is
All robust file servers need some mecha- given but prior to the actual printing. Users
nism for handling concurrent updates to a generally prefer “call-by-value” rather than
file or group of files. Many of them allow “call-by-reference” semantics for printers.
users to lock a file, page, or record to pre- One way to achieve the “call-by-value”
vent conflicting writes. Locking introduces semantics is to have a printer spooler
the problem of deadlocks, which can be server. To print a file, the client process
dealt with by using two-phase locking sends the file to the spooler. When the file
[Eswaran et al. 19761 or timestamps [Reed has been copied to the spooler’s directory,
19831. an acknowledgment is sent back to the
When the file system consists of multiple client.
servers working in parallel, it becomes pos- The actual print server is then imple-
sible to enhance reliability by replicating mented as a print client. Whenever the
some or all files over multiple servers. print client has nothing to print, it requests
Reading also becomes easier because the another file or block of a file from the print
workload can now be split over two servers, spooler, prints it, and then requests the
but writing is much harder because multi- next one. In this way the print spooler is a
ple copies must be updated simultaneously, server to both the client and the printing
or this effect simulated somehow. device.
One approach is to distribute the data Printer service is discussed by Janson et
but keep some of the control information al. [1983] and Needham and Herbert
(semi-) centralized. In LOCUS [Popek et [1982].
al. 1981; Walker et al. 19831, for example,
files can be replicated at many sites, but 2.5.4 Process Service
when a file is opened, the file server at one
site examines the OPEN request, the num- Every distributed operating system needs
ber and status of the file’s copies, and the some mechanism for creating new pro-
state of the network. It then chooses one cesses.At the lowest level, deep inside the
site to carry out the OPEN and the subse- system kernel, there must be a way of cre-
quent READS and WRITES. The other ating a new process from scratch. One way
sites are brought up to date later. is to have a FORK call, as UNIX does, but
other approaches are also possible. For ex-
2.5.3 Print Service
ample, in Amoeba, it is possible to ask the
kernel to allocate chunks of memory of
Compared with file service, on which a given sizes. The caller can then read and
great deal of time and energy has been write these chunks, loading them with the
expended by a large number of people, the text, data, and stack segments for a new
other services seem rather meager. Still, it process. Finally, the caller can give the
is worth saying at least a little bit about a filled-in segments back to the kernel and
few of the more interesting ones. ask for a new process built up from these
Nearly all distributed systems have some pieces. This scheme allows processes to be
kind of print service to which clients can created remotely or locally, as desired.
send files, file names, or capabilities for At a higher level it is frequently useful to
files with instructions to print them on one have a process server that one can ask

Computing Surveys, Vol. 17, NO. 4, December 1985


448 l A. S. Tanenbaum and R. van Renesse
whether there is a Pascal, TROFF, or some 2.5.6 Mail Service
other service, in the system. If there is, the
Electronic mail is a popular application of
request is forwarded to the relevant server.
computers these days. Practically every
If not, it is the job of the process server to
university computer science department in
build a process somewhere and give it the
the Western world is on at least one inter-
request. After, say, a very large-scale inte-
national network for sending and receiving
gration (VLSI) design rule checking server
electronic mail. When a site consists of only
has been created and has done its work, it
one computer, keeping track of the mail is
may or may not be a good idea to keep it in
easy. When a site has dozens of computers
the machine where it was created, depend-
spread over multiple local networks, how-
ing on how much work (e.g., network
ever, users often want to be able to read
traffic) is required to load it, and how often
their mail on any machine they happen to
it is called. The process server could easily
be logged on to. This desire gives rise to the
manage a server cache on a least recently
need for a machine-independent mail ser-
used basis, so that servers for common
vice, rather like a print service that can be
applications are usually preloaded and
accessed systemwide. Almes et al. [1985]
ready to go. As special-purpose VLSI pro-
discuss how mail is handled in the Eden
cessors become available for compilers
system.
and other applications, the process server
should be given the job of managing them
2.5.7 Time Service
in a way that is transparent to the system’s
users. There are two ways to organize a time
service. In the simplest way, clients can just
ask the service what time it is. In the other
2.55 Terminal Service
way, the time service can broadcast the
How the terminals are tied to the system correct time periodically, to keep all the
obviously depends to a large extent on the clocks on the other machines in sync. The
system architecture. If the system consists time server can be equipped with a radio
of a small number of minicomputers, each receiver tuned to WWV or some other
with a well-defined and stable user popu- transmitter that provides the exact time
lation, then each terminal can be hard down to the microsecond.
wired to the computer that its user nor- Even with these two mechanisms, it is
mally logs on to. If, however, the system impossible to have all processes exactly
consists entirely of a pool of processors that synchronized. Consider what happens
are dynamically allocated as needed, it is when a process requests the time of day
better to connect all the terminals to one from the time server. The request message
or more terminal servers that serve as comes in to the server, and a reply is sent
concentrators. back immediately. That reply must propa-
The terminal servers can also provide gate back to the requesting process, cause
such features as local echoing, intraline an interrupt on its machine, have the ker-
editing, and window management, if de- nel started up, and finally have the time
sired. Furthermore, the terminal server can recorded somewhere. Each of these steps
also hide the idiosyncracies of the various introduces an unknown, variable delay.
terminals in use by mapping them all onto On an Ethernet, for example, the amount
a standard virtual terminal. In this way the of time required for the time server to put
rest of the software deals only with the the reply message onto the network is non-
virtual terminal characteristics and the ter- deterministic and depends on the number
minal server takes care of the mappings to of machines contending for access at that
and from all the real terminals. The ter- instant. If a large distributed system has
minal server can also be used to support only one time server, messages to and from
multiple windows per terminal, with each it may have to travel a long distance and
window acting as a virtual terminal. pass over store-and-forward gateways with

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 449

variable queuing delays. If there are multi- nal format to those demanded by the wide-
ple time servers, they may get out of syn- area network carrier.
chronization because their crvstals run at
slightly different rates. Einstein’s special 3. EXAMPLES OF DISTRIBUTED
theory of relativity also puts constraints on OPERATING SYSTEMS
synchronizing remote clocks.
The result of all these problems is that Having disposed with the principles, it is
having a single global time is impossible. now time to look at some actual distributed
Distributed algorithms that depend on systems that have been constructed as re-
being able to find a unique global ordering search projects in universities around the
of widely separated events may not work as world. Although many such projects are in
expected. A number of researchers have various stages of development, space limi-
tried to find solutions to the various prob- tations prevent us from describing all of
lems caused by the lack of global time (see, them in detail. Instead of saying a few
e.g., Jefferson [1985], Lamport [1978, words about each system, we have chosen
19841, Marzullo and Owicki [1985], Reed to look at four systems that we consider
[1983], and Reif and Spirakis [1984]). representative. Our selection criteria were
as follows. First, we only chose systems that
2.58 Boot Service
were designed from scratch as-distributed
systems (systems that gradually evolved by
The boot service has two functions: bring- connecting together existing centralized
ing up the system from scratch when the systems or are multiprocessor versions of
power is turned on and helping important UNIX were excluded). Second, we only
services survive crashes. In both cases, it is chose systems that have actually been im-
helpful if the boot server has a hardware plemented; paper designs did not count.
mechanism for forcing a recalcitrant ma- Third, we only chose systems about which
chine to jump to a program in its own read- a reasonable amount of information was
only memory (ROM) in order to reset it. available.
The ROM program could simply sit in a Even with these criteria, there were
loop waiting for a message from the boot many more systems that could have been
service. The messagewould then be loaded discussed. As an aid to the reader interested
into that machine’s memory and executed in pursuing this subject further, we provide
as a program. here some references to other relevant
The second function alluded to above is work: Accent [Fitzgerald and Rashid 1985;
the “immortality service.” An important Rashid and Robertson 19811,Argus [Liskov
service could register with the boot service, 1982,1984; Liskov and Scheifler 1982; Oki
which would then poll it periodically to see et al. 19851, Chorus [Zimmermann, et al.
if it were still functioning. If not, the boot 19811, CRYSTAL [Dewitt et al. 19841,
service could initiate measures to patch DEMOS [Powell and Miller 19831,Distrib-
things up, for example, forcibly reboot it or uted UNIX [Luderer et al. 19811, HXDP
allocate another processor to take over its [Jensen 19781, LOCUS [Popek et al. 1981;
work. To provide high reliability, the boot Walker et al. 1983; Weinstein et al. 19851,
service should itself consist of multiple Meglos [Gaglianello and Katseff 19851,
processors, each of which keeps checking MICROS [Curtis and Wittie 1984; Mohan
that the others are still working properly. and Wittie 1985; Wittie and Curtis 1985;
Wittie and van Tilborg 19801, RIG [Ball et
2.5.9 Gateway Service
al. 19761, Roscoe/Arachne [Finkel et al.
1979; Solomon and Finkell978,1979], and
If the distributed system in question needs the work at Xerox Palo Alto Research Cen-
to communicate with other systems at re- ter [Birrell 1985; Birrell and Nelson 1984;
mote sites, it may need a gateway server to Birrell et al. 1984; Boggs et al. 1980; Brown
convert messagesand protocols from inter- et al. 1985; Swinehart et al. 19791.

Computing Surveys, Vol. 17, No. 4, December 1985


450 l A. S. Tanenbaum and R. van Renesse
The systems we examine here are the icated machines that provide various useful
Cambridge Distributed Computing System, services, including file service, name ser-
Amoeba, V, and Eden. The discussion of vice, boot service, etc. The number and
each system follows the list of topics treated location of these servers is relatively static.
above, namely, communication primitives,
naming and protection, resource manage- 3.7.7 Communication Primitives
ment, fault tolerance, and services.
Owing to the evolution from network to
3.1 The Cambridge Distributed Computing distributed system described earlier, the
System communication primitives are usually de-
scribed as network protocols rather than
The Computing Laboratory at the Univer- language primitives. The choice of the
sity of Cambridge has been doing research primitives was closely tuned to the capabil-
in networks and distributed systems since ities of the ring in order to optimize per-
the mid-1970s, first with the Cambridge formance. Nearly all communication is
ring and later with the Cambridge Distrib- built up from sending packets consisting of
uted Computing System [Needham and a 2-byte header, a 2-byte process identifier,
Herbert 19821. The Cambridge ring is not up to 2048 data bytes, and a 2-byte check-
a token-passing ring, but rather contains sum. On top of this basic packet protocol
several minipacket slots circulating around are a simple remote procedure call protocol
the ring. To send a packet, a machine waits and a byte stream protocol.
until an empty slot passes by, then inserts The basic packet protocol, which is a
a minipacket containing the source, desti- pure datagram system, is used by the single-
nation, some flag bits, and 2 bytes of data. shot protocol to build up something similar
Although the 2-byte minipackets them- to a remote procedure call. It consists of
selves are occasionally useful (e.g., for having the client send a packet to the server
acknowledgments), several block-oriented containing the request, and then having the
protocols have been developed for reliably server send a reply. Some machines are
exchanging 2K packets by accumulating multiprogrammed, so that the second
1024 minipackets. The nominal ring band- minipacket is used to route the incoming
width is 10 megabytes per second, but since packet to the correct process. The request
each minipacket has 2 bytes of data and 3 packet itself contains a function code and
bytes of overhead, the effective bandwidth the parameters, if any. The reply packet
is 4 megabytes per second. contains a status code and the result, if
The Cambridge ring project was very suc- any. Clients do not acknowledge receipt of
cessful, with copies of the ring currently in the result.
operation at many universities and com- Some applications, such as terminal han-
panies in the United Kingdom and else- dling and file transfer, work better with a
where. The availability of the ring led to flow-controlled, virtual-circuit protocol.
research on distributed computing systems The byte stream protocol is used for these
initially using nine Computer Automation applications. This protocol is a full-duplex,
LS14 minicomputers and later using about connection-oriented protocol, with full flow
a dozen Motorola 680008, under the direc- control and error control.
tion of Roger Needham.
The Cambridge system is primarily com- 3.1.2 Naming and Protection
posed of two components: the processor
bank and the servers. When a user logs in, Services can be located in the Cambridge
he or she normally requests one machine system by using the name server. To look
from the processor bank, uses it as a per- up a name, the client sends an ASCII string
sonal computer for the entire work session, to the name server, which then looks it up
and returns it when logging out. Processors in its tables and returns the machine num-
are not normally dynamically allocated for ber where the service is located, the port
short periods of time. The servers are ded- used to address it, and the protocol it

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 451
expects. The name server stores service Login Session Class Control
names as unstructured ASCII strings,
MARVIN 1 91432 sTwENl 31513
which are simply matched against incoming
requests character by character; that is, it l6AR6ARA 1 61300 19lWENl I 27130I
does not manage hierarchical names. The AMY 42106 FACLLTY 31616
name server itself has a fixed address that
never changes, so this address may be 61346 DIRECmR 41940
embedded into programs.
Although the service database is rela- Figure 11. The Active name table.
tively static, from time to time names must
be added or deleted to the name server’s
database. Commands are provided for this the user specifies a CPU type (e.g., 68000),
purpose, but for protection reasons these a list of attributes (e.g., memory size), and
commands may only be executed by the a program to be run. The resource manager
system administrator. then selects the most suitable CPU cur-
Finding the location of a service is only rently available for allocation. Various
half the work. To use most services, a pro- defaults are available, so, for example, a
cess must identify itself in an unforge- user can specify wanting to run TRIPOS
able way, so that the service can check to (a straightforward single-user operating
see whether that user is authorized. This system), and the resource manager will se-
identification is handled by the Active lect an appropriate CPU type if none has
Name Server, which maintains a table of been specified.
currently logged-in users. Each table entry The downloading of programs into pro-
has four fields: the user’s login name, his cessor bank machines is controlled by a
or her session key (a big random number), server called the ancilla, although some of
the user’s class (e.g., faculty, student), and the machines have intelligent ring inter-
a control key, as shown in Figure 11. faces that actually do most of the work.
To use a service, a user supplies the ser- The ancilla also helps simulate the ma-
vice with his login name, session key (ob- chine’s console and front panel, so that
tained at login time), and class. The service users have the same control over a proces-
can then ask the Active Name Server if sor bank machine as they would over real
such an entry exists. Since session keys are personal computers on their desks.
sparse, it is highly unlikely that a student
will be able to guess the current session key 3.1.4 Fault Tolerance
for the computer center director, and thus
be able to obtain services reserved for the The approach taken to fault tolerance in
director. The control key must be presented the Cambridge system is to make it easy to
to change an entry, thus providing a mech- bring servers back up after a crash. When
anism to restrict changing the Active Name a ring interface detects a special minipacket
Server’s table to a few people. whose source is the name server, it reboots
the processor by forcing it to jump to a
3.1.3 Resource Management
program in ROM. This program then sends
a request to the boot server, which in turn
The main resource managed by the system goes to the name server asking for reverse
is the processor bank, handled by a service name lookup. The name server then
called the resource manager. Usually a user searches its tables to find the service that
requests a processor to be allocated at login is running on the machine from which the
time, and then loads it with a single-user reverse lookup request came. As soon as
operating system. The processor then be- the reply comes in, the server knows
comes the user’s personal computer for the what it is supposed to be doing and can
rest of the login session. request the resource manager and ancilla
The resource manager accepts requests to download the appropriate program.
to allocate a processor. In these requests When machines are physically reset or

Computing Surveys, Vol. 17, No. 4, December 1985


452 . A. S. Tanenbaum and R. van Renesse

Filing File
Figure 12. The filing machine is posi- machine server
tioned between the users and the file server.
It maintains a block cache and handles
ASCII names. Block cache Regular files
Processor ASCII names Special files
bank Index files
machines

powered up, the same procedure is carried consisting of a sequence of 16-bit words,
out automatically. numbered from 0 to some maximum.
Another area in which some effort has Operations are provided for reading or writ-
been put to make the system fault tolerant ing arbitrary numbers of words, starting
is the file system, which supports atomic anywhere in the file. Each file is uniquely
updates on special files. This facility is identified by a 64-bit PUID (Permanent
described in the next section. User IDentifier) consisting of a 32-bit disk
address and a 32-bit random number.
The first variation is the special file,
3.1.5 Services
which has the property that writes to it are
We have already described several key serv- atomic, that is, they will either succeed
ers, including the name server, resource completely or not be done at all. They will
manager, ancilla, and active name server. never be partly completed, even in the face
Other small servers include the time server, of server crashes.
print server, login server, terminal server, The second variation is a file called an
and error server, which records system er- index, which is a special file consisting of a
rors for maintenance purposes. The tile sequence of slots, each holding one PUID.
server is examined here. When a file is created, the process creating
The file system started out with the idea it must specify an index and slot in that
of a single universal file server that pro- index into which the new file’s PUID is
vided basic storage service but very primi- stored. Since indexes are also files and as
tive naming and protection system, coupled such have PUIDs themselves, an index may
with single-user TRIPOS operating sys- contain pointers (PUIDs) to other indices,
tems in the processor bank machines, in allowing arbitrary directory trees and
which the naming and directory manage- graphs to be built. One index is distin-
ment would be done. The CAP computer (a guished as being the root index, which has
large research machine within the Cam- the property that the file server’s internal
bridge Computing Laboratory that does not garbage collector will never remove a file
have any disks of its own) also uses the file reachable from the root index.
server. After some experience with this In the initial implementation, the full
model, it was decided to create a new server, code of the TRIPOS operating system was
known as the filing machine, as a front end loaded into each pool processor. All of the
to the file system to improve the perform- directory management and handling of
ance (mostly by providing the filing ma- ASCII names was done on the processor
chine with a large cache, something that bank machines. Unfortunately, this scheme
the small user machines could not afford). had several problems. First, TRIPOS was
The CAP machine, which has adequate rather large and filled up so much memory
memory, continues to use the file server that little room was left for buffers, mean-
directly. The position of the filing machine ing that almost every read or write request
is shown in Figure 12. actually caused a disk access (the universal
The universal file server supports one file server has hardly any buffers). Second,
basic file type, with two minor variations. looking up a name in the directory hierar-
The basic file type is an unstructured file chy required all the intermediate directo-

ComputingSurveys,Vol. 17,No. 4, December1985


Distributed Operating Systems l 453

ries between the starting point and the file Processor


pool
Workstations
to be physically transported from the file
server to a machine doing the search.
To get around these problems, a filing
machine with a large cache was inserted in
front of the file server. This improvement
allowed programs to request files by name
instead of PUID, with the name lookup
occurring in the filing machine now. Owing
to the large cache, most of the relevant Specialized servers
directories are likely to be already present (file. data base, etc)
in the filing machine, thus eliminating
much network traffic. Furthermore, it al- Figure 13. The Amoeba architecture.
lowed the TRIPOS code in the user ma-
chines to be considerably stripped, since
the directory management was no longer on which users can carry out editing and
needed. It also allowed the file server to other tasks that require fast interactive re-
read and write in large blocks; this was sponse. Second are the pool processors, a
previously possible, but rarely done because group of CPUs that can be dynamically
of lack of buffer space on the user side. The allocated as needed, used, and then re-
resulting improvements were substantial. turned to the pool. For example, the “make”
command might need to do six compila-
tions; so six processors could be taken out
3.1.6 Implementation of the pool for the time necessary to do the
compilation and then returned. Alterna-
As should be clear by now, the whole Cam- tively, with a five-pass compiler, 5 X 6 =
bridge system is a highly pragmatic design, 30 processors could be allocated for the six
which from its inception [Wilkes and Need- compilations, gaining even more speedup.
ham 19801was designed to be actually used Third are the specialized servers, such as
by a substantial user community. About 90 directory, file, and block servers, database
machines are connected by three rings now, servers, bank servers, boot servers, and var-
and the system is fairly stable. A related ious other servers with specialized func-
research project was the connection of a tions. Fourth are the gateways, which are
number of Cambridge rings via a satellite used to link Amoeba systems at different
[Adams et al. 19821. Future research sites (and, eventually, different countries)
may include interconnection of multiple into a single, uniform system.
Cambridge rings using very-high-speed All the Amoeba machines run the same
(2-megabit-per-second) lines. kernel, which primarily provides message-
passing services and little else. The basic
3.2 Amoeba
idea behind the kernel was to keep it small,
not only to enhance its reliability, but also
Amoeba is a research project on distributed to allow as much as possible of the operat-
operating systems being carried out at the ing system to run as user processes, provid-
Vrije Universiteit in Amsterdam under the ing for flexibility and experimentation.
direction of Andrew Tanenbaum. Its goal Some of the research issues addressed by
is to investigate capability-based, object- the project are how to put as much of the
oriented systems and to build a working operating system as possible into user pro-
prototype system to use and evaluate. It cesses,how to use the processor pool, how
currently runs on a collection of 24 Moto- to integrate the workstations and pro-
rola 68010 computers connected by a cessor pool, and how to connect multiple
lo-megabytes-per-second local network. Amoeba sites into a single coherent system
The Amoeba architecture consists of four using wide-area networks. All of these
principal components, as shown in Figure issues use objects and capabilities in a
13. First are the workstations, one per user, uniform way.

Computing Surveys, Vol. 17, No. 4, December 1985


454 l A. S. Tanenbaum and R. van Renesse
3.2.1 Communication Primitives immediate reply with a status code of
REQUEST ABORTED. If the receiver was
The conceptual model for Amoeba com-
also blocked waiting for a server, the excep-
munication is the abstract data type or
tion is recursively propagated all the way
object model, in which clients perform op-
erations on objects in a location-indepen- down the line, forcing each server in turn
dent manner. To implement this model, to finish immediately. In this manner, all
Amoeba uses a minimal remote procedure the nested processes terminate normally
call model for communication between (with error status), so that little violence is
clients and servers. The basic client primi- done to the nesting structure. In effect, an
EXCEPTION message does not terminate
tive is to send a message of up to about 32
kilobytes to a server and then block waiting execution. Instead, it just says “Force nor-
for the result. Servers use GET-RE- mal termination immediately, even if you
are not done yet, and return an error
QUEST and PUT-REPLY to get new
status.”
work and send back the results, respec-
tively. These primitives are not embedded
in a language environment with automatic 3.2.2 Naming and Protection
stub generation. They are implemented as
small library routines that are used to in- All naming and protection issues in
voke the kernel directly from C programs. Amoeba are dealt with by a single, uniform
All the primitives are reliable in the sense mechanism: sparse capabilities [Tanen-
that detection and retransmission of lost baum et al. 19861. The system supports
messages, acknowledgment processing, and objects such as directories, files, disk
message-to-packet and packet-to-message blocks, processes, bank accounts, and de-
management are all done transparently by vices, but not small objects such as integers.
the kernel. Messages are unbuffered. If a Each object is owned by some service and
message arrives and no one is expecting it, managed by the corresponding server pro-
the message is simply discarded. The send- cesses.
ing kernel then times out and tries again. When an object is created, the process
Users can specify how long the kernel requesting its creation is given a capability
should retransmit before giving up and re- for it. Using this capability, a process can
porting failure. The idea behind this strat- carry out operations on the object, such as
egy is that server processes are generally reading or writing the blocks of a file, or
cloned in N-fold, so normally there will be starting or stopping a process. The number
a server waiting. Since a message is dis- and types of operations applicable to an
carded only if the system is badly over- object are determined by the service that
loaded, having the client time out and try created the object; a bit map in the capa-
again later is not a bad idea. bility tells which of those the holder of the
Although the basic message primitives capability is permitted to use. Thus the
are blocking, special provision is made for whole of Amoeba is based on the conceptual
handling emergency messages. For exam- model of abstract data types managed by
ple, if a database server is currently blocked services, as mentioned above. Users view
waiting for a tile server to get some data the Amoeba environment as a collection of
for it, and a user at a terminal hits the objects, named by capabilities, on which
BREAK key (indicating that he or she they can perform operations. This is in
wants to kill off the whole request), some contrast to systems in which the user view
way is needed to gracefully abort all is a collection of processes connected by
the processes working on behalf of that virtual circuits.
request. In the Amoeba system the terminal Each object has a globally unique name
server generates and sends a special EX- contained in its capabilities. Capabilities
CEPTION message, which causes an inter- are managed entirely by user processes;
rupt at the receiving process. they are protected cryptographically, not
This message forces the receiver to stop by any kernel-maintained tables or mech-
working on the request and send an anisms. A capability has four fields, as

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 455

40 24 0 48 directory server with a capability for a di-


Service port Object Rts Check rectory (itself an object) and an ASCII
string and ask for the capability that cor-
Figure 14. An Amoeba capability.
responds to that string in the given direc-
tory. Other operations are entering and
deleting (ASCII string, capability) pairs.
shown in Figure 14: This naming scheme is flexible in that a
directory may contain capabilities for an
The service port: a sparse address cor- arbitrary mixture of object types and loca-
responding to the service that owns the tions, but it is also uniform in that every
object, such as a file or directory service. object is controlled by a capability. A direc-
The object number: an internal identi- tory entry may, of course, be for another
fier that the service uses to tell which of directory, and so it is simple to build up
its objects this is (comparable to the a hierarchical (e.g., UNIX-like) directory
i-number in UNIX). tree, or even more general naming graphs.
The rights field: a bit map telling which Furthermore, a directory may also contain
operations on the object are permitted. a capability for a directory managed by a
The check field: a large random number different directory service. As long as all
used to authenticate the capability. the directory services have the same inter-
faces with the user, one can distribute ob-
When a server is asked to create an ob-
jects over directory services in an arbitrary
ject, it picks an available slot in its internal
tables (e.g., a free i-node, in UNIX termi- way.
nology), puts the information about the
3.2.3 Resource Management
new object there, and picks a new random
number to be used exclusively to protect Resource management in Amoeba is per-
this new object. Each server is free to use formed in a distributed way, again using
any protection scheme that it wants to, but capabilities. Each Amoeba machine (pool
the normal one is for it to build a capability processor, work station, etc.) runs a re-
containing its port, the object number, the source manager process that controls that
rights (initially all present), and a known machine. This process actually runs inside
constant. The two latter fields are then the kernel for efficiency reasons, but it uses
thoroughly mixed by encrypting them with the normal abstract data type interface
the random number as key, which is then with its clients. The key operations it sup-
stored in the internal table. ports are CREATE SEGMENT, WRITE
Later, when a process performs an oper- SEGMENT, READ SEGMENT, and
ation on the object, a message containing MAKE PROCESS. To create a new pro-
the object’s capability is sent to the server. cess, a process would normally execute
The server uses the (plaintext) object num- CREATE SEGMENT three times for
ber to find the relevant internal table entry the child process’s text, data, and stack
and extract the random number, which is segments, getting back one capability for
then used to decrypt the rights and check each segment. Then it would fill each one
fields. If the decryption yields the correct in with that segment’s initial data and
known constant, the rights field is believed finally perform MAKE PROCESS with
and the server can easily check whether the these capabilities as parameters, getting
requested operation is permitted. More de- back a capability for the new process.
tails about protection of capabilities can be Using the above primitives, it is easy to
found in Mullender and Tanenbaum [ 1984, build a set of processes that share text
19861 and Tanenbaum et al. [1986]. and/or data segments. This facility is useful
Capabilities can be stored in directories for constructing servers that consist inter-
managed by the directory service. A direc- nally of multiple miniprocesses (tasks) that
tory is effectively a set of (ASCII string, share text and data. Each of these processes
capability) pairs. The most common direc- has its own stack and, most important, its
tory operation is for a user to present the own program counter, so that when one of

ComputingSurveys,Vol. 17, No. 4, December 1985


456 . A. S. Tanenbaum and R. van Renesse
them blocks on a remote procedure call, the The bank server provides a basic mech-
others are not affected. For example, the anism on top of which many interesting
file server might consist of 10 processes policies can be implemented. For example:
sharing a disk cache, all of which start out If some resource is in short supply, are
by doing a GET-REQUEST. When a mes- servers allowed to raise the price as a ra-
sage comes in, the kernel sees that ten tioning mechanism? Do you get your
processes are all listening to the port spec- money back when you release disk space?
ified in the message; so it picks one process That is: Is the model one of clients and
at random and gives it the message. This servers buying and selling blocks, or is it
process then performs the requested oper- like renting something? If it is like renting,
ation, possibly blocking on remote proce- there will be a flow of money from users to
dure calls (e.g., calling the disk) while doing the various servers, and so users need in-
so, but leaving the other server processes comes to keep them going, rather than sim-
free to accept and handle new requests. ply initial fixed budgets. When new users
At a higher level the processor pool is are added, virtual money has to be created
managed by a process server that keeps for them. Does this lead to inflation? The
track of which processors are free and possibilities here are legion.
which are not. If an installation wants to
multiprogram the processor pool machines,
then the process server manages each proc- 3.2.4 Fault Tolerance
ess table slot on a pool processor as a virtual The basic idea behind fault tolerance in
processor. One of the interesting research Amoeba is that machine crashes are infre-
issues here is the interplay between the quent, and that most users are not willing
workstations and the processor pool; that to pay a penalty in performance in order to
is: When should a process be started up on make all crashes 100 percent transparent.
the workstation and when should it be off- Instead, Amoeba provides a boot service,
loaded to a pool processor? Research has with which servers can register. The boot
not yet yielded any definitive answers here, service polls each registered server at
although it seems intuitively clear that agreed upon intervals. If the server does
highly interactive processes, such as screen not reply properly within a specified time,
editors, should be local to the workstation, the boot service declares the server to be
and batchlike jobs, such as big compila- broken and requests the process server to
tions (e.g., UNIX “make”), should be run start up a new copy of the server on one of
elsewhere. the pool processors.
Accounting. Amoeba provides a general To understand how this strategy affects
mechanism for resource management and clients, it is important to realize that
accounting in the form of the bank server, Amoeba does not have any notion of a
which manages “bank account” objects. virtual circuit or a session. Each remote
Bank accounts hold virtual money, possibly procedure call is completely self-contained
in multiple currencies. The principal oper- and does not depend on any previous setup;
ation on bank account objects is transfer- that is, it does not depend on any volatile
ring virtual money between accounts. For information stored in server’s memories. If
example, to pay for file storage, a file server a server crashes before sending a reply, the
might insist on payment in advance of X kernel on the client side will time out and
dollars per megabyte of storage, and a pho- try again. When the new server comes up,
totypesetter server might want a payment the client’s kernel will discover this and
in advance of Y yen per page. The system send the request there, without the client
management can decide whether or not dol- even knowing that anything has happened.
lars and zlotys are convertible, depending Of course, this approach does not always
on whether or not it wants users to have work, for example, if the request is not
separate quotas on disk space and typeset- idempotent (the chocolate factory!) or if a
ter pages, or just give each user a single sick disk head has just mechanically
budget to use as he or she sees fit. scraped all the bits from some disk surface,

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 457

but it works much of the time and has zero itors, and shells, are operational. A server
overhead under normal conditions. has also been implemented on UNIX to
allow Amoeba programs to put capabilities
3.2.5 Services for UNIX files into their directories and
use them without having to know that the
Amoeba has several kinds of block, file, and files are actually located on a VAX running
directory services. The simplest one is a UNIX.
server running on top of the Amoeba kernel In addition to the UNIX emulation
that provides a file service functionally work, various applications have been im-
equivalent to the UNIX system call inter- plemented using pure Amoeba, including
face, to allow most UNIX programs to run parallel traveling salesman and parallel
on Amoeba with only the need to relink alpha-beta search [Bal et al. 19851. Current
them with a special library. research includes connecting Amoeba sys-
A more interesting server, however, is tems at five locations in three countries
FUSS (Free University Storage System), using wide-area networks.
which views each file as a sequence of
versions. A process can acquire a capability
for a private copy of a new version, modify 3.3 The V Kernel
it, and then commit it in a single indivisible
atomic action. Providing atomic commits The V kernel is a research project on dis-
at the file level (rather than only as a tributed systems at Stanford University
facility in some database systems) simpli- under the direction of David Cheriton
fies the construction of various servers, [Cheriton 1984a; Cheriton and Mann 1984;
such as the bank server, that have to be Cheriton and Zwaenepoel 1984a, 1984131.
highly robust. FUSS also supports multiple, It was motivated by the increasing avail-
simultaneous access using optimistic con- ability of powerful microcomputer-based
currency control. It is described in greater workstations, which can be seen as an
detail by Mullender and Tanenbaum alternative to traditional time-shared mini-
[1985]. computers. The V kernel is an outgrowth
Other key services are the directory ser- of the experience acquired with earlier sys-
vice, bank service, and boot service, all of tems, Thoth [Cheriton 1982; Cheriton et al.
which have already been discussed. 19791 and Verex.
The V kernel can be thought of as a
software back plane, analogous to the
3.2.6 Implementation
Multibus or S-100 bus back planes. The
The Amoeba kernel has been ported to five function of a back plane is to provide an
different CPUs: 68010, NS32016, 8088, infrastructure for components (for hard-
VAX, and PDP-11. All the servers de- ware, boards; for software, processes) to
scribed above, except the boot server, have communicate, and nothing else. Conse-
been written and tested, along with a num- quently, most of the facilities found in tra-
ber of others. Measurements have shown ditional operating systems, such as a file
that a remote procedure call from user system, resource management, and protec-
space on one 68010 to user space on a tion, are provided in V by servers outside
different 68010 takes just over 8 millisec- the kernel. In this respect V and Amoeba
onds (plus the time to actually carry out are conceptually very similar.
the service requested). The data rate be- Another point on which V and Amoeba
tween user processes on different machines agree is the free-market model of services.
has been clocked at over 250,000 bytes per Services such as the file system are, in
second, which is about 20 percent of the principle, just ordinary user processes. Any
raw network bandwidth, an exceptionally user who is dissatisfied with the standard
high value. file system [Stonebraker 1981; Tanenbaum
A library has been written to allow UNIX and Mullender 19821 is free to write his or
programs to run on Amoeba. A substantial her own. This view is in contrast to the
number of utilities, including compilers, ed- “centrally planned economy” model of most

Computing Surveys, Vol. 17, No. 4, December 1985


458 l A. S. Tanenbaum and R. van Renesse

Work- Work-
station station
Network I ’ I I
I I I
File File Rint
server server server

Figure 15. A typical V configuration.

time-sharing systems, which present the tually. In this way, messages longer than
file system on a “like it or lump it” basis. 32 bytes can be achieved.
The V system consists of a collection of Servers use the RECEIVE and REPLY
workstations (currently SUNS), each run- calls. The RECEIVE call can provide a
ning an identical copy of the V kernel. The segment buffer in addition to the regular
kernel consists of three components: the message buffer, so that if (part of) a seg-
interprocess communication handler, the ment has been piggybacked onto the mes-
kernel server (for providing basic services, sage, it will have a place to go. The REPLY
such as memory management), and the de- call can also provide a segment buffer for
vice server (for providing uniform access to the case in which the client provides a
I/O devices). Some of the workstations pseudopointer that the server can use to
support an interactive user, whereas others return results exceeding 32 bytes.
function as file servers, print servers, and To make this communication system eas-
other kinds of servers, as shown in Figure ier to use, calls to servers can be embedded
15. Unlike Amoeba, V does not have a in stubs so that the caller just sees an
processor pool. ordinary procedure call. Stub generation is
not automated, however.
3.3.1 Communication Primitives
3.3.2 Naming and Protection
The V communication primitives have been
designed in accordance with the back-plane V has three levels of naming. At the bottom
model mentioned above. They provide level, each process has a unique 32-bit pid,
basic, but fast communication. To access a which is the address used to send messages
server, a client does SEND(message, pid), to it. At the next level, services (i.e., pro-
which transmits the fixed-length (32-byte) cesses that carry out requests for clients)
“message” to the server, and then blocks can have symbolic (ASCII string) names in
until the server has sent back a reply, which addition to their pids. A service can register
overwrites “message.” The second param- a symbolic name with its kernel so that
eter, “pid,” is a 32-bit integer that uniquely clients can use the symbolic name instead
identifies the destination process. A mes- of the pid. When a client wants to access a
sage may contain a kind of pseudopointer service by its name, the client’s kernel
to one of the client’s memory segments. broadcasts a query to all the other kernels,
This pseudopointer can be used to permit to see where the server is. The (Server-
the server to read from or write to the Name, pid) pair is then put in a cache for
client’s memory. Such reads and writes are future use.
handled by kernel primitives COPYFROM The top level of naming makes it possible
and COPYTO. As an optimization, when a to assign symbolic names to objects, such
client does a SEND containing one of these as files. Symbolic names are always inter-
pseudopointers with READ permission, the preted in some “context,” analogous to
first 1K of the segment is piggybacked onto looking up a file name in some directory in
the message, on the assumption that the other systems. A context is a set of records,
server will probably want to read it even- each including the symbolic name, server’s

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 459

pid, context number, and object identifier. all run on the same processor. Application
Each server manages its own contexts; programs can make use of concurrency by
there is no centralized “name server.” A running as a team of processes, each of
symbolic name is looked up in a context by which does part of the kernel. If one process
searching all the records in that context for in a team is blocked waiting for a reply to
one whose name matches the given name. a message, the other ones are free to run.
When a match is found, the context num- The kernel server is prepared to carry out
ber and object identifier can be sent to the operations such as creating new processes
appropriate server to have some operation and teams, destroying processes and teams,
carried out. reading and writing processes’ states, and
Names may be hierarchical, as in u/b/c. mapping processes onto memory.
When a is looked up in some context, the All I/O in V is done using a uniform
result will probably be a new context, pos- interface called the V I/O protocol. The
sibly managed by a new server on a differ- protocol allows processes to read and write
ent machine. In that case the remaining specific blocks on the device. This block
string, b/c is passed on to that new server orientation was chosen to provide idempo-
for further lookup, and so on. tency. Terminal drivers must store the last
It is also possible to prefix a symbolic block read and filter out duplicate requests
name with an explicit context, as in in order to maintain the idempotency prop-
[HomeDirectory] a/b/c, in which case the erty. Implementation of byte streams is up
name is looked up in the context specified, to the users. The I/O protocol has proved
rather than in the current context (analo- general enough to handle disks, printers,
gous to the current working directory in terminals, and even a mouse.
other systems). A question that quickly
arises is, “Who keeps track of the various 3.3.4 Fault Tolerance
context names, such as ‘HomeDirectory’
above?” The answer is that each worksta- Since it was designed primarily for use in
tion in the system has a Context Prefix an interactive environment, V provides
Server, whose function is to map context little in the way of fault tolerance. If
names onto server names, so that the ap- something goes wrong, the user just does
propriate server can be found to interpret it again. V, however, does address excep-
the name itself. tion handling. Whenever a process causes
an exceptional condition to occur, such as
3.3.3 Resource Management stack overflow or referencing nonexistent
memory, the kernel detecting the error
Each processor in V has a dedicated func- sends a specially formatted message to the
tion, either as a user workstation or a file, exception server, which is outside the ker-
print, or other dedicated server; so no form nel. The exception server can then invoke
of dynamic processor allocation is provided. a debugger to take over. This scheme does
The key resources to be managed are pro- not require a process to make any advance
cesses, memory, and the I/O devices. preparation for being debugged and in prin-
Process and memory management is pro- ciple, can allow the process to continue
vided by the kernel server. I/O manage- execution afterward.
ment is provided by the device server. Both
of these are part of the kernel present on 3.3.5 Services
each machine, and are accessed via the
standard message mechanism described Since most of the V workstations do not
above. They are special only in that they have a disk, the central file server plays a
run in kernel mode and can get at the raw key role in the system. The file server is
hardware. not part of the operating system. Instead,
Processes are organized into groups it is just an ordinary user program running
called teams. A team of processes share a on top of the V kernel. Internally it is
common address space, and therefore must structured as a team of processes. The main

Computing Surveys, Vol. 17, No. 4, December 1985


460 . A. S. Tanenbaum and R. van Renesse

process handles directory operations, in- veloped at the University of Washington in


cluding opening files; subsidiary processes Seattle under the direction of Guy Almes,
perform the actual read and write com- Andrew Black, Ed Lazowska, and Jerre
mands, so that when one of them blocks Noe, is to investigate logically integrated
waiting for a disk block, the others can but physically distributed operating sys-
continue operation. The members of file tems. The idea is to construct a system
server team share a common buffer cache, based on the principle of one user and one
used to keep heavily used blocks in main workstation (no processor pool), but with a
memory. high degree of systemwide integration.
The file system is a traditional hierar- Eden is object oriented, with all objects
chical system, similar to that of Thoth accessed by capabilities, which are pro-
[Cheriton 19821.Each file has a file descrip- tected by the Eden kernel. Eden objects, in
tor, similar to an i-node in UNIX, except contrast to, say, Amoeba objects, contain
that the file descriptors are gathered into not only passive data, but also one or more
an ordinary file, which can grow as needed. processes that carry out the operations de-
Extensive measurements have been fined for the object. Objects are general:
made of the performance of the file server. Applications programmers can determine
As an indication, it takes 7.8 milliseconds what operations their objects will provide.
to read a 1K block from the file server when Objects are also mobile, but at any instant
the block is in the cache. This time includes each object (and all the processes it con-
the communication and network overhead. tains) resides on a single workstation.
When the block must be fetched from the Much more than most research projects
disk, the time is increased to 35.5 millisec- of this kind, Eden was designed top down.
onds. Given that the access time of the In fact, the underlying hardware and lan-
small Winchester disks used on personal guage was radically changed twice during
computers is rarely better than 40 millisec- the project, without causing too much rede-
onds, it is clear that the V implementation signing. This would have been much more
of diskless workstations with a fast (18- difficult in a bottom-up, hardware-driven
millisecond) central file server is definitely approach.
competitive.
Other V servers include the print server, 3.4.1 Communication Primitives
gateway server, and time server. Other Communication in Eden uses “invocation,”
servers are in the process of being devel- a form of remote procedure call. Programs
oped. are normally written in EPL, the Eden
Programming Language, which is based on
3.3.6 Implementation Concurrent Euclid. (The EPL translator is
The V kernel has been up and running at actually a preprocessor for Concurrent
Stanford University since September 1982. Euclid.) To perform an operation on an
It runs on SUN Microsystems 68000-based object, say, Lookup, on a directory object,
workstations, connected by 3-megabit-per- the EPL programmer just calls Lookup,
second and lo-megabit-per-second Ether- specifying a capability for the directory to
nets. The kernel is used as a base for a be searched, the string to be searched for,
variety of projects at Stanford, including and some other parameters.
the research project on distributed operat- The EPL compiler translates the call to
ing systems. A great deal of attention has Lookup to a call to a stub routine linked
been paid to tuning the system to make it together with the calling procedure. This
fast. stub routine assembles the parameters
and packs them in a standard form called
3.4 The Eden Project
ESCII (Eden Standard Code for Informa-
tion Interchange), and then calls a lower
The goal of the Eden system [Almes et al. level routine to transmit the function code
1985; Black 1983, 1985; Jessop et al. 1982; and packed parameters to the destination
Lazowska et al. 19811, which is being de- machine.

Computing Surveys, Vol. 1’7, No. 4, December 1985


Distributed Operating Systems l 461

When the message arrives at the desti- Each directory entry contains the ASCII
nation machine, a stub routine there un- string by which the capability is accessed
packs the ESCII message and makes a local and the capability itself. Clients can only
call on Lookup using the normal EPL call- access the contents of a directory by invok-
ing sequence. The reply proceeds analo- ing the directory object with one of the
gously in the opposite direction. The stub valid operations, which include add entry,
routines on both sides are automatically delete entry, lookup string, and rename ca-
generated by the EPL compiler. pability. Capabilities are protected from
The implementation of invocation is forgery by the kernel, but users keep copies
slightly complicated by the fact that an of capabilities for their own use; the kernel
object may contain multiple processes. verifies them when they are used.
When one process blocks waiting for a re- The basic protection scheme protects ob-
ply, the others must not be affected. This jects, using capabilities. Since all processes
problem is handled by splitting the invo- are embedded in objects, a process can be
cation into two layers. The upper layer protected by restricting the distribution of
builds the message, including the capability capabilities to its object. The only way to
for the object to be invoked and the ESCII obtain service from an object is by invoking
parameters, passes it to the lower layer, the object with the proper capability, pa-
and blocks the calling process until the rameters, etc., all of which are checked by
reply arrives. The lower layer then makes the kernel and EPL run-time system before
a nonblocking call to the kernel to actually the call is made.
send the message. If other processes are
active within the object, they can now be 3.4.3 Resource Management
run; if none are active, the object waits until Because no version of Eden runs on bare
a message arrives. machines, most of the issues associated
On the receiving side, a process within with low-level resource management have
the invoked object will normally have pre- not yet been dealt with. Nevertheless, some
viously executed a call announcing its will- resource management issues have been ad-
ingness to perform some operation (e.g., dressed. For example, when an object is
Lookup in the above example), thereby created, the issue arises of where to put it.
blocking itself. When the Lookup message At present, it is just put on the same work-
comes in, it is accepted by a special dis- station as the object that created it unless
patcher process that checks to see which an explicit request has been given to put it
process, if any, is blocked waiting to per- somewhere else.
form the operation requested by the mes- Another issue that has received consid-
sage. If a willing process is found, it runs erable attention is how to achieve concur-
and sends a reply, unblocking the caller. If rency within an object. From the beginning
no such process can be found, the message of the project it was considered desirable to
is queued until one becomes available. allow multiple processes to be simultane-
ously active within an object. These pro-
3.4.2 Naming and Protection
cesses all share a common address space,
although each one has its own stack for
Naming and protection in Eden are accom- local variables, procedure call/return in-
plished using the capability system. Data formation, etc. Having multiple active
are encapsulated within objects, and are processes within an object, coupled with
only accessible by invoking one of the op- the basic Eden semantics of remote invoca-
erations defined by the object. To invoke tions that block the caller but not the
an object, a process must have a valid ca- whole object, makes the implementation
pability. Thus there is a uniform naming somewhat complicated. It is necessary to
and protection scheme throughout Eden. allow one process to block waiting for a
Capabilities may be stored in any object. reply without blocking the object as a
Directories provide a convenient mecha- whole. Monitors are used for synchroniza-
nism for grouping capabilities together. tion. This multiprogramming of processes

Computing Surveys, Vol. 17, No. 4, December 1985


462 . A. S. Tanenbaum and R. van Renesse
within an object is handled by a run-time the whole checkpoint mechanism is that it
system within that object, rather than by is expensive: Any change to an object’s
the kernel itself (as is done in Amoeba and state, no matter how small, requires writing
also in V). The experiences of Eden, the entire object to the disk. The Eden
Amoeba, and V all seem to indicate that designers acknowledge this as a problem.
having cheap, “lightweight” processes that Another feature of Eden that supports
share a common address space is often fault tolerance is the ability of the file
useful [Black 19851. system, when asked, to store an object as
Management of dynamic storage for ob- multiple copies on different machines
jects has also been a subject of some work. (see below).
Each object has a heap for its own internal
use, for which the EPL compiler generates
3.4.5 Services
explicit allocate and deallocate commands.
However, a different storage management The Eden file system maintains arbitrary
scheme is used for objects themselves. objects. One particular object type, the
When a kernel creates an object, it allocates BYTESTORE, implements linear files, as
storage for the object from its own heap in UNIX. It is possible to set the “current
and gives the object its own address space. position” anywhere in the file and then read
It also manages the user capabilities for the sequentially from that point. Unlike V and
object in such a way that it is possible Amoeba, Eden does not have special ma-
systematically to find all capabilities by chines dedicated as servers. Instead, each
scanning the kernel’s data structures. workstation can support file objects, either
for the benefit of the local user or remote
ones.
3.4.4 Fault Tolerance
The model used for file service in Eden
The Eden kernel does not support atomic is quite different from the usual model of a
actions directly, although some services file server, which manages some set of tiles
provide them to their clients. Invocations and accepts requests from clients to per-
can fail with status CANNOT LOCATE form operations on them. In Eden, each file
OBJECT when the machine on which the (i.e., BYTESTORE object) contains within
invoked object resides crashes. On the other it the processes needed to handle opera-
hand, Eden goes to a considerable length tions on it. Thus the file contains the server
to make sure that objects are not totally rather than the server containing the file
destroyed by crashes. The technique used as in most other systems.
to accomplish this goal is to have objects Of course, actually having a process run-
checkpoint themselves periodically. Once ning for each file in existence would be
an object has written a copy of its state to unbearably expensive, so an optimization
disk, a subsequent crash merely has the is used in the implementation. When a file
effect of resetting the object to the state is not open, its processes are dormant and
that it had at the most recent checkpoint. consume no resources (other than the disk
Checkpoints themselves are atomic, and space for its checkpoint). Mailboxes, direc-
this property can be used to build up more tories, and all other Eden objects work the
complex atomic actions. same way. When an object is not busy with
By judicious timing of its checkpoints, an invocation, the processes inside it are
an object can achieve a high degree of reli- put to sleep by checkpointing the whole
ability. For example, within the user mail object to the disk.
system, a mailbox object will checkpoint When a file is opened, a copy of the code
itself just after any letter is received or for its internal processes is found, and the
removed. Upon receipt of a letter, a mailbox processes started up. Although all files on
can wait for confirmation of the checkpoint a given workstation share the same code,
before sending an acknowledgment back to when the first file is opened on a work-
the sender, to ensure that letters are never station, the code may have to be fetched
lost because of crashes. One drawback of from another workstation.

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 463
The approach has advantages and dis- phases. When someone proposes a meeting,
advantages compared with the traditional a transaction is first done to mark the
one-file-server-for-all-files way of doing proposed time as “tentatively occupied” on
things. There are two main advantages. all the participants’ calendars. When a par-
First, the complicated, multithreaded file ticipant notices the proposed date, he or
server code is eliminated: There is no tile she can then approve or reject it. If all
server. The processes within a BYTE- participants approve the meeting, it is
STORE object are dedicated to a single file. “committed” by another transaction; if
Second, files can be migrated freely about someone rejects the proposed appointment,
all the nodes in the system, so that, for the other participants are notified.
example, a tile might be created locally and
then moved to a remote node where it will 3.4.6 Implementation
later be used.
The chief disadvantage is performance. Eden has had a somewhat tortuous imple-
All the processes needed for the open tiles mentation history. The initial version was
consume resources, and fetching the code designed to be written in Ada4 on the Intel
for the first file to be opened on a work- 432, a highly complex multiprocessor, fault-
station is slow. tolerant microprocessor chip ensemble. To
The Eden file system supports nested make a long story short, neither the Ada
transactions [Pu and Noe 19851. When an compiler nor the 432 lived up to the pro-
atomic update on a set of files (or other ject’s expectations. To gather information
objects) is to be carried out, the manager for further design, a “throwaway” imple-
for that transaction first makes sure that mentation was made on top of VMS on a
all the new versions are safely stored on VAX.
disk, then it checkpoints itself, and finally The VAX/VMS version, called Newark
it updates the directory. (because that was thought to be far from
The transaction facility can be used to Eden), was written in Pascal and was not
support replicated tiles [Pu et al. 19861. In distributed (i.e., it ran on a single VAX). It
the simplest case, a directory object maps supported multiple processes per object
an ASCII name onto the capability for (VMS kernel processes) but did not have
that object. However, the system also has automatic stub generation. Furthermore,
“repdirs,” objects that map ASCII names the whole implementation was rather cum-
onto sets of capabilities, for example, all bersome, so it was then decided to design a
the copies of a replicated file. Updating a programming language that would provide
replicated file is handled by a transaction automatic stub generation, better type
manager, which uses a two-phase commit checking, and a more convenient way of
algorithm to update all the copies simul- dealing with concurrency.
taneously. If one of the copies is not This reevaluation led to EPL and a new
available for updating (e.g., its machine is implementation on top of UNIX instead of
down or the network is partitioned), a new VMS. Subsequently, Eden was ported to
copy of the file is generated, and the capa- 68000-based workstations (SUNS), also on
bility for the unreachable copy discarded. top of UNIX, rather than on the bare hard-
Sooner or later, the garbage collector will ware (and in contrast to the Cambridge
notice that the old copy is no longer in use system, V, and Amoeba, all of which run
and remove it. on bare 68000s). The decision to put UNIX
We touched briefly on the mail server on the bottom, instead of the top (as was
above. The mail system defines message, done with Amoeba), made system develop-
mailbox, and address list objects, with op- ment easier and assisted users in migrating
erations to deliver mail, read mail, reply to from UNIX to Eden. The price that has
mail, and so on. been paid is poor performance and a fair
The appointment calendar system is an-
other example of an Eden application. It is ‘Ada is a trademark of the U.S. Department of
used to schedule meetings and runs in two Defense.

Computing Surveys, Vol. 17, No. 4, December 1985


464 . A. S. Tanenbaum and R. van Renesse
amount of effort spent trying to convince where to send the request. Protection is
UNIX to do things against its will. done by the active name table, which keeps
track of the authorization status of each
3.5 Comparison of the Cambridge, logged in user.
Amoeba, V, and Eden Systems Amoeba has a single mechanism for all
naming and protection-sparse capabili-
Our four example systems have many as- ties. Each capability contains bits specify-
pects in common, but also differ in some ing which operations on the object are
significant ways. In this section we sum- allowed and which are not. The rights are
marize and compare the four systems with protected cryptographically, so that user
respect to the main design issues that we programs can manipulate them directly;
have been discussing. they are not stored in the kernel. ASCII-
string-to-capability mapping and capability
3.5.1 Communication Primitives storage are handled by directory servers for
All four systems use an RPC-like mecha- convenience.
nism (as opposed to an IS0 OS1 commu- Eden also uses capabilities, but these are
nication-oriented mechanism). not protected by sparseness or encryption,
The Cambridge mechanism is the sim- and so they must be protected by the ker-
plest, using the single-shot protocol with a nel. A consequence of this decision is that
2K request packet and a 2K reply packet all the kernels must be trustworthy. The
for most client-server communication. A Amoeba cryptographic protection scheme
byte stream protocol is also available. is less restrictive on this point.
Amoeba uses a similar REQUEST- V has naming at three levels: Processes
REPLY mechanism, but allows messages have pids, kernels have ASCII-to-pid map-
up to 32 kilobytes (with the kernel-han- pings, and servers use a context mechanism
dling message fragmentation and reassem- to relate symbolic names to a given context.
bly), as well as acknowledgments and time-
outs, thus providing user programs with a 3.5.3 Resource Management
more reliable and simpler interface.
V also uses a REQUEST-REPLY mech- Resource management is also handled quite
anism, but messages longer than an Eth- differently on all four systems. In the Cam-
ernet packet are dealt with by having the bridge system the main resource is the
sender include a sort of “capability” for a processor bank. A resource manager is pro-
message segment in the REQUEST packet. vided to allocate machines to users. Gen-
Using this “capability,” the receiver can erally, this allocation is fairly static-upon
fetch the rest of the message, as needed. login a user is allocated one machine for
For efficiency, the first 1K is piggybacked the duration of the login session, and this
onto the REQUEST itself. is the only machine the user uses during
Eden comes closest to a true RPC mech- the session. The user may load any oper-
anism, including having a language and ating system that he or she chooses in this
compiler with automatic stub generation machine.
and a minilanguage for parameter passing. Amoeba also has a pool of processors, but
None of the four examples attempts to these are allocated dynamically. A user run-
guarantee that remote calls will be executed ning “make” might be allocated ten pro-
exactly once. cessors to compile ten files; afterward, all
the processors would go back into the pool.
3.5.2 Naming and Protection Amoeba also provides a way for processes
to create segments on any machine (assum-
All four systems use different schemes for ing that the proper capability can be
naming and protection. In the Cambridge shown) and for these segments to be forged
system a single name server process maps into processes. Amoeba is unique among
symbolic service names onto (node, process the four systems in that it has a bank server
identifier) pairs so that the client will know that can allow servers to charge for services

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems l 465

and to limit resource usage by accounting entire objects can be checkpointed, making
for it. checkpointing a slow operation and thus
In V, each processor is dedicated as either discouraging its frequent use.
a workstation or a server, so processors are
not resources to be dynamically allocated. 3.5.5 Services
Each V kernel manages its own local re-
sources; there is no systemwide resource The file systems used by Cambridge,
management. Amoeba, V, and Eden are all quite different.
Eden has been built on top of existing The Cambridge system has two servers, the
operating systems, and therefore most of universal file server, and the filing machine,
the issues of resource management are done which was added later to improve the per-
by the underlying operating system. The formance by providing a large buffer cache.
main issue remaining for Eden is allocating The universal file server supports a primi-
and deallocating storage for objects. tive flat file, with no directory structure,
which is provided by the filing machine or
3.5.4 Fault Tolerance the user machines. The universal file server
has regular and special files, of which the
None of the four systems go to great lengths latter can be updated atomically.
to make themselves fault tolerant; for ex- Amoeba has several file systems. One of
ample, none support atomic actions as a them is compatible with UNIX, to allow
basic primitive. All four (with the possible UNIX applications to run on Amoeba. An-
exception of Eden) were designed with the other one, FUSS, supports multiversion,
intention of actually being used, so that the multiserver, tree-structured, immutable
inherent trade-off between performance files with atomic commit. Directory servers
and fault tolerance tended to get resolved map ASCII names to capabilities, thus al-
in favor of performance. lowing an arbitrary graph of files and direc-
In the Cambridge system the only con- tories to be constructed.
cession to fault tolerance is a feature in the V has a traditional file server similar to
ring interface to allow a machine to be UNIX. It is based on the earlier Thoth
remotely reset by sending a special packet system.
to the interface. There is also a small server Eden has no file server at all in the usual
that helps get the servers started up. sense. Instead, each file object has embed-
Amoeba provides some fault tolerance ded in it a process that acts like a private
through its boot server, with which pro- file server for that one file. Like Amoeba,
cesses can register. The boot server pools Eden has separate directory servers that
the registered processes periodically and, map ASCII strings onto capabilities and
finding one that fails to respond, requests provides the ability to map one string onto
a new processor and downloads the failed several files, thus providing for file repli-
program to it. This strategy does not re- cation. All four systems have a heteroge-
trieve the processes that were killed when neous variety of other services (e.g., print,
a machine has gone down, but it does au- mail, bank).
tomatically ensure that no key service is
ever down for more than, say, 30 seconds. 4. SUMMARY
V does not address the problem of fault
tolerance at all. Distributed operating systems are still in
Of the four systems, Eden makes the an early phase of development, with many
most effort to provide a higher degree of unanswered questions and relatively little
reliability than provided by the bare hard- agreement among workers in the field about
ware. The main tool used is checkpointing how things should be done. Many experi-
complete objects from time to time. If a mental systems use the client-server model
processor crashes, each of its objects can be with some form of remote procedure call as
restored to the state it had at the time of the communication base, but there are also
the last checkpoint. Unfortunately, only systems built on the connection model.

Computing Surveys, Vol. 17, No. 4, December 1985


466 . A. S. Tanenbaum and R. van Renesse
Relatively little has been done on distrib- BIRMAN, K. P., AND ROWE, L. A. 1982. A local
uted naming, protection, and resource network based on the UNIX operating system.
IEEE Trans. Softw. Eng. SE-8 (Mar.), 137-146.
management, other than building straight-
BIRRELL, A. D. 1985. Secure communication using
forward name servers and process servers. remote procedure calls. ACM Trans. Comput.
Fault tolerance is an up-and-coming area, Syst. 3, 1 (Feb.), 1-14.
with work progressing in redundancy tech- BIRRELL, A. D., AND NEEDHAM, R. M. 1980. A uni-
niques and atomic actions. Finally, a con- versal file server. IEEE Trans. Softw. Eng. SE-6,
siderable amount of work has gone into the (Sept.), 450-453.
construction of file servers, print servers, BIRRELL, A. D., AND NELSON, B. J. 1984.
and various other servers, but here too Implementing remote procedure calls. ACM
Trans. Comput. Syst. 2, 1 (Feb.), 39-59.
there is much work to be done. The only
BIRRELL, A. D., LEVIN, R., NEEDHAM, R. M., AND
conclusion that we draw is that distributed SCHROEDER,M. 1982. Grapevine: An exercise
operating systems will be an interesting and in distributed computing. Commun. ACM 25, 4
fruitful area of research for a number of (Apr.), 260-274.
years to come. BIRRELL, A. D., LEVIN, R., NEEDHAM, R. M., AND
SCHROEDER,M. 1984. Experience with Grape-
vine: The growth of a distributed system. ACM
ACKNOWLEDGMENTS Trans. Comput. Syst. 2, 1 (Feb.), 3-23.
We would like to thank Andrew Black, Dick Grune, BLACK, A. P. 1983. An asymmetric stream commu-
Sape Mullender, and Jennifer Steiner for their critical nications system. Oper. Syst. Rev. (ACM) 17, 5,
reading of the manuscript. 4-10. .
BLACK, A. P. 1985. Supporting distributed applica-
tions: Experience with Eden. In Proceedings of
the 10th Symposium on Operating Systems Prin-
REFERENCES ciples (Orcas Island, Wash., Dec. l-4). ACM, New
York, pp. 181-193.
ADAMS, C. J., ADAMS, G. C., WATERS, A. G., LESLIE,
I., AND KIRK, P. 1982. Protocol architecture of Boccs, D. R., SCHOCH, J. F., TAFT, E. A., ANY
the UNIVERSE project. In Proceedings of the 6th METCALFE, R. M. 1980. Pup: An internetwork
International Conference on Computer Commu- architecture. IEEE Trans. Commun. COM-28
nication (London, Sept. 7-10). International Con- (Apr.), 612-624.
ference for Computer Communication, pp. 379- BORG, A., BAUMBACH, J., AND GLAZER, S. 1983. A
383. message system supporting fault tolerance. Oper.
ALMES, G. T., BLACK, A. P., LAZOWSKA,E. D., AND Syst. Rev. (ACM) 17,5,90-99.
NI!IE, J. D. 1985. The Eden system: A technical BROWN, M. R., KOLLING, K. N., AND TAG. E. A.
review. IEEE Trans. Softw. Eng. SE-11 (Jan.). 1985. The Alnine file svstem. ACM Trans. Com-
43-59. put. Syst. 3, 4 ~Nov.), 261-293.
ANDERSON,T., AND LEE, P. A. 1981. Fault Toler- BROWNBRIDGE, D. R., MARSHALL, L. F., AND
ance, Principles and Practice. Prentice-Hall RANDELL, B. 1982. The Newcastle connec-
International, London. tion-Or UNIXES of the world unite! Softw.
AVIZIENIS, A., AND CHEN, L. 1977. On the im- Pratt. Exper. 12 (Dec.), 1147-1162.
plementation of N-version programming for BRYANT, R. M., AND FINKEL, R. A. 1981. A stable
software fault-tolerance during execution. In distributed scheduling algorithm. In Proceed-
Proceedings of the International Computer Soft- ings of the 2nd International Conference on Dis-
ware and Applications Conference. IEEE, New tributed Computer Systems (Apr.). IEEE, New
York, pp. 149-155. York, pp. 314-323.
AVIZIENIS, A., AND KELLY, J. 1984. Fault tolerance CHANDY, K. M., MISRA, J., AND HAAS, L. M.
by design diversity. Computer 17 (Aug.), 66-80. 1983. Distributed deadlock detection. ACM
BAL, H. E., VAN RENESSE,R., AND TANENBAUM, A. Trans. Comput. Syst. 1,2 (May), 145-156.
S. 1985. A distributed, parallel, fault tolerant CHERITON, D. R. 1982. The Thoth System: Multi-
computing system. Rep. 1%106, Dept. of Mathe- Process Structuring and Portability. American
matics and Comnuter Science. Vriie- Univ., The Elsevier, New York.
Netherlands, Oct. CHERITON,D. R. 1984a. An experiment using regis-
BALL, J. E., FELDMAN, J., Low, R., RASHID, R., AND ters for fast message-based interprocess commu-
ROVNER, P. 1976. RIG, Rochester’s intelligent nication. Oper. Syst. Rev. 18 (Oct.), 12-20.
gateway: System overview. IEEE Trans. Softw. CHERITON, D. R. 1984b. The V kernel: A software
Eng. SE-Z (Dec.), 321-329. base for distributed svstems. IEEE Softw. 1
BARAK, A., AND SHILOH, A. 1985. A distributed load- (Apr.), 19-42. -
balancing policy for a multicomputer. Softw. CHERITON, D. R., AND MANN, T. P. 1984. Uniform
Pratt. Exper. 1.5 (Sept.), 901-913. access to distributed name interpretation in the

ComputingSurveys,Vol. 17, No. 4, December 1985


Distributed Operating Systems 467
V system. In Proceedings of the 4th International FARBER,D. J., AND LARSON,K. C. 1972. The system
Conference on Distributed Computing Systems. architecture of the distributed computer sys-
IEEE, New York, pp. 290-297. tem-The communications system. In Proceed-
CHERITON, D. R., AND ZWAENEPOEL, W. 1983. The ings of the Symposium on Computer Networks
distributed V kernel and its performance for disk- (Brooklyn, Apr.). Polytechnic Inst. of Brooklyn,
less workstations. In Proceedings of the 9th Sym- Brooklyn, N.Y.
posium on Operating System PrincipZes. ACM, FINKEL, R. A., SOLOMON,M. H., AND TISCHLER, R.
New York, pp. 128-140. 1979. The Roscoe resource manager. COMP-
CHERITON, D. R., AND ZWAENEPOEL, W. 1984. CON 79 Digest of Papers (Feb.). IEEE, New York,
One-to-many interprocess communication in the pp. 8891.
V-svstem. In SZGCOMM ‘84 Tutorials and Svm- FITZGERALD,R., AND RASHID R. 1985. The integra-
poskm on Communications Architectures and tion of virtual memory management and inter-
Protocols (Montreal, Quebec, June 6-8). ACM, process communication in Accent. In Proceedings
New York. of the 10th Symposium on Operating Systems
CHERITON, D. R., MALCOLM, M. A., MELEN, L. S., Principles (Orcas Island, Wash., Dec. l-4). ACM,
AND SAGER,G. R. 1979. Thoth, a portable real- New York, pp. 13-14.
time operating system. Commun. ACM 22, 2 FRIDRICH, M., AND OLDER, W. 1981. The Felix file
(Feb.), 105-115. server. In Proceedings of the 8th Symposium on
CHESSON,G. 1975. The network UNIX system. In Operating Systems Principles (Pacific Grove,
Proceedings of the 5th Symposium on Operating Calif., Dec. 14-16). ACM, New York, pp. 37-44.
Systems Principles (Austin, Tex., Nov. 19-21). FRIDRICH, M., AND OLDER, W. 1984. HELIX: The
ACM, New York, pp. 60-66. architecture of a distributed file system. In Pro-
CHOW, T. C. K., AND ABRAHAM, J. A. 1982. Load ceedings of the 4th International Conference on
balancing in distributed svstems. IEEE Trans. Distributed Computing Systems. IEEE, New
Softw. Eng. SE-8 (July), 401-412. York, pp. 422-431.
CHOW,Y. C., AND KOHLER, W. H. 1979. Models for GAGLIANELLO, R. D., AND KATSEFF, H. P. 1985.
dynamic load balancing in heterogeneous multi- Meglos: An operating system for a multiprocessor
ple processor systems. IEEE Trans. Comput. environment. In Proceedings of the 5th Znterna-
C-28 (May), 354-361. tional Conference on Distributed Computing Sys-
tems (May). IEEE, New York, pp. 35-42.
CHU, W. W., HOLLOWAY,L. J., MIN-TSUNG, L., AND
EFE, K. 1980. Task allocation in distributed GLIGOR, V. D., AND SHATTUCK, S. H. 1980.
data processing. Computer 23 (Nov.), 57-69. Deadlock detection in distributed systems. IEEE
Trans. Softw. Eng. SE-6 (Sept.), 435-440.
CURTIS, R. S., AND WI~IE, L. D. 1984. Global GYLYS, V. B., AND EDWARDS,J. A. 1976. Optimal
naming in distributed systems. IEEE Softw. 1, partitioning of workload for distributed systems.
76-80. In Proceedings of COMPCON (Sept.). IEEE, New
DALAL, Y. K. 1977. Broadcast protocols in packet York, pp. 353-357.
switched computer networks. Ph.D. Dissertation, HWANG, K., CROFT, W. J., GOBLE, G. H. WAH, B.
Computer Science Dept., Stanford Univ., Stan- W., BRIGGS, F. A., SIMMONS, W. R., AND
ford, Calif. COATES,C. L. 1982. A UNIX-based local com-
DELLAR, C. 1982. A file server for a network of puter network. Computer 15 (Apr.), 55-66,
low-cost personal microcomputers. Softw. Pratt. ISLOOR, S. S., AND MARSLAND, T. A. 1978. An ef-
Erper. 22 (Nov.), 1051-1068. fective on-line deadlock detection technique for
DENNIS, J. B., AND VAN HORN, E. C. 1966. distributed database management systems. In
Programming semantics for multiprogrammed Proceedings of the International Computer and
computations. Commun. ACM 9, 3 (Mar.), 143- Software Application Conference. IEEE, New
154. York, pp. 283-288.
DEWIIT, D. J., FINKEL, R. A., AND SOLOMON, M. JANSON, P., SVOBODOVA, L., AND MAEHLE, E.
1984. The CRYSTAL multicomnuter: Design 1983. Filing and printing services in a local area
and implementation experience. Tech. Rep. TR- network. In Proceedings of the 8th Data Commu-
553, Computer Science Dept., Univ. of Wisconsin, nications Symposium (Cape Cod, Mass., Oct.
Madison, Wis. 3-6). IEEE, New York, pp. 211-219.
DION, J,. 1980. The Cambridge file server. Oper. Syst. JEFFERSON,D. R. 1985. Virtual time. ACM Trans.
Reu. (ACM) 14 (Oct.), 41-49. Program. Lang. Syst. 7 (July), 404-425.
EFE, K. 1982. Heuristic models of task assignment JENSEN, E. D. 1978. The Honeywell experimental
scheduling in distributed systems. Computer 15 distributed processor-An overview of its objec-
(June), 50-56. tive, philosophy and architectural facilities. Com-
ESWARAN, K. P., GRAY, J. N., LORIE, J. N., AND puter 1 Z (Jan), 28-38.
TRAIGER,I. L. 1976. The notions of consistency JESSOP,W. H., JACOBSON,D. M., NOE, J. D., BAER,
and predicate locks in a database system. Com- J.-L., AND Pu, C. 1982. The Eden transaction-
mun. ACM 19, 11 (Nov.), 624-633. based file system. In Proceedings of the 2nd Sym-

ComputingSurveys,Vol. 17, No. 4, December1985


468 l A. S. Tanenbaum and R. van Renesse
posium on Reliability in Distributed Software ana’ MILLSTEIN, R. E. 1977. The national software
Database Systems (Julv)._. IEEE. New York. works. In Proceedings of the ACM Annual Con-
pp. 163-169: ference (Seattle, Wash., Oct. 16-19). ACM, New
KRUEGER, P., AND FINKEL, R. A. 1983. An adaptive York, pp. 44-52.
load balancing algorithm for a multicomputer. MITCHELL, J. G., AND DION, J. 1982. A comparison
Unpublished manuscript, Computer Science of two network-based file servers. Commun. ACM
Dept., Univ. of Wisconsin. 25, 4 (Apr.), 233-245.
LAMPORT, L. 1978. Time, clocks, and the ordering MOHAN, C. K., AND WI~IE, L. D. 1985. Local re-
of events in a distributed system. Commun. ACM configuration of management trees in large net-
21, 7 (July), 558-565. works. In Proceedings of the 5th International
LAMPORT, L. 1984. Using time instead of timeout Conference on Distributed Computing Systems
for fault-tolerant distributed systems. ACM (May). IEEE, New York, pp. 386-393.
Trans. Program. Long. Syst. 6 (Apr.), 254-280. MULLENDER, S. J., AND TANENBAUM, A. S. 1984.
Protection and resource control in distributed
LAMPSON, B. W. 1981. Atomic transactions. In Dis-
operating systems. Comput. Networks 8 (Nov.),
tributed Systems-Architecture and Implementa-
421-432.
tion, B. W. Lampson, Ed. Springer-Verlag, Berlin
and New York, pp. 246-265. MULLENDER, S. J., AND TANENBAUM, A. S. 1985. A
distributed file service based on optimistic con-
LAZOWSKA, E. D., LEVY, H. M., ALMES, G. T.,
currency control. In Proceedings of the 10th Sym-
FISCHER, M. J., FOWLER, R. J., AND VESTAL, posium on Operating Systems Principles (Orcas
S. C. 1981. The architecture of the Eden svstem. Island, Wash., Dec. l-4). ACM, New York, pp.
In Proceedings of the 8th Symposium on Operating
51-62.
Svstems Princioles (Pacific Grove. Calif.. Dec. 14-
16). ACM, New York, pp. 148-159. MULLENDER, S. J., AND TANENBAUM, A. S. 1986.
The design of a capability-based distributed op-
LEVY, H. M. 1984. Capability-Based Computer Sys- erating system. Computer J. (in press).
tems. Digital Press, Maynard, Mass.
NEEDHAM, R. M., AND HERBERT, A. J. 1982. The
LISKOV, B. 1982. On linguistic support for distrib- Cambridge Distributed Computing System.
uted nrozrams. IEEE Trans. Softw. Ens. SE-8 AddisonrWesley, Reading, Mass. - ”
(Mayj, 203-210.
NELSON, B. J. 1981. Remote procedure call. Tech.
LISKOV, B. 1984. Overview of the Argus language Rep. CSL-81-9, Xerox Palo Alto Research
and system. Programming Methodology Group Center, Palo Alto, Calif.
Memo 40. Laboratory for Computer Science, OBERMARCK, R. 1982. Distributed deadlock detec-
Massachusetts Institute of Technology, Cam- tion algorithm. ACM Trans. Database Syst. 7
bridge, Mass., Feb. (June), 187-208.
LISKOV, B., AND SCHEIFLER, R. 1982. Guardians and OKI, B. M., LISKOV, B. H., AND SCHEIFLER, R. W.
actions: Linguistic support for robust, distributed 1985. Reliable object storage to support atomic
programs. ACM Trans. Program. Lang. Syst. 5, 3 actions. In Proceedings of the 10th Symposium on
(July), 381-404. 1983. ACM, pp. 7-19, Jan. 1982. Operating Systems Principles (Orcas Island,
LO, V. M. 1984. Heuristic algorithms for task assign- Wash., Dec. l-4). ACM, New York, pp. 147-159.
ment in distributed systems. In Proceedings of OUSTERHOUT, J. K. 1982. Scheduling techniques for
the 4th International Conference on Distributed concurrent systems. In Proceedings of the 3rd
Computing Systems. IEEE, New York, pp. 30-39. International Conference on Distributed Comput-
LUDERER, G. W. R., CHE, H., HAGGERTY, J. P., ing Systems. IEEE, New York, pp. 22-30.
KIRSLIS, P. A., AND MARSHALL, W. T. 1981. A PASHTAN, A. 1982. Object oriented operating sys-
distributed UNIX system based on a virtual cir- terns: An emerging design methodology. In Pro-
cuit switch. In Proceedings of the 8th Symposium ceedings of the ACM National Conference (Dallas.
on Operating Systems Principles (Pacific Grove, Tex., &t: 25-27). ACM, New York, pp. 126-131:
Calif., Dec. 14-16). ACM, New York, pp. 160- POPEK, G., WALKER, B., CHOW, J., EDWARDS, D.,
168. KLINE, C., RUDISIN, G., AND THIEL, G. 1981.
MAMRAK, S. A., MAURATH, P., GOMEZ, J., JANARDAN, LOCUS: A network transparent, high reliability
S., AND NICHOLAS, C. 1982. Guest layering dis- distributed system. In Proceedings of the 8th
tributed processing support on local- operating Symposium on Operating Systems Principles (Pa-
systems. In Proceedings of the 3rd Znternational cific Grove, Calif., Dec. 14-16). ACM, New York,
Conference on Distributed Computing Systems. pp. 160-168.
IEEE, New York, pp. 854-859. POWELL, M. L., AND MILLER, B. P. 1983. Process
MARZULLO, K., AND OWICKI, S. 1985. Maintaining migration in DEMOS/MP. Oper. Syst. Rev.
the time in a distributed system. Oper. Syst. Reu. (ACM) 17,5,110-119.
19 (July), 44-54. POWELL, M. L., AND PRESO’ITO, D. L. 1983.
MENASCE, D., AND MUNTZ, R. 1979. Locking and Publishing-A reliable broadcast communication
deadlock detection in distributed databases. mechanism. Oper. Syst. Reu. (ACM) 17, 5, lOO-
IEEE Trans. Softw. Eng. SE-5 (May), 195-202. 109.

Computing Surveys, Vol. 17, No. 4, December 1985


Distributed Operating Systems 469

m C., AND NOE, J. D. 1985. Nested transactions Principles (Pacific Grove, Calif., Dec. 10-12).
for general objects. Rep. TR-85-12-03, Computer ACM, New York, pp. 108-114.
Science Dept., Univ. of Washington, Seattle, SPECTOR, A. Z. 1982. Performing remote operations
Wash. efficiently on a local computer network. Commun.
pu, C., NOE, J. D., AND PROUDFOOT, A. 1986. ACM 25,4 (Apr.), 246260.
Regeneration of replicated objects: A technique STANKOVIC, J. A., AND SIDHU, I. S. 1984. An adap-
and its Eden implementation. In Proceedings of tive bidding algorithm for processes, clusters, and
the 2nd International Conference on Data &I&- distributed ups. In Proceedings of the 4th Inter-
neering (Los Angeles, Calif., Feb. 4-6). IEEE, national Conference on Distributed Computing
New York, pp. 175-187. Systems. IEEE, New York, pp. 49-59.
RASHID, R. F., AND ROBERTSON, G. G. 1981. Accent: STONE, H. S. 1977. Multiprocessor scheduling with
A communication oriented network operating the aid of network flow algorithms. IEEE Trans.
system kernel. In Proceedings of tne 8th Sympo- Softw. Eng. SE-3 (Jan.), 88-93.
sium on Operating Systems Principles (Pacific STONE, H. S. 1978. Critical load factors in distrib-
Grove, Calif., Dec. 14-16). ACM, New York, pp. uted computer systems. IEEE Trans. Softw. Eng.
64-75. SE-4 (May), 254-258.
REED, D. P. 1983. Implementing atomic actions on STONE, H. S., AND BOKHARI, S. H. 1978. Control
decentralized data. ACM Trans. Comput. Syst. 1, of distributed processes. Computer 11 (July),
1 (Feb.), 3-23. 97-106.
REED, D. P., AND SVOBODOVA, L. 1981. STONEBRAKER, M. 1981. Operating system support
SWALLOW: A distributed data storage system for database management. Commun. ACM 24, 7
for a local network. In Local Networks for Com- (July), 412-418.
puter Communications, A. West and P. Janson,
STURGIS, H. E., MITCHELL, J. G., AND ISRAEL, J.
Eds. North-Holland Publ., Amsterdam, pp. 355-
1980. Issues in the design and use of a distrib-
373.
uted file system. Oper. Syst. Rev. 24 (July),
REIF, J. H., AND SPIRAKIS, P. G. 1984. ‘Real-time 55-69.
synchronization of interprocess communications. SVENTEK, J., GREIMAN, W., O’DELL, M., AND
ACM Trans. Program. Lang. Syst. 6, 2 (Apr.), JANSEN, A. 1983. Token ring local networks-
215-238. A comparison of experimental and theoretical
RITCHIE, D. M., AND THOMPSON, K. 1974. The performance. Lawrence Berkeley Lab. Rep.
UNIX time-sharing system. Commun. ACM 19, 16254.
7 (July), 365-375. SVOBODOVA, L. 1981. A reliable object-oriented data
SALTZER, J. H., REED, D. P., AND CLARK, D. D. repository for a distributed computer system. In
1984. End-to-end arguments in system design. Proceedings of the 8th Symposium on Operating
ACM Trans. Comput. Syst. 2, 4 (Nov.), 277-278. Systems Principles (Pacific Grove, Calif., Dec. 14-
SATYANARAYANAN, M., HOWARD, J., NICHOLS, D., 16). ACM, New York, pp. 47-58.
SIDEBOTHAM, R., SPECTOR, A., AND WEST, M. SVOBODOVA, L. 1984. File servers for network-based
1985. The ITC distributed file system: Princi- distributed systems. ACM Comput. Sure. 16, 4
ples and design. In Proceedings of the 10th Sym- (Dec.), 353-398.
posium on Operating Systems Principles (Orcas SWINEHART, D., MCDANIEL, G., AND Boccs, D.
Island, Wash., Dec. l-4). ACM, New York, 1979. WFS: A simple shared file system for a
pp. 35-50. distributed environment. In Proceedings of the
SCHROEDER, M., GIFFORD, D., AND NEEDHAM, R. 7th Symposium on Operating Systems Principles
1985. A caching file system for a program- (Pacific Grove, Calif., Dec. 10-12). ACM, New
mer’s workstation. In Proceedings of the 10th York, pp. 9-17.
Symposium on Operating Systems Principles TANENBAUM, A. S., AND MULLENDER, S. J. 1982.
(Orcas Island, Wash., Dec. l-4). ACM, New Operating system requirements for distributed
York, pp. 25-34. data base systems. In Distributed Data Bases,
SMITH, R. 1979. The contract net protocol: High- H.-J. Schneider, Ed. North-Holland Publ.,
level communication and control in a distributed Amsterdam, pp. 105-114.
problem solver. In Proceedings of the 1st Znter- TANENBAUM, A. S., MULLENDER, S. J., AND VAN
national Conference on Distributed Computing RENESSE, R. 1986. Using sparse capabilities in
Systems. IEEE, New York, pp. 185-192. a distributed operating system. In Proceedings of
SOLOMON, M. H., AND FINKEL, R. A. 1978. the 6th International Conference on Distributed
ROSCOE: A multimicrocomputer operating - sys- Computer Systems. IEEE, New York, 1986, pp.
tern. In Proceedings of the 2nd Rocky Mountain 558-563.
Symposium on Microcomputers (Aug.), pp. 201- VAN TILBORG, A. M., AND W~IE, L. D. 1981. Wave
210. scheduling: Distributed allocation of task forces
SOLOMON, M. H., AND FINKEL, R. A. 1979. The in network computers. In Proceedings of the 2nd
Roscoe distributed operating system. In Proceed- International Conference on Distributed Comput-
ings of the 7th Symposium on Operating Systems ing Systems. IEEE, New York, pp. 337-347.

Computing Surveys, Vol. 17, No. 4, December 1985


470 l A. 5’. Tanenbaum and R. van Renesse
WALKER, B., POPEK, G., ENGLISH, R., KLINE, C., Distributed Computing Systems (May). IEEE,
AND THIEL, G. 1983. The LOCUS distributed New York, pp. 549-551.
operating system. Oper. Syst. Reu. (ACM) 17, 5, WI~IE, L. D., AND VAN RLBORG, A. M. 1980.
49-70.
MICROS, a distributed operating system for MI-
WAMBECQ,A. 1983. NETIX: A network-using op- CRONET, a reconfigurable network computer.
erating system, based on UNIX software. In Pro- IEEE Trans. Comput C-29 (Dec.), 1133-1144.
ceedings of the NFWO-ENRS Contact Group
(Leuven, Belgium, Mar.). WUPIT, A. 1983. Comparison of UNIX network sys-
terns. ACM, New York, pp. 99-108.
WEINSTEIN. M. J.. PAGE. T. W.. JR.. LIVESEY. B. K..
, , , I I

AND POPEK, G. J. 1985. Transactions and syn- ZIMMERMANN, H. 1980. OS1 reference model-The
chronization in a distributed operating system. In IS0 model of architecture for open systems in-
Proceedings of the 20th Symposium on Operating terconnection. IEEE Trans. Commun. COM-28
Systems Principles (Orcas Island, Wash., Dec. (Apr.), 425-432.
l-4). ACM, New York, pp. 115-125. ZIMMERMANN, H., BANINO, J.-S., CARISTAN, A.,
WILKES, M. V., AND NEEDHAM, R. M. 1980. The GUILLEMONT, M., AND MORISSET, G.
Cambridge model distributed system. Oper. Syst. 1981. Basic concepts for the support of distrib-
Reu. 14 (Jan.), 21-29. uted systems: The chorus approach. In Proceed-
WITTIE, L., AND CURTIS, R. 1985. Time manage- ings of t& 2nd International Conference on Dis-
ment for debugging distributed systems. In Pro- tributed Computing Systems. IEEE, New York,
ceedings of the 5th International Conference on pp. 60-66.

Received July 1985; final revision accepted April 1986.

ComputingSurveys,Vol. 17, No. 4, December 1985

You might also like