Sun's Network File System (NFS) : Client0 - Client1 - / - / - Network - Server+disks / - Client2 - / - Client3
Sun's Network File System (NFS) : Client0 - Client1 - / - / - Network - Server+disks / - Client2 - / - Client3
Sun's Network File System (NFS) : Client0 - Client1 - / - / - Network - Server+disks / - Client2 - / - Client3
One of the rst uses of distributed client/server computing was in the realm of distributed le systems. In such an environment, there are a number of client machines and one server (or a few); the server stores the data on its disks, and clients request data through well-formed protocol messages. Figure 44.1 depicts the basic setup.
Client0 | Client1--\ | \ | network-----Server+disks / | Client2--/ | | Client3
Figure 44.1: 4 clients, 1 Server (with disks), and yes, a network As you can see from the (ugly) picture, the server has the disks; the clients communicate through the network to access their directories and les on those disks. Why do we bother with this arrangement? (i.e., why dont we just let clients use their local disks?) Well, primarily this setup allows for easy sharing of data across clients. Thus, if 1
S UN S N ETWORK F ILE S YSTEM (NFS) you access a le on one machine (Client0) and then later use another (Client2), you will have the same view of the le system. Your data is naturally shared across these different machines. A secondary benet is centralized administration; for example, backing up les can be done from the few server machines instead of from the multitude of clients. Another advantage could be security; having all servers in a locked machine room prevents certain types of problems from arising.
44.1
O PERATING S YSTEMS
A RPACI -D USSEAU
S UN S N ETWORK F ILE S YSTEM (NFS) even; in the best such case, no network trafc need be generated.
Client Application ------------------------[Client-side File System] ------------------------Networking Layer <------>
From this simple overview, you should get a sense that there are two important pieces of software in a client/server distributed le system: the client-side le system and the le server. Together their behavior determines the overall behavior of the distributed le system.
44.2
On To NFS
One of the earliest and most successful systems was developed by Sun Microsystems, and is known as the Sun Network File System (or NFS) [S86]. In dening NFS, Sun took an unusual approach: instead of building a proprietary and closed system, Sun instead developed an open protocol which simply specied the exact message formats that clients and servers would use to communicate. Different groups could develop their own NFS servers and thus compete in an NFS marketplace while preserving interoperability. It worked: today there are many companies that sell NFS servers (including Sun, NetApp [HLM94], EMC, IBM, and others), and the widespread success of NFS is likely attributed to this open market approach.
44.3
A RPACI -D USSEAU
S UN S N ETWORK F ILE S YSTEM (NFS) on-going). However, NFSv2 is both wonderful and frustrating and thus the focus of our study. In NFSv2, one of the main goals of the design of the protocol was simple and fast server crash recovery. In a multiple-client, single-server environment, this goal makes a great deal of sense; any minute that the server is down (or unavailable) makes all the client machines (and their users) unhappy and unproductive. Thus, as the server goes, so goes the entire system.
44.4
O PERATING S YSTEMS
A RPACI -D USSEAU
// get descriptor // use descriptor to read MAX bytes from foo // use descriptor to read MAX bytes from foo // use descriptor to read MAX bytes from foo // close file (done with descriptor)
descriptor I am passing you here. In this example, the le descriptor is a piece of shared state between the client and the server (Ousterhout calls this distributed state [O91]). Shared state, as we hinted above, complicates crash recovery. Imagine the server crashes after the rst read completes, but before the client has issued the second one. After the server is up and running again, the client then issues the second read. Unfortunately, the server has no idea to which le fd is referring; that information was ephemeral (i.e., in memory) and thus lost when the server crashed. To handle this situation, the client and server would have to engage in some kind of recovery protocol, where the client would make sure to keep enough information around in its memory to be able to tell the server what it needs to know (in this case, that le descriptor fd refers to le foo). It gets even worse when you consider the fact that a stateful server has to deal with client crashes. Imagine, for example, a client that opens a le and then crashes. The open() uses up a le descriptor on the server; how can the server know it is OK to close a given le? In normal operation, a client would eventually call close() and thus inform the server that the le should be closed. However, when a client crashes, the server never receives a close(), and thus has to notice the client has crashed in order to close the le. For these reasons, the designers of NFS decided to pursue a stateless approach: each client operation contains all the information needed to complete the request. No fancy crash recov-
A RPACI -D USSEAU
S UN S N ETWORK F ILE S YSTEM (NFS) ery is needed; the server just starts running again, and a client, at worst, might have to retry a request. A SIDE : W HY S ERVERS C RASH Before getting into the details of the NFSv2 protocol, you might be wondering: why do servers crash? Well, as you might guess, there are plenty of reasons. Servers may simply suffer from a power outage (temporarily); only when power is restored can the machines be restarted. Servers are often comprised of hundreds of thousands or even millions of lines of code; thus, they have bugs (even good software has a few bugs per hundred or thousand lines of code), and thus they eventually will trigger a bug that will cause them to crash. They also have memory leaks; even a small memory leak will cause a system to run out of memory and crash. And, nally, in distributed systems, there is a network between the client and the server; if the network acts strangely (for example, if it becomes partitioned and clients and servers are working but cannot communicate), it may appear as if a remote machine has crashed, but in reality it is just not currently reachable through the network.
44.5
O PERATING S YSTEMS
A RPACI -D USSEAU
protocol to both be stateless and support the POSIX le system API? One key to understanding the design of the NFS protocol is understanding the le handle. File handles are used to uniquely describe the le or directory a particular operation is going to operate upon; thus, many of the protocol requests include a le handle. You can think of a le handle as having three important components: a volume identier, an inode number, and a generation number; together, these three items comprise a unique identier for a le or directory that a client wishes to access. The volume identier informs the server which le system the request refers to (an NFS server can export more than one le system); the inode number tells the server which le within that partition the request is accessing. Finally, the generation number is needed when reusing an inode number; by incrementing it whenever an inode number is reused, the server ensures that a client with an old le handle cant accidentally access the newly-allocated le. Here is a summary of some of the important pieces of the protocol; the full protocol is available elsewhere (see Callaghans book for an excellent and detailed overview of NFS [Sun89]). Well briey highlight some of the important components of the protocol. First, the LOOKUP protocol message is used to obtain a le handle, which is then subsequently used to access le data. The client passes a directory le handle and name of a le to look up, and the handle to that le (or directory) plus its attributes are passed back to the client from the server. For example, assume the client already has a directory le handle for the root directory of a le system (/) (indeed, this would be obtained through the NFS mount protocol, which is how clients and servers rst are connected together; we do not discuss the mount protocol here for sake of brevity). If an application running on the client tries to open the le /foo.txt, the
A RPACI -D USSEAU
NFSPROC_GETATTR expects: file handle returns: attributes NFSPROC_SETATTR expects: file handle, attributes returns: nothing NFSPROC_LOOKUP expects: directory file handle, name of file/directory to look up returns: file handle NFSPROC_READ expects: file handle, offset, count returns: data, attributes NFSPROC_WRITE expects: file handle, offset, count, data returns: attributes NFSPROC_CREATE expects: directory file handle, name of file to be created, attributes returns: nothing NFSPROC_REMOVE expects: directory file handle, name of file to be removed returns: nothing NFSPROC_MKDIR expects: directory file handle, name of directory to be created, attribut returns: file handle NFSPROC_RMDIR expects: directory file handle, name of directory to be removed returns: nothing NFSPROC_READDIR expects: directory handle, count of bytes to read, cookie returns: directory entries, cookie (which can be used to get more entries
Figure 44.3: Some examples of the NFS Protocol client-side le system will send a lookup request to the server, passing it the root directorys le handle and the name foo.txt; if successful, the le handle for foo.txt will be returned, along with its attributes. If case you are wondering, attributes are just the metadata that the le system tracks about each le, including elds such as le creation time, last modication time, size, ownership and permissions information, and so forth. Basically, the same type of information that you would get back if you called the stat()
O PERATING S YSTEMS
A RPACI -D USSEAU
S UN S N ETWORK F ILE S YSTEM (NFS) system call on a le. Once a le handle is available, the client can issue READ and WRITE protocol messages on a le to read or write the le, respectively. The READ protocol message requires the protocol to pass along the le handle of the le along with the offset within the le and number of bytes to read. The server then will be able to issue the read (after all, the handle tells the server which volume and which inode to read from, and the offset and count tells it which bytes of the le to read) and return the data to the client (or an error if there was a failure). WRITE is handled similarly, except the data is passed from the client to the server, and just a success code is returned. One last interesting protocol message is the GETATTR request; given a le handle, it simply fetches the attributes for that le, including the last modied time of the le. We will see why this protocol request is quite important in NFSv2 below when we discuss caching (see if you can guess why).
44.6
A RPACI -D USSEAU
10
App fd = open("/foo", ...); Cli Send LOOKUP (root dir file handle, "foo") Server Receive LOOKUP request Server look for "foo" in root dir Server if successful, pass back foos file handle/attributes Cli Receive LOOKUP reply Cli use attributes to do permissions check Cli if OK to access file, allocate file desc. in "open file table"; Cli store NFS file handle therein Cli store current file position (0 to begin) Cli return file descriptor to application App read(fd, buffer, MAX); Cli Use file descriptor to index into open file table Cli thus find the NFS file handle for this file Cli use the current file position as the offset to read from Cli Send READ (file handle, offset=0, count=MAX) Server Receive READ request Server file handle tells us which volume/inode number we need Server may have to read the inode from disk (or cache) Server use offset to figure out what block to read, Server and inode (and related structures) to find it Server issue read to disk (or get from server memory cache) Server return data (if successful) to client Cli Receive READ reply Cli Update file position to current + bytes read Cli set current file position = MAX Cli return data and error code to application App read(fd, buffer, MAX); (Same as above, except offset=MAX and set current file position = 2*MAX) App read(fd, buffer, MAX); (Same as above, except offset=2*MAX and set current file position = 3*MAX App read(fd, buffer, MAX); (Same as above, except offset=3*MAX and set current file position = 4*MAX App close(fd); Client Just need to clean up local structures Client Free descriptor "fd" in open file table for this process Client (No need to talk to server)
do not specify the offset to read from explicitly) into a properlyformatted read protocol message which tells the server exactly which bytes from the le to read. Upon every successful read,
O PERATING S YSTEMS
A RPACI -D USSEAU
S UN S N ETWORK F ILE S YSTEM (NFS) the client updates the current le position in its table and thus subsequent reads are issued with the same le handle but a different offset. Second, you may notice where server interactions occur. When the le is opened for the rst time, the client-side le system sends a LOOKUP request message. Indeed, if a long pathname must be traversed (e.g., /home/remzi/foo.txt), the client would send three LOOKUPs: one to look up home in the directory /, one to look up remzi in home, and nally one to look up foo.txt in remzi. Third, you may notice how each server request has all the information needed to complete the request in its entirety. This design point is critical to be able to gracefully recover from server failure, as we will now discuss.
11
44.7
A RPACI -D USSEAU
12
S UN S N ETWORK F ILE S YSTEM (NFS) request has not been processed and resends the request. If this time the server replies, all is well and the client has neatly handled the problem. The key to the ability of the client to simply retry the request regardless of what caused the failure is due to an important property of most NFS requests: they are idempotent. An operation is called idempotent when the effect of performing the operation multiple times is equivalent to the effect of performing the operating a single time. For example, if you store a value to a memory location three times, it is the same as doing so once; thus store value to memory is an idempotent operation. If, however, you increment a counter three times, it results in a different amount than doing so just once; thus, increment counter is not idempotent. More generally, any operation that just reads data is obviously idempotent; an operation that updates data must be more carefully considered to determine if it has this property. D ESIGN T IP : I DEMPOTENCY Idempotency is a useful property when building reliable systems. When an operation can be issued more than once, it is much easier to handle failure of the operation; you can just retry it. If an operation is not idempotent, life becomes more difcult. The key to the design of crash recovery in NFS is the idempotency of most of the common operations. LOOKUP and READ requests are trivially idempotent, as they only read information from the le server and do not update it. More interestingly, WRITE requests are also idempotent. If, for example, a WRITE fails, the client can simply retry it. Note how the WRITE message contains the data, the count, and (importantly) the exact offset to write the data to. Thus, it can be repeated with the knowledge that the outcome of multiple writes is the same as the outcome of a single write. In this way, the client can handle all timeouts in a unied
O PERATING S YSTEMS
A RPACI -D USSEAU
13
way. If a WRITE request was simply lost (Case 1 above), the client will retry it, the server will perform the write, and all will be well. The same will happen if the server happened to be down while the request was sent, but back up and running when the second request is sent, and again all works as desired (Case 2). Finally, the server may in fact receive the WRITE request, issue the write to its disk, and send a reply. This reply may get lost (Case 3), again causing the client to resend the request. When the server receives the request again, it will simply do the exact same thing: write the data to disk and reply that it has done so. If the client this time receives the reply, all is again well, and thus the client has handled both message loss and server failure in a uniform manner. Neat! A small aside: some operations are hard to make idempotent.
A RPACI -D USSEAU
14
S UN S N ETWORK F ILE S YSTEM (NFS) For example, when you try to make a directory that already exists, you are informed that the mkdir request has failed. Thus, in NFS, if the le server receives a MKDIR protocol message and executes it successfully but the reply is lost, the client may repeat it and encounter that failure when in fact the operation at rst succeeded and then only failed on the retry. Thus, life is not perfect.
A SIDE : S OMETIMES L IFE ISN T P ERFECT Even when you design a beautiful system, sometimes all the corner cases dont work out exactly as you might like. Take the mkdir example above; one could redesign mkdir to have different semantics, thus making it idempotent (think about how you might do so); however, why bother? The NFS design philosophy covers most of the important cases, and overall makes the system design clean and simple with regards to failure. Thus, accepting that life isnt perfect and still building the system is a sign of good engineering. Remember Ivan Sutherlands old saying: the perfect is the enemy of the good.
44.8
O PERATING S YSTEMS
A RPACI -D USSEAU
S UN S N ETWORK F ILE S YSTEM (NFS) ory. The cache also serves as a temporary buffer for writes. When a client application rst writes to a le, the client buffers the data in client memory (in the same cache as the data it read from the le server) before writing the data out to the server. Such write buffering is useful because it decouples application write() latency from actual write performance, i.e., the applications call to write() succeeds immediately (and just puts the data in the client-side le systems cache); only later does the data get written out to the le server. Thus, NFS clients cache data and performance is usually great and we are done, right? Unfortunately, not quite. Adding caching into any sort of system with multiple client caches introduces a big and interesting challenge which we will refer to as the cache consistency problem.
15
44.9
A RPACI -D USSEAU
16
---------------| C1 | | cache: F[v1] | ----------------
Figure 44.6: The Cache Consistency Problem ing. Thus, let us call this aspect of the cache consistency problem update visibility; when do updates from one client become visible at other clients? The second subproblem of cache consistency is a stale cache; in this case, C2 has nally ushed its writes to the le server, and thus the server has the latest version (F[v2]). However, C1 still has F[v1] in its cache; if a program running on C1 reads le F, it will get a stale version (F[v1]) and not the most recent copy (F[v2]). Again, this may result in undesirable behavior. NFSv2 implementations solve these cache consistency problems in two ways. First, to address update visibility, clients implement what is sometimes called ush-on-close consistency semantics; specically, when a le is written to and subsequently closed by a client application, the client ushes all updates (i.e., dirty pages in the cache) to the server. With ush-on-close consistency, NFS tries to ensure that an open() from another node will see the latest version of the le. Second, to address the stale-cache problem, NFSv2 clients rst check to see whether a le has changed before using its cached contents. Specically, when opening a le, the clientside le system will issue a GETATTR request to the server to fetch the les attributes. The attributes, importantly, include information as to when the le was last modied on the server; if the time-of-modication is more recent than the time that the le was fetched into the client cache, the client invalidates the
O PERATING S YSTEMS
A RPACI -D USSEAU
S UN S N ETWORK F ILE S YSTEM (NFS) le, thus removing it from the client cache and ensuring that subsequent reads will go to the server and retrieve the latest version of the le. If, on the other hand, the client sees that it has the latest version of the le, it will go ahead and use the cached contents, thus increasing performance. When the original team at Sun implemented this solution to the stale-cache problem, they realized a new problem; suddenly, the NFS server was ooded with GETATTR requests. A good engineering principle to follow is to design for the common case, and to make it work well; here, although the common case was that a le was accessed only from a single client (perhaps repeatedly), the client always had to send GETATTR requests to the server to make sure no one else had changed the le. A client thus bombards the server, constantly asking has anyone changed this le? when more of the time, no one had. To remedy this situation (somewhat), an attribute cache was added to each client. A client would still validate a le before accessing it, but most often would just look in the attribute cache to fetch the attributes. The attributes for a particular le were placed in the cache when the le was rst accessed, and then would timeout after a certain amount of time (say 3 seconds). Thus, during those three seconds, all le accesses would determine that it was OK to use the cached le and thus do so with no network communication with the server.
17
44.10
A RPACI -D USSEAU
18
S UN S N ETWORK F ILE S YSTEM (NFS) made it very hard to understand or reason about exactly what version of a le one was getting. Sometimes you would get the latest version; sometimes you would get an old version simply because your attribute cache hadnt yet timed out and thus the client was happy to give you what was in client memory. Although this was ne most of the time, it would (and still does!) occasionally lead to odd behavior. And thus we have described the oddity that is NFS client caching. Whew!
44.11
These writes overwrite the three blocks of a le with a block of as, then bs, and then cs. Thus, if the le initially looked like this:
O PERATING S YSTEMS
A RPACI -D USSEAU
19
we might expect the nal result after these writes to be like this:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa bbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbbb cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
The xs, ys, and zs, would be overwritten with as, bs, and cs, respectively. Now lets assume for the sake of the example that these three client writes were issued to the server as three distinct WRITE protocol messages. Assume the rst WRITE message is received by the server and issued to the disk, and the client informed of its success. Now assume the second write is just buffered in memory, and the server also reports it success to the client before forcing it to disk; unfortunately, the server crashes before writing it to disk. The server quickly restarts and receives the third write request, which also succeeds. Thus, to the client, all the requests succeeded, but we are surprised that the le contents look like this:
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyyy <--- oops cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc
Yikes! Because the server told the client that the second write was successful before committing it to disk, an old chunk is left in the le, which, depending on the application, might result in a completely useless le. To avoid this problem, NFS servers must commit each write to stable (persistent) storage before informing the client of success; doing so enables the client to detect server failure during a write, and thus retry until it nally succeeds. Doing so ensures we will never end up with le contents intermingled as in the above example.
A RPACI -D USSEAU
20
S UN S N ETWORK F ILE S YSTEM (NFS) The problem that this requirement gives rise to in NFS server implementation is that write performance, without great care, can be the major performance bottleneck. Indeed, some companies (e.g., Network Appliance) came into existence with the simple objective of building an NFS server that can perform writes quickly; one trick they use is to rst put writes in a batterybacked memory, thus enabling to quickly reply to WRITE requests without fear of losing the data and without the cost of having to write to disk right away; the second trick is to use a le system design specically design to write to disk quickly when one nally needs to do so [HLM94,RO91].
O PERATING S YSTEMS
A RPACI -D USSEAU
21
44.12
Summary
We have seen the introduction of a distributed le system known as NFS. NFS is centered around the idea of simple and fast recovery in the face of server failure, and achieves this end through careful protocol design. Idempotency of operations is essential; because a client can safely replay a failed operation, it is OK to do so whether or not the server has executed the request. We also have seen how the introduction of caching into a multiple-client, single-server system can complicate things. In particular, the system must resolve the cache consistency problem in order to behave reasonably; however, NFS does so in a slightly ad hoc fashion which can occasionally result in observably weird behavior. Finally, we saw how caching on the server can be tricky; in particular, writes to the server must be forced to stable storage before returning success (otherwise data can be lost).
A RPACI -D USSEAU
22
References
[S86] The Sun Network File System: Design, Implementation and Experience Russel Sandberg USENIX Summer 1986 The original NFS paper. Frankly, it is pretty poorly written and makes some of the behaviors of NFS hard to understand. [P+94] NFS Version 3: Design and Implementation Brian Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane Lebel, Dave Hitz USENIX Summer 1994. 137-152 [P+00] The NFS version 4 protocol Brian Pawlowski, David Noveck, David Robinson, Robert Thurlow Proceedings of the 2nd International System Administration and Networking Conference (SANE 2000). [4] NFS Illustrated Brent Callaghan Addison-Wesley Professional Computing Series, 2000 A great NFS reference. [Sun89] NFS: Network File System Protocol Specication Sun Microsystems, Inc. Request for Comments: 1094. March 1989 Available: http://www.ietf.org/rfc/rfc1094.txt [O91] The Role of Distributed State John K. Ousterhout Available: ftp://ftp.cs.berkeley.edu/ucb/sprite/papers/state.ps [HLM94] File System Design for an NFS File Server Appliance Dave Hitz, James Lau, Michael Malcolm USENIX Winter 1994. San Francisco, California, 1994 Hitz et al. were greatly inuenced by previous work on log-structured le systems. [RO91] The Design and Implementation of the Log-structured File System Mendel Rosenblum, John Ousterhout Symposium on Operating Systems Principles (SOSP), 1991.
O PERATING S YSTEMS
A RPACI -D USSEAU