Chapter 12: File System Implementation: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 45

Chapter 12: File System

Implementation

Operating System Concepts– 99h Edition Silberschatz, Galvin and Gagne ©2013
Chapter 12: File System Implementation
 File-System Structure
 File-System Implementation
 Directory Implementation
 Allocation Methods
 Free-Space Management
 Efficiency and Performance
 Recovery
 NFS
 Example: WAFL File System

Operating System Concepts – 9th Edition 12.2 Silberschatz, Galvin and Gagne ©20
Objectives
 To describe the details of implementing local file systems and directory structures

 To describe the implementation of remote file systems

 To discuss block allocation and free-block algorithms and trade-offs

Operating System Concepts – 9th Edition 12.3 Silberschatz, Galvin and Gagne ©20
Un peu de français
 http://gdt.oqlf.gouv.qc.ca : Grand Dictionnaire Technologique de l’Office Québécois de la Langue Francaise
 http://translate.google.com/

 file system système de fichiers


 FCB file control bloc (inode) bloc d’informations sur un fichier
 device driver pilote de périphérique
 directory répertoire
 slab tranche

Operating System Concepts – 9th Edition 12.4 Silberschatz, Galvin and Gagne ©20
File System Structure
 File structure
 Logical storage unit
 Collection of related information
 File system resides on secondary storage (disks)
 Given a user interface to storage, maps logical to physical
 Provides efficient and convenient access to disk by allowing data to be stored and retrieved easily
 Disk provides in-place rewrite (read, modify, write back) and random access
 I/O transfers performed in blocks of one or more sectors (32 bytes to 4K bytes, usually 512 bytes)
 File control block (inode) – storage structure consisting of information about a file
 Device driver controls the physical device
 File system organized into layers

Operating System Concepts – 9th Edition 12.5 Silberschatz, Galvin and Gagne ©20
File System Layers
 Device drivers manage I/O devices at the I/O control layer
 Given commands like “read drive1, cylinder 72, track 2, sector 10,
into memory location 1060”, it outputs low-level hardware specific
commands to hardware controller
 Basic file system
 Given command like “retrieve block 123” translates to device driver
 Also manages memory buffers and caches (allocation, freeing,
replacement)
 Buffers hold data in transit
 Caches hold frequently used data
 File organization module understands files, logical address, and
physical blocks
 Translates logical block address to physical block address
 Manages free space, disk allocation
 Logical file system manages metadata information
 Translates file name into file number, file handle, location by
maintaining file control blocks (inodes in UNIX)
 Directory management
 Protection

Operating System Concepts – 9th Edition 12.6 Silberschatz, Galvin and Gagne ©20
File System Layers (cont.)
 Layering useful for reducing complexity and redundancy, but adds overhead and can decrease performance
 Logical layers can be implemented by any coding method according to OS designer
 Many file systems, sometimes many within an operating system
 Each with its own format
 CD-ROM is ISO 9660
 UNIX has UFS, FFS
 Windows has FAT, FAT32, NTFS as well as floppy, CD, DVD, Blu-ray
 Linux has more than 40 types, with extended file system ext2 and ext3 leading; plus distributed file
systems, etc.
 New ones still arriving – ZFS, GoogleFS, Oracle ASM, FUSE

Operating System Concepts – 9th Edition 12.7 Silberschatz, Galvin and Gagne ©20
File System Implementation
 We have system calls at the API level, but how do we implement their functions?
 On-disk and in-memory structures
 On disk, a file system may contain :
 Boot control block contains info needed by system to boot OS from that volume
 Needed if volume contains OS (otherwise empty block), usually first block of volume
 Volume control block (superblock, master file table) contains volume/partition details
 Total number of blocks, number of free blocks, block size, free block pointers or array
 Directory structure organizes the files (per file system)
 Names and associated inode numbers, master file table
 Per-file File Control Block (FCB) contains many details about the file
 unique inode number, permissions, size, dates
 NTFS stores into in master file table using relational database structures

Typical FCB File creation:


- allocate FCB
- read directory into memory
- update directory with new file name
- write data back to disk

Operating System Concepts – 9th Edition 12.8 Silberschatz, Galvin and Gagne ©20
In-Memory File System Structures
 In memory, a file system uses these
informations for management and
performance (caching) :
 Mount table storing file system mounts,
mount points, file system types
 System-wide open-file table contains a
FCB for each open file, etc.
 Per-process open-file table has a
pointer to the entry in the system-wide
open-file table, etc.
 Buffers hold data blocks from secondary
storage
 The figure illustrates the necessary file system
structures provided by the operating systems
 (a) refers to opening a file open():
if no entry in SWOF
 Open returns a file handle/descriptor for
search dir and create an entry
subsequent use
create entry in PPOF
 (b) refers to reading a file add pointer to SWOF
 Data from read eventually copied to increment file used counter in SWOF
specified user process memory address close():
reverse order of operations
if file counter is 0
remove entry in SWOF and write metadata to disk
Operating System Concepts – 9th Edition 12.9 Silberschatz, Galvin and Gagne ©20
Partitions and Mounting
 Disk can be divided in multiple partitions
 Partition can be
 a volume containing a file system (“cooked”) or
 raw – just a sequence of blocks with no file system (e.g., swap, database)
 Boot block can point to boot volume or boot loader
 Set of blocks that contain enough code to know how to load the kernel from the file system
 Or a boot management program for multi-OS booting
 Root partition contains the OS, other partitions can hold other OSes, other file systems, or be raw
 Mounted at boot time
 Other partitions can mount automatically or manually
 At mount time, file system consistency is checked
 Is all metadata correct?
 NO: fix it, try again
 YES: add to mount table, allow access

Operating System Concepts – 9th Edition 12.10 Silberschatz, Galvin and Gagne ©20
Virtual File Systems
 Need to access transparently different file
systems on the same disk, or over a network,
mounted locally open, read, write, close
 Virtual File Systems (VFS) on UNIX provide an
object-oriented way of implementing file systems
 VFS allows the same system call interface (API)
to be used for different types of file systems
 Separates file-system generic operations
from implementation details
 Implementation can be one of many file
systems types, or network file system
 Implements vnodes which hold inodes
or network file details
 Then dispatches operation to appropriate
file system implementation routines
 The API is to the VFS interface, rather than any
specific type of file system

Operating System Concepts – 9th Edition 12.11 Silberschatz, Galvin and Gagne ©20
Virtual File System Implementation
 For example, Linux has four object types:
 inode: individual file
 file: open file
 superblock: entire file system
 dentry: individual directory entry

 VFS defines set of operations on the objects that must be implemented


 Every object has a pointer to a function table
 Function table has addresses of routines to implement that function on that object
 They include open(), close(), read(), write(), mmap(), etc.

Operating System Concepts – 9th Edition 12.12 Silberschatz, Galvin and Gagne ©20
Directory Implementation
 Linear List of file names with pointers to the data blocks
 Simple to program
 Time-consuming to execute
 Linear search time
 Could keep ordered alphabetically via linked list or use B+ tree

 Hash Table – linear list with hash data structure


 Decreases directory search time
 Collisions – situations where two file names hash to the same location
 Only good if entries are fixed size, or then use chained-overflow method (hash entry can be a linked
list instead of an individual value)

Operating System Concepts – 9th Edition 12.13 Silberschatz, Galvin and Gagne ©20
Allocation Methods - Contiguous
 An allocation method refers to how disk blocks are allocated for files; three major methods:
 Contiguous
 Linked
 Indexed

 Contiguous allocation – each file occupies set of contiguous blocks


 Best performance in most cases
 Disk head does not jump from sector to sector, but moves along a track, and moves by one track
from cylinder to the next cylinder
 Simple
 Only starting location (block address) and length (number of blocks) are required
 Direct access is also supported by easy calculation to find i-th block
 Problems include :
 finding space for a new file
 knowing file size
– easy when file is copied, but a file can still evolve
– best fit algorithm can be applied to relocate with file new size, but time consuming
 external fragmentation
– need for compaction off-line (downtime) or on-line (but performance penalty)

Operating System Concepts – 9th Edition 12.14 Silberschatz, Galvin and Gagne ©20
Contiguous Allocation of Disk Space

Operating System Concepts – 9th Edition 12.15 Silberschatz, Galvin and Gagne ©20
Extent-Based Systems
 Many newer file systems (e.g., Veritas File System) use a modified contiguous allocation scheme

 Extent-based file systems allocate disk blocks in extents

 An extent is a contiguous chunk of space


 A file has a location, a block count, a link to the first block of extent
 A file consists of one or more extents
 Internal and external fragmentations are still a problem

Operating System Concepts – 9th Edition 12.16 Silberschatz, Galvin and Gagne ©20
Allocation Methods - Linked
 Linked allocation – each file is a linked list of blocks
 Directory has a pointer to first and last blocks
 Each block contains a pointer to the next block, last block
has a nil pointer
 No external fragmentation, no compaction needed
 File can grow as long as there are free blocks
 Free space management system called when new block
needed
 Need to allocate space for block pointers
 Improve efficiency by clustering blocks into groups,
but it increases internal fragmentation
 Direct access can be inefficient
 Locating a block can take many I/Os and disk seeks
(must follow the pointers)
 Reliability can be a problem
 Damaged pointer loses the rest of the file
 Partly solved with doubly linked list or list of pointers,
but more overheads

Operating System Concepts – 9th Edition 12.17 Silberschatz, Galvin and Gagne ©20
Allocation Methods - Linked
 FAT (File Allocation Table) variation of MS-DOS
 Beginning of volume has table, indexed by block number
 Table entry i contains next block index
 Much like a linked list, but faster on disk and cacheable (keep FAT in memory)
 New block allocation simple (find 0 entries)

Operating System Concepts – 9th Edition 12.18 Silberschatz, Galvin and Gagne ©20
Allocation Methods - Indexed
 Indexed allocation
 Each file has its own index block(s) of
pointers to its data blocks

 Need to allocate index table in first block

 Allow for random access

 Dynamic access without external fragmentation,


but overhead for reading index block

 Mapping from logical to physical in a file of


maximum size of 256K bytes and block size of
512 bytes. We need only 1 block for index table

Operating System Concepts – 9th Edition 12.19 Silberschatz, Galvin and Gagne ©20
Indexed Allocation – Mapping (cont.)

outer-index

index table file

Operating System Concepts – 9th Edition 12.20 Silberschatz, Galvin and Gagne ©20
Combined Scheme: UNIX UFS
(4K bytes per block, 32-bit addresses)
Note: More index blocks
than can be addressed
with 32-bit file pointer

12 pointers

Operating System Concepts – 9th Edition 12.21 Silberschatz, Galvin and Gagne ©20
Performance
 Best allocation method depends on file access type

 Contiguous
 great for sequential and random accesses
 Linked
 good for sequential (keep address of next block in memory)
 bad for random (i-th block needs i reads)
 Some systems support both contiguous and linked allocation schemes
 Declare access type at creation, select either contiguous or linked
 Can convert to each other by re-creating/copying the file
 Indexed: more complex to evaluate its performance
 Single block access could require one or more index block reads, then data block read
 Clustering can help improve throughput, reduce CPU overhead

Operating System Concepts – 9th Edition 12.22 Silberschatz, Galvin and Gagne ©20
Performance (cont.)
 Adding instructions to the execution path to save one disk I/O is reasonable
 Intel Core i7 Extreme Edition 990x (2011) at 3.46GHz = 159,000 MIPS
 http://en.wikipedia.org/wiki/Instructions_per_second
 Typical disk drive at 250 I/Os per second
 159,000 MIPS / 250 = 630 million instructions during one disk I/O
 Fast SSD drives provide 60,000 IOPS
 159,000 MIPS / 60,000 = 2.65 millions instructions during one disk I/O

Operating System Concepts – 9th Edition 12.23 Silberschatz, Galvin and Gagne ©20
Free-Space Management
 File system maintains free-space list to track available blocks/clusters
 (Using term “block” for simplicity)
 Bit vector or bit map (n blocks)

0 1 2 n-1

 …

1  block[i ] free
bit[i] =
0  block[i ] occupied

 Easy and efficient to get contiguous files


 CPUs have instructions to return offset within word of first “1” bit
 Block number calculation : (number of bits per word) x (number of 0-value words) + offset of first 1 bit
 Bit map requires extra space and disks are getting larger
block size = 4KB = 212 bytes
disk size = 240 bytes (1 terabyte)
n = 240/212 = 228 bits (or 256MB); if clusters of 4 blocks = 64MB of memory

Operating System Concepts – 9th Edition 12.24 Silberschatz, Galvin and Gagne ©20
Free-Space Management (cont.)

 Linked list (free list) with special list-head pointer (in volume control
block)
 No waste of space
 Cannot get contiguous space easily
 Traversing the list requires substantial I/O time, but fast to get
only one free block
 No need to traverse the entire list (if # free blocks recorded)
 Grouping
 Modify linked list to store addresses of next n - 1 free blocks in
first free block (similar to indexing)
 Last address in block is a pointer to next block that contains
free-block pointers (like this one)
 Counting
 Space is frequently contiguously used and freed, with
contiguous allocation, extents, or clustering
 Keep address of first free block and a count of following
(contiguous) free blocks
 Free space list then has entries containing addresses and
counts

Operating System Concepts – 9th Edition 12.25 Silberschatz, Galvin and Gagne ©20
Free-Space Management (cont.)
 Space Maps
 Used in ZFS
 Consider meta-data I/O on very large file systems
 Full data structures like bit maps could not fit in memory, thus requiring thousands of I/Os
 Divides device space into metaslab units and manages metaslabs
 A volume can contain hundreds of metaslabs
 Each metaslab has an associated space map
 Uses counting algorithm to store info about free blocks
 But records to log file rather than file system
 Log of all block activity, in time order, in counting format
 Metaslab activity -> load space map into memory in balanced-tree structure, indexed by offset
 Replay log into that structure; in-memory space map is a rebuilt representation of the allocated/free
space
 Combine contiguous free blocks into single entry

Operating System Concepts – 9th Edition 12.26 Silberschatz, Galvin and Gagne ©20
Disk Efficiency and Performance
 Efficiency dependent on:
 Disk allocation and directory algorithms
 E.g., pre-allocating inodes spread over the disk such that data will later be near its inode block
 Types of data kept in file’s directory entry
 E.g., last-read time for a file requires to write data to the file inode (costly for reading a file)
 Pre-allocation or as-needed allocation of metadata structures
 Fixed-size or varying-size data structures
 E.g., pointer size in various tables, evolving technology and needs
 Performance
 Keeping data and metadata close together
 Buffer cache – separate section of main memory for frequently used blocks
 Synchronous vs. asynchronous writes to disk
 Synchronous writes sometimes requested by apps or needed by OS, e.g., for metadata
– No buffering / caching allowed – writes must hit disk before acknowledgement
 Asynchronous writes are more common, buffer-able, faster
 Free-behind and read-ahead – techniques to optimize sequential access (unlike LRU) because we
know the previous page will not be used again, and the following pages will be necessary soon
 Counter-intuitive: reads frequently slower than writes, because writes are asynchronous and
optimized by disk controller, and reads may include some read-ahead

Operating System Concepts – 9th Edition 12.27 Silberschatz, Galvin and Gagne ©20
Page and Unified Buffer Caches
 A page cache caches pages rather than disk blocks using virtual memory techniques and addresses
 Memory-mapped I/O uses a page cache
 Routine I/O through the file system uses the buffer (disk) cache

 A unified buffer cache uses the same page cache to cache both memory-mapped pages and ordinary
file system I/O to avoid double caching

 But which caches get priority, and what replacement algorithms to use?

Operating System Concepts – 9th Edition 12.28 Silberschatz, Galvin and Gagne ©20
Recovery
 Files and directories might be inconsistent/corrupted between in-memory and on-disk if a crash occurs or in
cases of bugs in file system, disk controller, user application, etc.
 Affects directory structures, free-block pointers, free FCB pointers
 Increased risks under caching strategies

 Consistency checking – compare data in directory structure with data blocks on disk, and try to fix
inconsistencies
 Can be slow and sometimes fails (e.g., a directory entry on an indexed allocation system)
 UNIX writes synchronously results from space allocation or metadata changes (not safe to crashes)
 Another solution applies writes onto new blocks, and only after successful writes, does it update the pointers
 Could keep the old pointers to recover into a snapshot
 Could also add a checksum on metadata and data blocks

 Use system programs to back up data from disk to another storage device (magnetic tape, other magnetic
disk, optical disk)
 Can use full backup in-between a series of incremental backups with only files that have changed
 Recover lost file or disk by restoring data from backup
 Keep some permanent backups away from the system
 Beware not to re-use backup media too many times

Operating System Concepts – 9th Edition 12.29 Silberschatz, Galvin and Gagne ©20
Log Structured File Systems
 Log structured (or journaling) file systems record each metadata update to the file system as a
transaction
 All transactions are written to a log
 A transaction is considered committed once it is written to the log (sequentially)
 Sometimes to a separate device or section of disk
 However, the file system may not yet be updated
 The transactions in the log are asynchronously written to the file system structures
 When the file system structures are modified, the transaction is removed from the log
 If the file system crashes, all remaining transactions in the log must still be performed
 A problem when a transaction was not committed before the crash, and must ‘undo’
 Faster recovery from crash, removes chance of inconsistency of metadata

Operating System Concepts – 9th Edition 12.30 Silberschatz, Galvin and Gagne ©20
The Sun Network File System (NFS)
 An implementation and a specification of a software system for accessing remote files across LANs (or
WANs)

 The implementation is part of the Solaris and SunOS operating systems running on Sun workstations using
an unreliable datagram protocol (UDP)/IP protocol and Ethernet

Operating System Concepts – 9th Edition 12.31 Silberschatz, Galvin and Gagne ©20
NFS (cont.)
 Interconnected workstations viewed as a set of independent machines with independent file systems,
which allows sharing among these file systems in a transparent manner
 A remote directory is mounted over a local file system directory
 The mounted directory looks like an integral subtree of the local file system, replacing the
subtree descending from the local directory
 Specification of the remote directory for the mount operation is nontransparent; the host name of the
remote directory has to be provided
 Files in the remote directory can then be accessed in a transparent manner
 Subject to access-rights accreditation, potentially any file system (or directory within a file system),
can be mounted remotely on top of any local directory

Operating System Concepts – 9th Edition 12.32 Silberschatz, Galvin and Gagne ©20
NFS (cont.)
 NFS is designed to operate in a heterogeneous environment of different machines, operating systems, and
network architectures; the NFS specifications independent of these media

 This independence is achieved through the use of RPC primitives built on top of an External Data
Representation (XDR) protocol used between two implementation-independent interfaces

 The NFS specification distinguishes between the services provided by a mount mechanism and the actual
remote-file-access services

Operating System Concepts – 9th Edition 12.33 Silberschatz, Galvin and Gagne ©20
Three Independent File Systems

Operating System Concepts – 9th Edition 12.34 Silberschatz, Galvin and Gagne ©20
Mounting in NFS

Mounts Cascading mounts

Operating System Concepts – 9th Edition 12.35 Silberschatz, Galvin and Gagne ©20
NFS Mount Protocol
 Establishes initial logical connection between server and client
 Mount operation includes name of remote directory to be mounted and name of server machine storing it
 Mount request is mapped to corresponding RPC and forwarded to mount server running on server
machine
 Export list – specifies local file systems that server exports for mounting, along with names of
machines that are permitted to mount them
 Following a mount request that conforms to its export list, the server returns a file handle—a key for further
accesses
 File handle – a file-system identifier, and an inode number to identify the mounted directory within the
exported file system
 The mount operation changes only the user’s view and does not affect the server side

Operating System Concepts – 9th Edition 12.36 Silberschatz, Galvin and Gagne ©20
NFS Protocol
 Provides a set of remote procedure calls for remote file operations. The procedures support the following
operations:
 searching for a file within a directory
 reading a set of directory entries
 manipulating links and directories
 accessing file attributes
 reading and writing files
 NFS servers are stateless; each request has to provide a full set of arguments (NFS V4 is just coming
available – very different, stateful)
 Modified data must be committed to the server’s disk before results are returned to the client (lose
advantages of caching)
 The NFS protocol does not provide concurrency-control mechanisms

Operating System Concepts – 9th Edition 12.37 Silberschatz, Galvin and Gagne ©20
Three Major Layers of NFS Architecture
 UNIX file-system interface (based on the open, read, write, and close calls, and file descriptors)

 Virtual File System (VFS) layer – distinguishes local files from remote ones, and local files are further
distinguished according to their file-system types
 The VFS activates file-system-specific operations to handle local requests according to their file-
system types
 Calls the NFS protocol procedures for remote requests

 NFS service layer – bottom layer of the architecture


 Implements the NFS protocol

Operating System Concepts – 9th Edition 12.38 Silberschatz, Galvin and Gagne ©20
Schematic View of NFS Architecture

Operating System Concepts – 9th Edition 12.39 Silberschatz, Galvin and Gagne ©20
NFS Path-Name Translation
 Performed by breaking the path into component names and performing a separate NFS lookup call for
every pair of component name and directory vnode

 To make lookup faster, a directory name lookup cache on the client’s side holds the vnodes for remote
directory names

Operating System Concepts – 9th Edition 12.40 Silberschatz, Galvin and Gagne ©20
NFS Remote Operations
 Nearly one-to-one correspondence between regular UNIX system calls and the NFS protocol RPCs
(except opening and closing files)

 NFS adheres to the remote-service paradigm, but employs buffering and caching techniques for the sake
of performance

 File-blocks cache – when a file is opened, the kernel checks with the remote server whether to fetch or
revalidate the cached attributes
 Cached file blocks are used only if the corresponding cached attributes are up to date

 File-attribute cache – the attribute cache is updated whenever new attributes arrive from the server

 Clients do not free delayed-write blocks until the server confirms that the data have been written to disk

Operating System Concepts – 9th Edition 12.41 Silberschatz, Galvin and Gagne ©20
Example: WAFL File System
 Used on Network Appliance “Filers” – distributed file system appliances

 “Write-anywhere file layout”

 Serves up NFS, CIFS, http, ftp

 Random I/O optimized, write optimized


 NVRAM for write caching

 Similar to Berkeley Fast File System, with extensive modifications

Operating System Concepts – 9th Edition 12.42 Silberschatz, Galvin and Gagne ©20
The WAFL File Layout

Operating System Concepts – 9th Edition 12.43 Silberschatz, Galvin and Gagne ©20
Snapshots in WAFL

Operating System Concepts – 9th Edition 12.44 Silberschatz, Galvin and Gagne ©20
End of Chapter 12

Operating System Concepts– 99h Edition Silberschatz, Galvin and Gagne ©2013

You might also like