11.1 EXT2 File System
11.1 EXT2 File System
11.1 EXT2 File System
11
Abstract
This chapter covers EXT2 file system. The goal of this chapter is to lead the reader to implement a
complete EXT2 file system that is totally Linux compatible. The premise is that if the reader
understands one file system well, it should be easy to adapt to any other file systems. It first
describes the historic role of EXT2 file system in Linux and the current status of EXT3/EXT4 file
systems. It uses programming examples to show the various EXT2 data structures and how to
traverse the EXT2 files system tree. Then it shows how to implement an EXT2 file system which
supports all file operations as in the Linux kernel. It shows how to build a base file system by
mount_root from a virtual disk. Then it divides the file system implementation into 3 levels. Level-1
expands the base file system to implement the file system tree. Level-2 implements read/write
operations of file contents. Level-3 implements mount/umount of file systems and file protection.
In each level, it describes the algorithms of the file system functions and demonstrates their
implementations by programming examples. Each level is cumulated by a programming project.
The final project is to integrate all the programming examples and exercises into a fully functional
file system.
For many years, Linux used EXT2 (Card et al. 1995) as the default file system. EXT3 (ETX3, 2014) is
an extension of EXT2. The main addition in EXT3 is a journal file, which records changes made to the
file system in a journal log. The log allows for quicker recovery from errors in case of a file system
crash. An EXT3 file system with no error is identical to an EXT2 file system. The newest extension of
EXT3 is EXT4 (Cao et al. 2007). The major change in EXT4 is in the allocation of disk blocks. In
EXT4, block numbers are 48 bits. Instead of discrete disk blocks, EXT4 allocates contiguous ranges of
disk blocks, called extents. Other than these minor changes, the file system structure and file operations
remain the same. The purpose of this book is to teach the principles of file systems. Large file storage
capacity is not the primary goal. Principles of file system design and implementation, with an emphasis
on simplicity and compatibility with Linux, are the major focal points. For these reasons, we shall use
ETX2 as the file system. The goal of this chapter is to lead the reader to implement a complete EXT2
file system that is totally Linux compatible. The premise is that if the reader understands one file
system well, it should be relatively easy to adapt to any other file systems.
creates an EXT2 file system on a device with nblocks blocks of blksize bytes per block and ninodes
inodes. The device can be either a real device or a virtual disk file. If blksize is not specified, the default
block size is 1 KB. If ninoides is not specified, mke2fs will compute a default ninodes number based on
nblocks. The resulting EXT2 file system is ready for use in Linux. As a specific example, the following
commands
creates an EXT2 files system on a virtual disk file named vdisk with 1440 blocks of 1 KB block size.
11.2.3 Superblock
Block#1: Superblock: (at byte offset 1024 in hard disk partitions): B1 is the superblock, which
contains information about the entire file system. Some of the important fields of the superblock
structure are shown below.
struct ext2_super_block {
u32 s_inodes_count; /* Inodes count */
u32 s_blocks_count; /* Blocks count */
u32 s_r_blocks_count; /* Reserved blocks count */
u32 s_free_blocks_count; /* Free blocks count */
u32 s_free_inodes_count; /* Free inodes count */
u32 s_first_data_block; /* First Data Block */
u32 s_log_block_size; /* Block size */
u32 s_log_cluster_size; /* Allocation cluster size */
u32 s_blocks_per_group; /* # Blocks per group */
u32 s_clusters_per_group; /* # Fragments per group */
u32 s_inodes_per_group; /* # Inodes per group */
u32 s_mtime; /* Mount time */
u32 s_wtime; /* Write time */
u16 s_mnt_count; /* Mount count */
s16 s_max_mnt_count; /* Maximal mount count */
u16 s_magic; /* Magic signature */
// more non-essential fields
u16 s_inode_size; /* size of inode structure */
}
The meanings of most superblock fields are obvious. Only a few fields deserve more explanation.
s_first_data_block ¼ 0 for 4KB block size and 1 for 1KB block size. It is used to determine the start
block of group descriptors, which is s_first_data_block + 1.
s_log_block_size determines the file block size, which is 1KB*(2**s_log_block_size), e.g.. 0 for 1KB
block size, 1 for 2KB block size and 2 for 4KB block size, etc. The most often used block size is
1KB for small file systems and 4KB for large file systems.
s_mnt_count¼ number of times the file system has been mounted. When the mount count reaches the
max_mnt_count, a fsck session is forced to check the file system for consistency.
s_magic is the magic number which identifies the file system type. For EXT2/3/4 files systems, the
magic number is 0xEF53.
Block#2: Group Descriptor Block (s_first_data_block+1 on hard disk): EXT2 divides disk blocks
into groups. Each group contains 8192 (32 K on HD) blocks. Each group is described by a group
descriptor structure.
struct ext2_group_desc {
u32 bg_block_bitmap; // Bmap block number
u32 bg_inode_bitmap; // Imap block number
u32 bg_inode_table; // Inodes begin block number
u16 bg_free_blocks_count; // THESE are OBVIOUS
u16 bg_free_inodes_count;
u16 bg_used_dirs_count;
u16 bg_pad; // ignore these
u32 bg_reserved[3];
};
304 11 EXT2 File System
Since a virtual floppy disk (FD) has only 1440 blocks, B2 contains only 1 group descriptor. The rest
are 0’s. On hard disks with a large number of groups, group descriptors may span many blocks. The
most important fields in a group descriptor are bg_block_bitmap, bg_inode_bitmap and
bg_inode_table, which point to the group’s blocks bitmap, inodes bitmap and inodes start block,
respectively. For the Linux formatted EXT2 file system, blocks 3 to 7 are reserved. So bmap¼8,
imap¼9 and inode_table ¼ 10.
Block#8: Block Bitmap (Bmap): (bg_block_bitmap): A bitmap is a sequence of bits used to represent
some kind of items, e.g. disk blocks or inodes. A bitmap is used to allocate and deallocate items. In a
bitmap, a 0 bit means the corresponding item is FREE, and a 1 bit means the corresponding item is
IN_USE. A FD has 1440 blocks but block#0 is not used by the file system. So the Bmap has only 1439
valid bits. Invalid bits are treated as IN_USE and set to 1’s.
Block#9: Inode Bitmap (Imap): (bg_inode_bitmap): An inode is a data structure used to represent a
file. An EXT2 file system is created with a finite number of inodes. The status of each inode is
represented by a bit in the Imap in B9. In an EXT2 FS, the first 10 inodes are reserved. So the Imap of
an empty EXT2 FS starts with ten 1’s, followed by 0’s. Invalid bits are again set to 1’s.
11.2.6 Inodes
Block#10: Inodes (begin) Block: (bg_inode_table): Every file is represented by a unique inode
structure of 128 (256 in EXT4) bytes. The essential inode fields are listed below.
struct ext2_inode {
u16 i_mode; // 16 bits = |tttt|ugs|rwx|rwx|rwx|
u16 i_uid; // owner uid
u32 i_size; // file size in bytes
u32 i_atime; // time fields in seconds
u32 i_ctime; // since 00:00:00,1-1-1970
u32 i_mtime;
u32 i_dtime;
u16 i_gid; // group ID
u16 i_links_count; // hard-link count
u32 i_blocks; // number of 512-byte sectors
u32 i_flags; // IGNORE
u32 i_reserved1; // IGNORE
u32 i_block[15]; // See details below
u32 i_pad[7]; // for inode size = 128 bytes
}
| 4 | 3 | 9 |
i_mode = |tttt|ugs|rwxrwxrwx|
11.2 EXT2 File System Data Structures 305
In the i_mode field, the leading 4 bits specify the file type, e.g. tttt¼1000 for REG file, 0100 for DIR,
etc. The next 3 bits ugs indicate the file’s special usage. The last 9 bits are the rwx permission bits for
file protection.
The i_size field is the file size in bytes. The various time fields are number of seconds elapsed since
0 hr 0 min, 0 s of January 1, 1970. So each time filed is a very large unsigned integer. They can be
converted to calendar form by the library function
char *ctime(&time_field)
which takes a pointer to a time field and returns a string in calendar form. For example,
The i_block[15] array contains pointers to disk blocks of a file, which are
The inode size (128 or 256) is designed to divides block size (1 KB or 4 KB) evenly, so that every
inode block contains an integral number of inodes. In the simple EXT2 file system, the number of
inodes is (a Linux default) 184. The number of inodes blocks is equal to 184/8 ¼ 23. So the inodes
blocks include B10 to B32. Each inode has a unique inode number, which is the inode’s position in
the inode blocks plus 1. Note that inode positions count from 0, but inode numbers count from 1. A
0 inode number means no inode. The root directory’s inode number is 2. Similarly, disk block numbers
also count from 1 since block 0 is never used by a file system. A zero block number means no disk
block.
Data Blocks Immediately after the inodes blocks are data blocks for file storage. Assuming
184 inodes, the first real data block is B33, which is i_block[0] of the root directory /.
struct ext2_dir_entry_2{
u32 inode; // inode number; count from 1, NOT 0
u16 rec_len; // this entry’s length in bytes
306 11 EXT2 File System
The dir_entry is an open-ended structure. The name field contains 1 to 255 chars without a
terminating NULL. So the dir_entry’s rec_len also varies.
In computer systems, a problem which arises very often is as follows. A city has M blocks, numbered
0 to M-1. Each block has N houses, numbered 0 to N-1. Each house has a unique block address,
denoted by (block, house), where 0 < ¼ block < M, 0 < ¼ house < N. An alien from outer space may be
unfamiliar with the block addressing scheme on Earth and prefers to address the houses linearly as
0,1,..N-1,N, N + 1 ....., etc. Given a block address BA ¼ (block, house), how to convert it to a linear
address LA, and vice versa? If everything counts from 0, the conversion is very simple.
Note that the conversion is valid only if everything counts from 0. If some of the items do not count
from 0, they can not be used in the conversion formula directly. The reader may try to figure out how to
handle such cases in general. For ease of reference, we shall refer to the conversion method as the
Mailman’s algorithm. The following shows applications of the Mailman’s algorithm.
(1) Test, Set and Clear bits in C: In standard C programs, the smallest addressable unit is a char or byte.
It is often necessary to manipulate bits in a bitmap, which is a sequence of bits. Consider char buf
[1024], which has 1024 bytes, denoted by buf[i], i ¼ 0, 1,. . . ., 1023. It also has 8192 bits
numbered 0,1,2,....8191. Given a bit number BIT, e.g. 1234, which byte i contains the bit, and
which bit j is it in that byte? Solution:
This allows us to combine the Mailman’s algorithm with bit masking to do the following bit
operations in C.
struct bits{
unsigned int bit0 : 1; // bit0 field is a single bit
unsigned int bit123 : 3; // bit123 field is a range of 3 bits
unsigned int otherbits : 27; // other bits field has 27 bits
unsigned int bit31 : 1; // bit31 is the highest bit
}var;
The structure defines var. as an unsigned 32-bit integer with individual bits or ranges of bits. Then,
var.bit0 ¼ 0; assigns 1 to bit 0, and var.bit123 ¼ 5; assigns 101 to bits 1 to 3, etc. However, the
generated code still relies on the Mailman’s algorithm and bit masking to access the individual bits.
The Mailman’s algorithm allows us to manipulate bits in a bitmap directly without defining
complex C structures.
(2) In an EXT2 file system, each file has a unique INODE structure. On the file system disk, inodes
begin in the inode_table block. Each disk block contains
INODES_PER_BLOCK = BLOCK_SIZE/sizeof(INODE)
inodes. Each inode has a unique inode number, ino ¼ 1, 2, ....., counted linearly from 1. Given an
ino, e.g. 1234, determine which disk block contains the inode and which inode is it in that block?
We need to know the disk block number because read/write a real disk is by blocks. Solution:
Similarly, converting double and triple indirect logical block numbers to physical block numbers in
an EXT2 file system also depends on the Mailman’s algorithm.
(3) Convert linear disk block number to CHS ¼ (cylinder, head, sector) format: Floppy disk and old
hard disk use CHS addressing but file systems always use linear block addressing. The algorithm
can be used to convert a disk block number to CHS when calling BIOS INT13.
In this section, we shall show how to access and display the contents of EXT2 file systems by example
programs. In order to compile and run these programs, the system must have the ext2fs.h header file
installed, which defines the data structures of EXT2/3/4 file systems. Ubuntu Linux user may get and
install the ext2fs development package by
The following C program displays the superblock of an EXT2 file system. The basic technique is as
follows.
(2) Read the superblock (block #1 or 1 KB at offset 1024) into a char buf[1024].
char buf[1024];
lseek(fd, 1024, SEEK_SET); // seek to byte offset 1024
int n = read(fd, buf, 1024);
(3) Let a struct ext2_super_block *sp point to buf[ ]. Then use sp- > field to access the various fields
of the superblock structure.
The same technique is also applicable to any other data structures in a real or virtual disk.
Example 11.1 superblock.c program: display superblock information of an EXT2 file system.
SUPER *sp;
char buf[1024];
int fd, blksize, inodesize;
Exercise 11.1 Write a C program to display the group descriptor of an EXT2 file system on a device.
(See Problem 1).
The program in Example 11.2 displays the inodes bitmap (imap) in HEX.
Example 11.2 imap.c program: display inodes bitmap of an EXT2 file system.
fd = open(dev, O_RDONLY);
if (fd < 0){printf("open %s failed\n", device); exit(1);}
get_block(fd, 1, buf); // get superblock
sp = (SUPER *)buf;
// check magic number to ensure it’s an EXT2 FS
ninodes = sp->s_inodes_count; // get inodes_count
printf("ninodes = %d\n", ninodes);
get_block(fd, 2, buf); // get group descriptor
gp = (GD *)buf;
imapblk = gp->bg_inode_bitmap; // get imap block number
printf("imapblk = %d\n", imapblk);
get_block(fd, imapblk, buf); // get imap block into buf[ ]
for (i=0; i<=nidoes/8; i++){ // print each byte in HEX
printf(“%02x “, (u8)buf[i]);
}
printf("\n");
}
The program prints each byte of the inodes bitmap as 2 HEX digits. The outputs look like the
following.
In the imap, bits are stored linearly from low to high address. The first 16 bits (from low to high) are
b’11111111 11100000’, but they are printed as ff 07 in HEX, which is not very informative since the
bits in each byte are printed in reverse order, i.e. from high to low address.
Exercise 11.2 Modify the imap.c program to print the inodes bitmap in char map form, i.e. for each
bit, print a ‘0’ if the bit is 0, print a ‘1’ if the bit is 1. (see Problem 2).
The outputs of the modified program should look like Fig. 11.3, in which each char represents a bit
in the imap.
Exercise 11.3 Write a C program to display the blocks bitmap of an Ext2 file system, also in char
map form.
In an EXT2 file system, the number 2 (count from 1) inode is the inode of the root directory /. If we
read the root inode into memory, we should be able to display its various fields such as mode, uid, gid,
file size, time fields, hard links count and data block numbers, etc. The program in Example 11.3
displays the INDOE information of the root directory of an EXT2 file system.
Example 11.3 inode.c program: display root inode information of an EXT2 file system.
struct ext2_dir_entry_2 {
u32 inode; // inode number; count from 1, NOT 0
u16 rec_len; // this entry’s length in bytes
u8 name_len; // name length in bytes
u8 file_type; // not used
char name[EXT2_NAME_LEN]; // name: 1-255 chars, no ending NULL
};
where NAME is a sequence of name_len chars without a terminating NULL byte. Each dir_entry has a
record length rec_len. The rec_len of the last entry in a block covers the remaining block length,
i.e. from where the entry begins to the end of block. The following algorithm shows how to step
through the dir_entries in a directory data block.
Exercise 11.4 Write a C program to print the dir_entries of a directory. For simplicity, we may assume
a directory INODE has at most 12 direct blocks, i_block[0] to i_block[11]. This assumption is
reasonable. With 1 KB block size and an average file name length of 16 chars, a single disk block
can contain up to 1024/(8 + 16) ¼ 42 dir_entries. With 12 disk blocks, a directory can contain more
than 500 entries. We may safely assume that no user would put that many files in any directory. For an
empty EXT2 file system, the program output should look like Fig. 11.5. (see Problem 4).
Exercise 11.5 Mount mydisk under Linux. Create new directories and copy files to the mounted file
system, then umount it (See Problem 5). Run the dir.c program on mydisk again to see the outputs,
which should look like Fig. 11.6. The reader may verify that the name_len of each entry is the exact
number of chars in the name field, and every rec_len is a multiple of 4 (for alignment), which is
(8 + name_len) raised to the next multiple of 4, except the last entry, whose rec_len covers the
remaining block length.
function which searches for a dir_entry with a given name. Return its inode number if found, else
return 0. (see Problem 6).
Given an EXT2 file system and the pathname of a file, e.g. /a/b/c, the problem is how to find the file. To
find a file amounts to finding its inode. The algorithm is as follows.
(1) Read in the superblock. Check the magic number s_magic (0xEF53) to verify it’s indeed an
EXT2 FS.
(2) Read in the group descriptor block (1 + s_first_data_block) to access the group 0 descriptor. From
the group descriptor’s bg_inode_table entry, find the inodes begin block number, call it the
InodesBeginBlock.
(3) Read in InodeBeginBlock to get the inode of /, which is INODE #2.
316 11 EXT2 File System
(4) Tokenize the pathname into component strings and let the number of components be n. For
example, if pathname¼/a/b/c, the component strings are “a”, “b”, “c”, with n ¼ 3. Denote the
components by name[0], name[1], ..,name[n-1].
(5) Start from the root INODE in (3), search for name[0] in its data block(s). For simplicity, we may
assume that the number of entries in a DIR is small, so that a DIR inode only has 12 direct data
blocks. With this assumption, it suffices to search 12 (non-zero) direct blocks for name[0]. Each
data block of a DIR INODE contains dir_entry structures of the form
where NAME is a sequence of nlen chars without a terminating NULL. For each data block, read
the block into memory and use a dir_entry *dp to point at the loaded data block. Then use
name_len to extract NAME as a string and compare it with name[0]. If they do not match, step
to the next dir_entry by
and continue the search. If name[0] exists, we can find its dir_entry and hence its inode number.
(6) Use the inode number, ino, to locate the corresponding INODE. Recall that ino counts from 1. Use
Mailman’s algorithm to compute the disk block containing the INODE and its offset in that block.
Then read in the INODE of /a, from which we can determine whether it’s a DIR. If /a is not a
DIR, there can’t be /a/b, so the search fails. If it’s a DIR and there are more components to search,
continue for the next component name[1]. The problem now becomes: search for name[1] in the
INODE of /a, which is exactly the same as that of Step (5).
(7) Since Steps 5–6 will be repeated n times, it’s better to write a search function
If the search loop ends successfully, ip must point at the INODE of pathname. Traversing large
EXT2/3 file systems with many groups is similar. The reader may consult Chap. 3 of [Wang, 2015]
for details.
Given a device containing an EXT2 file system and a pathname, .e.g. /a/b/c/d, write a C function.
which returns an INODE pointer to the file’s inode or 0 if the file in inaccessible (see Problem 7).
As will be seen shortly, the path2inode() function is the most important function in a file system.
Write a C program, showblock, which prints all disk block (numbers) of a file (see Problem 8).
The objective of this section is to show how to implement a complete file system. First, we show the
overall organization of a file system, and the logical steps of the implementation. Next, we present a
base file system to help the reader get started. Then we organize the implementation steps as a series of
programming projects for the reader to complete. It is believed that such an experience would be
beneficial to any computer science student.
Figure 11.7 shows the internal organization of an EXT2 file system. The organization diagram is
explained by the labels (1) to (5).
(1) is the PROC structure of the current running process. As in a real system, every file operation is due
to the current executing process. Each PROC has a cwd, which points to the in-memory INODE of
the PROC’s Current Working Directory (CWD). It also has an array of file descriptors, fd[], which
point to opened file instances.
(2) is the root pointer of the file system. It points to the in-memory root INODE. When the system
starts, one of the devices is chosen as the root device, which must be a valid EXT2 file system. The
root INODE (inode #2) of the root device is loaded into memory as the root (/) of the file system.
This operation is known as “mount root file system”
(3) is an openTable entry. When a process opens a file, an entry of the PROC’s fd array points to an
openTable, which points to the in-memory INODE of the opened file.
(4) is an in-memory INODE. Whenever a file is needed, its INODE is loaded into a minode slot for
reference. Since INODEs are unique, only one copy of each INODE can be in memory at any time.
In the minode, (dev, ino) identify where the INODE came from, for writing the INODE back to
disk if modified. The refCount field records the number of processes that are using the minode.
318 11 EXT2 File System
The dirty field indicates whether the INODE has been modified. The mounted flag indicates
whether the INODE has been mounted on and, if so, the mntabPtr points to the mount table entry
of the mounted file system. The lock field is to ensure that an in-memory INODE can only be
accessed by one process at a time, e.g. when modifying the INODE or during a read/write operation.
(5) is a table of mounted file systems. For each mounted file system, an entry in the mount table is used
to record the mounted file system information, e.g. the mounted file system device number. In the
in-memory INODE of the mount point, the mounted flag is turned on and the mntabPtr points to
the mount table entry. In the mount table entry, mntPointPtr points back to the in-memory INODE
of the mount point. As will be shown later, these doubly-linked pointers allow us to cross mount
points when traversing the file system tree. In addition, a mount table entry may also contain other
information of the mounted file system, such as values from the superblock, group descriptor,
bitmaps and inodes start block, etc. for quick access. If any of these cached items are modified,
they must be written back to the device when the device is umounted.
Implementation of the file system is divided into three levels. Each level deals with a distinct part of the
file system. This makes the implementation process modular and easier to understand. In the file
system implementation, the FS directory contains files which implement the EXT2 file system. The
files are organized as follows.
11.7 Base File System 319
Level-1 implements the basic file system tree. It contains the following files, which implement the
indicated functions.
mkdir, creat, mknod, rmdir, link, unlink, symlink, rm, ls, cd and pwd, etc.
This file contains the data structure types of the EXT2 file system, such as superblock, group
descriptor, inode and directory entry structures. In addition, it also contains the open file table,
mount table, PROC structures and constants of the file system.
320 11 EXT2 File System
// PROC structure
typedef struct proc{
struct Proc *next;
int pid;
int uid;
int gid;
11.7 Base File System 321
int ppid;
int status;
struct minode *cwd;
OFT *fd[NFD];
}PROC;
(2). global.c file: This file contains global variables of the file system. Examples of global variables are
When the file system starts, we initializes all the global data structures and let running point at
PROC[0], which is process P0 of the superuser (uid ¼ 0). As in a real file system, every operation is
due to the current running process. We begin with a superuser process because it does not need any file
protection. File protection by permission checking will be enforced later in Level 3 of the FS
implementation.
int fs_init()
{
int i,j;
for (i=0; i<NMINODE; i++) // initialize all minodes as FREE
minode[i].refCount = 0;
for (i=0; i<NMTABLE; i++) // initialize mtables as FREE
mtable[i].dev = 0;
for (i=0; i<NOFT; i++) // initialize ofts as FREE
oft[i].refCount = 0;
for (i=0; i<NPROC; i++){ // initialize PROCs
proc[i].status = READY; // ready to run
proc[i].pid = i; // pid = 0 to NPROC-1
proc[i].uid = i; // P0 is a superuser process
for (j=0; j<NFD; j++)
proc[i].fd[j] = 0; // all file descriptors are NULL
proc[i].next = &proc[i+1]; // link list
}
proc[NPROC-1].next = &proc[0]; // circular list
running = &proc[0]; // P0 runs first
}
During file system operation, global data structures are regarded as system resources, which are
used and released dynamically. Each set of resources is managed by a pair of allocate and deallocate
functions. For example, mialloc() allocates a FREE minode for use, and midalloc() releases a used
minode. Other resource management functions are similar, which will be shown later when they are
actually needed.
(3). util.c file: This file contains commonly used utility functions of the file system. The most important
utility functions are read/write disk block functions, iget(), iput() and getino(), which are explained in
more detail.
(3).1. get_block/put_block functions: We assume that a block device, e.g. a real or virtual disk, can
only be read or written in unit of block size. For real disks, this is due to hardware constraints. For
virtual disks, we assume that read/write is also by block size, so that the code can be ported to real disks
if desired. For a virtual disk, we first open it for R|W mode and use the file descriptor as the device
number. The following functions read/write a virtual disk block into/from a buffer area in memory.
(3).2. iget(dev, ino) function: This function returns a pointer to the in-memory minode containing the
INODE of (dev, ino). The returned minode is unique, i.e. only one copy of the INODE exists in
memory. In a real file system, the returned minode is locked for exclusive use until it is either released
or unlocked. For simplicity, we shall assume that minode locking is unnecessary, which will be
explained later.
(3).3. The iput(INODE *mip) function: This function releases a used minode pointed by mip. Each
minode has a refCount, which represents the number of users that are using the minode. iput()
decrements the refCount by 1. If the refCount is non-zero, meaning that the minode still has other
users, the caller simply returns. If the caller is the last user of the minode (refCount ¼ 0), the INODE is
written back to disk if it is modified (dirty).
if (mip==0) return;
mip->refCount--; // dec refCount by 1
if (mip->refCount > 0) return; // still has user
if (mip->dirty == 0) return; // no need to write back
(3).4. getino() function: The getino() function implements the file system tree traversal algorithm. It
returns the INODE number (ino) of a specified pathname. To begin with, we assume that in the level-1
file system implementation, the file system resides on a single root device, so that there are no mounted
devices and mounting point crossings. Mounted file systems and mounting point crossing will be
11.7 Base File System 325
considered later in level-3 of the file system implementation. Thus, the getino() function essentially
returns the (dev, ino) of a pathname. The function first uses the tokenize() function to break up
pathname into component strings. We assume that the tokenized strings are in a global data area, each
pointed by a name[i] pointer and the number of token strings is nname. Then it calls the search()
function to search for the token strings in successive directories. The following shows the tokenize()
and search() functions.
(3).5. Use of getino()/iget()/iput(): In a file system, almost every operation begins with a pathname,
e.g. mkdir pathname, cat pathname, etc. Whenever a pathname is specified, its inode must be loaded
into memory for reference. The general pattern of using an inode is
. ino = getino(pathname);
. mip = iget(dev, ino);
. use mip->INODE, which may modify the INODE;
. iput(mip);
There are only a few exceptions to this usage pattern. For instance,
chdir : iget the new DIR minode but iput the old DIR minode.
open : iget the minode of a file, which is released when the file is closed.
mount: iget the minode of mountPoint, which is released later by umount.
11.7 Base File System 327
In general, iget and iput should appear in pairs, like a pair of matched parentheses. We may rely on
this usage pattern in the implementation code to ensure that every INODE is loaded and then released
properly.
(4). Minodes Locking: In a real file system, each minode has a lock field, which ensures that the
minode can only be accessed by one process at a time, e.g. when modifying the INODE. Unix kernel
uses a busy flag and sleep/wakeup to synchronize processes accessing the same minode. In other
systems, each minode may have a mutex lock or a semaphore lock. A process is allowed to access a
minode only if it holds the minode lock. The reason for minodes locking is as follows.
Assume that a process Pi needs the inode of (dev, ino), which is not in memory. Pi must load the
inode into a minode entry. The minode must be marked as (dev, ino) to prevent other processes from
loading the same inode again. While loading the inode from disk Pi may wait for I/O completion,
which switches to another process Pj. If Pj needs exactly the same inode, it would find the needed
minode already exists. Without the lock, Pj would proceed to use the minode before it is even loaded in
yet. With the lock, Pj must wait until the minode is loaded, used and then released by Pi. In addition,
when a process read/write an opened file, it must lock the file’s minode to ensure that each read/write
operation is atomic. For simplicity, we shall assume that only one process runs at a time, so that the
lock is unnecessary. However, the reader should beware that minodes locking is necessary in a real file
system.
Exercise 11.7 Design and implement a scheme, which uses the minode[ ] area as a cache memory for
in-memory INODES. Once an INODE is loaded into a minode slot, try to keep it in memory for as long
as possible even if no process is actively using it.
11.7.3 Mount-Root
(5). mount_root.c file: This file contains the mount_root() function, which is called during system
initialization to mount the root file system. It reads the superblock of the root device to verify the device
is a valid EXT2 file system.Then it loads the root INODE (ino ¼ 2) of the root device into a minode and
sets the root pointer to the root minode. It also sets the CWD of all PROCs to the root minode. A mount
table entry is allocated to record the mounted root file system. Some key information of the root device,
such as the number of inodes and blocks, the starting blocks of the bitmaps and inodes table, are also
recorded in the mount table for quick access.
// global variables
MINODE minode[NMINODES], *root;
MTABLE mtable[NMTABLE];
PROC proc[NPROC], *running;
int ninode, nblocks, bmap, imap, iblock;
int dev;
char gline[25], *name[16]; // tokenized component string strings
int nname; // number of component strings
char *rootdev = “mydisk”; // default root_device
328 11 EXT2 File System
int fs_init()
{
int i,j;
for (i=0; i<NMINODES; i++) // initialize all minodes as FREE
minode[i].refCount = 0;
for (i=0; i<NMOUNT; i++) // initialize mtable entries as FREE
mtable[i].dev = 0;
for (i=0; i<NPROC; i++){ // initialize PROCs
proc[i].status = READY; // reday to run
proc[i].pid = i; // pid = 0 to NPROC-1
proc[i].uid = i; // P0 is a superuser process
for (j=0; j<NFD; j++)
proc[i].fd[j] = 0; // all file descriptors are NULL
proc[i].next = &proc[i+1];
}
proc[NPROC-1].next = &proc[0]; // circular list
running = &proc[0]; // P0 runs first
}
get_block(dev, 2, buf);
gp = (GD *)buf;
bmap = mp->bmap = gp->bg_blocks_bitmap;
imap = mp->imap = gp->bg_inodes_bitmap;
iblock = mp->iblock = gp->bg_inode_table;
printf("bmap=%d imap=%d iblock=%d\n", bmap, imap iblock);
mip->refCount = 1;
iput(mip);
}
}
exit(0);
}
How to ls: ls [pathname] lists the information of either a directory or a file. In Chap. 5 (Sect. 5.4), we
showed an ls program, which works as follows.
(1) ls_dir(dirname): use opendir() and readdir() to get filenames in the directory. For each filename,
call ls_file(filename).
(2) ls_file(filename): stat the filename to get file information in a STAT structure. Then list the STAT
information.
Since the stat system call essentially returns the same information of a minode, we can modify the
original ls algorithm by using minodes directly. The following shows the modified ls algorithm
HOW TO pwd: The following shows the algorithm of pwd, which uses recursion on the directory
minodes.
pwd(MINODE *wd){
if (wd == root) print "/"
else rpwd(wd);
}
// pwd start:
pwd(running->cwd);
Exercise 11.8 Implement step (2) of the pwd algorithm as a utility function.
Exercise 11.9 Implement step (4) of the pwd algorithm as a utility function.
which returns the name string of a dir_entry identified by my_ino in the parent directory.
The programming task here is to complete the above mount_root.c program as the base file system.
Run it with a virtual disk containing an EXT2 file system. The outputs of the base file system should
look like Fig. 11.8. The figure only shows the results of the ls command. It should also support the cd
and pwd commands.
mkdir pathname
makes a new directory with pathname. The permission bits of the new directory are set to the default
value 0755 (owner can access and rw, others can access but can only READ).
mkdir creates an empty directory with a data block containing the default . and .. entries. The algorithm
of mkdir is
Most steps of the mkdir algorithm are self-explanatory. Only step (4) needs more explanation. We
illustrate step (4) in more detail by example code. In order to make a directory, we need to allocate an
inode from the inodes bitmap, and a disk block from the blocks bitmap, which rely on test and set bits
in the bitmaps. In order to maintain file system consistency, allocating an inode must decrement the
free inodes count in both superblock and group descriptor by 1. Similarly, allocating a disk block must
decrement the free blocks count in both superblock and group descriptor by 1. It is also worth noting
11.8 File System Level-1 Functions 333
that bits in the bitmaps count from 0 but inode and block numbers count from 1. The following code
segments show the detailed steps of (4).
Allocation of disk blocks is similar, except it uses the blocks bitmap and it decrements the free
blocks count in both superblock and group descriptor. This is left as an exercise.
(4).2. Create INODE: The following code segment creates an INODE¼(dev, ino) in a minode, and
writes the INODE to disk.
(4).3. Create data block for new DIR containing . and .. entries
char buf[BLKSIZE];
bzero(buf, BLKSIZE); // optional: clear buf[ ] to 0
DIR *dp = (DIR *)buf;
// make . entry
dp->inode = ino;
dp->rec_len = 12;
dp->name_len = 1;
dp->name[0] = ‘.’;
// make .. entry: pino=parent DIR ino, blk=allocated block
dp = (char *)dp + 12;
dp->inode = pino;
dp->rec_len = BLKSIZE-12; // rec_len spans block
dp->name_len = 2;
dp->name[0] = d->name[1] = ‘.’;
put_block(dev, blk, buf); // write to blk on diks
enters a [ino, name] as a new dir_entry into a parent directory. The enter_name algorithm consists of
several steps.
The following diagrams show the block contents before and after entering [ino, name] as a new
entry.
The following diagrams shows the new data block containing only one entry.
336 11 EXT2 File System
It is noted that the above creat algorithm differs from Unix/Linux in that it does not open the file for
WRITE mode and return a file descriptor. In practice, creat is rarely used as a stand-alone function. It is
used internally by the open() function, which may create a file, open it for WRITE and return a file
descriptor. The open operation will be implemented later in level-2 of the file system.
The programming task here is to implement mkdir-creat functions and add the corresponding mkdir
and creat commands to the base file system. Demonstrate the system by testing the mkdir and creat
functions. Figure 11.9 shows the sample outputs of the resulting file system. It shows the detailed steps
of executing the command mkdir dir1, as described in the mkdir Algorithm.
3. rmdir: The command
rmdir dirname
removes a directory. As in Unix/Linux, in order to remove a DIR, the directory must be empty, for the
following reasons. First, removing a non-empty directory implies the removal of all the files and
subdirectories in the directory. Although it is possible to implement a recursive rmdir operation, which
removes an entire directory tree, the basic operation is still remove one directory at a time. Second, a
non-empty directory may contain files that are actively in use, e.g. files opened for read/write, etc.
Removing such a directory is clearly unacceptable. Although it is possible to check whether there are
any active files in a directory, doing this would incur too much overhead. The simplest way out is to
require that a directory must be empty in order to be removed. The algorithm of rmdir is.
11.8 File System Level-1 Functions 337
In the rmdir algorithm, most steps are simple and self-explanatory. We only need to explain the
following steps in more detail.
1. How to test DIR empty: Every DIR’s links_count starts with 2 (for the. and .. entries). Each
subdirectory increments its links_count by 1 but regular files do not increment the links_count of
the parent directory. So, if a DIR’s links_count is greater than 2, it is definitely not empty. However,
if links_count ¼ 2, the DIR may still contain regular files. In that case, we must traverse the DIR’s
data blocks to count the number of dir_entries, which must be greater than 2. This can be done by
the same technique of stepping through dir_entris in the data blocks of a DIR INODE.
2. How to deallocate inode and data block: When removing a DIR, we must deallocate its inode and
data block (numbers). The function
idalloc(dev, ino)
deallocates an inode (number). It clears the ino’s bit in the device’s inodes bitmap to 0. Then it
increments the free inodes count in both superblock and group descriptor by 1.
Deallocating a disk block number is similar, except it uses the device’s blocks bitmap and it
increaments the free blocks count in both superblock and groupdescriptor. This is left as an exercise.
Exercise 11.11 Implement a bdalloc(int dev, int bno) function which deallocates a disk block
(number) bno.
removes the dir_entry of name from a parent directory minode pointed by pmip. The algorithm of
rm_child is as follows.
---------------------------------------------–
|ino rlen=BLKSIZE nlen NAME |
---------------------------------------------–
The following diagrams illustrate the block contents before and after
deleting such an entry.
The programming task here is to implement the rmdir function and add the rmdir command to the file
system. Compile and run the resulting program to demonstrate rmdir operation. Figure 11.10 shows the
sample outputs of the resulting file system. The figure shows the detailed steps of removing a directory.
creates a hard link from new_file to old_file. Hard links can only be applied to regular files, not to
DIRs, because linking to DIRs may create loops in the file system name space. Hard link files share the
same inode. Therefore, they must be on the same device. The algorithm of link is as follows.
The following diagrams show the outcome of the link operation. It adds an entry z to the data block
of /x/y. The inode number of z is the same ino of a/b/c, so they share the same INODE, whose
links_count is incremented by 1.
unlink filename
unlinks a file. It decrements the file’s links_count by 1 and deletes the file name from its parent DIR.
When a file’s links_count reaches 0, the file is truly removed by deallocating its data blocks and inode.
The algorithm of unlink is
creates a symbolic link from new_file to old_file. Unlike hard links, symlink can link to anything,
including DIRs, even files not on the same device. The algorithm of symlink is
reads the target file name of a symbolic file and returns the length of the target file name. The
algorithm of readlink() is.
Other level-1 functions include access, chmod, chown, change file’s time fields, etc. The operations of
all such functions are of the same pattern:
The programming project #1 is to complete the level-1 implementation of the file system and
demonstrate the file system. Figure 11.11 shows the sample outputs of running the Level 1 implemen-
tation of the file system. The ls command shows that hi is a symbolic link to the directory dir1.
Level-2 of the file system implements read and write operations of file contents. It consists of the
following functions: open, close, lseek, read, write, opendir and readdir.
1. Open Operation
In Unix/Linux, the system call
opens a file for read or write, where flags is one of O_RDONLY, O_WRONLY, O_RDWR, which
may be bitwise or-ed with O_CREAT, O_APPEND, O_TRUNC, etc. These symbolic constants are
defined in the fcntl.h file. On success, open() returns a file descriptor (number), which is used in
subsequent system calls, such as read(), write(), lseek() and close(), etc. For simplicity, we shall assume
the parameter flags ¼ 0|1|2|3 or RD|WR|RW|AP for READ|WRITE| RDWR|APEND, respectively.
The algorithm of open() is.
11.9 File System Level-2 Functions 345
(3). Search for the first FREE fd[index] entry with the lowest index in PROC;
fd[index] = &OFT; // fd entry points to the openTable entry;
Figure 11.12 shows the data structure created by open. In the figure, (1) is the PROC structure of the
process that calls open(). The returned file descriptor, fd, is the index of the fd[ ] array in the PROC
structure. The contents of fd[fd] points to an OFT, which points to the minode of the file. The OFT’s
refCount represents the number of processes that share the same instance of an opened file. When a
process opens a file, the refCount in the OFT is set to 1. When a process forks a child process, the child
process inherits all the opened file descriptors of the parent, which increments the refCount of every
shared OFT by 1. When a process closes a file descriptor, it decrements OFT.refCount by 1. When
OFT.refCount reaches 0, the file’s minode is released and the OFT is deallocated. The OFT’s offset is a
conceptual pointer to the current byte position in the file for read/write. It is initialized to 0 for RD|WR|
RW mode or to file size for APPEND mode.
11.9.2 lseek
sets the offset in the OFT of an opened file descriptor to the byte position either from the file beginning
(SEEK_SET) or relative to the current position (SEEK_CUR). For simplicity, we shall assume that
the new position is always from the file beginning. Once set, the next read/write begins from the
current offset position. The algorithm of lseek is trivial. It only needs to check the requested position
value is within the bounds of [0, fileSize-1]. We leave the implementation of lseek as an exercise.
Exercise 11.12 Write C code for the lseek function int lseek(int fd, int position).
2. Close Operation
The close(int fd) operation closes a file descriptor. The algorithm of close is
reads nbytes from an opened file descriptor into a buffer area in user space. The read system call is
routed to the read function in the OS kernel. For regular files, the algorithm of read is.
The algorithm of read() can be best explained in terms of Fig. 11.13. Assume that fd is opened for
READ. The offset in the OFT points to the current byte position in the file from where we wish to read
nbytes. To the file system in kernel, a file is just a sequence of contiguous bytes, numbered from 0 to
fileSize-1. As Fig. 11.13 shows, the current byte position, offset, falls in a logical block.
bytes still available for read. These numbers are used in the read algorithm.
348 11 EXT2 File System
For small EXT2 files systems, the block size is 1 KB and files have at most double indirect blocks.
For read, the algorithm of converting logical blocks to physical blocks is.
Exercise 11.13 Complete the algorithm for converting double indirect logical blocks to physical
blocks. Hint: Mailman’s algorithm.
Exercise 11.14 For simplicity and clarity, step (5) of the read algorithm transfers data one byte at a
time and updates the control variable on each byte transfer, which are not very efficient. Optimize the
code by transferring a maximal sized chunk of data at a time.
writes nbytes from buf in user space to an opened file descriptor and returns the actual number of bytes
written. The write system call is routed to the write function in the OS kernel. For regular files, the
algorithm of write is
} // end while(remain)
(6). write_block(dev, blk, kbuf);
} // end while(nbytes)
(7). set minode dirty = 1; // mark minode dirty for iput() when fd is closed
return count;
The algorithm of write can be best explained in terms of Fig. 11.14. In the figure, the offset in the
OFT is the current byte position in the file to write, which falls in a logical block. It uses offset to
compute the logical block number, lbk, the start byte position and the number of bytes remaining in the
logical block. It then converts the logical block to physical block through the file’s INODE.i_block
array.Then it reads the physical block into a kbuf and transfer data from buf into kbuf, which may
increase the file size if offset exceeds the current file size. After transferring data to kbuf, it writes the
buffer back to disk. The write algorithm stops when all nbytes are written. The return value is nbtyes
unless write fails.
The algorithm of converting logical block to physical block for write is similar to that of read,
except for the following difference. During write, the intended data block may not exist. If a direct
block does not exist, it must be allocated and recorded in the INODE. If the indirect block i_block
[12] does not exist, it must be allocated and initialized to 0. If an indirect data block does not exist, it
must be allocated and recorded in the indirect block. Simliarly, if the double indirect block i_block
[13] dose not exist, it must be allocated and initialized to 0. If an entry in the double indirect block does
not exist, it must be allocated and initialized to zero. If a double indirect block does not exist, it must be
allocated and recorded in a double indirect block entry, etc.
Exercise 11.15 In the write algorithm, a file opened for WRITE mode may start with no data blocks at
all. Complete the algorithm for coverting logical block numbers to physical blocks, including direct,
indirect and double indirect blocks.
Exercise 11.16 For simplicity and clarity, step (5) of the write algorithm transfers data one byte at a
time and updates the control variable on each byte transfer. Optimize the code by transferring a
maximal sized chunk of data at a time.
350 11 EXT2 File System
Exercise 11.17 With open() and read(), implement the command cat filename, which displays the
contents of a file.
Exercise 11.18 With open(), read() and write(), implement the command cp src dest, which copies a
src file src to a dest file.
11.9.6 Opendir-Readdir
Unix considers everything as a file. Therefore, we should be able to open a DIR for read just like a
regular file. From a technical point of view, there is no need for a separate set of opendir() and readdir()
functions. However, different Unix systems may have different file systems. It may be difficult for
users to interpret the contents of a DIR file. For this reason, POSIX specifies opendir and readdir
operations, which are independent of file systems. Support for opendir is trivial; it’s the same open
system call, but readdir() has the form.
which returns a pointer to a dirent structure on each call. This can be implemented in user space as a
library I/O function. Instead, we shall implement opendir() and readir() directly.
int opendir(pathaname)
{ return open(pathname, RD|O_DIR); }
where O_DIR is a bit pattern for opening the file as a DIR. In the open file table, the mode field
contains the O_DIR bit.
The programming project #2 is to complete the implementation of File System Level-2 and demon-
strate read, write, cat, cp and mv operations in the file system. The reader may consult earlier Chapters
for cat and cp operations. For the mv operation, see Problem 12. Figure 11.15 show the sample outputs
of running the complete Level 2 file system.
11.10 File System Level-3 351
File system level-3 supports mount, umount of file systems and file protection.
Mount Operation
The command
mounts a file system to a mount_point directory. It allows the file system to include other file
systems as parts of the existing file system. The data structures used in mount are the MOUNT table
and the in-memory minode of the mount_point directory. The algorithm of mount is
Umount Operation:
The operation umount filesys
un-mounts a mounted file system. It detaches a mounted file system from the mounting point, where
filesys may be a virtual diak name or a mounting point directory name. The algorithm of umount is
To check whether a mounted file system contains any files that are actively in use, search the
in-memory minodes for any entry whose dev matches the dev number of the mounted file system.
While it is easy to implement mount and umount, there are implications. With mount, we must modify
the getino(pathname) function to support crossing mount points. Assume that a file system, newfs, has
been mounted on the directory /a/b/c/. When traversing a pathname, mount point crossing may occur in
both directions.
(1) Downward traversal: When traversing the pathname /a/b/c/x, once we reach the minode of /a/b/c,
we should see that the minode has been mounted on (mounted flag¼1). Instead of searching for x
in the INODE of /a/b/c, we must
• Follow the minode’s mntPtr pointer to locate the mount table entry.
• From the mount table’s dev number, get its root (ino¼2) INODE into memory.
• Then continue search for x under the root INODE of the mounted device.
(2) Upward traversal: Assume that we are at the directory /a/b/c/x/ and traversing upward, e.g. cd ../
../, which will cross the mount point /a/b/c. When we reach the root INODE of the mounted file
system, we should see that it is a root directory (ino¼2) but its dev number differs from that of the
11.10 File System Level-3 353
real root, so it is not the real root yet. Using its dev number, we can locate its mount table entry,
which points to the mounted minode of /a/b/c/. Then, we switch to the minode of /a/b/c/ and
continue the upward traversal. Thus, crossing mount point is like a monkey or squirrel hoping from
one tree to another and then back.
Since crossing mounting points changes the device number, a single global device number becomes
inadequate.We must modify the getino() function as
In the modified getino() function, change *dev whenever the pathname crosses a mounting point.
Thus, the modified getino() function essentially returns (dev, ino) of a pathname with dev being the
final device number.
In Unix/Linux, file protection is by permission bits in the file’s INODE. Each file’s INODE has an
i_mode field, in which the low 9 bits are permissions. The 9 permission bits are
The first 3 bits apply to the owner of the file, the second 3 bits apply to users in the same group as the
owner, and the last 3 bits apply to all others. For directories, the x bit indicates whether a process is
allowed to go into the directory. Each process has a uid and a gid. When a process tries to access a file,
the file system checks the process uid and gid against the file’s permission bits to determine whether it
is allowed to access the file with the intended mode of operation. If the process does not have the right
permission, the access will be denied. For the sake of simplicity, we may ignore process gid and use
only the process uid to check access permission.
In Unix/Linux, each process has a real uid and an effective uid. The file system checks the access
rights of a process by its effective uid. Under normal conditions, the effective uid and real uid of a
process are identical. When a process executes a setuid program, which has the setuid bit in the file’s
354 11 EXT2 File System
i_mode field turned on, the process’ effective uid becomes the uid of the program. While executing a
setuid program, the process effectively becomes the owner of the program file. For example, when a
process executes the mail program, which is a setuid program owned by the superuser, it can write to
the mail file of another user. When a process finishes executing a setuid program, it reverts back to the
real uid. For simplicity, we shall ignore effective uid also.
File locking is a mechanism which allows a process to set locks on a file, or parts of a file to prevent
race conditions when updating files. File locks can be either shared, which allows concurrent reads, or
exclusive, which enforces exclusive write. File locks can also be mandatory or advisory. For example,
Linux supports both shared and exclusive files locks but file locking is only advisory. In Linux, file
locks can be set by the fcntl() system call and manipulated by the flock() system call. For simplicity, we
shall assume only a very simple kind of file locking. When a process tries to open a file, the intended
mode of operation is checked for compatibility. The only compatible modes are READs. If a file is
already opened for updating mode, i.e. WR|RW|APPEND, it cannot be opened again. However, this
does not prevent related processes, e.g. a parent process and a child process, from modifying the same
file that’s opened by the parent, but this is also true in Unix/Linux. In that case, the file system can only
guarantee each write operation is atomic but not the writing order of the processes, which depends on
process scheduling.
Programming Project #3 is to complete the file system implementation by including functions in the
Level-3 of the file system. Sample solution of the file system is available for download at the book’s
website. Source code of the complete file system is available for instructors upon request.
The simple EXT2 file system uses 1 KB block size and has only 1 disk block group. It can be extended
easily as follows.
(1) Multiple groups: The size of group descriptor is 32 bytes. With 1 KB block size, a block may
contain 1024/32 ¼ 32 group descriptors. With 32 groups, the file system size can be extended to
32*8 ¼ 256 MB.
(2) 4 KB block size: With 4 KB block size and only one group, the file system size would be
4*8 ¼ 32 MB. With one group descriptor block, the file system may have 128 groups, which
extend the file system size to 128*32 ¼ 4GB. With 2 group descriptor blocks, the file system size
would be 8GB, etc. Most of the extensions are straightforward, which are suitable topics for
programming projects.
(3) Pipe files: It’s possible to implement pipes as regular files, which obey the read/write protocol of
pipes. The advantages of this scheme are: it unifies pipes and file inodes and it allows named pipes,
which can be used by unrelated processes. In order to support fast read/write operations, pipe
contents should be in memory, such as a RAMdisk. If desired, the reader may implement named
pipes as FIFO files.
11.12 Summary 355
(4) I/O Buffering: In the Programming Project, every disk block is read in or written out directly. This
would incur too much physical disk I/O operations. For better efficiency, a real file system usually
uses a set of I/O buffers as a cache memory for disk blocks. File system I/O buffering is covered in
Chap. 12, but it may be incorporated into the file system project.
11.12 Summary
This chapter covers EXT2 file system. The goal of this chapter is to lead the reader to implement a
complete EXT2 file system that is totally Linux compatible. The premise is that if the reader
understands one file system well, it should be easy to adapt to any other file systems. It first describes
the historic role of EXT2 file system in Linux and the current status of EXT3/EXT4 file systems. It uses
programming examples to show the various EXT2 data structures and how to traverse the EXT2 files
system tree. Then it shows how to implement an EXT2 file system which supports all file operations as
in the Linux kernel. It shows how to build a base file system by mount_root from a virtual disk. Then it
divides the file system implementation into 3 levels. Level-1 expands the base file system to implement
the file system tree. Level-2 implements read/write operations of file contents. Level-3 implements
mount/umount of file systems and file protection. In each level, it describes the algorithms of the file
system functions and demonstrates their implementations by programming examples. Each level is
cumulated by a programming project. The final project is to integrate all the programming examples
and exercises into a fully functional file system. The programming project has been used as the class
project in the CS360, Systems Programming, course at EECS, WSU for many years, and it has proved
to be very effective and successful.
Problems
1. Write a C program to display the group descriptor of an EXT2 file system on a device.
2. Modify the imap.c program to print the inodes bitmap in char map form, i.e. for each bit, print a ‘0’
if the bit is 0, print a ‘1’ if the bit is 1.
3. Write a C program to display the blocks bitmap of an Ext2 file system, also in char map form.
4. Write a C program to print the dir_entries of a directory.
5. A virtual disk can be mounted under Linux as a loop device, as in
Run the dir.c program on mydisk again to verify the newly created dirs and files still exist.
6. Given an INODE pointer to a DIRectory inode. Write a
function which searches for a dir_entry with a given name. Return its inode number if found, else
return 0.
7. Given a device containing an EXT2 file system and a pathname, .e.g. /a/b/c/d, write a C function
which returns an INODE pointer to the file’s inode, or 0 if the file in inaccessible.
356 11 EXT2 File System
8. Assume: a small EXT2 file system with 1 KB block size and only one blocks group.
1. Prove that a file may only have some double-indirect blocks but no triple-indirect blocks.
2. Given the pathname of a file, e.g. /a/b/c/d, write a C program, showblock, which prints all disk
block numbers of the file. Hint: use Mailman’s algorithm for double indirect blocks, if any.
9. In the Programming Project #1, an in-memory minode is released as FREE when its refCount
reaches 0. Design and implement a scheme, which uses the minode[ ] area as a cache memory for
in-memory INODES. Once an INODE is loaded into a minode slot, try to keep it in memory for as
long as possible even if no process is actively using it. Whenever a process finds a needed INODE
in memory, count it as a cache hit. Otherwise, it’s a cache miss. Modify the iget(dev, ino) function
to count the number of cache hits.
10. Implement step (2) of the pwd algorithm as a utility function
which returns the name string of a dir_entry identified by my_ino in the parent directory.
12. Implement the mv file1 file2 operation, which moves file1 to file2. For files on the same device,
mv should just rename file1 as file2 without copying file contents. Hints: link-unlink. How to
implement mv if the files are on different devices?
References
Card, R., Theodore Ts’o, T., Stephen Tweedie, S., “Design and Implementation of the Second Extended Filesystem”,
web.mit.edu/tytso/www/linux/ext2intro.html, 1995
Cao, M., Bhattacharya, S, Tso, T., "Ext4: The Next Generation of Ext2/3 File system", IBM Linux Technology Center,
2007.
EXT2: http://www.nongnu.org/ext2-doc/ext2.html, 2001
EXT3: http://jamesthornton.com/hotlist/linux-filesystems/ext3-journal, 2015