OS Notes

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 86

MCA-I (Sem-II) Operating System Syllabus

2.1 Operating System 1. Introduction : Evaluation of operating systems, types of operating systems, different views of operating system concepts and structure. [3] 2. Processes : The process concept system programmers view of processes. The operating system view of processes. Operating system service for process management, schedule algorithms, performance evaluation. [4] 3. Memory Management: Memory management without swapping of aging, actual memory base of replacement algorithms design issues of paging of paging segmentation. [4] 4. Inter process communication and synchronization: The need for inter process synchronization, mutual exclusion, Samphores, hardware support for mutual exclusion queuing implementation of semaphores, classical problems in concurrent programming, critical region and conditional critical region, monitor messages, deadlocks. [6] 5. File systems: File systems, directories file systems implementation, security protection mechanisms. [4] 6. Input/output: Principles of I/O hardware, I/O devices, drive controllers, direct memory access, principles of I/O software goals, interrupt handlers, device drivers, device independent I/O software. User space I/O software. [6] 7. Disk: Disk hardware, scheduling algorithms, Error handling, track-at-a-time caching, RAM disks, clocks, clocks hardware, clock software. Terminal: Terminal hardware, memory terminals, I/O software, processes and processors in distributed systems, Threads, system models, processors allocations, scheduling. [6] 8. Distributed file system: Design implementations, trends. [5] 9. Performance: Management, monitoring and evaluation, introduction, important trends affecting performance issue, why performance monitoring and evaluation are needed, performance measures, evaluation techniques, bottleneck and saturation feedback loops, case studies, DOS, WINDOWS, UNIX and Linex operating systems. [7] Reference: 1. H.M. An Introduction to Operating Systems, Addition Wesley publishing Co.,1984 2. nkpvie M., Operating System concepts and Design McGraw Hill, 1990 person I.L. 3. Abranam Silbrschat, Operating System Concepts, Addition Wesley publishing Co., 1989. 4. Tenenbaum A.S., Modern Operating Systems

CHAPTER-2 PROCESS
Process is an instance of a computer program that is being executed. It contains the program code and its current activity. Depending on the operating system (OS), a process may be made up of multiple threads of execution that execute instructions concurrently. A computer program is a passive collection of instructions, a process is the actual execution of those instructions. Several processes may be associated with the same program; for example, opening up several instances of the same program often means more than one process is being executed. Multitasking is a method to allow multiple processes to share processors (CPUs) and other system resources. Each CPU executes a single task at a time. However, multitasking allows each processor to switch between tasks that are being executed without having to wait for each task to finish. Depending on the operating system implementation, switches could be performed when tasks perform input/output operations, when a task indicates that it can be switched, or on hardware interrupts. computer system process consists of (or is said to 'own') the following resources:

An image of the executable machine code associated with a program. Memory (typically some region of virtual memory); which includes the executable code, process-specific data (input and output), a call stack (to keep track of active subroutines and/or other events), and a heap to hold intermediate computation data generated during run time. Operating system descriptors of resources that are allocated to the process, such as file descriptors (Unix terminology) or handles (Windows), and data sources and sinks. Security attributes, such as the process owner and the process' set of permissions (allowable operations). Processor state (context), such as the content of registers, physical memory addressing, etc. The state is typically stored in computer registers when the process is executing, and in memory otherwise

Inter-process communication When processes communicate with each other it is called "Inter-process communication" (IPC). Processes frequently need to communicate, for instance in a shell pipeline, the output of the first process need to pass to the second one, and so on to the other process.It is preferred in a wellstructured way not using interrupts.

Process states

The various process states, displayed in a state diagram, with arrows indicating possible transitions between states. An operating system kernel that allows multi-tasking needs processes to have certain states. Names for these states are not standardised, but they have similar functionality.[1]

First, the process is "created" - it is loaded from a secondary storage device (hard disk or CD-ROM...) into main memory. After that the process scheduler assigns it the state "waiting". While the process is "waiting" it waits for the scheduler to do a so-called context switch and load the process into the processor. The process state then becomes "running", and the processor executes the process instructions. If a process needs to wait for a resource (wait for user input or file to open ...), it is assigned the "blocked" state. The process state is changed back to "waiting" when the process no longer needs to wait.

Once the process finishes execution, or is terminated by the operating system, it is no longer needed. The process is removed instantly or is moved to the "terminated" state. When removed, it just waits to be removed from main memory.[1][4]

Process Management: Process creation Operating systems need some ways to create processes. In a very simple system designed for running only a single application (e.g., the controller in a microwave oven), it may be possible to have all the processes that will ever be needed be present when the system comes up. In generalpurpose systems, however, some way is needed to create and terminate processes as needed during operation. There are four principal events that cause a process to be created:

System initialization. Execution of process creation system call by running a process. A user request to create a new process. Initiation of a batch job.

When an operating system is booted, typically several processes are created. Some of these are foreground processes, that interacts with a (human) user and perform work for them. Other are background processes, which are not associated with particular users, but instead have some specific function. For example, one background process may be designed to accept incoming emails, sleeping most of the day but suddenly springing to life when an incoming e-mail arrives. Another background process may be designed to accept an incoming request for web pages hosted on the machine, waking up when a request arrives to service that request. Process creation in UNIX and Linux are done through fork() or clone() system calls. There are several steps involved in process creation. The first step is the validation of whether the parent process has sufficient authorization to create a process. Upon successful validation, the parent process is copied almost entirely, with changes only to the unique process id, parent process, and user-space. Each new process gets its own user space.[1] Process termination There are many reasons for process termination:

Batch job issues halt instruction User logs off Process executes a service request to terminate Error and fault conditions Normal completion Time limit exceeded Memory unavailable

Bounds violation; for example: attempted access of (non-existent) 11th element of a 10element array Protection error; for example: attempted write to read-only file Arithmetic error; for example: attempted division by zero Time overrun; for example: process waited longer than a specified maximum for an event I/O failure Invalid instruction; for example: when a process tries to execute data (text) Privileged instruction Data misuse Operating system intervention; for example: to resolve a deadlock Parent terminates so child processes terminate (cascading termination) Parent request

Two-state process management model The operating systems principal responsibility is in controlling the execution of processes. This includes determining the interleaving pattern for execution and allocation of resources to processes. One part of designing an OS is to describe the behaviour that we would like each process to exhibit. The simplest model is based on the fact that a process is either being executed by a processor or it is not. Thus, a process may be considered to be in one of two states, RUNNING or NOT RUNNING. When the operating system creates a new process, that process is initially labeled as NOT RUNNING, and is placed into a queue in the system in the NOT RUNNING state. The process (or some portion of it) then exists in main memory, and it waits in the queue for an opportunity to be executed. After some period of time, the currently RUNNING process will be interrupted, and moved from the RUNNING state to the NOT RUNNING state, making the processor available for a different process. The dispatch portion of the OS will then select, from the queue of NOT RUNNING processes, one of the waiting processes to transfer to the processor. The chosen process is then relabeled from a NOT RUNNING state to a RUNNING state, and its execution is either begun if it is a new process, or is resumed if it is a process which was interrupted at an earlier time. From this model we can identify some design elements of the OS:

The need to represent, and keep track of each process. The state of a process. The queuing of NON RUNNING processes

Three-state process management model Although the two-state process management model is a perfectly valid design for an operating system, the absence of a BLOCKED state means that the processor lies idle when the active process changes from CPU cycles to I/O cycles. This design does not make efficient use of the processor. The three-state process management model is designed to overcome this problem, by introducing a new state called the BLOCKED state. This state describes any process which is waiting for an I/O event to take place. In this case, an I/O event can mean the use of some device or a signal from another process. The three states in this model are:

RUNNING: The process that is currently being executed. READY: A process that is queuing and prepared to execute when given the opportunity. BLOCKED: A process that cannot execute until some event occurs, such as the completion of an I/O operation.

At any instant, a process is in one and only one of the three states. For a single processor computer, only one process can be in the RUNNING state at any one instant. There can be many processes in the READY and BLOCKED states, and each of these states will have an associated queue for processes. Processes entering the system must go initially into the READY state, processes can only enter the RUNNING state via the READY state. Processes normally leave the system from the RUNNING state. For each of the three states, the process occupies space in main memory. While the reason for most transitions from one state to another might be obvious, some may not be so clear.

RUNNING READY The most common reason for this transition is that the running process has reached the maximum allowable time for uninterrupted execution; i.e. timeout occurs. Other reasons can be the imposition of priority levels as determined by the scheduling policy used for the Low Level Scheduler, and the arrival of a higher priority process into the READY state. RUNNING BLOCKED A process is put into the BLOCKED state if it requests something for which it must wait. A request to the OS is usually in the form of a system call, (i.e. a call from the running process to a function that is part of the OS code). For example, requesting a file from disk or a saving a section of code or data from memory to a file on disk.

Five-state process management model While the three state model is sufficient to describe the behavior of processes with the given events, we have to extend the model to allow for other possible events, and for more sophisticated design. In particular, the use of a portion of the hard disk to emulate main memory (so called virtual memory) requires additional states to describe the state of processes which are suspended from main memory, and placed in virtual memory (on disk). Of course, such processes can, at a future time, be resumed by being transferred back into main memory. The Medium Level Scheduler controls these events. A process can be suspended from the RUNNING, READY or BLOCKED state, giving rise to two other states, namely, READY SUSPEND and BLOCKED SUSPEND. A RUNNING process that is suspended becomes READY SUSPEND, and a BLOCKED process that is suspended becomes BLOCKED SUSPEND. A process can be suspended for a number of reasons; the most significant of which arises from the process being swapped out of memory by the memory management system in order to free memory for other processes. Other common reasons for a process being suspended are when one suspends execution while debugging a program, or when the system is monitoring processes. For the five-state process management model, consider the following transitions described in the next sections.

BLOCKED BLOCKED SUSPEND If a process in the RUNNING state requires more memory, then at least one BLOCKED process can be swapped out of memory onto disk. The transition can also be made for the BLOCKED process if there are READY processes available, and the OS determines that the READY process that it would like to dispatch requires more main memory to maintain adequate performance. BLOCKED SUSPEND READY SUSPEND A process in the BLOCKED SUSPEND state is moved to the READY SUSPEND state when the event for which it has been waiting occurs. Note that this requires that the state information concerning suspended processes be accessible to the OS. READY SUSPEND READY When there are no READY processes in main memory, the OS will need to bring one in to continue execution. In addition, it might be the case that a process in the READY SUSPEND state has higher priority than any of the processes in the READY state. In that case, the OS designer may dictate that it is more important to get at the higher priority process than to minimise swapping. READY READY SUSPEND Normally, the OS would be designed so that the preference would be to suspend a BLOCKED process rather than a READY one. This is because the READY process can be executed as soon as the CPU becomes available for it, whereas the BLOCKED process is taking up main memory space and cannot be executed since it is waiting on some other event to occur. However, it may be necessary to suspend a READY process if that is the only way to free a sufficiently large block of main memory. Finally, the OS may choose to suspend a lower-priority READY process rather than a higher-priority BLOCKED process if it believes that the BLOCKED process will be ready soon.

Process description and control Each process in the system is represented by a data structure called a Process Control Block (PCB), or Process Descriptor in Linux, which performs the same function as a traveller's passport. The PCB contains the basic information about the job including:

What it is Where it is going How much of its processing has been completed Where it is stored How much it has spent in using resources

Process Identification: Each process is uniquely identified by the users identification and a pointer connecting it to its descriptor. Process Status: This indicates the current status of the process; READY, RUNNING, BLOCKED, READY SUSPEND, BLOCKED SUSPEND. Process State: This contains all of the information needed to indicate the current state of the job.

Accounting: This contains information used mainly for billing purposes and for performance measurement. It indicates what kind of resources the process has used and for how long. Processor modes Contemporary processors incorporate a mode bit to define the execution capability of a program in the processor. This bit can be set to kernel mode or user mode. Kernel mode is also commonly referred to as supervisor mode, monitor mode or ring 0. In kernel mode, the processor can execute every instruction in its hardware repertoire, whereas in user mode, it can only execute a subset of the instructions. Instructions that can be executed only in kernel mode are called kernel, privileged or protected instructions to distinguish them from the user mode instructions. For example, I/O instructions are privileged. So, if an application program executes in user mode, it cannot perform its own I/O. Instead, it must request the OS to perform I/O on its behalf. The system may logically extend the mode bit to define areas of memory to be used when the processor is in kernel mode versus user mode. If the mode bit is set to kernel mode, the process executing in the processor can access either the kernel or user partition of the memory. However, if user mode is set, the process can reference only the user memory space. We frequently refer to two classes of memory user space and system space (or kernel, supervisor or protected space). In general, the mode bit extends the operating system's protection rights. The mode bit is set by the user mode trap instruction, also called a supervisor call instruction. This instruction sets the mode bit, and branches to a fixed location in the system space. Since only system code is loaded in the system space, only system code can be invoked via a trap. When the OS has completed the supervisor call, it resets the mode bit to user mode prior to the return. The kernel concept The parts of the OS critical to its correct operation execute in kernel mode, while other software (such as generic system software) and all application programs execute in user mode. This fundamental distinction is usually the irrefutable distinction between the operating system and other system software. The part of the system executing in kernel supervisor state is called the kernel, or nucleus, of the operating system. The kernel operates as trusted software, meaning that when it was designed and implemented, it was intended to implement protection mechanisms that could not be covertly changed through the actions of untrusted software executing in user space. Extensions to the OS execute in user mode, so the OS does not rely on the correctness of those parts of the system software for correct operation of the OS. Hence, a fundamental design decision for any function to be incorporated into the OS is whether it needs to be implemented in the kernel. If it is implemented in the kernel, it will execute in kernel (supervisor) space, and have access to other parts of the kernel. It will also be trusted software by the other parts of the kernel. If the function is implemented to execute in user mode, it will have no access to kernel data structures. However, the advantage is that it will normally require very limited effort to invoke the function. While kernel-implemented functions may be easy to implement, the trap mechanism and authentication at the time of the call are usually relatively expensive. The kernel code runs fast, but there is a large performance overhead in the actual call. This is a subtle, but important point.

Requesting system services There are two techniques by which a program executing in user mode can request the kernel's services:

System call Message passing

Operating systems are designed with one or the other of these two facilities, but not both. First, assume that a user process wishes to invoke a particular target system function. For the system call approach, the user process uses the trap instruction. The idea is that the system call should appear to be an ordinary procedure call to the application program; the OS provides a library of user functions with names corresponding to each actual system call. Each of these stub functions contains a trap to the OS function. When the application program calls the stub, it executes the trap instruction, which switches the CPU to kernel mode, and then branches (indirectly through an OS table), to the entry point of the function which is to be invoked. When the function completes, it switches the processor to user mode and then returns control to the user process; thus simulating a normal procedure return. In the message passing approach, the user process constructs a message, that describes the desired service. Then it uses a trusted send function to pass the message to a trusted OS process. The send function serves the same purpose as the trap; that is, it carefully checks the message, switches the processor to kernel mode, and then delivers the message to a process that implements the target functions. Meanwhile, the user process waits for the result of the service request with a message receive operation. when the OS process completes the operation, it sends a message back to the user process. The distinction between the two approaches has important consequences regarding the relative independence of the OS behavior, from the application process behavior, and the resulting performance. As a rule of thumb, operating system based on a system call interface can be made more efficient than those requiring messages to be exchanged between distinct processes. This is the case, even though the system call must be implemented with a trap instruction; that is, even though the trap is relatively expensive to perform, it is more efficient than the message passing approach, where there are generally higher costs associated with process multiplexing, message formation and message copying. The system call approach has the interesting property that there is not necessarily any OS process. Instead, a process executing in user mode changes to kernel mode when it is executing kernel code, and switches back to user mode when it returns from the OS call. If, on the other hand, the OS is designed as a set of separate processes, it is usually easier to design it so that it gets control of the machine in special situations, than if the kernel is simply a collection of functions executed by users processes in kernel mode. Even procedurebased operating system usually find it necessary to include at least a few system processes (called daemons in UNIX) to handle situation whereby the machine is otherwise idle such as scheduling and handling the network. Scheduling:

Scheduling is a key concept in computer multitasking, multiprocessing operating system and real-time operating system designs. Scheduling refers to the way processes are assigned to run on the available CPUs, since there are typically many more processes running than there are available CPUs. This assignment is carried out by softwares known as a scheduler and dispatcher. The scheduler is concerned mainly with:

Throughput - number of processes that complete their execution per time unit. Latency, specifically: o Turnaround - total time between submission of a process and its completion. o Response time - amount of time it takes from when a request was submitted until the first response is produced. Fairness - Equal CPU time to each process (or more generally appropriate times according to each process' priority).

In practice, these goals often conflict (e.g. throughput versus latency), thus a scheduler will implement a suitable compromise. In real-time environments, such as mobile devices for automatic control in industry (for example robotics), the scheduler also must ensure that processes can meet deadlines; this is crucial for keeping the system stable. Scheduled tasks are sent to mobile devices and managed through an administrative back end. Types of operating system schedulers Operating systems may feature up to 3 distinct types of a long-term scheduler (also known as an admission scheduler or high-level scheduler), a mid-term or medium-term scheduler and a shortterm scheduler . The names suggest the relative frequency with which these functions are performed. Long-term scheduler The long-term, or admission, scheduler decides which jobs or processes are to be admitted to the ready queue; that is, when an attempt is made to execute a program, its admission to the set of currently executing processes is either authorized or delayed by the long-term scheduler. Thus, this scheduler dictates what processes are to run on a system, and the degree of concurrency to be supported at any one time - ie: whether a high or low amount of processes are to be executed concurrently, and how the split between IO intensive and CPU intensive processes is to be handled. In modern OS's, this is used to make sure that real time processes get enough CPU time to finish their tasks. Without proper real time scheduling, modern GUI interfaces would seem sluggish. [Stallings, 399]. Long-term scheduling is also important in large-scale systems such as batch processing systems, computer clusters, supercomputers and render farms. In these cases, special purpose job

scheduler software is typically used to assist these functions, in addition to any underlying admission scheduling support in the operating system. Mid-term scheduler The mid-term scheduler temporarily removes processes from main memory and places them on secondary memory (such as a disk drive) or vice versa. This is commonly referred to as "swapping out" or "swapping in" (also incorrectly as "paging out" or "paging in"). The mid-term scheduler may decide to swap out a process which has not been active for some time, or a process which has a low priority, or a process which is page faulting frequently, or a process which is taking up a large amount of memory in order to free up main memory for other processes, swapping the process back in later when more memory is available, or when the process has been unblocked and is no longer waiting for a resource. [Stallings, 396] [Stallings, 370] In many systems today (those that support mapping virtual address space to secondary storage other than the swap file), the mid-term scheduler may actually perform the role of the long-term scheduler, by treating binaries as "swapped out processes" upon their execution. In this way, when a segment of the binary is required it can be swapped in on demand, or "lazy loaded". [stallings, 394] Short-term scheduler The short-term scheduler (also known as the CPU scheduler) decides which of the ready, inmemory processes are to be executed (allocated a CPU) next following a clock interrupt, an IO interrupt, an operating system call or another form of signal. Thus the short-term scheduler makes scheduling decisions much more frequently than the long-term or mid-term schedulers - a scheduling decision will at a minimum have to be made after every time slice, and these are very short. This scheduler can be preemptive, implying that it is capable of forcibly removing processes from a CPU when it decides to allocate that CPU to another process, or nonpreemptive (also known as "voluntary" or "co-operative"), in which case the scheduler is unable to "force" processes off the CPU. [Stallings, 396]. In most cases short-term scheduler is written in assembler because it is critical part of operating system. Dispatcher Another component involved in the CPU-scheduling function is the dispatcher. The dispatcher is the module that gives control of the CPU to the process selected by the short-term scheduler. This function involves the following:

Switching context Switching to user mode Jumping to the proper location in the user program to restart that program

The dispatcher should be as fast as possible, since it is invoked during every process switch. The time it takes for the dispatcher to stop one process and start another running is known as the dispatch latency. [gravin, 155]. Scheduling disciplines

Scheduling disciplines are algorithms used for distributing resources among parties which simultaneously and asynchronously request them. Scheduling disciplines are used in routers (to handle packet traffic) as well as in operating systems (to share CPU time among both threads and processes), disk drives (I/O scheduling), printers (print spooler), most embedded systems, etc. The main purposes of scheduling algorithms are to minimize resource starvation and to ensure fairness amongst the parties utilizing the resources. Scheduling deals with the problem of deciding which of the outstanding requests is to be allocated resources. There are many different scheduling algorithms. In this section, we introduce several of them. First in first out Main article: First In First Out Also known as First Come, First Served (FCFS), is the simplest scheduling algorithm, FIFO simply queues processes in the order that they arrive in the ready queue.

Since context switches only occur upon process termination, and no reorganization of the process queue is required, scheduling overhead is minimal. Throughput can be low, since long processes can hog the CPU Turnaround time, waiting time and response time can be low for the same reasons above No prioritization occurs, thus this system has trouble meeting process deadlines. The lack of prioritization does permit every process to eventually complete, hence no starvation.

Shortest remaining time Main article: Shortest remaining time Also known as Shortest Job First (SJF). With this strategy the scheduler arranges processes with the least estimated processing time remaining to be next in the queue. This requires advance knowledge or estimations about the time required for a process to complete.

If a shorter process arrives during another process' execution, the currently running process may be interrupted, dividing that process into two separate computing blocks. This creates excess overhead through additional context switching. The scheduler must also place each incoming process into a specific place in the queue, creating additional overhead. This algorithm is designed for maximum throughput in most scenarios.

Waiting time and response time increase as the process' computational requirements increase. Since turnaround time is based on waiting time plus processing time, longer processes are significantly affected by this. Overall waiting time is smaller than FIFO, however since no process has to wait for the termination of the longest process. No particular attention is given to deadlines, the programmer can only attempt to make processes with deadlines as short as possible. Starvation is possible, especially in a busy system with many small processes being run.

Fixed priority pre-emptive scheduling The O/S assigns a fixed priority rank to every process, and the scheduler arranges the processes in the ready queue in order of their priority. Lower priority processes get interrupted by incoming higher priority processes.

Overhead is not minimal, nor is it significant. FPPS has no particular advantage in terms of throughput over FIFO scheduling. Waiting time and response time depend on the priority of the process. Higher priority processes have smaller waiting and response times. Deadlines can be met by giving processes with deadlines a higher priority. Starvation of lower priority processes is possible with large amounts of high priority processes queuing for CPU time.

Round-robin scheduling The scheduler assigns a fixed time unit per process, and cycles through them.

RR scheduling involves extensive overhead, especially with a small time unit. Balanced throughput between FCFS and SJF, shorter jobs are completed faster than in FCFS and longer processes are completed faster than in SJF. Fastest average response time, waiting time is dependent on number of processes, and not average process length. Because of high waiting times, deadlines are rarely met in a pure RR system. Starvation can never occur, since no priority is given. Order of time unit allocation is based upon process arrival time, similar to FCFS.

Multilevel queue scheduling This is used for situations in which processes are easily divided into different groups. For example, a common division is made between foreground (interactive) processes and background (batch) processes. These two types of processes have different response-time requirements and so may have different scheduling needs. Overview Scheduling algorithm CPU Utilization Throughput Turnaround time Response time

First In First Out Shortest Job First Priority based scheduling Round-robin scheduling

Low Medium Medium High

Low High Low Medium High

High Medium High Medium Medium

Low Medium High High Medium

Multilevel Queue scheduling High How to choose a scheduling algorithm

When designing an operating system, a programmer must consider which scheduling algorithm will perform best for the use the system is going to see. There is no universal best scheduling algorithm, and many operating systems use extended or combinations of the scheduling algorithms above. For example, Windows NT/XP/Vista uses a Multilevel feedback queue, a combination of fixed priority preemptive scheduling, round-robin, and first in first out. In this system, processes can dynamically increase or decrease in priority depending on if it has been serviced already, or if it has been waiting extensively. Every priority level is represented by its own queue, with round-robin scheduling amongst the high priority processes and FIFO among the lower ones. In this sense, response time is short for most processes, and short but critical system processes get completed very quickly. Since processes can only use one time unit of the round robin in the highest priority queue, starvation can be a problem for longer high priority processes. Operating system scheduler implementations Windows Very early MS-DOS and Microsoft Windows systems were non-multitasking, and as such did not feature a scheduler. Windows 3.1x used a non-preemptive scheduler, meaning that it did not interrupt programs. It relied on the program to end or tell the OS that it didn't need the processor so that it could move on to another process. This is usually called cooperative multitasking. Windows 95 introduced a rudimentary preemptive scheduler; however, for legacy support opted to let 16 bit applications run without preemption. Windows NT-based operating systems use a multilevel feedback queue. 32 priority levels are defined, 0 through to 31, with priorities 0 through 15 being "normal" priorities and priorities 16 through 31 being soft real-time priorities, requiring privileges to assign. 0 is reserved for the Operating System. Users can select 5 of these priorities to assign to a running application from the Task Manager application, or through thread management APIs. The kernel may change the priority level of a thread depending on its I/O and CPU usage and whether it is interactive (i.e. accepts and responds to input from humans), raising the priority of interactive and I/O bounded processes and lowering that of CPU bound processes, to increase the responsiveness of interactive applications. The scheduler was modified in Windows Vista to use the cycle counter register of modern processors to keep track of exactly how many CPU cycles a thread has

executed, rather than just using an interval-timer interrupt routine. Vista also uses a priority scheduler for the I/O queue so that disk defragmenters and other such programs don't interfere with foreground operations. Mac OS Mac OS 9 uses cooperative scheduling for threads, where one process controls multiple cooperative threads, and also provides preemptive scheduling for MP tasks. The kernel schedules MP tasks using a preemptive scheduling algorithm. All Process Manager processes run within a special MP task, called the "blue task". Those processes are scheduled cooperatively, using a round-robin scheduling algorithm; a process yields control of the processor to another process by explicitly calling a blocking function such as WaitNextEvent. Each process has its own copy of the Thread Manager that schedules that process's threads cooperatively; a thread yields control of the processor to another thread by calling YieldToAnyThread or YieldToThread. Mac OS X uses a multilevel feedback queue, with four priority bands for threads - normal, system high priority, kernel mode only, and real-time. Threads are scheduled preemptively; Mac OS X also supports cooperatively-scheduled threads in its implementation of the Thread Manager in Carbon AIX In AIX Version 4 there are three possible values for thread scheduling policy :

FIFO: Once a thread with this policy is scheduled, it runs to completion unless it is blocked, it voluntarily yields control of the CPU, or a higher-priority thread becomes dispatchable. Only fixed-priority threads can have a FIFO scheduling policy. RR: This is similar to the AIX Version 3 scheduler round-robin scheme based on 10ms time slices. When a RR thread has control at the end of the time slice, it moves to the tail of the queue of dispatchable threads of its priority. Only fixed-priority threads can have a RR scheduling policy. OTHER This policy is defined by POSIX1003.4a as implementation-defined. In AIX Version 4, this policy is defined to be equivalent to RR, except that it applies to threads with non-fixed priority. The recalculation of the running thread's priority value at each clock interrupt means that a thread may lose control because its priority value has risen above that of another dispatchable thread. This is the AIX Version 3 behavior.

Threads are primarily of interest for applications that currently consist of several asynchronous processes. These applications might impose a lighter load on the system if converted to a multithreaded structure. AIX 5 implements the following scheduling policies: FIFO, round robin, and a fair round robin. The FIFO policy has three different implementations: FIFO, FIFO2, and FIFO3. The round robin policy is named SCHED_RR in AIX, and the fair round robin is called SCHED_OTHER. This link provides additional information on AIX 5 scheduling: http://www.ibm.com/developerworks/aix/library/au-aix5_cpu/index.html#N100F6 .

GNU/Linux From version 2.5 of the kernel to version 2.6, Linux used a multilevel feedback queue with priority levels ranging from 0-140. 0-99 are reserved for real-time tasks and 100-140 are considered nice task levels. For real-time tasks, the time quantum for switching processes is approximately 200 ms, and for nice tasks approximately 10 ms. The scheduler will run through the queue of all ready processes, letting the highest priority processes go first and run through their time slices, after which they will be placed in an expired queue. When the active queue is empty the expired queue will become the active queue and vice versa. From versions 2.6 to 2.6.23, the kernel used an O(1) scheduler. In version 2.6.23, they replaced this method with the Completely Fair Scheduler that uses red-black trees instead of queues. FreeBSD FreeBSD uses a multilevel feedback queue with priorities ranging from 0-255. 0-63 are reserved for interrupts, 64-127 for the top half of the kernel, 128-159 for real-time user threads, 160-223 for time-shared user threads, and 224-255 for idle user threads. Also, like Linux, it uses the active queue setup, but it also has an idle queue.[8] NetBSD NetBSD uses a multilevel feedback queue with priorities ranging from 0-223. 0-63 are reserved for time-shared threads (default, SCHED_OTHER policy), 64-95 for user threads which entered kernel space, 96-128 for kernel threads, 128-191 for user real-time threads (SCHED_FIFO and SCHED_RR policies), and 192-223 for software interrupts. Solaris Solaris uses a multilevel feedback queue with priorities ranging from 0-169. 0-59 are reserved for time-shared threads, 60-99 for system threads, 100-159 for real-time threads, and 160-169 for low priority interrupts. Unlike Linux, when a process is done using its time quantum, it's given a new priority and put back in the queue. Performance Evaluation:

CHAPTER 3 MEMORY MANAGEMENT

Memory management is the act of managing computer memory. In its simpler forms, this involves providing ways to allocate portions of memory to programs at their request, and freeing it for reuse when no longer needed. The management of main memory is critical to the computer system. Virtual memory systems separate the memory addresses used by a process from actual physical addresses, allowing separation of processes and increasing the effectively available amount of RAM using disk swapping. The quality of the virtual memory manager can have a big impact on overall system performance. Garbage collection is the automated allocation and deallocation of computer memory resources for a program. This is generally implemented at the programming language level and is in opposition to manual memory management, the explicit allocation and deallocation of computer memory resources. Region-based memory management is an efficient variant of explicit memory management that can deallocate large groups of objects simultaneously. Requirements Memory management systems on multi-tasking operating systems usually deal with the following issues. Relocation In systems with virtual memory, programs in memory must be able to reside in different parts of the memory at different times. This is because when the program is swapped back into memory after being swapped out for a while it can not always be placed in the same location. The virtual memory management unit must also deal with concurrency. Memory management in the operating system should therefore be able to relocate programs in memory and handle memory references and addresses in the code of the program so that they always point to the right location in memory. Protection Processes should not be able to reference the memory for another process without permission. This is called memory protection, and prevents malicious or malfunctioning code in one program from interfering with the operation of other running programs.

Sharing Even though the memory for different processes is normally protected from each other, different processes sometimes need to be able to share information and therefore access the same part of memory. Shared memory is one of the fastest techniques for Inter-process communication. Logical organization Programs are often organized in modules. Some of these modules could be shared between different programs, some are read only and some contain data that can be modified. The memory management is responsible for handling this logical organization that is different from the physical linear address space. One way to arrange this organization is segmentation. Physical Organization Memory is usually divided into fast primary storage and slow secondary storage. Memory management in the operating system handles moving information between these two levels of memory. DOS memory managers In addition to standard memory management, the 640 KB barrier of MS-DOS and compatible systems led to the development of programs known as memory managers when PC main memories started to be routinely larger than 640 KB in the late 1980s (see conventional memory). These move portions of the operating system outside their normal locations in order to increase the amount of conventional or quasi-conventional memory available to other applications. Examples are EMM386, which was part of the standard installation in DOS's later versions, and QEMM. These allowed use of memory above the 640 KB barrier, where memory was normally reserved for RAMs, and high and upper memory. Memory management is a complex field of computer science and there are many techniques being developed to make it more efficient. This guide is designed to introduce you to some of the basic memory management issues that programmers face. Some platforms have specific problems in dealing with memory, which are not covered in depth here. In particular, if you are looking for information on configuring memory under DOS or Windows (EMS, XMS, HIMEM.SYS, and the like), then you won't find The Memory Management Reference very useful. This guide attempts to explain any terms it uses as it introduces them. In addition, there is a Glossary of memory management terms that gives fuller information; some terms are linked to the relevant entries. Memory management is usually divided into three areas: hardware, operating system, and application (described in more detail below), although the distinctions are a little fuzzy. In most computer systems, all three are present to some extent, forming layers between the user's

program and the actual memory hardware. The Memory Management Reference is mostly concerned with application memory management. Hardware memory management Memory management at the hardware level is concerned with the electronic devices that actually store data. This includes things like RAM and memory caches. Operating system memory management In the operating system, memory must be allocated to user programs, and reused by other programs when it is no longer required. The operating system can pretend that the computer has more memory than it actually does, and also that each program has the machine's memory to itself; both of these are features of virtual memory(1) systems. Application memory management Application memory management involves supplying the memory needed for a program's objects and data structures from the limited resources available, and recycling that memory for reuse when it is no longer required. Because application programs cannot in general predict in advance how much memory they are going to require, they need additional code to handle their changing memory requirements. Application memory management combines two related tasks: Allocation When the program requests a block of memory, the memory manager must allocate that block out of the larger blocks it has received from the operating system. The part of the memory manager that does this is known as the allocator. There are many ways to perform allocation, a few of which are discussed in Allocation techniques. Recycling When memory blocks have been allocated, but the data they contain is no longer required by the program, then the blocks can be recycled for reuse. There are two approaches to recycling memory: either the programmer must decide when memory can be reused (known as manual memory management); or the memory manager must be able to work it out (known as automatic memory management). These are both described in more detail below. An application memory manager must usually work to several constraints, such as: CPU overhead

The additional time taken by the memory manager while the program is running; Interactive pause times How much delay an interactive user observes; Memory overhead How much space is wasted for administration, rounding (known as internal fragmentation), and poor layout (known as external fragmentation). Some of the common problems encountered in application memory management are considered in the next section. Memory management problems The basic problem in managing memory is knowing when to keep the data it contains, and when to throw it away so that the memory can be reused. This sounds easy, but is, in fact, such a hard problem that it is an entire field of study in its own right. In an ideal world, most programmers wouldn't have to worry about memory management issues. Unfortunately, there are many ways in which poor memory management practice can affect the robustness and speed of programs, both in manual and in automatic memory management. Typical problems include: Premature free or dangling pointer Many programs give up memory, but attempt to access it later and crash or behave randomly. This condition is known as premature free, and the surviving reference to the memory is known as a dangling pointer. This is usually confined to manual memory management. Memory leak Some programs continually allocate memory without ever giving it up and eventually run out of memory. This condition is known as a memory leak. External fragmentation A poor allocator can do its job of giving out and receiving blocks of memory so badly that it can no longer give out big enough blocks despite having enough spare memory. This is because the free memory can become split into many small blocks, separated by blocks still in use. This condition is known as external fragmentation. Poor locality of reference

Another problem with the layout of allocated blocks comes from the way that modern hardware and operating system memory managers handle memory: successive memory accesses are faster if they are to nearby memory locations. If the memory manager places far apart the blocks a program will use together, then this will cause performance problems. This condition is known as poor locality of reference. Inflexible design Memory managers can also cause severe performance problems if they have been designed with one use in mind, but are used in a different way. These problems occur because any memory management solution tends to make assumptions about the way in which the program is going to use memory, such as typical block sizes, reference patterns, or lifetimes of objects. If these assumptions are wrong, then the memory manager may spend a lot more time doing bookkeeping work to keep up with what's happening. Interface complexity If objects are passed between modules, then the interface design must consider the management of their memory. A well-designed memory manager can make it easier to write debugging tools, because much of the code can be shared. Such tools could display objects, navigate links, validate objects, or detect abnormal accumulations of certain object types or block sizes. Manual memory management Manual memory management is where the programmer has direct control over when memory may be recycled. Usually this is either by explicit calls to heap management functions (for example, malloc/free in C), or by language constructs that affect the stack (such as local variables). The key feature of a manual memory manager is that it provides a way for the program to say, "Have this memory back; I've finished with it"; the memory manager does not recycle any memory without such an instruction. The advantages of manual memory management are:

It can be easier for the programmer to understand exactly what is going on; Some manual memory managers perform better when there is a shortage of memory.

The disadvantages of manual memory management are:


The programmer must write a lot of code to do repetitive bookkeeping of memory; Memory management must form a significant part of any module interface; Manual memory management typically requires more memory overhead per object; Memory management bugs are common.

It is very common for programmers, faced with an inefficient or inadequate manual memory manager, to write code to duplicate the behavior of a memory manager, either by allocating large blocks and splitting them for use, or by recycling blocks internally. Such code is known as a suballocator. Suballocators can take advantage of special knowledge of program behavior, but are less efficient in general than fixing the underlying allocator. Unless written by a memory management expert, suballocators may be inefficient or unreliable. The following languages use mainly manual memory management in most implementations, although many have conservative garbage collection extensions: Algol; C; C++; COBOL; Fortran; Pascal. For more information, see manual memory management in the Glossary. Automatic memory management Automatic memory management is a service, either as a part of the language or as an extension, that automatically recycles memory that a program would not otherwise use again. Automatic memory managers (often known as garbage collectors, or simply collectors) usually do their job by recycling blocks that are unreachable from the program variables (that is, blocks that cannot be reached by following pointers). The advantages of automatic memory management are:

The programmer is freed to work on the actual problem; Module interfaces are cleaner; There are fewer memory management bugs; Memory management is often more efficient.

The disadvantages of automatic memory management are:


Memory may be retained because it is reachable, but won't be used again; Automatic memory managers (currently) have limited availability.

There are many ways of performing automatic recycling of memory, a few of which are discussed in Recycling techniques. Most modern languages use mainly automatic memory management: BASIC, DylanTM, Erlang, Haskell, JavaTM, JavaScriptTM, Lisp, ML, Modula-3, Perl, the PostScript language, Prolog, Python, Scheme, Smalltalk, etc. For more information, see automatic memory management in the Glossary. More information Memory

Monoprogramming without Swapping or Paging Monoprogramming with fixed partitions Swapping Virtual Memory MEMORY Memory is the electronic holding place for instructions and data that the computer's microprocessor can reach quickly. When the computer is in normal operation, its memory usually contains the main parts of the operating system and some or all of the application programs and related data that are being used. Memory is often used as a shorter synonym for random access memory (RAM). This kind of memory is located on one or more microchips that are physically close to the microprocessor in the computer. Most desktop and notebook computers sold today include at least 16 megabytes of RAM, and are upgradeable to include more. The more RAM you have, the less frequently the computer has to access instructions and data from the more slowly accessed hard disk form of storage. Memory is sometimes distinguished from storage, or the physical medium that holds the much larger amounts of data that won't fit into RAM and may not be immediately needed there. Storage devices include hard disks, floppy disks, CD-ROM, and tape backup systems. The terms auxiliary storage, auxiliary memory, and secondary memory have also been used for this kind of data repository. Additional kinds of integrated and quickly accessible memory are read-only memory (ROM), programmable ROM (PROMO), erasable programmable ROM (EPROM). These are used to keep special programs and data, such as the basic input/output system, that need to be in the computer all the time. The memory is a resource that needs to be managed carefully. Most computers have a memory hierarchy, with a small amount of very fast, expensive, volatile cache memory, some number of megabytes of medium-speed, medium-price, volatile main memory (RAM), and hundreds of thousands of megabytes of slow, cheap, non-volatile disk storage. It is the job of the operating system to coordinate how these memories are used. The part of the operating system that manages the memory hierarchy is the memory manager. It keeps track of parts of memory that are in use and those that are not in use, to allocate memory to processes when they need it and de-allocate it when they are done, and to manage swapping between main memory and disk when main memory is too small to hold all the processes. Systems for managing memory can be divided into two categories: the system of moving processes back and forth between main memory and disk during execution (known as swapping and paging) and the process that does not do so (that is, no swapping and ping). Monoprogramming without Swapping or Paging The most simple memory management scheme is to run one program at a time, sharing the memory between that program and the operating system. As shown in the diagram bellow, there

are three variations of this type. The operating system may be at the bottom of the memory in TAM (random access memory), as shown in the diagram 1, or it may be in ROM (read-only memory) at the top of the memory, as shown in diagram 2, or the device drivers may be at the top of the memory in a ROM and the rest of the system in RAM down bellow With the system organised in this way, only one process at a time can be running. As soon as the user types a command, the operating system copies the request program from a disk to memory and executes it. When the process finishes, the operating system displays a prompt character and waits for a new command. When it receives the command, it loads a new program into memory overwriting the first one. Monoprogramming with fixed partitions It is often desirable to allow multiple processes to run at the same time, even on simple operating systems in which multiprogramming is sometimes used. On time-sharing systems, having multiple processes in memory at once means that when one process is blocked waiting for the I/O to finish, another one can use the CPU. This way, multiprogramming increases the CPU utilisation. It is however preferable to be able to run two or more programs at once even on personal computers. Swapping The process of organising memory into fixed partitions on batch system is simple compared to time sharing systems or graphically oriented personal computers. On batch systems, each job is loaded into a partition when it gets to the head of the queue. It stays in memory until it has finished. As long as enough jobs can be kept in memory to keep the CPU busy all the time, there is no reason to use anything more complicated. On time-sharing machines, sometimes there is not enough main memory to hold all the currently active processes, therefore excess processes must be kept on disk and brought in to run dynamically. Swapping as an approach to memory management consists of bringing each process in its entirety, running it for a while, then putting it back on the disk. For details and illustrations, see OS design and implementation, A.S. Tanenbaum & A.S. Woodhull, Prentice Hall 1997, pg.310.

Virtual Memory
Another strategy for managing memory is the virtual memory, which allows programs to run even when they are only partially in main memory. The basic idea behind this strategy is that the combine size of the program, data, and stack may exceed the amount of physical memory available for it. The operating system keeps those parts of the program currently in use in main memory, and the rest on the disk. Virtual memory can also work in a multiprogramming system, with bits and pieces pf many programs in memory at once. While a program is waiting for a part of itself to be brought in, it is waiting for I/O and cannot run, so the CPU can be given to another process, the same way as for

any other multiprogramming system. Most virtual memory systems use a technique called paging.

Diagram of memory management The memory management subsystem is one of the most important parts of the operating system. Since the early days of computing, there has been a need for more memory than exists physically in a system. Strategies have been developed to overcome this limitation and the most successful of these is virtual memory. Virtual memory makes the system appear to have more memory than it actually has by sharing it between competing processes as they need it. Virtual memory does more than just make your computer's memory go further. The memory management subsystem provides: Large Address Spaces The operating system makes the system appear as if it has a larger amount of memory than it actually has. The virtual memory can be many times larger than the physical memory in the system, Protection Each process in the system has its own virtual address space. These virtual address spaces are completely separate from each other and so a process running one application cannot affect another. Also, the hardware virtual memory mechanisms allow areas of memory to be protected against writing. This protects code and data from being overwritten by rogue applications.

Memory Mapping Memory mapping is used to map image and data files into a processes address space. In memory mapping, the contents of a file are linked directly into the virtual address space of a process. Fair Physical Memory Allocation The memory management subsystem allows each running process in the system a fair share of the physical memory of the system, Shared Virtual Memory Although virtual memory allows processes to have separate (virtual) address spaces, there are times when you need processes to share memory. For example there could be several processes in the system running the bash command shell. Rather than have several copies of bash, one in each processes virtual address space, it is better to have only one copy in physical memory and all of the processes running bash share it. Dynamic libraries are another common example of executing code shared between several processes. Shared memory can also be used as an Inter Process Communication (IPC) mechanism, with two or more processes exchanging information via memory common to all of them. Linux supports the Unix TM System V shared memory IPC. 3.1 An Abstract Model of Virtual Memory

Figure 3.1: Abstract model of Virtual to Physical address mapping Before considering the methods that Linux uses to support virtual memory it is useful to consider an abstract model that is not cluttered by too much detail. As the processor executes a program it reads an instruction from memory and decodes it. In decoding the instruction it may need to fetch or store the contents of a location in memory. The processor then executes the instruction and moves onto the next instruction in the program. In this way the processor is always accessing memory either to fetch instructions or to fetch and store data. In a virtual memory system all of these addresses are virtual addresses and not physical addresses. These virtual addresses are converted into physical addresses by the processor based on information held in a set of tables maintained by the operating system. To make this translation easier, virtual and physical memory are divided into handy sized chunks called pages. These pages are all the same size, they need not be but if they were not, the system would be very hard to administer. Linux on Alpha AXP systems uses 8 Kbyte pages and on Intel x86 systems it uses 4 Kbyte pages. Each of these pages is given a unique number; the page frame number (PFN). In this paged model, a virtual address is composed of two parts; an offset and a virtual page frame number. If the page size is 4 Kbytes, bits 11:0 of the virtual address contain the offset and bits 12 and above are the virtual page frame number. Each time the processor encounters a virtual address it must extract the offset and the virtual page frame number. The processor must translate the virtual page frame number into a physical one and then access the location at the correct offset into that physical page. To do this the processor uses page tables. Figure 3.1 shows the virtual address spaces of two processes, process X and process Y, each with their own page tables. These page tables map each processes virtual pages into physical pages in memory. This shows that process X's virtual page frame number 0 is mapped into memory in physical page frame number 1 and that process Y's virtual page frame number 1 is mapped into physical page frame number 4. Each entry in the theoretical page table contains the following information:

Valid flag. This indicates if this page table entry is valid, The physical page frame number that this entry is describing, Access control information. This describes how the page may be used. Can it be written to? Does it contain executable code?

The page table is accessed using the virtual page frame number as an offset. Virtual page frame 5 would be the 6th element of the table (0 is the first element). To translate a virtual address into a physical one, the processor must first work out the virtual addresses page frame number and the offset within that virtual page. By making the page size a power of 2 this can be easily done by masking and shifting. Looking again at Figures 3.1 and

assuming a page size of 0x2000 bytes (which is decimal 8192) and an address of 0x2194 in process Y's virtual address space then the processor would translate that address into offset 0x194 into virtual page frame number 1. The processor uses the virtual page frame number as an index into the processes page table to retrieve its page table entry. If the page table entry at that offset is valid, the processor takes the physical page frame number from this entry. If the entry is invalid, the process has accessed a non-existent area of its virtual memory. In this case, the processor cannot resolve the address and must pass control to the operating system so that it can fix things up. Just how the processor notifies the operating system that the correct process has attempted to access a virtual address for which there is no valid translation is specific to the processor. However the processor delivers it, this is known as a page fault and the operating system is notified of the faulting virtual address and the reason for the page fault. Assuming that this is a valid page table entry, the processor takes that physical page frame number and multiplies it by the page size to get the address of the base of the page in physical memory. Finally, the processor adds in the offset to the instruction or data that it needs. Using the above example again, process Y's virtual page frame number 1 is mapped to physical page frame number 4 which starts at 0x8000 (4 x 0x2000). Adding in the 0x194 byte offset gives us a final physical address of 0x8194. By mapping virtual to physical addresses this way, the virtual memory can be mapped into the system's physical pages in any order. For example, in Figure 3.1 process X's virtual page frame number 0 is mapped to physical page frame number 1 whereas virtual page frame number 7 is mapped to physical page frame number 0 even though it is higher in virtual memory than virtual page frame number 0. This demonstrates an interesting byproduct of virtual memory; the pages of virtual memory do not have to be present in physical memory in any particular order. 3.1.1 Demand Paging As there is much less physical memory than virtual memory the operating system must be careful that it does not use the physical memory inefficiently. One way to save physical memory is to only load virtual pages that are currently being used by the executing program. For example, a database program may be run to query a database. In this case not all of the database needs to be loaded into memory, just those data records that are being examined. If the database query is a search query then it does not make sense to load the code from the database program that deals with adding new records. This technique of only loading virtual pages into memory as they are accessed is known as demand paging. When a process attempts to access a virtual address that is not currently in memory the processor cannot find a page table entry for the virtual page referenced. For example, in Figure 3.1 there is no entry in process X's page table for virtual page frame number 2 and so if process X attempts to read from an address within virtual page frame number 2 the processor cannot translate the

address into a physical one. At this point the processor notifies the operating system that a page fault has occurred. If the faulting virtual address is invalid this means that the process has attempted to access a virtual address that it should not have. Maybe the application has gone wrong in some way, for example writing to random addresses in memory. In this case the operating system will terminate it, protecting the other processes in the system from this rogue process. If the faulting virtual address was valid but the page that it refers to is not currently in memory, the operating system must bring the appropriate page into memory from the image on disk. Disk access takes a long time, relatively speaking, and so the process must wait quite a while until the page has been fetched. If there are other processes that could run then the operating system will select one of them to run. The fetched page is written into a free physical page frame and an entry for the virtual page frame number is added to the processes page table. The process is then restarted at the machine instruction where the memory fault occurred. This time the virtual memory access is made, the processor can make the virtual to physical address translation and so the process continues to run. Linux uses demand paging to load executable images into a processes virtual memory. Whenever a command is executed, the file containing it is opened and its contents are mapped into the processes virtual memory. This is done by modifying the data structures describing this processes memory map and is known as memory mapping. However, only the first part of the image is actually brought into physical memory. The rest of the image is left on disk. As the image executes, it generates page faults and Linux uses the processes memory map in order to determine which parts of the image to bring into memory for execution. 3.1.2 Swapping If a process needs to bring a virtual page into physical memory and there are no free physical pages available, the operating system must make room for this page by discarding another page from physical memory. If the page to be discarded from physical memory came from an image or data file and has not been written to then the page does not need to be saved. Instead it can be discarded and if the process needs that page again it can be brought back into memory from the image or data file. However, if the page has been modified, the operating system must preserve the contents of that page so that it can be accessed at a later time. This type of page is known as a dirty page and when it is removed from memory it is saved in a special sort of file called the swap file. Accesses to the swap file are very long relative to the speed of the processor and physical memory and the operating system must juggle the need to write pages to disk with the need to retain them in memory to be used again. If the algorithm used to decide which pages to discard or swap (the swap algorithm is not efficient then a condition known as thrashing occurs. In this case, pages are constantly being written to disk and then being read back and the operating system is too busy to allow much real

work to be performed. If, for example, physical page frame number 1 in Figure 3.1 is being regularly accessed then it is not a good candidate for swapping to hard disk. The set of pages that a process is currently using is called the working set. An efficient swap scheme would make sure that all processes have their working set in physical memory. Linux uses a Least Recently Used (LRU) page aging technique to fairly choose pages which might be removed from the system. This scheme involves every page in the system having an age which changes as the page is accessed. The more that a page is accessed, the younger it is; the less that it is accessed the older and more stale it becomes. Old pages are good candidates for swapping. 3.1.3 Shared Virtual Memory Virtual memory makes it easy for several processes to share memory. All memory access are made via page tables and each process has its own separate page table. For two processes sharing a physical page of memory, its physical page frame number must appear in a page table entry in both of their page tables. Figure 3.1 shows two processes that each share physical page frame number 4. For process X this is virtual page frame number 4 whereas for process Y this is virtual page frame number 6. This illustrates an interesting point about sharing pages: the shared physical page does not have to exist at the same place in virtual memory for any or all of the processes sharing it. 3.1.4 Physical and Virtual Addressing Modes It does not make much sense for the operating system itself to run in virtual memory. This would be a nightmare situation where the operating system must maintain page tables for itself. Most multi-purpose processors support the notion of a physical address mode as well as a virtual address mode. Physical addressing mode requires no page tables and the processor does not attempt to perform any address translations in this mode. The Linux kernel is linked to run in physical address space. The Alpha AXP processor does not have a special physical addressing mode. Instead, it divides up the memory space into several areas and designates two of them as physically mapped addresses. This kernel address space is known as KSEG address space and it encompasses all addresses upwards from 0xfffffc0000000000. In order to execute from code linked in KSEG (by definition, kernel code) or access data there, the code must be executing in kernel mode. The Linux kernel on Alpha is linked to execute from address 0xfffffc0000310000. 3.1.5 Access Control The page table entries also contain access control information. As the processor is already using the page table entry to map a processes virtual address to a physical one, it can easily use the access control information to check that the process is not accessing memory in a way that it should not.

There are many reasons why you would want to restrict access to areas of memory. Some memory, such as that containing executable code, is naturally read only memory; the operating system should not allow a process to write data over its executable code. By contrast, pages containing data can be written to but attempts to execute that memory as instructions should fail. Most processors have at least two modes of execution: kernel and user. You would not want kernel code executing by a user or kernel data structures to be accessible except when the processor is running in kernel mode.

Figure 3.2: Alpha AXP Page Table Entry The access control information is held in the PTE and is processor specific; figure 3.2 shows the PTE for Alpha AXP. The bit fields have the following meanings: V Valid, if set this PTE is valid, FOE ``Fault on Execute'', Whenever an attempt to execute instructions in this page occurs, the processor reports a page fault and passes control to the operating system, FOW ``Fault on Write'', as above but page fault on an attempt to write to this page, FOR ``Fault on Read'', as above but page fault on an attempt to read from this page,

ASM Address Space Match. This is used when the operating system wishes to clear only some of the entries from the Translation Buffer, KRE Code running in kernel mode can read this page, URE Code running in user mode can read this page, GH Granularity hint used when mapping an entire block with a single Translation Buffer entry rather than many, KWE Code running in kernel mode can write to this page, UWE Code running in user mode can write to this page, page frame number For PTEs with the V bit set, this field contains the physical Page Frame Number (page frame number) for this PTE. For invalid PTEs, if this field is not zero, it contains information about where the page is in the swap file. The following two bits are defined and used by Linux: _PAGE_DIRTY if set, the page needs to be written out to the swap file, _PAGE_ACCESSED Used by Linux to mark a page as having been accessed. 3.2 Caches If you were to implement a system using the above theoretical model then it would work, but not particularly efficiently. Both operating system and processor designers try hard to extract more performance from the system. Apart from making the processors, memory and so on faster the

best approach is to maintain caches of useful information and data that make some operations faster. Linux uses a number of memory management related caches: Buffer Cache The buffer cache contains data buffers that are used by the block device drivers. These buffers are of fixed sizes (for example 512 bytes) and contain blocks of information that have either been read from a block device or are being written to it. A block device is one that can only be accessed by reading and writing fixed sized blocks of data. All hard disks are block devices. The buffer cache is indexed via the device identifier and the desired block number and is used to quickly find a block of data. Block devices are only ever accessed via the buffer cache. If data can be found in the buffer cache then it does not need to be read from the physical block device, for example a hard disk, and access to it is much faster. Page Cache This is used to speed up access to images and data on disk. It is used to cache the logical contents of a file a page at a time and is accessed via the file and offset within the file. As pages are read into memory from disk, they are cached in the page cache. Swap Cache Only modified (or dirty) pages are saved in the swap file. So long as these pages are not modified after they have been written to the swap file then the next time the page is swapped out there is no need to write it to the swap file as the page is already in the swap file. Instead the page can simply be discarded. In a heavily swapping system this saves many unnecessary and costly disk operations. Hardware Caches One commonly implemented hardware cache is in the processor; a cache of Page Table Entries. In this case, the processor does not always read the page table directly but instead caches translations for pages as it needs them. These are the Translation Look-aside Buffers and contain cached copies of the page table entries from one or more processes in the system. When the reference to the virtual address is made, the processor will attempt to find a matching TLB entry. If it finds one, it can directly translate the virtual address into a physical one and perform the correct operation on the data. If the processor cannot find a matching TLB entry then it must get the operating system to help. It does this by

signalling the operating system that a TLB miss has occurred. A system specific mechanism is used to deliver that exception to the operating system code that can fix things up. The operating system generates a new TLB entry for the address mapping. When the exception has been cleared, the processor will make another attempt to translate the virtual address. This time it will work because there is now a valid entry in the TLB for that address. The drawback of using caches, hardware or otherwise, is that in order to save effort Linux must use more time and space maintaining these caches and, if the caches become corrupted, the system will crash. 3.3 Linux Page Tables

Figure 3.3: Three Level Page Tables Linux assumes that there are three levels of page tables. Each Page Table accessed contains the page frame number of the next level of Page Table. Figure 3.3 shows how a virtual address can be broken into a number of fields; each field providing an offset into a particular Page Table. To translate a virtual address into a physical one, the processor must take the contents of each level field, convert it into an offset into the physical page containing the Page Table and read the page frame number of the next level of Page Table. This is repeated three times until the page frame number of the physical page containing the virtual address is found. Now the final field in the virtual address, the byte offset, is used to find the data inside the page.

Each platform that Linux runs on must provide translation macros that allow the kernel to traverse the page tables for a particular process. This way, the kernel does not need to know the format of the page table entries or how they are arranged. This is so successful that Linux uses the same page table manipulation code for the Alpha processor, which has three levels of page tables, and for Intel x86 processors, which have two levels of page tables. 3.4 Page Allocation and Deallocation There are many demands on the physical pages in the system. For example, when an image is loaded into memory the operating system needs to allocate pages. These will be freed when the image has finished executing and is unloaded. Another use for physical pages is to hold kernel specific data structures such as the page tables themselves. The mechanisms and data structures used for page allocation and deallocation are perhaps the most critical in maintaining the efficiency of the virtual memory subsystem. All of the physical pages in the system are described by the mem_map data structure which is a list of mem_map_t
1

structures which is initialized at boot time. Each mem_map_t describes a single physical page in the system. Important fields (so far as memory management is concerned) are: count This is a count of the number of users of this page. The count is greater than one when the page is shared between many processes, age This field describes the age of the page and is used to decide if the page is a good candidate for discarding or swapping, map_nr This is the physical page frame number that this mem_map_t describes. The free_area vector is used by the page allocation code to find and free pages. The whole buffer management scheme is supported by this mechanism and so far as the code is concerned, the size of the page and physical paging mechanisms used by the processor are irrelevant. Each element of free_area contains information about blocks of pages. The first element in the array describes single pages, the next blocks of 2 pages, the next blocks of 4 pages and so on upwards in powers of two. The list element is used as a queue head and has pointers to the page data structures in the mem_map array. Free blocks of pages are queued here. map is a pointer to a bitmap which keeps track of allocated groups of pages of this size. Bit N of the bitmap is set if the Nth block of pages is free.

Figure free-area-figure shows the free_area structure. Element 0 has one free page (page frame number 0) and element 2 has 2 free blocks of 4 pages, the first starting at page frame number 4 and the second at page frame number 56. 3.4.1 Page Allocation Linux uses the Buddy algorithm 2 to effectively allocate and deallocate blocks of pages. The page allocation code attempts to allocate a block of one or more physical pages. Pages are allocated in blocks which are powers of 2 in size. That means that it can allocate a block 1 page, 2 pages, 4 pages and so on. So long as there are enough free pages in the system to grant this request (nr_free_pages min_free_pages) the allocation code will search the free_area for a block of pages of the size requested. Each element of the free_area has a map of the allocated and free blocks of pages for that sized block. For example, element 2 of the array has a memory map that describes free and allocated blocks each of 4 pages long. The allocation algorithm first searches for blocks of pages of the size requested. It follows the chain of free pages that is queued on the list element of the free_area data structure. If no blocks of pages of the requested size are free, blocks of the next size (which is twice that of the size requested) are looked for. This process continues until all of the free_area has been searched or until a block of pages has been found. If the block of pages found is larger than that requested it must be broken down until there is a block of the right size. Because the blocks are each a power of 2 pages big then this breaking down process is easy as you simply break the blocks in half. The free blocks are queued on the appropriate queue and the allocated block of pages is returned to the caller.

Figure 3.4: The free_area data structure For example, in Figure 3.4 if a block of 2 pages was requested, the first block of 4 pages (starting at page frame number 4) would be broken into two 2 page blocks. The first, starting at page frame number 4 would be returned to the caller as the allocated pages and the second block, starting at page frame number 6 would be queued as a free block of 2 pages onto element 1 of the free_area array. 3.4.2 Page Deallocation Allocating blocks of pages tends to fragment memory with larger blocks of free pages being broken down into smaller ones. The page deallocation code recombines pages into larger blocks of free pages whenever it can. In fact the page block size is important as it allows for easy combination of blocks into larger blocks. Whenever a block of pages is freed, the adjacent or buddy block of the same size is checked to see if it is free. If it is, then it is combined with the newly freed block of pages to form a new free block of pages for the next size block of pages. Each time two blocks of pages are recombined into a bigger block of free pages the page deallocation code attempts to recombine that block into a yet larger one. In this way the blocks of free pages are as large as memory usage will allow.

For example, in Figure 3.4, if page frame number 1 were to be freed, then that would be combined with the already free page frame number 0 and queued onto element 1 of the free_area as a free block of size 2 pages. 3.5 Memory Mapping When an image is executed, the contents of the executable image must be brought into the processes virtual address space. The same is also true of any shared libraries that the executable image has been linked to use. The executable file is not actually brought into physical memory, instead it is merely linked into the processes virtual memory. Then, as the parts of the program are referenced by the running application, the image is brought into memory from the executable image. This linking of an image into a processes virtual address space is known as memory mapping.

Figure 3.5: Areas of Virtual Memory

Every processes virtual memory is represented by an mm_struct data structure. This contains information about the image that it is currently executing (for example bash) and also has pointers to a number of vm_area_struct data structures. Each vm_area_struct data structure describes the start and end of the area of virtual memory, the processes access rights to that memory and a set of operations for that memory. These operations are a set of routines that Linux must use when manipulating this area of virtual memory. For example, one of the virtual memory operations performs the correct actions when the process has attempted to access this virtual memory but finds (via a page fault) that the memory is not actually in physical memory. This operation is the nopage operation. The nopage operation is used when Linux demand pages the pages of an executable image into memory. When an executable image is mapped into a processes virtual address a set of vm_area_struct data structures is generated. Each vm_area_struct data structure represents a part of the executable image; the executable code, initialized data (variables), unitialized data and so on. Linux supports a number of standard virtual memory operations and as the vm_area_struct data structures are created, the correct set of virtual memory operations are associated with them. 3.6 Demand Paging Once an executable image has been memory mapped into a processes virtual memory it can start to execute. As only the very start of the image is physically pulled into memory it will soon access an area of virtual memory that is not yet in physical memory. When a process accesses a virtual address that does not have a valid page table entry, the processor will report a page fault to Linux. The page fault describes the virtual address where the page fault occurred and the type of memory access that caused. Linux must find the vm_area_struct that represents the area of memory that the page fault occurred in. As searching through the vm_area_struct data structures is critical to the efficient handling of page faults, these are linked together in an AVL (Adelson-Velskii and Landis) tree structure. If there is no vm_area_struct data structure for this faulting virtual address, this process has accessed an illegal virtual address. Linux will signal the process, sending a SIGSEGV signal, and if the process does not have a handler for that signal it will be terminated. Linux next checks the type of page fault that occurred against the types of accesses allowed for this area of virtual memory. If the process is accessing the memory in an illegal way, say writing to an area that it is only allowed to read from, it is also signalled with a memory error. Now that Linux has determined that the page fault is legal, it must deal with it. Linux must differentiate between pages that are in the swap file and those that are part of an executable image on a disk somewhere. It does this by using the page table entry for this faulting virtual address.

If the page's page table entry is invalid but not empty, the page fault is for a page currently being held in the swap file. For Alpha AXP page table entries, these are entries which do not have their valid bit set but which have a non-zero value in their PFN field. In this case the PFN field holds information about where in the swap (and which swap file) the page is being held. How pages in the swap file are handled is described later in this chapter. Not all vm_area_struct data structures have a set of virtual memory operations and even those that do may not have a nopage operation. This is because by default Linux will fix up the access by allocating a new physical page and creating a valid page table entry for it. If there is a nopage operation for this area of virtual memory, Linux will use it. The generic Linux nopage operation is used for memory mapped executable images and it uses the page cache to bring the required image page into physical memory. However the required page is brought into physical memory, the processes page tables are updated. It may be necessary for hardware specific actions to update those entries, particularly if the processor uses translation look aside buffers. Now that the page fault has been handled it can be dismissed and the process is restarted at the instruction that made the faulting virtual memory access. 3.7 The Linux Page Cache

Figure 3.6: The Linux Page Cache The role of the Linux page cache is to speed up access to files on disk. Memory mapped files are read a page at a time and these pages are stored in the page cache. Figure 3.6 shows that the page cache consists of the page_hash_table, a vector of pointers to mem_map_t data structures.

Each file in Linux is identified by a VFS inode data structure (described in Chapter filesystemchapter) and each VFS inode is unique and fully describes one and only one file. The index into the page table is derived from the file's VFS inode and the offset into the file. Whenever a page is read from a memory mapped file, for example when it needs to be brought back into memory during demand paging, the page is read through the page cache. If the page is present in the cache, a pointer to the mem_map_t data structure representing it is returned to the page fault handling code. Otherwise the page must be brought into memory from the file system that holds the image. Linux allocates a physical page and reads the page from the file on disk. If it is possible, Linux will initiate a read of the next page in the file. This single page read ahead means that if the process is accessing the pages in the file serially, the next page will be waiting in memory for the process. Over time the page cache grows as images are read and executed. Pages will be removed from the cache as they are no longer needed, say as an image is no longer being used by any process. As Linux uses memory it can start to run low on physical pages. In this case Linux will reduce the size of the page cache. 3.8 Swapping Out and Discarding Pages When physical memory becomes scarce the Linux memory management subsystem must attempt to free physical pages. This task falls to the kernel swap daemon (kswapd). The kernel swap daemon is a special type of process, a kernel thread. Kernel threads are processes have no virtual memory, instead they run in kernel mode in the physical address space. The kernel swap daemon is slightly misnamed in that it does more than merely swap pages out to the system's swap files. Its role is make sure that there are enough free pages in the system to keep the memory management system operating efficiently. The Kernel swap daemon (kswapd) is started by the kernel init process at startup time and sits waiting for the kernel swap timer to periodically expire. Every time the timer expires, the swap daemon looks to see if the number of free pages in the system is getting too low. It uses two variables, free_pages_high and free_pages_low to decide if it should free some pages. So long as the number of free pages in the system remains above free_pages_high, the kernel swap daemon does nothing; it sleeps again until its timer next expires. For the purposes of this check the kernel swap daemon takes into account the number of pages currently being written out to the swap file. It keeps a count of these in nr_async_pages; this is incremented each time a page is queued waiting to be written out to the swap file and decremented when the write to the swap device has completed. free_pages_low and free_pages_high are set at system startup time and are related to the number of physical pages in the system. If the number of free pages in the system has fallen below free_pages_high or worse still free_pages_low, the kernel swap daemon will try three ways to reduce the number of physical pages being used by the system:

Reducing the size of the buffer and page caches, Swapping out System V shared memory pages, Swapping out and discarding pages. If the number of free pages in the system has fallen below free_pages_low, the kernel swap daemon will try to free 6 pages before it next runs. Otherwise it will try to free 3 pages. Each of the above methods are tried in turn until enough pages have been freed. The kernel swap daemon remembers which method it was using the last time that it attempted to free physical pages. Each time it runs it will start trying to free pages using this last successful method. After it has free sufficient pages, the swap daemon sleeps again until its timer expires. If the reason that the kernel swap daemon freed pages was that the number of free pages in the system had fallen below free_pages_low, it only sleeps for half its usual time. Once the number of free pages is more than free_pages_low the kernel swap daemon goes back to sleeping longer between checks. 3.8.1 Reducing the Size of the Page and Buffer Caches The pages held in the page and buffer caches are good candidates for being freed into the free_area vector. The Page Cache, which contains pages of memory mapped files, may contain unneccessary pages that are filling up the system's memory. Likewise the Buffer Cache, which contains buffers read from or being written to physical devices, may also contain unneeded buffers. When the physical pages in the system start to run out, discarding pages from these caches is relatively easy as it requires no writing to physical devices (unlike swapping pages out of memory). Discarding these pages does not have too many harmful side effects other than making access to physical devices and memory mapped files slower. However, if the discarding of pages from these caches is done fairly, all processes will suffer equally. Every time the Kernel swap daemon tries to shrink these caches it examines a block of pages in the mem_map page vector to see if any can be discarded from physical memory. The size of the block of pages examined is higher if the kernel swap daemon is intensively swapping; that is if the number of free pages in the system has fallen dangerously low. The blocks of pages are examined in a cyclical manner; a different block of pages is examined each time an attempt is made to shrink the memory map. This is known as the clock algorithm as, rather like the minute hand of a clock, the whole mem_map page vector is examined a few pages at a time. Each page being examined is checked to see if it is cached in either the page cache or the buffer cache. You should note that shared pages are not considered for discarding at this time and that a page cannot be in both caches at the same time. If the page is not in either cache then the next page in the mem_map page vector is examined.

Pages are cached in the buffer cache (or rather the buffers within the pages are cached) to make buffer allocation and deallocation more efficient. The memory map shrinking code tries to free the buffers that are contained within the page being examined. If all the buffers are freed, then the pages that contain them are also be freed. If the examined page is in the Linux page cache, it is removed from the page cache and freed. When enough pages have been freed on this attempt then the kernel swap daemon will wait until the next time it is periodically woken. As none of the freed pages were part of any process's virtual memory (they were cached pages), then no page tables need updating. If there were not enough cached pages discarded then the swap daemon will try to swap out some shared pages. 3.8.2 Swapping Out System V Shared Memory Pages System V shared memory is an inter-process communication mechanism which allows two or more processes to share virtual memory in order to pass information amongst themselves. How processes share memory in this way is described in more detail in Chapter IPC-chapter. For now it is enough to say that each area of System V shared memory is described by a shmid_ds data structure. This contains a pointer to a list of vm_area_struct data structures, one for each process sharing this area of virtual memory. The vm_area_struct data structures describe where in each processes virtual memory this area of System V shared memory goes. Each vm_area_struct data structure for this System V shared memory is linked together using the vm_next_shared and vm_prev_shared pointers. Each shmid_ds data structure also contains a list of page table entries each of which describes the physical page that a shared virtual page maps to. The kernel swap daemon also uses a clock algorithm when swapping out System V shared memory pages. . Each time it runs it remembers which page of which shared virtual memory area it last swapped out. It does this by keeping two indices, the first is an index into the set of shmid_ds data structures, the second into the list of page table entries for this area of System V shared memory. This makes sure that it fairly victimizes the areas of System V shared memory. As the physical page frame number for a given virtual page of System V shared memory is contained in the page tables of all of the processes sharing this area of virtual memory, the kernel swap daemon must modify all of these page tables to show that the page is no longer in memory but is now held in the swap file. For each shared page it is swapping out, the kernel swap daemon finds the page table entry in each of the sharing processes page tables (by following a pointer from each vm_area_struct data structure). If this processes page table entry for this page of System V shared memory is valid, it converts it into an invalid but swapped out page table entry and reduces this (shared) page's count of users by one. The format of a swapped out System V shared page table entry contains an index into the set of shmid_ds data structures and an index into the page table entries for this area of System V shared memory.

If the page's count is zero after the page tables of the sharing processes have all been modified, the shared page can be written out to the swap file. The page table entry in the list pointed at by the shmid_ds data structure for this area of System V shared memory is replaced by a swapped out page table entry. A swapped out page table entry is invalid but contains an index into the set of open swap files and the offset in that file where the swapped out page can be found. This information will be used when the page has to be brought back into physical memory. 3.8.3 Swapping Out and Discarding Pages The swap daemon looks at each process in the system in turn to see if it is a good candidate for swapping. Good candidates are processes that can be swapped (some cannot) and that have one or more pages which can be swapped or discarded from memory. Pages are swapped out of physical memory into the system's swap files only if the data in them cannot be retrieved another way. A lot of the contents of an executable image come from the image's file and can easily be re-read from that file. For example, the executable instructions of an image will never be modified by the image and so will never be written to the swap file. These pages can simply be discarded; when they are again referenced by the process, they will be brought back into memory from the executable image. Once the process to swap has been located, the swap daemon looks through all of its virtual memory regions looking for areas which are not shared or locked. Linux does not swap out all of the swappable pages of the process that it has selected; instead it removes only a small number of pages. Pages cannot be swapped or discarded if they are locked in memory. The Linux swap algorithm uses page aging. Each page has a counter (held in the mem_map_t data structure) that gives the Kernel swap daemon some idea whether or not a page is worth swapping. Pages age when they are unused and rejuvinate on access; the swap daemon only swaps out old pages. The default action when a page is first allocated, is to give it an initial age of 3. Each time it is touched, it's age is increased by 3 to a maximum of 20. Every time the Kernel swap daemon runs it ages pages, decrementing their age by 1. These default actions can be changed and for this reason they (and other swap related information) are stored in the swap_control data structure. If the page is old (age = 0), the swap daemon will process it further. Dirty pages are pages which can be swapped out. Linux uses an architecture specific bit in the PTE to describe pages this way (see Figure 3.2). However, not all dirty pages are necessarily written to the swap file. Every virtual memory region of a process may have its own swap operation (pointed at by the vm_ops pointer in the vm_area_struct) and that method is used. Otherwise, the swap daemon will allocate a page in the swap file and write the page out to that device.

The page's page table entry is replaced by one which is marked as invalid but which contains information about where the page is in the swap file. This is an offset into the swap file where the page is held and an indication of which swap file is being used. Whatever the swap method used, the original physical page is made free by putting it back into the free_area. Clean (or rather not dirty) pages can be discarded and put back into the free_area for re-use. If enough of the swappable processes pages have been swapped out or discarded, the swap daemon will again sleep. The next time it wakes it will consider the next process in the system. In this way, the swap daemon nibbles away at each processes physical pages until the system is again in balance. This is much fairer than swapping out whole processes. 3.9 The Swap Cache When swapping pages out to the swap files, Linux avoids writing pages if it does not have to. There are times when a page is both in a swap file and in physical memory. This happens when a page that was swapped out of memory was then brought back into memory when it was again accessed by a process. So long as the page in memory is not written to, the copy in the swap file remains valid. Linux uses the swap cache to track these pages. The swap cache is a list of page table entries, one per physical page in the system. This is a page table entry for a swapped out page and describes which swap file the page is being held in together with its location in the swap file. If a swap cache entry is non-zero, it represents a page which is being held in a swap file that has not been modified. If the page is subsequently modified (by being written to), its entry is removed from the swap cache. When Linux needs to swap a physical page out to a swap file it consults the swap cache and, if there is a valid entry for this page, it does not need to write the page out to the swap file. This is because the page in memory has not been modified since it was last read from the swap file. The entries in the swap cache are page table entries for swapped out pages. They are marked as invalid but contain information which allow Linux to find the right swap file and the right page within that swap file. 3.10 Swapping Pages In The dirty pages saved in the swap files may be needed again, for example when an application writes to an area of virtual memory whose contents are held in a swapped out physical page. Accessing a page of virtual memory that is not held in physical memory causes a page fault to occur. The page fault is the processor signalling the operating system that it cannot translate a virtual address into a physical one. In this case this is because the page table entry describing this page of virtual memory was marked as invalid when the page was swapped out. The processor cannot handle the virtual to physical address translation and so hands control back to the operating system describing as it does so the virtual address that faulted and the reason for the fault. The format of this information and how the processor passes control to the operating system is processor specific.

The processor specific page fault handling code must locate the vm_area_struct data structure that describes the area of virtual memory that contains the faulting virtual address. It does this by searching the vm_area_struct data structures for this process until it finds the one containing the faulting virtual address. This is very time critical code and a processes vm_area_struct data structures are so arranged as to make this search take as little time as possible. Having carried out the appropriate processor specific actions and found that the faulting virtual address is for a valid area of virtual memory, the page fault processing becomes generic and applicable to all processors that Linux runs on. The generic page fault handling code looks for the page table entry for the faulting virtual address. If the page table entry it finds is for a swapped out page, Linux must swap the page back into physical memory. The format of the page table entry for a swapped out page is processor specific but all processors mark these pages as invalid and put the information neccessary to locate the page within the swap file into the page table entry. Linux needs this information in order to bring the page back into physical memory. At this point, Linux knows the faulting virtual address and has a page table entry containing information about where this page has been swapped to. The vm_area_struct data structure may contain a pointer to a routine which will swap any page of the area of virtual memory that it describes back into physical memory. This is its swapin operation. If there is a swapin operation for this area of virtual memory then Linux will use it. This is, in fact, how swapped out System V shared memory pages are handled as it requires special handling because the format of a swapped out System V shared page is a little different from that of an ordinairy swapped out page. There may not be a swapin operation, in which case Linux will assume that this is an ordinairy page that does not need to be specially handled. It allocates a free physical page and reads the swapped out page back from the swap file. Information telling it where in the swap file (and which swap file) is taken from the the invalid page table entry. If the access that caused the page fault was not a write access then the page is left in the swap cache and its page table entry is not marked as writable. If the page is subsequently written to, another page fault will occur and, at that point, the page is marked as dirty and its entry is removed from the swap cache. If the page is not written to and it needs to be swapped out again, Linux can avoid the write of the page to its swap file because the page is already in the swap file. If the access that caused the page to be brought in from the swap file was a write operation, this page is removed from the swap cache and its page table entry is marked as both dirty and writable. Difference between paging and segmentation Paging Computer memory is divided into small partitions that are all the same size and referred to as, page frames. Then when a process is loaded it gets divided into pages which are the same size as those previous frames. The process pages are then loaded into the frames.

Segmentation Computer memory is allocated in various sizes (segments) depending on the need for address space by the process. These segments may be individually protected or shared between processes. Commonly you will see what are called Segmentation Faults in programs, this is because the data thats is about to be read or written is outside the permitted address space of that process. So now we can distinguish the differences and look at a comparison between the two: Paging: Transparent to programmer (system allocates memory) No separate protection No separate compiling No shared code Segmentation: Involves programmer (allocates memory to specific function inside code) Separate compiling Separate protection Share code Virtual Memory

The program thinks it has a large range of contiguous addresses, but in reality the parts it is currently using are scattered around RAM, and the inactive parts are saved in a disk file.

In computing, virtual memory is a memory management technique developed for multitasking kernels. This technique virtualizes a computer architecture's various hardware memory devices (such as RAM modules and disk storage drives), allowing a program to be designed as though:

there is only one hardware memory device and this "virtual" device acts like a RAM module. the program has, by default, sole access to this virtual RAM module as the basis for a contiguous working memory (an address space).

Systems that employ virtual memory:


use hardware memory more efficiently than systems without virtual memory.[citation needed] make the programming of applications easier by: o hiding fragmentation. o delegating to the kernel the burden of managing the memory hierarchy; there is no need for the program to handle overlays explicitly. o obviating the need to relocate program code or to access memory with relative addressing.

CHAPTER 4 INTER PROCESS COMMUNICATION AND SYNCHRONIZATION

In computing, Inter-process communication (IPC) is a set of techniques for the exchange of data among multiple threads in one or more processes. Processes may be running on one or more computers connected by a network. IPC techniques are divided into methods for message passing, synchronization, shared memory, and remote procedure calls (RPC). The method of IPC used may vary based on the bandwidth and latency of communication between the threads, and the type of data being communicated. There are several reasons for providing an environment that allows process cooperation:

Information sharing speedup; Modularity; Convenience; and Privilege separation.

IPC may also be referred to as inter-thread communication and inter-application communication. Mutual Exclusion Mutual exclusion (often abbreviated to mutex) algorithms are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable, by pieces of computer code called critical sections. A critical section is a piece of code in which a process or thread accesses a common resource. The critical section by itself is not a mechanism or algorithm for mutual exclusion. A program, process, or thread can have the critical section in it without any mechanism or algorithm which implements mutual exclusion. Examples of such resources are fine-grained flags, counters or queues, used to communicate between code that runs concurrently, such as an application and its interrupt handlers. The synchronization of access to those resources is an acute problem because a thread can be stopped or started at any time.

Enforcing mutual exclusion There are both software and hardware solutions for enforcing mutual exclusion. The different solutions are shown below. Hardware solutions On a uniprocessor system a common way to achieve mutual exclusion inside kernels is to disable interrupts for the smallest possible number of instructions that will prevent corruption of the shared data structure, the critical section. This prevents interrupt code from running in the critical section, that also protects against interrupt-based process-change. In a computer in which several processors share memory, an indivisible test-and-set of a flag could be used in a tight loop to wait until the other processor clears the flag. The test-and-set performs both operations without releasing the memory bus to another processor. When the code leaves the critical section, it clears the flag. This is called a "spinlock" or "busy-wait". Similar atomic multiple-operation instructions, e.g., compare-and-swap, are commonly used for lock-free manipulation of linked lists and other data structures. Software solutions Beside the hardware supported solution, some software solutions exist that use "busy-wait" to achieve the goal. Examples of these include the following:

Dekker's algorithm Peterson's algorithm Lamport's bakery algorithm The black-white bakery algorithm Szymanski's Algorithm

Unfortunately, spin locks and busy waiting take up excessive processor time and power and are considered anti-patterns in almost every case. In addition, these algorithms do not work if Outof-order execution is utilized on the platform that executes them. Programmers have to specify strict ordering on the memory operations within a thread.[citation needed] The solution to these problems is to use synchronization facilities provided by an operating system's multithreading library, which will take advantage of hardware solutions if possible but will use software solutions if no hardware solutions exist. For example, when the operating system's lock library is used and a thread tries to acquire an already acquired lock, the operating system will suspend the thread using a context switch and swaps it out with another thread that is ready to be run, or could put that processor into a low power state if there is no other thread that can be run. Therefore, most modern mutual exclusion methods attempt to reduce latency and busy-waits by using queuing and context switches. However, if the time that is spent suspending a thread and then restoring it can be proven to be always more than the time that must be waited

for a thread to become ready to run after being blocked in a particular situation, then spinlocks are a fine solution for that situation only. Semaphore a semaphore is a protected variable or abstract data type that provides a simple but useful abstraction for controlling access by multiple processes to a common resource in a parallel programming environment. A useful way to think of a semaphore is as a record of how many units of a particular resource are available, coupled with operations to safely (i.e. without race conditions) adjust that record as units are required or become free, and if necessary wait until a unit of the resource becomes available. Semaphores are a useful tool in the prevention of race conditions and deadlocks; however, their use is by no means a guarantee that a program is free from these problems. Semaphores which allow an arbitrary resource count are called counting semaphores, whilst semaphores which are restricted to the values 0 and 1 (or locked/unlocked, unavailable/available) are called binary semaphores.

Concurrent Programming: In computer science, concurrency is a property of systems in which several computations are executing simultaneously, and potentially interacting with each other. The computations may be executing on multiple cores in the same chip, preemptively time-shared threads on the same processor, or executed on physically separated processors. A number of mathematical models have been developed for general concurrent computation including Petri nets, process calculi, the Parallel Random Access Machine model and the Actor model. Deadlock A deadlock is a situation where in two or more competing actions are each waiting for the other to finish, and thus neither ever does. It is often seen in a paradox like the "chicken or the egg". The concept of a Catch 22 is similar. Coffman deadlock refers to a specific condition when two or more processes are each waiting for each other to release a resource, or more than two processes are waiting for resources in a circular chain (see Necessary conditions). Deadlock is a common problem in multiprocessing where many processes share a specific type of mutually exclusive resource known as a software lock or soft lock. Computers intended for the time-sharing and/or real-time markets are often equipped with a hardware lock (or hard lock) which guarantees exclusive access to processes, forcing serialized access. Deadlocks are particularly troubling because there is no general solution to avoid (soft) deadlocks. This situation may be likened to two people who are drawing diagrams, with only one pencil and one ruler between them. If one person takes the pencil and the other takes the ruler, a deadlock

occurs when the person with the pencil needs the ruler and the person with the ruler needs the pencil to finish his work with the ruler. Neither request can be satisfied, so a deadlock occurs. Examples An example of a deadlock which may occur in database products is the following. Client applications using the database may require exclusive access to a table, and in order to gain exclusive access they ask for a lock. If one client application holds a lock on a table and attempts to obtain the lock on a second table that is already held by a second client application, this may lead to deadlock if the second application then attempts to obtain the lock that is held by the first application. (But this particular type of deadlock is easily prevented, e.g., by using an all-ornone resource allocation algorithm.) Another example might be a text formatting program that accepts text sent to it to be processed and then returns the results, but does so only after receiving "enough" text to work on (e.g. 1KB). A text editor program is written that sends the formatter some text and then waits for the results. In this case a deadlock may occur on the last block of text. Since the formatter may not have sufficient text for processing, it will suspend itself while waiting for the additional text, which will never arrive since the text editor has sent it all of the text it has. Meanwhile, the text editor is itself suspended waiting for the last output from the formatter. This type of deadlock is sometimes referred to as a deadly embrace (properly used only when only two applications are involved) or starvation. However, this situation, too, is easily prevented by having the text editor send a forcing message (e.g. EOF, (End Of File)) with its last (partial) block of text, which will force the formatter to return the last (partial) block after formatting, and not wait for additional text. In communications, corrupted messages may cause computers to go into bad states where they are not communicating properly. The network may be said to be deadlocked even though no computer is waiting for a resource. This is different than a Coffman deadlock.

Necessary conditions There are four necessary and sufficient conditions for a Coffman deadlock to occur, known as the Coffman conditions from their first description in a 1971 article by E. G. Coffman.
1. Mutual exclusion condition: a resource that cannot be used by more than one process at a

time 2. Hold and wait condition: processes already holding resources may request new resources 3. No preemption condition: No resource can be forcibly removed from a process holding it, resources can be released only by the explicit action of the process. The first three conditions are necessary but not sufficient for a deadlock to exist. For deadlock to actually take place,a fourth condition is required:

1. Circular wait condition: two or more processes form a circular chain where each process

waits for a resource that the next process in the chain holds. When circular waiting is triggered by mutual exclusion operations it is sometimes called lock inversion[2]. Prevention

Removing the mutual exclusion condition means that no process may have exclusive access to a resource. This proves impossible for resources that cannot be spooled, and even with spooled resources deadlock could still occur. Algorithms that avoid mutual exclusion are called non-blocking synchronization algorithms. The "hold and wait" conditions may be removed by requiring processes to request all the resources they will need before starting up (or before embarking upon a particular set of operations); this advance knowledge is frequently difficult to satisfy and, in any case, is an inefficient use of resources. Another way is to require processes to release all their resources before requesting all the resources they will need. This too is often impractical. (Such algorithms, such as serializing tokens, are known as the all-or-none algorithms.) A "no preemption" (lockout) condition may also be difficult or impossible to avoid as a process has to be able to have a resource for a certain amount of time, or the processing outcome may be inconsistent or thrashing may occur. However, inability to enforce preemption may interfere with a priority algorithm. (Note: Preemption of a "locked out" resource generally implies a rollback, and is to be avoided, since it is very costly in overhead.) Algorithms that allow preemption include lock-free and wait-free algorithms and optimistic concurrency control. The circular wait condition: Algorithms that avoid circular waits include "disable interrupts during critical sections", and "use a hierarchy to determine a partial ordering of resources" (where no obvious hierarchy exists, even the memory address of resources has been used to determine ordering) and Dijkstra's solution.

Avoidance Deadlock can be avoided if certain information about processes is available in advance of resource allocation. For every resource request, the system sees if granting the request will mean that the system will enter an unsafe state, meaning a state that could result in deadlock. The system then only grants requests that will lead to safe states. In order for the system to be able to figure out whether the next state will be safe or unsafe, it must know in advance at any time the number and type of all resources in existence, available, and requested. One known algorithm that is used for deadlock avoidance is the Banker's algorithm, which requires resource usage limit to be known in advance. However, for many systems it is impossible to know in advance what every process will request. This means that deadlock avoidance is often impossible. Two other algorithms are Wait/Die and Wound/Wait, each of which uses a symmetry-breaking technique. In both these algorithms there exists an older process (O) and a younger process (Y). Process age can be determined by a timestamp at process creation time. Smaller time stamps are older processes, while larger timestamps represent younger processes. Wait/Die Wound/Wait

O needs a resource held by Y O waits Y needs a resource held by O Y dies

Y dies Y waits

It is important to note that a process may be in an unsafe state but would not result in a deadlock. The notion of safe/unsafe states only refers to the ability of the system to enter a deadlock state or not. For example, if a process requests A which would result in an unsafe state, but releases B which would prevent circular wait, then the state is unsafe but the system is not in deadlock. Detection Often, neither avoidance nor deadlock prevention may be used. Instead deadlock detection and process restart are used by employing an algorithm that tracks resource allocation and process states, and rolls back and restarts one or more of the processes in order to remove the deadlock. Detecting a deadlock that has already occurred is easily possible since the resources that each process has locked and/or currently requested are known to the resource scheduler or OS. Detecting the possibility of a deadlock before it occurs is much more difficult and is, in fact, generally undecidable, because the halting problem can be rephrased as a deadlock scenario. However, in specific environments, using specific means of locking resources, deadlock detection may be decidable. In the general case, it is not possible to distinguish between algorithms that are merely waiting for a very unlikely set of circumstances to occur and algorithms that will never finish because of deadlock. Deadlock detection techniques include, but is not limited to, Model checking. This approach constructs a Finite State-model on which it performs a progress analysis and finds all possible terminal sets in the model. These then each represent a deadlock. Distributed deadlock This article's tone or style may not be appropriate for Wikipedia. Specific concerns may be found on the talk page. See Wikipedia's guide to writing better articles for suggestions. (November 2010)

Distributed deadlocks can occur in distributed systems when distributed transactions or concurrency control is being used. Distributed deadlocks can be detected either by constructing a global wait-for graph, from local wait-for graphs at a deadlock detector or by a distributed algorithm like edge chasing. In a Commitment ordering based distributed environment (including the Strong strict two-phase locking (SS2PL, or rigorous) special case) distributed deadlocks are resolved automatically by the atomic commitment protocol (e.g. two-phase commit (2PC)), and no global wait-for graph or other resolution mechanism are needed. Similar automatic global deadlock resolution occurs also in environments that employ 2PL that is not SS2PL (and typically not CO; see Deadlocks in 2PL). However 2PL that is not SS2PL is rarely utilized in practice.

Phantom deadlocks are deadlocks that are detected in a distributed system due to system internal delays, but no longer actually exist at the time of detection.

CHAPTER 5 FILE SYSTEM

A file system (often also written as filesystem) is a method of storing and organizing computer files and their data. Essentially, it organizes these files into a database for the storage, organization, manipulation, and retrieval by the computer's operating system. File systems are used on data storage devices such as hard disks or CD-ROMs to maintain the physical location of the files. Beyond this, they might provide access to data on a file server by acting as clients for a network protocol (e.g., NFS, SMB, or 9P clients), or they may be virtual and exist only as an access method for virtual data (e.g., procfs). It is distinguished from a directory service and registry. Aspects of file systems Most file systems make use of an underlying data storage device that offers access to an array of fixed-size physical sectors, generally a power of 2 in size (512 bytes or 1, 2, or 4 KiB are most common). The file system is responsible for organizing these sectors into files and directories, and keeping track of which sectors belong to which file and which are not being used. Most file systems address data in fixed-sized units called "clusters" or "blocks" which contain a certain number of disk sectors (usually 1-64). This is the smallest amount of disk space that can be allocated to hold a file. However, file systems need not make use of a storage device at all. A file system can be used to organize and represent access to any data, whether it is stored or dynamically generated (e.g., procfs). File names A file name (or filename) is a name assigned to a file in order to secure storage location in the computer memory. Whether the file system has an underlying storage device or not, file systems typically have directories which associate file names with files, usually by connecting the file name to an index in a file allocation table of some sort, such as the FAT in a DOS file system, or an inode in a Unix-like file system. Directory structures may be flat, or allow hierarchies where directories may contain subdirectories. In some file systems, file names are structured, with special syntax for filename extensions and version numbers. In others, file names are simple strings, and per-file metadata is stored elsewhere. Metadata

Other bookkeeping information is typically associated with each file within a file system. The length of the data contained in a file may be stored as the number of blocks allocated for the file or as an exact byte count. The time that the file was last modified may be stored as the file's timestamp. Some file systems also store the file creation time, the time it was last accessed, and the time the file's meta-data was changed. (Note that many early PC operating systems did not keep track of file times.) Other information can include the file's device type (e.g., block, character, socket, subdirectory, etc.), its owner user-ID and group-ID, and its access permission settings (e.g., whether the file is read-only, executable, etc.). Arbitrary attributes can be associated on advanced file systems, such as NTFS, XFS, ext2/ext3, some versions of UFS, and HFS+, using extended file attributes. This feature is implemented in the kernels of Linux, FreeBSD and Mac OS X operating systems, and allows metadata to be associated with the file at the file system level. This, for example, could be the author of a document, the character encoding of a plain-text document, or a checksum. Hierarchical file systems The hierarchical file system (not to be confused with Apple's HFS) was an early research interest of Dennis Ritchie of Unix fame; previous implementations were restricted to only a few levels, notably the IBM implementations, even of their early databases like IMS. After the success of Unix, Ritchie extended the file system concept to every object in his later operating system developments, such as Plan 9 and Inferno. Facilities Traditional file systems offer facilities to create, move and delete both files and directories. They lack facilities to create additional links to a directory (hard links in Unix), rename parent links (".." in Unix-like OS), and create bidirectional links to files. Traditional file systems also offer facilities to truncate, append to, create, move, delete and inplace modify files. They do not offer facilities to prepend to or truncate from the beginning of a file, let alone arbitrary insertion into or deletion from a file. The operations provided are highly asymmetric and lack the generality to be useful in unexpected contexts. For example, interprocess pipes in Unix have to be implemented outside of the file system because the pipes concept does not offer truncation from the beginning of files. Secure access Secure access to basic file system operations can be based on a scheme of access control lists or capabilities. Research has shown access control lists to be difficult to secure properly, which is why research operating systems tend to use capabilities. [citation needed] Commercial file systems still use access control lists.

Types of file systems File system types can be classified into disk file systems, network file systems and special purpose file systems. Disk file systems A disk file system is a file system designed for the storage of files on a data storage device, most commonly a disk drive, which might be directly or indirectly connected to the computer. Examples of disk file systems include FAT (FAT12, FAT16, FAT32, exFAT), NTFS, HFS and HFS+, HPFS, UFS, ext2, ext3, ext4, btrfs, ISO 9660, ODS-5, Veritas File System, VMFS ZFS, ReiserFS, Linux SWAP and UDF. Some disk file systems are journaling file systems or versioning file systems. ISO 9660 and Universal Disk Format are the two most common formats that target Compact Discs and DVDs. Mount Rainier is a newer extension to UDF supported by Linux 2.6 series and Windows Vista that facilitates rewriting to DVDs in the same fashion as has been possible with floppy disks. Flash file systems A flash file system is a file system designed for storing files on flash memory devices. These are becoming more prevalent as the number of mobile devices is increasing, and the capacity of flash memories increase. While a disk file system can be used on a flash device, this is suboptimal for several reasons:

Erasing blocks: Flash memory blocks have to be explicitly erased before they can be rewritten. The time taken to erase blocks can be significant, thus it is beneficial to erase unused blocks while the device is idle. Random access: Disk file systems are optimized to avoid disk seeks whenever possible, due to the high cost of seeking. Flash memory devices impose no seek latency. Wear levelling: Flash memory devices tend to wear out when a single block is repeatedly overwritten; flash file systems are designed to spread out writes evenly.

Log-structured file systems have many of the desirable properties for a flash file system. Such file systems include JFFS2 and YAFFS. Tape file systems A tape file system is a file system and tape format designed to store files on tape in a selfdescribing form. Magnetic tapes are sequential storage media, posing challenges to the creation and efficient management of a general-purpose file system. IBM has recently announced a new file system for tape called the Linear Tape File System. The IBM implementation of this file system has been released as the open-source IBM Long Term File System product.

Database file systems A recent concept for file management is the idea of a database-based file system. Instead of, or in addition to, hierarchical structured management, files are identified by their characteristics, like type of file, topic, author, or similar metadata. Transactional file systems Some programs need to update multiple files "all at once." For example, a software installation may write program binaries, libraries, and configuration files. If the software installation fails, the program may be unusable. If the installation is upgrading a key system utility, such as the command shell, the entire system may be left in an unusable state. Transaction processing introduces the isolation guarantee, which states that operations within a transaction are hidden from other threads on the system until the transaction commits, and that interfering operations on the system will be properly serialized with the transaction. Transactions also provide the atomicity guarantee, that operations inside of a transaction are either all committed, or the transaction can be aborted and the system discards all of its partial results. This means that if there is a crash or power failure, after recovery, the stored state will be consistent. Either the software will be completely installed or the failed installation will be completely rolled back, but an unusable partial install will not be left on the system. Windows, beginning with Vista, added transaction support to NTFS, abbreviated TxF. TxF is the only commercial implementation of a transactional file system, as transactional file systems are difficult to implement correctly in practice. There are a number of research prototypes of transactional file systems for UNIX systems, including the Valor file system,[1] Amino,[2] LFS,[3] and a transactional ext3 file system on the TxOS kernel,[4] as well as transactional file systems targeting embedded systems, such as TFFS.[5] Ensuring consistency across multiple file system operations is difficult, if not impossible, without file system transactions. File locking can be used as a concurrency control mechanism for individual files, but it typically does not protect the directory structure or file metadata. For instance, file locking cannot prevent TOCTTOU race conditions on symbolic links. File locking also cannot automatically roll back a failed operation, such as a software upgrade; this requires atomicity. Journaling file systems are one technique used to introduce transaction-level consistency to file system structures. Journal transactions are not exposed to programs as part of the OS API; they are only used internally to ensure consistency at the granularity of a single system call. Network file systems A network file system is a file system that acts as a client for a remote file access protocol, providing access to files on a server. Examples of network file systems include clients for the NFS, AFS, SMB protocols, and file-system-like clients for FTP and WebDAV.

Shared disk file systems A shared disk file system is one in which a number of machines (usually servers) all have access to the same external disk subsystem (usually a SAN). The file system arbitrates access to that subsystem, preventing write collisions. Examples include GFS from Red Hat, GPFS from IBM, and SFS from DataPlow. Special purpose file systems A special purpose file system is basically any file system that is not a disk file system or network file system. This includes systems where the files are arranged dynamically by software, intended for such purposes as communication between computer processes or temporary file space. Special purpose file systems are most commonly used by file-centric operating systems such as Unix. Examples include the procfs (/proc) file system used by some Unix variants, which grants access to information about processes and other operating system features. Deep space science exploration craft, like Voyager I and II used digital tape-based special file systems. Most modern space exploration craft like Cassini-Huygens used Real-time operating system file systems or RTOS influenced file systems. The Mars Rovers are one such example of an RTOS file system, important in this case because they are implemented in flash memory. File systems and operating systems Most operating systems provide a file system, as a file system is an integral part of any modern operating system. Early microcomputer operating systems' only real task was file management a fact reflected in their names (see DOS). Some early operating systems had a separate component for handling file systems which was called a disk operating system. On some microcomputers, the disk operating system was loaded separately from the rest of the operating system. On early operating systems, there was usually support for only one, native, unnamed file system; for example, CP/M supports only its own file system, which might be called "CP/M file system" if needed, but which didn't bear any official name at all. Because of this, there needs to be an interface provided by the operating system software between the user and the file system. This interface can be textual (such as provided by a command line interface, such as the Unix shell, or OpenVMS DCL) or graphical (such as provided by a graphical user interface, such as file browsers). If graphical, the metaphor of the folder, containing documents, other files, and nested folders is often used (see also: directory and folder). Flat file systems In a flat file system, there are no subdirectorieseverything is stored at the same (root) level on the media, be it a hard disk, floppy disk, etc. While simple, this system rapidly becomes

inefficient as the number of files grows, and makes it difficult for users to organize data into related groups. Like many small systems before it, the original Apple Macintosh featured a flat file system, called Macintosh File System. Its version of Mac OS was unusual in that the file management software (Macintosh Finder) created the illusion of a partially hierarchical filing system on top of EMFS. This structure meant that every file on a disk had to have a unique name, even if it appeared to be in a separate folder. MFS was quickly replaced with Hierarchical File System, which supported real directories. A recent addition to the flat file system family is Amazon's S3, a remote storage service, which is intentionally simplistic to allow users the ability to customize how their data is stored. The only constructs are buckets (imagine a disk drive of unlimited size) and objects (similar, but not identical to the standard concept of a file). Advanced file management is allowed by being able to use nearly any character (including '/') in the object's name, and the ability to select subsets of the bucket's content based on identical prefixes. File systems under Unix-like operating systems Unix-like operating systems create a virtual file system, which makes all the files on all the devices appear to exist in a single hierarchy. This means,in those systems, there is one root directory, and every file existing on the system is located under it somewhere. Unix-like systems can use a RAM disk or network shared resource as its root directory. Unix-like systems assign a device name to each device, but this is not how the files on that device are accessed. Instead, to gain access to files on another device, the operating system must first be informed where in the directory tree those files should appear. This process is called mounting a file system. For example, to access the files on a CD-ROM, one must tell the operating system "Take the file system from this CD-ROM and make it appear under such-andsuch directory". The directory given to the operating system is called the mount point it might, for example, be /media. The /media directory exists on many Unix systems (as specified in the Filesystem Hierarchy Standard) and is intended specifically for use as a mount point for removable media such as CDs, DVDs, USB drives or floppy disks. It may be empty, or it may contain subdirectories for mounting individual devices. Generally, only the administrator (i.e. root user) may authorize the mounting of file systems. Unix-like operating systems often include software and tools that assist in the mounting process and provide it new functionality. Some of these strategies have been coined "auto-mounting" as a reflection of their purpose.
1. In many situations, file systems other than the root need to be available as soon as the

operating system has booted. All Unix-like systems therefore provide a facility for mounting file systems at boot time. System administrators define these file systems in the configuration file fstab (vfstab in Solaris), which also indicates options and mount points.

2. In some situations, there is no need to mount certain file systems at boot time, although

their use may be desired thereafter. There are some utilities for Unix-like systems that allow the mounting of predefined file systems upon demand. 3. Removable media have become very common with microcomputer platforms. They allow programs and data to be transferred between machines without a physical connection. Common examples include USB flash drives, CD-ROMs, and DVDs. Utilities have therefore been developed to detect the presence and availability of a medium and then mount that medium without any user intervention. 4. Progressive Unix-like systems have also introduced a concept called supermounting; see, for example, the Linux supermount-ng project. For example, a floppy disk that has been supermounted can be physically removed from the system. Under normal circumstances, the disk should have been synchronized and then unmounted before its removal. Provided synchronization has occurred, a different disk can be inserted into the drive. The system automatically notices that the disk has changed and updates the mount point contents to reflect the new medium. Similar functionality is found on Windows machines. 5. An automounter will automatically mount a file system when a reference is made to the directory atop which it should be mounted. This is usually used for file systems on network servers, rather than relying on events such as the insertion of media, as would be appropriate for removable media. File systems under Linux Linux supports many different file systems, but common choices for the system disk include the ext* family (such as ext2, ext3 and ext4), XFS, JFS, ReiserFS and btrfs. File systems under Solaris The Sun Microsystems Solaris operating system in earlier releases defaulted to (non-journaled or non-logging) UFS for bootable and supplementary file systems. Solaris defaulted to, supported, and extended UFS. Support for other file systems and significant enhancements were added over time, including Veritas Software Corp. (Journaling) VxFS, Sun Microsystems (Clustering) QFS, Sun Microsystems (Journaling) UFS, and Sun Microsystems (open source, poolable, 128 bit compressible, and error-correcting) ZFS. Kernel extensions were added to Solaris to allow for bootable Veritas VxFS operation. Logging or Journaling was added to UFS in Sun's Solaris 7. Releases of Solaris 10, Solaris Express, OpenSolaris, and other open source variants of the Solaris operating system later supported bootable ZFS. Logical Volume Management allows for spanning a file system across multiple devices for the purpose of adding redundancy, capacity, and/or throughput. Legacy environments in Solaris may use Solaris Volume Manager (formerly known as Solstice DiskSuite.) Multiple operating systems (including Solaris) may use Veritas Volume Manager. Modern Solaris based operating

systems eclipse the need for Volume Management through leveraging virtual storage pools in ZFS. File systems under Mac OS X Mac OS X uses a file system that it inherited from classic Mac OS called HFS Plus, sometimes called Mac OS Extended. HFS Plus is a metadata-rich and case preserving file system. Due to the Unix roots of Mac OS X, Unix permissions were added to HFS Plus. Later versions of HFS Plus added journaling to prevent corruption of the file system structure and introduced a number of optimizations to the allocation algorithms in an attempt to defragment files automatically without requiring an external defragmenter. Filenames can be up to 255 characters. HFS Plus uses Unicode to store filenames. On Mac OS X, the filetype can come from the type code, stored in file's metadata, or the filename. HFS Plus has three kinds of links: Unix-style hard links, Unix-style symbolic links and aliases. Aliases are designed to maintain a link to their original file even if they are moved or renamed; they are not interpreted by the file system itself, but by the File Manager code in userland. Mac OS X also supports the UFS file system, derived from the BSD Unix Fast File System via NeXTSTEP. However, as of Mac OS X 10.5 (Leopard), Mac OS X can no longer be installed on a UFS volume, nor can a pre-Leopard system installed on a UFS volume be upgraded to Leopard.[6] Newer versions Mac OS X are capable of reading and writing to the legacy FAT file systems(16 & 32). They are capable of reading, but not writing to the NTFS file system. Third party software is still necessary to write to the NTFS file system under Snow Leopard 10.6.4. File systems under Plan 9 from Bell Labs Plan 9 from Bell Labs was originally designed to extend some of Unix's good points, and to introduce some new ideas of its own while fixing the shortcomings of Unix. With respect to file systems, the Unix system of treating things as files was continued, but in Plan 9, everything is treated as a file, and accessed as a file would be (i.e., no ioctl or mmap). Perhaps surprisingly, while the file interface is made universal it is also simplified considerably: symlinks, hard links and suid are made obsolete, and an atomic create/open operation is introduced. More importantly the set of file operations becomes well defined and subversions of this like ioctl are eliminated. Secondly, the underlying 9P protocol was used to remove the difference between local and remote files (except for a possible difference in latency or in throughput). This has the advantage that a device or devices, represented by files, on a remote computer could be used as though it were the local computer's own device(s). This means that under Plan 9, multiple file servers provide access to devices, classing them as file systems. Servers for "synthetic" file systems can

also run in user space bringing many of the advantages of micro kernel systems while maintaining the simplicity of the system. Everything on a Plan 9 system has an abstraction as a file; networking, graphics, debugging, authentication, capabilities, encryption, and other services are accessed via I-O operations on file descriptors. For example, this allows the use of the IP stack of a gateway machine without need of NAT, or provides a network-transparent window system without the need of any extra code. Another example: a Plan-9 application receives FTP service by opening an FTP site. The ftpfs server handles the open by essentially mounting the remote FTP site as part of the local file system. With ftpfs as an intermediary, the application can now use the usual file-system operations to access the FTP site as if it were part of the local file system. A further example is the mail system which uses file servers that synthesize virtual files and directories to represent a user mailbox as /mail/fs/mbox. The wikifs provides a file system interface to a wiki. These file systems are organized with the help of private, per-process namespaces, allowing each process to have a different view of the many file systems that provide resources in a distributed system. The Inferno operating system shares these concepts with Plan 9. File systems under Microsoft Windows

Directory listing in a Windows command shell Windows makes use of the FAT and NTFS file systems. Unlike many other operating systems, Windows uses a drive letter abstraction at the user level to distinguish one disk or partition from another. For example, the path C:\WINDOWS represents a directory WINDOWS on the partition represented by the letter C. The C drive is most commonly used for the primary hard disk partition, on which Windows is usually installed and from which it boots. This "tradition" has become so firmly ingrained that bugs came about in older applications which made assumptions that the drive that the operating system was installed on was C. The tradition of using "C" for the drive letter can be traced to MS-DOS, where the letters

A and B were reserved for up to two floppy disk drives. This in turn derived from CP/M in the 1970s, and ultimately from IBM's CP/CMS of 1967. Network drives may also be mapped to drive letters. FAT The File Allocation Table (FAT) filing system, supported by all versions of Microsoft Windows, was an evolution of that used in Microsoft's earlier operating system (MS-DOS which in turn was based on 86-DOS). FAT ultimately traces its roots back to the short-lived M-DOS project and Standalone disk BASIC before it. Over the years various features have been added to it, inspired by similar features found on file systems used by operating systems such as Unix. Older versions of the FAT file system (FAT12 and FAT16) had file name length limits, a limit on the number of entries in the root directory of the file system and had restrictions on the maximum size of FAT-formatted disks or partitions. Specifically, FAT12 and FAT16 had a limit of 8 characters for the file name, and 3 characters for the extension (such as .exe). This is commonly referred to as the 8.3 filename limit. VFAT, which was an extension to FAT12 and FAT16 introduced in Windows NT 3.5 and subsequently included in Windows 95, allowed long file names (LFN). FAT32 also addressed many of the limits in FAT12 and FAT16, but remains limited compared to NTFS. exFAT (also known as FAT64) is the newest iteration of FAT, with certain advantages over NTFS with regards to file system overhead. exFAT is only compatible with newer Windows systems, such as Windows 2003, Windows Vista, Windows 2008, Windows 7 and more recently, support has been added for WinXP.[7] NTFS NTFS, introduced with the Windows NT operating system, allowed ACL-based permission control. Hard links, multiple file streams, attribute indexing, quota tracking, sparse files, encryption, compression, reparse points (directories working as mount-points for other file systems, symlinks, junctions, remote storage links) are also supported, though not all these features are well-documented.[citation needed] File Protection Umask Settings The umask command can be used to determine the default file creation mode on your system. It is the octal complement of the desired file mode. If files are created without any regard to their permissions settings, the user could inadvertently give read or write permission to someone that should not have this permission. Typical umask settings include 022, 027, and 077 (which is the most restrictive). Normally the umask is set in /etc/profile, so it applies to all users on the system.

The resulting permission is calculated as follows: The default permission of user/group/others (7 for directories, 6 for files) is combined with the inverted mask (NOT) using AND on a per-bitbasis. Example 1: file, default 6, binary: 110 mask, eg. 2: 010, NOT: 101 resulting permission, AND: 100 (equals 4, r__) Example 2: file, default 6, binary: 110 mask, eg. 6: 110, NOT: 001 resulting permission, AND: 000 (equals 0, ___) Example 3: directory, default 7, binary: 111 mask, eg. 2: 010, NOT: 101 resulting permission, AND: 101 (equals 5, r_x) Example 4: directory, default 7, binary: 111 mask, eg. 6: 110, NOT: 001 resulting permission, AND: 001 (equals 1, __x) # Set the user's default umask umask 033 Be sure to make root's umask 077, which will disable read, write, and execute permission for other users, unless explicitly changed using chmod. In this case, newly-created directories would have 744 permissions, obtained by subtracting 033 from 777. Newly-created files using the 033 umask would have permissions of 644. If you are using Red Hat, and adhere to their user and group ID creation scheme (User Private Groups), it is only necessary to use 002 for a umask. This is due to the fact that the default configuration is one user per group. 5.2. File Permissions It's important to ensure that your system files are not open for casual editing by users and groups who shouldn't be doing such system maintenance.

Unix separates access control on files and directories according to three characteristics: owner, group, and other. There is always exactly one owner, any number of members of the group, and everyone else. A quick explanation of Unix permissions: Ownership - Which user(s) and group(s) retain(s) control of the permission settings of the node and parent of the node Permissions - Bits capable of being set or reset to allow certain types of access to it. Permissions for directories may have a different meaning than the same set of permissions on files. Read:

To be able to view contents of a file To be able to read a directory

Write:

To be able to add to or change a file To be able to delete or move files in a directory

Execute:

To be able to run a binary program or shell script To be able to search in a directory, combined with read permission

Save Text Attribute: (For directories) The "sticky bit" also has a different meaning when applied to directories than when applied to files. If the sticky bit is set on a directory, then a user may only delete files that the he owns or for which he has explicit write permission granted, even when he has write access to the directory. This is designed for directories like /tmp, which are worldwritable, but where it may not be desirable to allow any user to delete files at will. The sticky bit is seen as a t in a long directory listing. SUID Attribute: (For Files) This describes set-user-id permissions on the file. When the set user ID access mode is set in the owner permissions, and the file is executable, processes which run it are granted access to system resources based on user who owns the file, as opposed to the user who created the process. This is the cause of many "buffer overflow" exploits. SGID Attribute: (For Files)

If set in the group permissions, this bit controls the "set group id" status of a file. This behaves the same way as SUID, except the group is affected instead. The file must be executable for this to have any effect. SGID Attribute: (For directories) If you set the SGID bit on a directory (with chmod g+s directory), files created in that directory will have their group set to the directory's group. You - The owner of the file Group - The group you belong to Everyone - Anyone on the system that is not the owner or a member of the group File Example: -rw-r--r-- 1 kevin users 114 Aug 28 1997 .zlogin 1st bit - directory? (no) 2nd bit - read by owner? (yes, by kevin) 3rd bit - write by owner? (yes, by kevin) 4th bit - execute by owner? (no) 5th bit - read by group? (yes, by users) 6th bit - write by group? (no) 7th bit - execute by group? (no) 8th bit - read by everyone? (yes, by everyone) 9th bit - write by everyone? (no) 10th bit - execute by everyone? (no) The following lines are examples of the minimum sets of permissions that are required to perform the access described. You may want to give more permission than what's listed here, but this should describe what these minimum permissions on files do: -r-------- Allow read access to the file by owner --w------- Allows the owner to modify or delete the file (Note that anyone with write permission to the directory the file is in can overwrite it and thus delete it) ---x------ The owner can execute this program, but not shell scripts, which still need read permission ---s------ Will execute with effective User ID = to owner --------s- Will execute with effective Group ID = to group -rw------T No update of "last modified time". Usually used for swap files ---t------ No effect. (formerly sticky bit)

Directory Example: drwxr-xr-x 3 kevin users 512 Sep 19 13:47 .public_html/ 1st bit - directory? (yes, it contains many files) 2nd bit - read by owner? (yes, by kevin) 3rd bit - write by owner? (yes, by kevin) 4th bit - execute by owner? (yes, by kevin) 5th bit - read by group? (yes, by users 6th bit - write by group? (no) 7th bit - execute by group? (yes, by users) 8th bit - read by everyone? (yes, by everyone) 9th bit - write by everyone? (no) 10th bit - execute by everyone? (yes, by everyone) The following lines are examples of the minimum sets of permissions that are required to perform the access described. You may want to give more permission than what's listed, but this should describe what these minimum permissions on directories do: dr-------- The contents can be listed, but file attributes can't be read d--x------ The directory can be entered, and used in full execution paths dr-x------ File attributes can be read by owner d-wx------ Files can be created/deleted, even if the directory isn't the current one d------x-t Prevents files from deletion by others with write access. Used on /tmp d---s--s-- No effect System configuration files (usually in /etc) are usually mode 640 (-rw-r-----), and owned by root. Depending on your site's security requirements, you might adjust this. Never leave any system files writable by a group or everyone. Some configuration files, including /etc/shadow, should only be readable by root, and directories in /etc should at least not be accessible by others. SUID Shell Scripts SUID shell scripts are a serious security risk, and for this reason the kernel will not honor them. Regardless of how secure you think the shell script is, it can be exploited to give the cracker a root shell. 5.3. Integrity Checking Another very good way to detect local (and also network) attacks on your system is to run an integrity checker like Tripwire, Aide or Osiris. These integrety checkers run a number of checksums on all your important binaries and config files and compares them against a database of former, known-good values as a reference. Thus, any changes in the files will be flagged.

It's a good idea to install these sorts of programs onto a floppy, and then physically set the write protect on the floppy. This way intruders can't tamper with the integrety checker itself or change the database. Once you have something like this setup, it's a good idea to run it as part of your normal security administration duties to see if anything has changed. You can even add a crontab entry to run the checker from your floppy every night and mail you the results in the morning. Something like: # set mailto MAILTO=kevin # run Tripwire 15 05 * * * root /usr/local/adm/tcheck/tripwire will mail you a report each morning at 5:15am. Integrity checkers can be a godsend to detecting intruders before you would otherwise notice them. Since a lot of files change on the average system, you have to be careful what is cracker activity and what is your own doing.

CHAPTER 6 INPUT/OUTPUT PRINCIPLES

Input/output, or I/O, refers to the communication between an information processing system (such as a computer), and the outside world, possibly a human, or another information processing system. Inputs are the signals or data received by the system, and outputs are the signals or data sent from it. The term can also be used as part of an action; to "perform I/O" is to perform an input or output operation. I/O devices are used by a person (or other system) to communicate with a computer. For instance, a keyboard or a mouse may be an input device for a computer, while monitors and printers are considered output devices for a computer. Devices for communication between computers, such as modems and network cards, typically serve for both input and output. Note that the designation of a device as either input or output depends on the perspective. Mouse and keyboards take as input physical movement that the human user outputs and convert it into signals that a computer can understand. The output from these devices is input for the computer. Similarly, printers and monitors take as input signals that a computer outputs. They then convert these signals into representations that human users can see or read. For a human user the process of reading or seeing these representations is receiving input. These interactions between computers and humans is studied in a field called humancomputer interaction. In computer architecture, the combination of the CPU and main memory (i.e. memory that the CPU can read and write to directly, with individual instructions) is considered the brain of a computer, and from that point of view any transfer of information from or to that combination, for example to or from a disk drive, is considered I/O. The CPU and its supporting circuitry provide memory-mapped I/O that is used in low-level computer programming in the implementation of device drivers. An I/O algorithm is one designed to exploit locality and perform efficiently when data reside on secondary storage, such as a disk drive. Interface I/O Interface is required whenever the I/O device is driven by the processor. The interface must have necessary logic to interpret the device address generated by the processor. Handshaking should be implemented by the interface using appropriate commands like (BUSY,READY,WAIT), and the processor can communicate with I/O device through the interface. If different data formats are being exchanged, the interface must be able to convert serial data to parallel form and vice-versa. There must be provision for generating interrupts and the corresponding type numbers for further processing by the processor if required

A computer that uses memory-mapped I/O accesses hardware by reading and writing to specific memory locations, using the same assembler language instructions that computer would normally use to access memory. Addressing mode There are many ways through which data can be read or stored in the memory. Each method is an addressing mode, and has its own advantages and limitations. There are many type of addressing modes such as direct addressing, indirect addressing, immediate addressing, index addressing, based addressing, based-index addressing, implied addressing, etc. Direct address In this type of address of the data is a part of the instructions itself. When the processor decodes the instruction, it gets the memory address from where it can be read/store the required information. Mov Reg. [Addr] Here the Addr operand points to a memory location which holds the data and copies it into the specified Register. Indirect address Here the address can be stored in a register. The instructions will have the register which has the address. So to fetch the data, the instruction must be decoded appropriate register selected. The contents of the register will be treated as the address using this address appropriate memory location is selected and data is read/written. Drive Controllers: The disk controller (or "hard disk controller") is the circuit which allows the CPU to communicate with a hard disk, floppy disk or other kind of disk drive. Early disk controllers were identified by their storage methods and data encoding. They were typically implemented on a separate controller card. Modified frequency modulation (MFM) controllers were the most common type in small computers, used for both floppy disk and hard disk drives. Run length limited (RLL) controllers used data compression to increase storage capacity by about 50%. Priam created a proprietary storage algorithm that could double the disk storage. Shugart Associates Systems Interface (SASI) was a predecessor to SCSI. Modern disk controllers are integrated into the disk drive. For example, disks called "SCSI disks" have built-in SCSI controllers. In the past, before most SCSI controller functionality was implemented in a single chip, separate SCSI controllers interfaced disks to the SCSI bus.

The most common types of interfaces provided nowadays by disk controllers are PATA (IDE) and Serial ATA for home use. High-end disks use SCSI, Fibre Channel or Serial Attached SCSI. Disk controller versus host adapter The correct term for the components that allows a computer to talk to a peripheral bus is host adapter or host bus adapter (HBA). On the other hand, a disk controller allows a disk to talk to the same bus. Those two are often confused, especially in the PC world. In fact signals read by a disk read-and-write head are converted by a disk controller, then transmitted over the peripheral bus, then converted again by the host adapter into the suitable format for the motherboard's bus, and then read by the CPU. Sometimes there may be yet another controller between a host adapter and a disk controller - a disk array controller that allows hardware RAID to be formed. Sometimes it may be even physically integrated with an HBA, but it performs different functions. Direct Memory Access Direct memory access (DMA) is a feature of modern computers and microprocessors that allows certain hardware subsystems within the computer to access system memory for reading and/or writing independently of the central processing unit. Many hardware systems use DMA including disk drive controllers, graphics cards, network cards and sound cards. DMA is also used for intra-chip data transfer in multi-core processors, especially in multiprocessor system-onchips, where its processing element is equipped with a local memory (often called scratchpad memory) and DMA is used for transferring data between the local memory and the main memory. Computers that have DMA channels can transfer data to and from devices with much less CPU overhead than computers without a DMA channel. Similarly a processing element inside a multi-core processor can transfer data to and from its local memory without occupying its processor time and allowing computation and data transfer concurrency. Without DMA, using programmed input/output (PIO) mode for communication with peripheral devices, or load/store instructions in the case of multicore chips, the CPU is typically fully occupied for the entire duration of the read or write operation, and is thus unavailable to perform other work. With DMA, the CPU would initiate the transfer, do other operations while the transfer is in progress, and receive an interrupt from the DMA controller once the operation has been done. This is especially useful in real-time computing applications where not stalling behind concurrent operations is critical. Another and related application area is various forms of stream processing where it is essential to have data processing and transfer in parallel, in order to achieve sufficient throughput. Principle DMA is an essential feature of all modern computers, as it allows devices to transfer data without subjecting the CPU to a heavy overhead. Otherwise, the CPU would have to copy each piece of data from the source to the destination, making itself unavailable for other tasks. This situation is aggravated because access to I/O devices over a peripheral bus is generally slower

than normal system RAM. With DMA, the CPU gets freed from this overhead and can do useful tasks during data transfer (though the CPU bus would be partly blocked by DMA). In the same way, a DMA engine in an embedded processor allows its processing element to issue a data transfer and carries on its own task while the data transfer is being performed. A DMA transfer copies a block of memory from one device to another. While the CPU initiates the transfer by issuing a DMA command, it does not execute it. For so-called "third party" DMA, as is normally used with the ISA bus, the transfer is performed by a DMA controller which is typically part of the motherboard chipset. More advanced bus designs such as PCI typically use bus mastering DMA, where the device takes control of the bus and performs the transfer itself. In an embedded processor or multiprocessor system-on-chip, it is a DMA engine connected to the on-chip bus that actually administers the transfer of the data, in coordination with the flow control mechanisms of the on-chip bus. A typical usage of DMA is copying a block of memory from system RAM to a buffer on the device or vice versa. Such an operation usually does not stall the processor, which as a result can be scheduled to perform other tasks unless those tasks include a read from or write to memory. DMA is essential to high performance embedded systems. It is also essential in providing socalled zero-copy implementations of peripheral device drivers as well as functionalities such as network packet routing, audio playback and streaming video. Multicore embedded processors (in the form of multiprocessor system-on-chip) often use one or more DMA engines in combination with scratchpad memories for both increased efficiency and lower power consumption. In computer clusters for high-performance computing, DMA among multiple computing nodes is often used under the name of remote DMA. There are two control signal used to request and acknowledge a DMA transfer in microprocess-based system.The HOLD pin is used to request a DMA action and the HLDA pin is an output acknowledges the DMA action. Interrupt Handler An interrupt handler, also known as an interrupt service routine (ISR), is a callback subroutine in an operating system or device driver whose execution is triggered by the reception of an interrupt. Interrupt handlers have a multitude of functions, which vary based on the reason the interrupt was generated and the speed at which the Interrupt Handler completes its task. An interrupt handler is a low-level counterpart of event handlers. These handlers are initiated by either hardware interrupts or interrupt instructions in software, and are used for servicing hardware devices and transitions between protected modes of operation such as system calls. Overview In several operating systems - Linux, Microsoft Windows, and some other operating systems in the past, interrupt handlers are divided into two parts: the First-Level Interrupt Handler (FLIH) and the Second-Level Interrupt Handlers (SLIH). FLIHs are also known as hard interrupt handlers or fast interrupt handlers, and SLIHs are also known as slow/soft interrupt handlers, Deferred Procedure Call.

A FLIH implements at minimum platform-specific interrupt handling similarly to interrupt routines. In response to an interrupt, there is a context switch, and the code for the interrupt is loaded and executed. The job of a FLIH is to quickly service the interrupt, or to record platformspecific critical information which is only available at the time of the interrupt, and schedule the execution of a SLIH for further long-lived interrupt handling. FLIHs cause jitter in process execution. FLIHs also mask interrupts. Reducing the jitter is most important for real-time operating systems, since they must maintain a guarantee that execution of specific code will complete within an agreed amount of time. To reduce jitter and to reduce the potential for losing data from masked interrupts, programmers attempt to minimize the execution time of a FLIH, moving as much as possible to the SLIH. With the speed of modern computers, FLIHs may implement all device and platform-dependent handling, and use a SLIH for further platform-independent long-lived handling. FLIHs which service hardware typically mask their associated interrupt (or keep it masked as the case may be) until they complete their execution. An (unusual) FLIH which unmasks its associated interrupt before it completes is called a reentrant interrupt handler. Reentrant interrupt handlers might cause a stack overflow from multiple preemptions by the same interrupt vector, and so they are usually avoided. In a priority interrupt system, the FLIH also (briefly) masks other interrupts of equal or lesser priority. A SLIH completes long interrupt processing tasks similarly to a process. SLIHs either have a dedicated kernel thread for each handler, or are executed by a pool of kernel worker threads. These threads sit on a run queue in the operating system until processor time is available for them to perform processing for the interrupt. SLIHs may have a long-lived execution time, and thus are typically scheduled similarly to threads and processes. In Linux, FLIHs are called upper half, and SLIHs - lower half or bottom half. This is different from naming used in other Unix-like systems, where both are a part of bottom half. Interrupt threads Several operating systems - Solaris, NetBSD, Mac OS X, WinCE and FreeBSD, for example use different scheme known as interrupt threads. An interrupt handler provided by the device driver is just a high-priority thread which runs with interrupts enabled and, more importantly, may block on mutex. This greatly simplifies locking in the kernel. Also, interrupt thread may be preempted by higher-priority interrupt thread. Input/Output Devices Chapter 5 of Tanenbaum. Have both hardware and software. Want to hide the details from the programmer (user). Ideally have the same interface to all devices (device independence). Think of UNIX. Process doesn't need to distinguish between input coming from terminal, the network, a _le or another process. I/O Devices Two principal types:

1. block devices: stores information in _xed size blocks (128 to 1024 bytes). Example: disks. _ can read or write blocks independently _ related is ability to seek Tape drive: can seek and read, but probably cannot write at arbitrary place. 2. character devices: stream of characters (independent of blocks). Examples: terminals, line printers, network drivers _ not addressable _ no seek operation Clocks fall outside of this categorization. Device Controllers Operating systems do not deal with devices directly. Rather there is a mechanical and electronic portion. Electronic portion is a device controller. A printed circuit card. Look at Silbershatz Fig 13.1. Also a network interface card. Example: CRT controller takes care of details of rescanning the CRT beam How do controllers and the CPU communicate? Use memory-mapped I/O _ CPU puts command in registers (I/O address space) (Silbershatz Fig. 13.2) example commands: read, write, seek. Also parameters. _ CPU goes o_ and does other things. _ When device done, the controller causes an interrupt. CPU reads any results from the controller's registers. Polled I/O No interrupts used. CPU puts request in controller's registers and then polls waiting for the request to _nish. Also called programmed I/O. Might be used for debugging in operating system interrupt handler because it doesn't block. Direct Memory Access (DMA) What about large amounts of data to transfer? For example, disk read. Controller gets the data and bu_ers it. CPU could only get data in small chunks (byte or word at a time). Look at Fig. 5.4. Idea is to free CPU from being involved with transfer of data. CPU also gives: _ memory address (physical, so the page must be tied down) _ count Principles of I/O Software Issues of the software: _ e_ciency|I/O operations are very slow compared to main memory and the CPU. Want to avoid them being a bottleneck on the system. _ device independence|programs do not need to know about input or output device. _ uniform naming|same naming convention regardless of the device. Large systems typically have multiple disk drives, but user sees a common _le space on top. Extend to a distributed _le system. _ error handling|as close to device as possible, if controller detects an error it ideally corrects it or rereads (use checksum to detect) _ synchronous vs. asynchronous|physical I/O is asynchronous (interrupt-driven).

However want to make it appear synchronous (blocking) to the user. Can also use polled I/O where the device driver continually polls the device (may be used by kernel to not wait in debugging a device driver). _ bu_ering|may have to bu_er data read from a device (such as a packet from network). However, may end up and copy bu_er multiple times. _ sharable vs. dedicated|_le is shared, printer cannot be shared. Operating System must handle both kinds of devices. I/O Software Layers Correspond to layers in Fig 5-16. Interrupt Handlers Bowels of the system. Occurs when a device controller wants to tell CPU something (clock tick, write done, ready to read more, etc) What happens on interrupt: _ Interrupt handler gets invoked. _ Could send a message to blocked device driver process. _ On other systems a semaphore is signalled Device Drivers Contains device-dependent code. May have a device driver for a class of related devices (terminal driver for example). It knows details about device. device independent request -> device driver -> device dependent request It may queue up the request if device is already busy. Will send that request when the current request is complete. Puts data in registers and retrieves results as needed. Reports results to device-independent software. Device-Independent I/O Software Two major functions: _ perform I/O functions common to all devices _ provide a uniform interface to the user-level software _ naming|read/write to terminal using /dev/tty _ protection|look at _le protection (user, group, world) _ bu_ering|read one byte from disk. Will actually read a block of data and pass one byte to program. Also a downside as shown in Fig 5-15. User-Space I/O Software For example, printf() is a library routine that calls write(). Also user-level software to support other operations. Spooling for example where a daemon process controls access to a spooling directory. When you print, the _le is put in the directory and when your turn comes it is printed. Also used for _le transfer. UUCP networks use this approach.

CHAPTER 7 DISK
A RAM disk or RAM drive is a block of RAM (primary storage or volatile memory) that a computer's software is treating as if the memory were a disk drive (secondary storage). It is sometimes referred to as a "virtual RAM drive" or "software RAM drive" to distinguish it from a "hardware RAM drive" that uses separate hardware containing RAM, which is a type of solidstate drive. Performance The performance of a RAM disk is in general, orders of magnitude faster than other forms of storage media, such as a hard drive, tape drive or optical drive. This performance gain is due to multiple factors, including access time, maximum throughput and file system, as well as others. First, the access time of files are greatly decreased, since a RAM disk is solid state (no mechanical parts). A physical hard drive or optical media, such as cd-rom, dvd, and blu-ray, must move a head or optical eye into position, and tape drives must wind or rewind to a particular position on the media before reading or writing can occur. RAM disks can access data with only the memory address of a given file, with no movement, alignment or positioning necessary. Second, the maximum throughput of a RAM disk is limited by the speed of the RAM, the data bus, and the CPU of the computer. Other forms of storage media are further limited by the speed of the storage bus, such as IDE (PATA), SATA, USB, Serial or LPT (Parallel). Compounding this limitation is the speed of the actual mechanics of the drive motors, heads and/or eyes. Third, the file system in use, such as FAT, NTFS, USBFS, ext2, etc, uses extra accesses, reads and writes to the drive, which although small, can add up quickly, especially in the event of many small files vs. few larger files (temporary internet folders, web caches, etc). Because the storage is in RAM, it is volatile memory, which means it will be lost in the event of power loss, whether intentional (computer reboot or shutdown) or accidental (power failure). This is sometimes desirable: for example, when working with a decrypted copy of an encrypted file. In many cases, the data stored on the RAM disk is created, for faster access, from data permanently stored elsewhere, and is re-created on the RAM disk when the system reboots. Implementation Software RAM disks use the normal RAM in main memory as if it were a partition on a hard drive rather than actually accessing the data bus normally used for secondary storage. Though

RAM disks can often be supported directly from the operating system via special mechanisms in the operating system kernel, it is possible to also create and manage a RAM disk by way of a user space application process.[1] Usually no battery backup is needed due to the temporary nature of the information stored in the RAM disk, but an uninterruptible power supply can keep the entire system running during a power outage, if necessary. Some RAM disks use a compressed filesystem such as cramfs to allow compressed data to be accessed on the fly, without uncompressing it first. This is convenient because RAM disks are often small due to the higher price per byte than conventional hard drive storage. Distributed Processors Distributed computing is a field of computer science that studies distributed systems. A distributed system consists of multiple autonomous computers that communicate through a computer network. The computers interact with each other in order to achieve a common goal. A computer program that runs in a distributed system is called a distributed program, and distributed programming is the process of writing such programs.[1] Distributed computing also refers to the use of distributed systems to solve computational problems. In distributed computing, a problem is divided into many tasks, each of which is solved by one computer.[2] Introduction The word distributed in terms such as "distributed system", "distributed programming", and "distributed algorithm" originally referred to computer networks where individual computers were physically distributed within some geographical area.[3] The terms are nowadays used in a much wider sense, even referring to autonomous processes that run on the same physical computer and interact with each other by message passing.[4] While there is no single definition of a distributed system,[5] the following defining properties are commonly used:

There are several autonomous computational entities, each of which has its own local memory.[6] The entities communicate with each other by message passing.[7]

In this article, the computational entities are called computers or nodes. A distributed system may have a common goal, such as solving a large computational problem.[8] Alternatively, each computer may have its own user with individual needs, and the purpose of the distributed system is to coordinate the use of shared resources or provide communication services to the users.[9] Other typical properties of distributed systems include the following:

The system has to tolerate failures in individual computers.[10]

The structure of the system (network topology, network latency, number of computers) is not known in advance, the system may consist of different kinds of computers and network links, and the system may change during the execution of a distributed program.
[11]

Each computer has only a limited, incomplete view of the system. Each computer may know only one part of the input.[12]

(a)(b) A distributed system. (c) A parallel system. Parallel or distributed computing? The terms "concurrent computing", "parallel computing", and "distributed computing" have a lot of overlap, and no clear distinction exists between them.[13] The same system may be characterised both as "parallel" and "distributed"; the processors in a typical distributed system run concurrently in parallel.[14] Parallel computing may be seen as a particular tightly-coupled form of distributed computing,[15] and distributed computing may be seen as a loosely-coupled form of parallel computing.[5] Nevertheless, it is possible to roughly classify concurrent systems as "parallel" or "distributed" using the following criteria:

In parallel computing, all processors have access to a shared memory. Shared memory can be used to exchange information between processors.[16] In distributed computing, each processor has its own private memory (distributed memory). Information is exchanged by passing messages between the processors.[17]

The figure on the right illustrates the difference between distributed and parallel systems. Figure (a) is a schematic view of a typical distributed system; as usual, the system is represented as a graph in which each node (vertex) is a computer and each edge (line between two nodes) is a communication link. Figure (b) shows the same distributed system in more detail: each computer has its own local memory, and information can be exchanged only by passing messages from one node to another by using the available communication links. Figure (c) shows a parallel system in which each processor has a direct access to a shared memory. The situation is further complicated by the traditional uses of the terms parallel and distributed algorithm that do not quite match the above definitions of parallel and distributed systems; see the section Theoretical foundations below for more detailed discussion. Nevertheless, as a rule of thumb, high-performance parallel computation in a shared-memory multiprocessor uses parallel algorithms while the coordination of a large-scale distributed system uses distributed algorithms. Thread thread of execution is the smallest unit of processing that can be scheduled by an operating system. It generally results from a fork of a computer program into two or more concurrently running tasks. The implementation of threads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process. Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources. In particular, the threads of a process share the latter's instructions (its code) and its context (the values that its variables reference at any given moment). To give an analogy, multiple threads in a process are like multiple cooks reading off the same cook book and following its instructions, not necessarily from the same page. On a single processor, multithreading generally occurs by time-division multiplexing (as in multitasking): the processor switches between different threads. This context switching generally happens frequently enough that the user perceives the threads or tasks as running at the same time. On a multiprocessor or multi-core system, the threads or tasks will actually run at the same time, with each processor or core running a particular thread or task. Many modern operating systems directly support both time-sliced and multiprocessor threading with a process scheduler. The kernel of an operating system allows programmers to manipulate threads via the system call interface. Some implementations are called a kernel thread, whereas a lightweight process (LWP) is a specific type of kernel thread that shares the same state and information. Programs can have user-space threads when threading with timers, signals, or other methods to interrupt their own execution, performing a sort of ad-hoc time-slicing.

Threads compared with processes Threads differ from traditional multitasking operating system processes in that:

processes are typically independent, while threads exist as subsets of a process processes carry considerable state information, whereas multiple threads within a process share state as well as memory and other resources processes have separate address spaces, whereas threads share their address space processes interact only through system-provided inter-process communication mechanisms. Context switching between threads in the same process is typically faster than context switching between processes.

Systems like Windows NT and OS/2 are said to have "cheap" threads and "expensive" processes; in other operating systems there is not so great a difference except the cost of address space switch which implies a TLB flush.

CHAPTER 8 DISTRIBUTED FILE SYSTEM


A distributed file system or network file system is any file system that allows access to files from multiple hosts sharing via a computer network.[1] This makes it possible for multiple users on multiple machines to share files and storage resources. The client nodes do not have direct access to the underlying block storage but interact over the network using a protocol. This makes it possible to restrict access to the file system depending on access lists or capabilities on both the servers and the clients, depending on how the protocol is designed. In contrast, in a shared disk file system all nodes have equal access to the block storage where the file system is located. On these systems the access control must reside on the client. Distributed file systems may include facilities for transparent replication and fault tolerance. That is, when a limited number of nodes in a file system go offline, the system continues to work without any data loss. The difference between a distributed file system and a distributed data store can be vague, but DFSes are generally geared towards use on local area networks. History and examples The first file servers were developed in the 1970s. In 1976 Digital Equipment Corporation created the File Access Listener (FAL), an implementation of the Data Access Protocol as part of DECnet Phase II which became the first widely used network file system. In 1985 Sun Microsystems created the file system called "Network File System" (NFS) which became the first widely used Internet Protocol based network file system. Other notable network file systems are Andrew File System (AFS), Apple Filing Protocol (AFP), NetWare Core Protocol (NCP), and Server Message Block (SMB) which is also known as Common Internet File System (CIFS). Transparency Transparency is usually built into distributed file systems, so that files accessed over the network can be treated the same as files on local disk by programs and users. The multiplicity and dispersion of servers and storage devices are thus made invisible. It is up to the network file system to locate the files and to arrange for the transport of the data.

Performance A common performance measurement of a network file system is the amount of time needed to satisfy service requests. In conventional systems, this time consists of a disk-access time and a small amount of CPU-processing time. But in a network file system, a remote access has additional overhead due to the distributed structure. This includes the time to deliver the request to a server, the time to deliver the response to the client, and for each direction, a CPU overhead of running the communication protocol software. The performance of a network file system can be viewed as one dimension of its transparency; to be fully equivalent, it would need to be comparable to that of a local disk. Concurrent file updates Concurrency control becomes an issue when more than one person or client are accessing the same files and want to update it. Hence updates to the file from one client should not interfere with access and updates from other clients. Concurrency control or locking may be either built into the file system or be provided by an add-on protocol. CHAPTER Performance General DESIGNING THE PERFORMANCE MEASUREMENT SYSTEM A number of suggestions have been offered by various experts on the subject of designing performance measurement systems. Below is a list of suggestions derived from a number of these experts. Some of these apply to all measures and some apply to a limited number of a firm's measures. A firm's performance measures should:

Be simple and easy to use. Have a clear purpose. Provide fast feedback. Cover all the appropriate elements (internal, external, financial and nonfinancial). Relate to performance improvement, not just monitoring. Reinforce the firm's strategy. Relate to both long-term and short-term objectives of the organization. Match the firm's organization culture. Not conflict with one another.

Be integrated both horizontally and vertically in the corporate structure. Be consistent with the firm's existing recognition and reward system. Focus on what is important to customers. Focus on what the competition is doing. Lead to identification and elimination of waste. Help accelerate organizational learning. Help build a consensus for change when customer expectations shift or strategies and priorities call for the organization to behave differently. Evaluate groups not individuals for performance to schedule. Establish specific numeric standards for most goals. Be available for constant review.

Other recommendations for organizations that are developing performance measures include: 1. Data collection and methods of calculating the performance measure must be clearly defined. 2. Objective performance criteria are preferable to subjective ones. 3. Recognize that measures may vary between locations; avoid a "one size fits all" mentality. Wisner and Fawcett provide a nine-step process for developing a performance measurement system: 1. Clearly define the firm's mission statement. 2. Identify the firm's strategic objectives using the mission statement as a guide (profitability, market share, quality, cost, flexibility, dependability, and innovation). 3. Develop an understanding of each functional area's role in achieving the various strategic objectives. 4. For each functional area, develop global performance measures capable of defining the firm's overall competitive position to top management. 5. Communicate strategic objectives and performance goals to lower levels in the organization. Establish more specific performance criteria at each level.

6. Assure consistency with strategic objectives among the performance criteria used at each level. 7. Assure the compatibility of performance measures used in all functional areas. 8. Use the performance measurement system to identify competition, locate problem areas, assist the firm in updating strategic objectives and making tactical decisions to achieve these objectives, and supply feedback after the decisions are implemented. 9. Periodically reevaluate the appropriateness of the established performance measurement system in view of the current competitive environment. Finally, it is important that the performance measurement systems used by managers be continually reviewed and revised as the environment and economy changes. Failure to make the necessary modifications can inhibit the ability of the organization to be an effective and efficient global competitor.

You might also like