DSCC Unit 4
DSCC Unit 4
DSCC Unit 4
A Distributed System is a Network of Machines that can exchange information with each other through Message-passing. It can be very useful
as it helps in resource sharing. It enables computers to coordinate their activities and to share the resources of the system so that users perceive
the system as a single, integrated computing facility.
1 Client/Server Systems: Client-Server System is the most basic communication method where the client sends input to the server and the server
replies to the client with an output. The client requests the server for resources or a task to do, the server allocates the resource or performs the task
and sends the result in the form of a response to the request of the client. Client Server System can be applied with multiple servers.
2. Peer-to-Peer Systems: Peer-to-Peer System communication model works as a decentralized model in which the system works like both Client and
Server. Nodes are an important part of a system. In this, each node performs its task on its local memory and shares data through the supporting
medium, this node can work as a server or as a client for a system. Programs in the peer-to-peer system can communicate at the same level without
any hierarchy.
Middleware: Middleware can be thought of as an application that sits between two separate applications and provides service to both. It works as a
base for different interoperability applications running on different operating systems. Data can be transferred to other between others by using this
service.
4. Three-tier: Three-tier system uses a separate layer and server for each function of a program. In this data of the client is stored in the middle tier
rather than sorted into the client system or on their server through which development can be done easily. It includes an Application Layer, Data Layer,
and Presentation Layer. This is mostly used in web or online applications.
5. N-tier: N-tier is also called a multitier distributed system. The N-tier system can contain any number of functions in the network. N-tier systems
contain similar structures to three-tier architecture. When interoperability sends the request to another application to perform a task or to provide a
service. N-tier is commonly used in web applications and data systems.
SCHEDULING ALGORITHMS
LOCAL SCHEDULING
In a distributed system, local scheduling means how an individual workstation should schedule those processes assigned to it in order to maximize the
overall performance. It seems that local scheduling is the same as the scheduling approach on a stand-alone workstation. However, they are different in
many aspects. In a distributed system, the local scheduler may need global information from other workstations to achieve the optimal overall
performance of the entire system. For example, in the extended stride scheduling of clusters, the local schedulers need global ticket information in
order to achieve fairness across all the processes in the system. In recent years, there have been many scheduling techniques developed in different
models. Here, we introduce two of them: one is a proportionalsharing scheduling approach, in which the resource consumption rights of each active
process are proportional to the relative shares that it is allocated. The other is predictive scheduling, which is adaptive to the CPU load and resource
distribution of the distributed system. The traditional priority-based schedulers are difficult to understand and give more processing time to users with
many jobs, which leads to unfairness among users. Numerous researches have been trying to find a scheduler that is easy to implement and can solve
the problem of allocating resources to users fairly over time. In this environment, proportional-share scheduling was brought out to effectively solve
this problem. With proportional-share scheduling, the resource consumption rights of each active process are proportional to the relative shares that it
is allocated.
STRIDE SCHEDULING
As a kind of proportional-share scheduling strategies, stride scheduling allocates resources to competing users in proportion to the number of tickets
they hold. Each user has a time interval, or stride, inversely proportional to his/her ticket allocation, which determines how frequently it is used. A pass
is associated with each user. The user with a minimum pass is scheduled at each interval; a pass is then incremented by the job's stride.
Extension to Stride Scheduling: The original stride scheduling only deals with CPU-bound jobs. If the proportional-share schedulers are to handle the
interactive and I/O intensive job workloads, they must be extended to improve the responsive time and I/O throughput, while n ot penalizing competing
users. Here we discuss two extensions to stride scheduling that give credits to jobs not competing for resources. In this way, jobs are given incentive to
relinquish the processor when not in use and will receive their share of resources over a longer time-interval. Thus, because interactive jobs are
scheduled more frequently when they awaken, they can receive better response time. The first approach is loan & borrow, and t he second approach is
system credit. Both approaches are built upon exhaustible tickets, which are simple tickets with expiration time.
Loan & Borrow: In this approach, exhausted tickets are traded among competing clients. When a user temporarily exits the system, other users can
borrow these otherwise inactive tickets. The borrowed tickets expire when the user rejoins the system. When the sleeping user wakes up, it stops
loaning tickets and is paid back in exhaustible tickets by the borrowing users. In general, the lifetime of the exhaustible tickets is equal to the length the
original tickets were borrowed. This policy can keep the total number of tickets in the system constant over time; thus, users can accurately determine
the amount of resources they receive. However, it also introduces an excessive amount of computation into the scheduler on every sleep and wake-up
event, which we don’t expect.
System Credit: This second approach is an approximation of the first one. With system credits, clients are given exhaustible tickets from th e system
when they awaken. The idea behind this policy is that after a client sleeps and awakens, the scheduler calculates the number of exhaustible tickets for
the clients to receive its proportional share over some longer interval. The system credit policy is easy to implement and does not add significant
overhead to the scheduler on sleep and wakeup events. Proportional-share of resources can be allocated to clients running sequential jobs in a cluster.
In the cluster, users are guaranteed a proportional-share of resources if each local stride-scheduler is aware of the number of tickets issued in its
currency across the cluster and if the total number of base tickets allocated on each workstation is balanced. The solution for the first assu mption is
simple: each local scheduler is informed of the number of tickets issued in each currency, and then correctly calculates the base funding of each local
job. The solution for distributing tickets to the stride-schedulers is to run a user-level ticketssever on each of the nodes in the cluster. Each stride-
scheduler periodically contacts the local ticket server to update and determine the value of currencies. Further, for parallel jobs in a distributed cluster,
proportional-share resources can be provided through a combination of stride-scheduling and implicit coscheduling. Preliminary simulations of implicit
coscheduling for a range of a communication patterns and computation granularity indicate that the stride-scheduler with system credit performs
similarly to the Solaris time-sharing scheduler which is used in the Berkeley NOW environment .
PREDICTIVE SCHEDULING
Predictive scheduling differs from other scheduling approaches in that it provides intelligence, adaptivity and proactivity so that the system
implementing predictive scheduling can adapt to new architectures and/or algorithms and/or environmental changes automaticall y. Predictive
scheduling can learn new architectures, algorithms and methods that are embedded into the system. They provide some guarantees of service.
Furthermore, they are able to anticipate significant changes to its environment and avoid those changes to become the system performance bottleneck.
Predictive scheduling can be roughly decomposed into three components: H-cell, S-cell and allocator. The H-cell receives information of hardware
resource changes such as disk traffic, CPU usage, memory availability, etc., and provides near-real-time control. Meanwhile, S-cell provides longterm
control of computational demands--such as what the deadline of a task is and what its real-time requirement is--by interrogating the parallel program
code. H-cell and S-cell respectively collect information about computational supply and computational demand, and provide to the allocator the raw
data or some intelligent recommendations. The allocator reconciles the recommendations sent by the H-cells and S-cells and schedules jobs according
to their deadline, while guaranteeing constraints and enforcing the deadline. In the allocator, the previous inputs, in the form of a vector of
performance information (such as memory, CPU, disk usage etc.), are aggregated into sets. Each set corresponds to a scheduling decision. The allocator
re-organizes the sets dynamically to keep a limited memory demand by splitting or merging sets. If a new input matches one of the pattern categories, a
decision will be made due to the corresponding decision of that pattern set, otherwise a new pattern category is built to associate this new input
pattern with corresponding scheduling decision. Most of the scheduling policies are used either when a process blocks or at the end of a time slice,
which may reduce the performance because there can be a considerable lapse of time before scheduling is done. Predictive scheduling solves this
problem by predicting when a scheduling decision is necessary, or predicting the parameters needed by the scheduling decision when not known in
advance. Based on the collected static information (machine type, CPU power, etc.) and dynamic information (memory free space, CPU load, etc.),
predictive scheduling tries to make an educated guess about the future behavior, such as CPU idle time slot, which can be used to make scheduling
decisions in advance. Predicting the future performance based on past information is a common strategy, and it can achieve a satisfactory performance
in practical work. Predictive scheduling is very effective in performance and reliability enhancement, even with the simplest methods, but at the cost of
design complexity and management overhead. Furthermore, it is observed that the more complicated method is used, the more des ign complexity and
management overhead, and the less performance and reliability enhancement.
COSCHEDULING
In 1982, Outsterhout introduced the idea of coscheduling , which schedules the interacting activities (i. e., processes) in a job so that all the activities
execute simultaneously on distinct workstations. It can produce benefits in both system and individ ual job efficiency. Without coordinated scheduling,
the processor thrashing may lead to high communication latencies and consequently degraded overall performance. With systems connected by high
performance networks that already achieve latencies within tens microseconds, the success of coscheduling becomes a more important factor in
deciding the performance.
GANG SCHEDULING
Gang scheduling is a typical coscheduling approach, which has already been introduced for a long time but still plays a funda mental role. Moreover,
there are still many research projects in progress to improve gang scheduling. The approach identifies a job as a gang and its components as gang
members. Further, each job is assigned to a class that has the minimum number of workstations that meet the requirement of its gang members based
on a oneprocess- one-workstation policy. The class has a local scheduler, which can have its own scheduling policy. When a job is scheduled, each of its
gang members is allocated to a distinct workstation, and thus, the job executes in parallel. When a time-slice finishes, all running gang members are
preempted simultaneously, and all processes from a second job are scheduled for the next time-slice. When a job is rescheduled, effort is also made to
run the same processes on the same processors. The strategy bypasses the busy-waiting problem by scheduling all processes at the same time.
According to the experience, it works well for parallel jobs that have a lot of inter-process communications. However, it also has several disadvantages.
First, it is a centralized scheduling strategy, with a single scheduler making decisions for all jobs and all workstations. This centralized nature can easily
become the bottleneck when the load is heavy. Second, although this scheduler can achieve high system efficiency on regular parallel applications, it
has difficulty in selecting alternate jobs run when processes block, requiring simultaneous multi-context switches across the nodes. Third, to achieve
good performance requires long scheduling quanta, which can interfere with interactive response, making them a less attractive choice for use in a
distributed system. These limitations motivate the integrated approaches. The requirement of centralized control and the poor timesharing response of
previous scheduling approaches have motivated new, integrated coscheduling approaches. Such approaches extend local timesharing schedulers,
preserving their interactive response and autonomy. Further, such approaches do not need explicitly identified sets of processes to be coscheduled, but
rather integrate the detection of a coscheduling requirement with actions to produce effective coscheduling.
IMPLICIT COSCHEDULING
Implicit coscheduling is a distributed algorithm for time-sharing communicating processes in a cluster of workstations. By observing and reacting to
implicit information, local schedulers in the system make independent decisions that dynamically coordinate the scheduling of communicating
processes. The principal mechanism involved is two-phase spin-blocking: a process waiting for a message response spins for some amount of time, and
then relinquishes the processor if the response does not arrive. The spin ti me before a process relinquishes the processor at each communication event
consists of three components. First, a process should spin for the baseline time for the communication operation to complete; this component keeps
coordinated jobs in synchrony. Second, the process should increase the spin time according to a local cost-benefit analysis of spinning versus blocking.
Third, the pairwise cost-benefit, i.e., the process, should spin longer when receiving messages from other processes, thus considering t he impact of this
process on others in the parallel job.
● The baseline time comprises the round-trip time of the network, the overhead of sending and receiving messages, and the time to awake the
destination process when the request arrives.
● The local cost-benefit is the point at which the expected benefit of relinquishing the processor exceeds the cost of being scheduled again. For
example, if the destination process will be scheduled later, it may be beneficial to spin longer and avoid the cost of losing coordination and being
rescheduled later. On the other hand, when a large load imbalance exists across processes in the parallel job, it may be wasteful to spin for the entire
load-imbalance even when all the processes are coscheduled.
● The pairwise spin-time only occurs when other processes are sending to the currently spinning process, and is therefore conditional. Consider a pair
of processes: the receiver who is performing a twophase spin-block while waiting for a communication operation to complete, and a sender who is
sending a request to the receiver. When waiting for a remote operation, the process spins for the base and local amount, whil e recording the number of
incoming messages. If the average interval between requests is sufficiently small, the process assumes that it will remain beneficial in the future to be
scheduled and continues to spins for an additional spin time. The process continues conditionally spinning for intervals of s pin time until no messages
are received in an interval.
DYNAMIC COSCHEDULING
Dynamic coscheduling makes scheduling decisions driven directly by the message arrivals. When an arriving message is directed to a process that isn’t
running, a schedule decision is made. The idea derives from the observation that only those communicating processes need to be coscheduled.
Therefore, it doesn’t require explicit identification to specify the processes need coscheduling.
The implementation consists three parts:
Monitoring Communication/Thread Activity: A firmware, which is on the network interface card, monitors the thread activities by periodically reading
the host's kernel memory. If the incoming message is sent to the process currently running, the scheduler should do nothing.
Causing Scheduling Decisions: If a message received is not sent to the process currently running, an interrupt will be produced and invoke the
interrupt routine. When the routine finds that it would be fair to preempt the process currently running, the process receivi ng the message has its
priority raised to the maximum allowable priority for user mode timesharing processes, and is placed at the front of the dispatcher queue. Flags are set
to cause a scheduling decision based on the new priorities. This will cause the process receiving the message to be scheduled unless the process
currently running has a higher priority than the maximum allowable priority for user mode.
Making a Decision Whether to Preempt: In dynamic coscheduling, the process receiving the message is scheduled only if doing so would not cause
unfair CPU allocation. The fairness is implemented by limiting the frequency of priority boosts that therefore limits the frequency of preemption. In jobs
with fine-grained communication, the sender and receiver are scheduled together and run until one of them blocks or is preempted. Larger collections
of communicating processes are coscheduled by transitivity. The experiments taken in HPVM project indicate that dynamic coscheduling can provide
good performance for a parallel process running on a cluster of workstations in competition with serial processes. Performanc e was able to close to
ideal: CPU times were nearly the same as for batch processing, and reduced job response times by up to 20% over implicit scheduling while maintaining
near-perfect fairness. Further, it claims that dynamiccoscheduling-like approaches can be used to implement coordinated resource management in a
much broader range of cases, although most of which are still to be explored.
Each process is viewed as a collection of tasks. These tasks are scheduled to suitable processor to improve performance. This is not a widely used
approach because:
● It requires characteristics of all the processes to be known in advance.
● This approach does not take into consideration the dynamically changing state of the system. In this approach, a process is considered to be
composed of multiple tasks and the goal is to find an optimal assignment policy for the tasks of an individual process. The following are typical
assumptions for the task assignment approach: ● Minimize IPC cost (this problem can be modeled using network flow model) ● Efficient resource
utilization ● Quick turnaround time ● A high degree of parallelism A Distributed System is a Network of Machines that can exchange information with
each other through Message-passing. It can be very useful as it helps in resource sharing. In this article, we will see the concept of the Task Assignment
Approach in Distributed systems.
Resource Management: One of the functions of system management in distributed systems is Resource Management. When a user requests the
execution of the process, the resource manager performs the allocation of resources to the process submitted by the user for execution. In addition,
the resource manager routes process to appropriate nodes (processors) based on assignments. Multiple resources are available in the distributed
system so there is a need for system transparency for the user. There can be a logical or a physical resource in the system. For example, data files in
sharing mode, Central Processing Unit (CPU), etc. As the name implies, the task assignment approach is based on the division of the process into
multiple tasks. These tasks are assigned to appropriate processors to improve performance and efficiency. This approach has a major setback in that it
needs prior knowledge about the features of all the participating processes. Furthermore, it does not take into account the d ynamically changing state
of the system. This approach’s major objective is to allocate tasks of a single process in the best possible manner as it is based on the division of tasks in
a system. For that, there is a need to identify the optimal policy for its implementation.
Working of Task Assignment Approach:
In the working of the Task Assignment Approach, the following are the assumptions:
• The division of an individual process into tasks.
• Each task’s computing requirements and the performance in terms of the speed of each processor are known.
• The cost incurred in the processing of each task performed on every node of the system is known.
• The IPC (Inter-Process Communication) cost is known for every pair of tasks performed between nodes.
• Other limitations are also familiar, such as job resource requirements and available resources at each node, task priority co nnections,
and so on.
Goals of Task Assignment Algorithms:
• Reducing Inter-Process Communication (IPC) Cost
• Quick Turnaround Time or Response Time for the whole process
• A high degree of Parallelism
• Utilization of System Resources in an effective manner
The above-mentioned goals time and again conflict. To exemplify, let us consider the goal-1 using which all the tasks of a process need
to be allocated to a single node for reducing the Inter-Process Communication (IPC) Cost. If we consider goal-4 which is based on the
efficient utilization of system resources that implies all the tasks of a process to be divided and processed by appropriate nodes in a
system.
Note: The possible number of assignments of tasks to nodes:
For m tasks and n nodes= m x n
But in practice, the possible number of assignments of tasks to nodes < m x n because of the constraint for allocation of tasks to the
appropriate nodes in a system due to their particular requirements like memory space, etc.
Need for Task Assignment in a Distributed System:
The need for task management in distributed systems was raised for achieving the set performance goals. For that optimal assignments
should be carried out concerning cost and time functions such as task assignment to minimize the total execution and communication
costs, completion task time, total cost of 3 (execution, communication, and interference), total execution and communication costs with
the limit imposed on the number of tasks assigned to each processor, and a weighted product of cost functions of total execution and
communication costs and completion task time. All these factors are countable in task allocation and turn, resulting in the best outcome
of the system.
(NOTE:- example not In this)
A load balancer is a device that acts as a reverse proxy and distributes network or application traffic across a number of servers. Load adjusting is the
approach to conveying load units (i.e., occupations/assignments) across the organization which is associated with the distributed system. Load
adjusting should be possible by the load balancer. The load balancer is a framework that can deal with the load and is utilized to disperse the
assignments to the servers. The load balancers allocates the primary undertaking to the main server and the second assignment to the second
server.
Purpose of Load Balancing in Distributed Systems:
• Security: A load balancer provide safety to your site with practically no progressions to your application.
• Protect applications from emerging threats: The Web Application Firewall (WAF) in the load balancer shields your site.
• Authenticate User Access: The load balancer can demand a username and secret key prior to conceding admittance to your site to
safeguard against unapproved access.
• Protect against DDoS attacks: The load balancer can distinguish and drop conveyed refusal of administration (DDoS) traffic before
it gets to your site.
• Performance: Load balancers can decrease the load on your web servers and advance traffic for a superior client experience.
• SSL Offload: Protecting traffic with SSL (Secure Sockets Layer) on the load balancer eliminates the upward from web servers
bringing about additional assets being accessible for your web application.
• Traffic Compression: A load balancer can pack site traffic giving your clients a vastly improved encounter with your site.
Load Balancing Approaches:
• Round Robin
• Least Connections
• Least Time
• Hash
• IP Hash
Classes of Load Adjusting Calculations:
Following are a portion of the various classes of the load adjusting calculations.
• Static: In this model assuming any hub/node is found with a heavy load, an assignment can be taken arbitrarily and move the
undertaking to some other arbitrary system. .
• Dynamic: It involves the present status data for load adjusting. These are better calculations than static calculations.
• Deterministic: These calculations utilize processor and cycle attributes to apportion cycles to the hubs.
• Centralized: The framework states data is gathered by a single hub.
Advantages of Load Balancing:
• Load balancers minimize server response time and maximize throughput.
• Load balancer ensures high availability and reliability by sending requests only to online servers
• Load balancers do continuous health checks to monitor the server’s capability of handling the request.
LOAD-SHARING APPROACH
Load sharing basically denotes the process of forwarding a router to share the forwarding of traffic, in case of multiple paths if available in the routing
table. In case there are equal paths then the forwarding process will follow the load-sharing algorithm. In load sharing systems, all nodes share the
overall workload, and the failure of some nodes increases the pressure of the rest of the nodes. The load sharing approach ensures that no node is
kept idle so that each node can share the load.
For example, suppose there are two connections of servers of different bandwidths of 500Mbps and another 250Mbps. Let, there are 2 packets.
Instead of sending the 2 packets to the same connection i.e. 500Mbps, 1 packet will be forwarded to the 500Mbps and another to the 250Mbps
connection. Here the goal is not to use the same amount of bandwidth in two connections but to share the load so that each connection can sensibly
deal with it without any traffic.
Load Sharing algorithm includes policies like location policy, process transfer policy, state information exchange policy, load estimation policy,
priority assignment policy, and migration limiting policy.
1. Location Policies: The location policy concludes the sender node or the receiver node of a process that will be moved inside the framework for load
sharing. Depending upon the sort of node that steps up and searches globally for a reasonable node for the process, the location strategies are of the
accompanying kinds:
• Sender-inaugurated policy: Here the sender node of the process has the priority to choose where the process has to be sent. The actively loaded
nodes search for lightly loaded nodes where the workload has to be transferred to balance the pressure of traffic. Whenever a node’s load turns out
to be more than the threshold esteem, it either communicates a message or arbitrarily tests different nodes individually to o bserve a lightly loaded
node that can acknowledge at least one of its processes. In the event that a reasonable receiver node isn’t found, the node on which the process began
should execute that process.
• Receiver-inaugurated policy: Here the receiver node of the process has the priority to choose where to receive the process. In this policy, lightly
loaded nodes search for actively loaded nodes from which the execution of the process can be accepted. Whenever the load on a node falls under
threshold esteem, it communicates a text message to all nodes or tests nodes individually to search for the actively loaded nodes. Some vigorously
loaded node might move one of its processes if such a transfer does not reduce its load underneath the normal threshold.
2. Process transfer Policy: All or nothing approach is used in this policy. The threshold value of all the nodes is allotted as 1. A node turns into a
receiver node if there is no process and on the other side a node becomes a sender node if it has more than 1 process. If the nodes turn idle then they
can’t accept a new process immediately and thus it misuses the processing power To overcome this problem, transfer the process in such a node that
is expected to be idle in the future. Sometimes to ignore the processing power on the nodes, the load-sharing algorithm turns the threshold value
from 1 to 2.
3. State Information exchange Policy: In load-sharing calculation, it is not required for the nodes to regularly exchange information, however, have
to know the condition of different nodes when it is either underloaded or overloaded. Thus two sub-policies are used here:
• Broadcast when the state changes: The nodes will broadcast the state information request only when there is a change in state. In the sender-
inaugurated location policy, the state information request is only broadcasted by the node when a node is overloaded. In the receiver-inaugurated
location policy, the state information request is only broadcasted by the node when a node is underloaded.
• Poll when the state changes: In a large network the polling operation is performed. It arbitrarily asks different nodes for state information till it gets
an appropriate one or it reaches the test limit.
4. Load Estimation Policy: Load-sharing algorithms aim to keep away from nodes from being idle yet it is adequate to know whether a node is
occupied or idle. Consequently, these algorithms typically utilize the least complex load estimation policy of counting the absolute number of processes
on a node.
5. Priority Assignment Policy: It uses some rules to determine the priority of a particular node. The rules are:
• Selfish: Higher priority is provided to the local process than the remote process. Thus, it has the worst response time performance for the remote
process and the best response time performance for the local process.
• Altruistic: Higher priority is provided to the remote process than the local process. It has the best response time performance.
• Intermediate: The number of local and remote processes on a node decides the priority. At the point when the quantity of local processes is more or
equivalent to the number of remote processes then local processes are given higher priority otherwise remote processes are given higher priority than
local processes.
6. Migration limiting policy: This policy decides the absolute number of times a process can move. One of the accompanying two strategies might be
utilized.
• Uncontrolled: On arrival of a remote process at a node is handled similarly as a process emerging at a node because of which any number of times a
process can migrate.
• Controlled: A migration count parameter is used to fix the limit of the migration of a process. Thus, a process can migrate a fixed number of times
here. This removes the instability of uncontrolled strategy.
PROCESS MANAGEMENT
Process management is a systematic approach to ensure that effective and efficient business processes are in place. It is a methodology used to align
business processes with strategic goals. In contrast to project management, which is focused on a single project, process management addresses
repetitive processes carried out on a regular basis. It looks at every business process, individually and as a whole, to create a more efficient
organization. It analyzes current systems, spots bottlenecks, and identifies areas of improvement. Process management is a long-term strategy that
constantly monitors business processes so they maintain optimal efficiency. Implemented properly, it significantly helps boost business growth.
A Distributed File System (DFS) as the name suggests, is a file system that is distributed on multiple file servers or multiple
locations. It allows programs to access or store isolated files as they do with the local ones, allowing programmers to access
files from any network or computer. The main purpose of the Distributed File System (DFS) is to allows users of physically
distributed systems to share their data and resources by using a Common File System.
FEATURES OF DFS:
Structure transparency: There is no need for the client to know about the number or locations of file servers and the
storage devices. Multiple file servers should be provided for performance, adaptability, and dependability. Access
transparency: Both local and remote files should be accessible in the same manner. The file system should be automatically
located on the accessed file and send it to the client’s side.
Naming transparency: There should not be any hint in the name of the file to the location of the file. Once a name is given
to the file, it should not be changed during transferring from one node to another.
Replication transparency: If a file is copied on multiple nodes, both the copies of the file and their locations should be
hidden from one node to another.
Performance: Performance is based on the average amount of time needed to convince the client requests. This time covers
the CPU time + time taken to access secondary storage + network access time. It is advisable that the performance of the
Distributed File System be similar to that of a centralized file system.
Security: A distributed file system should be secure so that its users may trust that their data will be kept private. To
safeguard the information contained in the file system from unwanted & unauthorized access, security mechanisms must be
implemented.
APPLICATIONS OF DFS:
NFS: NFS stands for Network File System. It is a client-server architecture that allows a computer user to view, store, and
update files remotely. The protocol of NFS is one of the several distributed file system standards for Network-Attached
Storage (NAS).
CIFS: CIFS stands for Common Internet File System. CIFS is an accent of SMB. That is, CIFS is an application of SIMB protocol,
designed by Microsoft.
SMB: SMB stands for Server Message Block. It is a protocol for sharing a file and was invented by IMB. The SMB protocol
was created to allow computers to perform read and write operations on files to a remote host over a Local Area Network
(LAN). The directories present in the remote host can be accessed via SMB and are called as ―shares.
Hadoop: Hadoop is a group of open-source software services. It gives a software framework for distributed storage and
operating of big data using the MapReduce programming model. The core of Hadoop contains a storage part, known as
Hadoop Distributed File System (HDFS), and an operating part which is a MapReduce programming model.
NetWare: NetWare is an abandon computer network operating system developed by Novell, Inc. It primarily used combined
multitasking to run di
Advantages:
● DFS allows multiple user to access or store the data.
● It allows the data to be share remotely.
● It improved the availability of file, access time, and network efficiency.
● Improved the capacity to change the size of the data and also improves the ability to exchange the data.
● Distributed File System provides transparency of data even if server or disk fails.
Disadvantages:
● In Distributed File System nodes and connections needs to be secured therefore we can say that security is at stake.
● There is a possibility of lose of messages and data in the network while movement from one node to another. ● Database
connection in case of Distributed File System is complicated.
● Also handling of the database is not easy in Distributed File System as compared to a single user system.
● There are chances that overloading will take place if all nodes tries to send data at once.
There are mainly two types of file models in the distributed operating system.
1. Structure Criteria
2. Modifiability Criteria Structure Criteria
There are two types of file models in structure criteria.
These are as follows:
1. Structured Files
2. Unstructured Files
STRUCTURED FILES
The Structured file model is presently a rarely used file model. In the structured file model, a file is seen as a collection of records by the
file system. Files come in various shapes and sizes and with a variety of features. It is also possible that rec ords from various files in the
same file system have varying sizes. Despite belonging to the same file system, files have various attributes. A record is the smallest unit of
data from which data may be accessed. The read/write actions are executed on a set of records. Different "File Attributes" are provided in
a hierarchical file system to characterize the file. Each attribute consists of two parts: a name and a value. The file system used determines
the file attributes. It provides information on files, file sizes, file owners, the date of last modification, the date of file creation, access
permission, and the date of last access. Because of the varied access rights, the Directory Service function is utilized to manage file
attributes.
1. Files with Non-Indexed records: Records in non-indexed files are retrieved based on their placement inside the file. For instance, the
second record from the starting and the second from the end of the record.
2. Files with Indexed records: Each record contains a single or many key fields in a file containing indexed records, each of which may be
accessed by specifying its value. A file is stored as a B-tree or similar data structure or hash table to find records quickly.
Unstructured Files:
It is the most important and widely used file model. A file is a group of unstructured data sequences in the unstructured model. Any
substructure does not support it. The data and structure of each file available in the file system is an uninterrupted sequence of bytes
such as UNIX or DOS. Most latest OS prefer the unstructured file model instead of the structured file model due to sharing of files by
multiple apps. It has no structure; therefore, it can be interpreted in various ways by different applications.
Modifiability Criteria:
There are two files model in the Modifiability Criteria.
These are as follows:
1. Mutable Files
2. Immutable Files
1. Mutable Files: The existing operating system employs the mutable file model. A file is described as a single series of records because
the same file is updated repeatedly once new material is added. After a file is updated, the existing contents are cha nged by the new
contents.
2. Immutable Files: The Immutable file model is used by Cedar File System (CFS). The file may not be modified once created in the
immutable file model. Only after the file has been created can it be deleted. Several versions of the same file are created t o implement
file updates. When a file is changed, a new file version is created. There is consistent sharing because only immutable files are shared in
this file paradigm. Distributed systems allow caching and replication strategies, overcoming the limitation of many copies and maintaining
consistency. The disadvantages of employing the immutable file model include increased space use and disc allocation activity. CFS uses
the "Keep" parameter to keep track of the file's current version number. When the parameter value is 1, it results in the production of a
new file version. The previous version is erased, and the disk space is reused for a new one. When the parameter value is greater than 1,
it indicates the existence of several versions of a file. If the version number is not specified, CFS utilizes the lowest version number for
actions such as "delete" and the highest version number for other activities such as "open".
The specific client's request for accessing a particular file is serviced on the basis of the file accessing model used by th e distributed file
system. The file accessing model basically depends on 1 ) the unit of data access and 2 ) the method used for accessing remote files.
On the basis of the unit of data access, following file access models might be used in order to access the specific file.
1. File-level transfer model
2. Block-level transfer model
3. Byte-level transfer model
4. Record-level transfer model
1. File-level transfer model: In file-level transfer model, the complete file is moved while a particular operation necessitates the file data
to be transmitted all the way through the distributed computing network amongst client and server. This model has better scalability and
is efficient.
2. Block-level transfer model: In block-level transfer model, file data transfers through the network amongst client and a server is
accomplished in units of file blocks. In short, the unit of data transfer in block-level transfer model is file blocks. The block-level transfer
model might be used in distributed computing environment comprising several diskless workstations.
3. Byte-level transfer model: In byte-level transfer model, file data transfers the network amongst client and a server is accomplished in
units of bytes. In short, the unit of data transfer in byte-level transfer model is bytes. The byte-level transfer model offers more flexibility
in comparison to the other file transfer models since, it allows retrieval and storage of an arbitrary sequential subrange of a file. The
major disadvantage of byte-level transfer model is the trouble in cache management because of the variable-length data for different
access requests.
4. Record-level transfer model: The record-level file transfer model might be used in the file models where the file contents are structured
in the form of records. In record-level transfer model, file data transfers through the network amongst client and a server is accomplished
in units of records. The unit of data transfer in record level transfer model is record.