Future Generation Computer Systems 23 (2007) 846–860
www.elsevier.com/locate/fgcs
Job scheduling and data replication on data grids
Ruay-Shiung Chang ∗ , Jih-Sheng Chang, Shin-Yi Lin
Department of Computer Science and Information Engineering, National Dong Hwa University, Shoufeng, Hualien 974, Taiwan
Received 25 July 2006; received in revised form 23 February 2007; accepted 27 February 2007
Available online 16 March 2007
Abstract
In data grids, many distributed scientific and engineering applications often require access to a large amount of data (terabytes or petabytes).
Data access time depends on bandwidth, especially in a cluster grid. Network bandwidth within the same cluster is larger than across clusters. In
a communication environment, the major bottleneck to supporting fast data access in Grids is the high latencies of Wide Area Networks (WANs)
and Internet. Effective scheduling in such network architecture can reduce the amount of data transferred across the Internet by dispatching a job
to where the needed data are present. Another solution is to use a data replication mechanism to generate multiple copies of the existing data to
reduce access opportunities from a remote site. To utilize the above two concepts, in this paper we develop a job scheduling policy, called HCS
(Hierarchical Cluster Scheduling), and a dynamic data replication strategy, called HRS (Hierarchical Replication Strategy), to improve the data
access efficiencies in a cluster grid. We simulate our algorithm to evaluate various combinations of data access patterns. We also implement HCS
and HRS in the Taiwan Unigrid environment. The simulation and experiment results show that HCS and HRS successfully reduces data access
time and the amount of inter-cluster-communications in comparison with other strategies in a cluster grid.
c 2007 Elsevier B.V. All rights reserved.
Keywords: Data replication; Data grid; Job scheduling
1. Introduction
In data grids [1,2], distributed scientific and engineering
applications often require access to a large amount of data
(terabytes or petabytes). Managing this large amount of data in
a centralized way is ineffective due to extensive access latency
and load on the central server. Hence, such huge dataset must
be separated and stored in several physical locations.
In a communication environment, the performance of
accessing a distributed and huge amount of data depends on the
availability of network bandwidth. Namely, slow data access
can throttle the performance of data-intensive applications
running on grid computers. In Fig. 1, a simple hierarchical
form of a grid system, called cluster grid, is shown. A cluster
represents an organization unit which is a group of sites that are
geographically close. We define two kinds of communications
between sites in a cluster grid. Intra-communication is the
communication between sites within the same cluster. On the
other hand, inter-communication is the communication between
∗ Corresponding author. Tel.: +886 3 8632031; fax: +886 3 8632030.
E-mail address: rschang@mail.ndhu.edu.tw (R.-S. Chang).
0167-739X/$ - see front matter c 2007 Elsevier B.V. All rights reserved.
doi:10.1016/j.future.2007.02.008
sites across clusters. Network bandwidth between sites within a
cluster will be larger than across clusters. Therefore, to reduce
access latency and to avoid WAN bandwidth bottleneck in a
cluster grid, it is important to reduce the number of intercommunications.
To address this problem, we consider two aspects of intercommunication: job scheduling and replication mechanism.
Consider a case that many of the authorized users submit jobs
to solve data-intensive problems. We want that jobs be executed
as fast as possible. The size of the data used on Data Grid is
from terabytes to petabytes. Scheduling jobs to suitable grid
sites is necessary because data movement between different
grid sites is time consuming. The scheduling decisions should
be based on the appropriate resources a grid site has. Other
factors to be considered include CPU workload, features of
computational capability, location of data and network load. If
a job is scheduled to a site where the required data are present,
the job can process data in this site without any transmission
delay for getting data from a remote site.
Data replication is another important optimization step
to manage large data by replicating data in geographically
distributed data stores. Previous replication strategies show
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
847
Fig. 1. A cluster grid architecture.
that replicating data can offer high data availability and low
bandwidth consumption. When users’ jobs access a large
amount of data from remote sites, dynamic replica optimizer
running in the site tries to store replicas on local storage for
future possible repeated requests. If most data resides on the
same site where they are needed, the frequency of remote data
access is going to decrease. This can reduce job execution time
and increase the robustness of grid application. Inter-cluster
communications also can be avoided, if data within the same
cluster are always accessed first.
Our new scheduling policy considers the locations of
required data, the access cost and the job queue length of
a computing node. It is called HCS (Hierarchical Clusterbased Scheduling). HCS uses hierarchical scheduling that takes
cluster information into account to reduce the search time for an
appropriate computing node. When data have to be replicated,
we develop a replication strategy, called HRS (Hierarchical
Replication Strategy). It integrates previous replication strategy
and increases the chances of accessing data at a nearby node.
In order to study the complex nature of a typical
grid environment and evaluate various replica optimization
algorithms, a grid simulator, called OptorSim [4], was
developed by EU Data Grid project [3]. Our job scheduling
policy and replica strategy are simulated in OptorSim
and compared with various scheduling policies and replica
strategies. HCS and HRS successfully reduces data access time
and the amount of inter-cluster communications in comparing
to other combinations in a cluster grid. In addition, we
have implemented HCS and HRS in the Taiwan UniGrid
Platform [23] for observing the system performance on a real
grid platform that contains many clusters distributed across the
Internet.
This rest of this paper is organized as follows: Section 2
gives an overview of previous work on grid replication and
job scheduling. Section 3 introduces our HCS policy and HRS
replication strategy. We show results from simulations of HCS
and HRS in Section 4. The implementation and performance
evaluation is given in Section 5. Finally, Section 6 concludes
the paper and outlines some future research work.
2. Related work
As jobs are data intensive, scheduling issues often involve
effective computation and data management in the data grids.
The replication of data sets is not really a new technique. Data
replication has been around for decades and it is now adapted to
the grid environment. In [5,6], Ranganathan and Foster present
six different replica strategies: (1) No replication or caching,
(2) Best Client: a replica is created at the best client that
has the largest number of requests for the file, (3) Cascading
replication: once popularity exceeds the threshold for a file at
a given time interval, a replica is created at next level which is
on the path to the best client, (4) Plain caching: the client that
requests the file stores a copy of the file locally, (5) Caching
plus Cascading Replication: this combines Plain caching and
Cascading replication strategy, and (6) Fast Spread: replicas of
the file are created at each node along its path to the client.
These strategies are evaluated with three different data
patterns: (1) Random access: there is no locality in access
patterns, (2) data access with a small degree of temporal locality
(recently accessed file are likely to be accessed again), (3)
Data access with a small degree of temporal and geographical
locality. (Files recently accessed by a site are likely to be
accessed by nearby site.)
The results of simulations indicate that different access
pattern needs different replica strategies. With suitable
strategies, they can dramatically improved the performances in
bandwidth savings or access latency. Two strategies performed
the best in the simulations: Cascading and Fast Spread when
compared to traditional strategies.
Ranganathan and Foster also propose a variety of techniques
to intelligently replicate data across sites and assign jobs
to sites in data grid [7]. They have conducted a study of
the performances of various scheduling algorithms by using
simulator. A scheduler selects a remote site to dispatch a
job based on one of the four algorithms: (1) JobRadom:
schedule a job randomly, (2) JobLeastLoaded: schedule a job
to where there is the least number of jobs waiting to run, (3)
JobDataPresent: schedule a job to where it has the least load
and requested data, and (4) JobLocally: always run jobs locally.
848
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
These job scheduling algorithms are combined with three
different replication strategies: (1) DataDoNothing: there is
no replication, (2) DataRandom: once the threshold for a
file is exceeded, a replica is created at a random site, (3)
DataLeastLoad: once the threshold for a file is exceeded, a
replica is created at a site where the least number of jobs are
waiting in the queue.
The simulations show that loosely coupled jobs and remotely
distributed large data sets can be optimized separately. They
recognize the significance of data location in job dispatching
and scheduling in grids. However, this work only considers
jobs that use a single input file and assumes homogeneous
sites with a simplified First-In–First-Out (FIFO) strategy within
local schedulers.
The replication strategies mentioned above [5–7] aim to
reduce the message traffic in the network. However, the data
mapping is not optimal. In [8], a replication algorithm is tested
which uses a cost model to predict whether replicas are worth
creating. It is found to be more effective in reducing average
job time than the basic case where there is no replication. The
simulation architecture used was based on a combination of ring
and fat-tree structure [9] where leaf client nodes ran jobs but
higher nodes contained all the storage resources, in contrast to
the EU Data Grid architecture.
Like previous scheduling algorithm, the Close-to-Files (CF)
algorithm [10] schedules a job to least loaded processors close
to a site where data are present. Assuming that a job only
needs a single-file input data, it uses an exhaustive algorithm
to search across all combinations of computing sites and data
sites to find the combination with the minimum cost, including
computation and transmission delay. Data replication is also
used to improve performance. The simulation results show
that CF can achieve good performance when compared to WF
(Worst-Fit) job placement algorithm, which places jobs on the
execution sites with the largest number of idle processors.
Stork [11] provides a scheduler for data placement in Grid
environments. The idea is to map data close to computational
resource to efficiently complete computational cycles. Thus,
Stork needs many data management operations, such as
locating, accessing, storing, replicating, queuing and checkpointing. To avoid overloading of network resources, it handles
data transfers among heterogeneous systems by considering
job priorities and transfer failures. It can also automatically
decide which protocol to use to transfer data from one host to
another. Stork uses ClassAds [12] as the mechanism to specify
job and data requirements. Stork’s method can be coupled with
a task scheduler such as DAGMan (Directed Acyclic Graph
Manager) [13] of the Condor project [14].
The work in [15] deals with the problem of integrating
the scheduling and replication strategies, called Integrated
Replication and Scheduling Strategy (IRS). It decouples jobs
scheduling from data scheduling. At the end of periodic
intervals when jobs are scheduled, the popularity of required
files is calculated and then used by the data scheduler to
replicate data for the next set of jobs, which may or may not
share the same data requirements as the previous set.
BHR (Bandwidth Hierarchy based Replication) [7,16]
extends current site-level replica optimization study into the
“network-level” based on hierarchy of bandwidth appeared
in Internet. BHR tries to maximize the number of required
data in the same region in order to fetch replica faster, since
bandwidth within region would be larger. They record the
regional popularity of files. BHR optimizer selects the best
replica for a job and if local storage is already fill up, it will
delete duplicated replica in other site within region. If the
storage space is still deficient, BHR remove unpopular files
from the “regional point” for the second time. Our replication
strategy, HRS, follows BHR’s concept of maximizing hit ratio
of required data within a cluster. Unlike BHR, ours focuses on
avoiding a large number of inter-cluster communications.
Economy based replication strategy [17] is a long-term
optimization technique which aims at minimizing the overall
cost of file access on data grid while given a finite amount of
storage resources. While storage resources can maximize profit,
computational resources can minimize the file purchase cost.
In this economy model, data files are regarded as the goods in
the market and are traded by different grid site according to
file requests from running jobs. When requesting a replica, it
will try to access the cheapest replica in the Grid by starting
an auction. Storage resources that have the file locally may
reply by bidding a price that estimates the cost of data transfer.
If the storage resource at a grid site is already filled up with
replicas, selection and deletion of expendable file can create
space for a newly requested data. Within the economy model,
it uses a prediction function for estimating the future revenue
of data files. The authors show that improvements compared
to traditional replication technique by performing simulations
with OptorSim.
In [38], a scalable system called IMAGINE-P2P with
the capability of supporting distributed index queries on
a structured DHT (Dynamic Hash Table) P2P network is
proposed. The replication strategy improves the availability
of semantic overlay for dynamic networks according to the
experiment results.
Two dynamic replication mechanisms [36] are proposed in
the multi-tier architecture for Data Grids, including Simple
Bottom-Up (SBU) and Aggregate Bottom-Up (ABU). The SBU
algorithm replicates the data file that exceeds a pre-defined
threshold for clients. The main shortcoming of SBU is the
lack of consideration to the relationship with historical access
records. For the sake of addressing the problem, ABU is
designed to aggregate the historical records to the upper tier
until it reaches the root. With the hierarchical topology, the
client searches for files from a client to the root. In addition,
the root replicates the needed data at every node. Therefore,
the access latency can be improved significantly. On the other
hand, a lot of storage space will be wasted. The storage space
utilization and access latency will be the trade-off.
In [37], a centralized dynamic replication algorithm is
proposed. It finds out the popular files by means of ABU
strategy through analysing the data access history. Furthermore,
a novel strategy is designed to determine the average number of
file accesses from the access history table.
849
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
performance, so the influence of bandwidth within a cluster on
scheduling and replication can be ignored. This simple model
that gathers a set of cluster-to-cluster performances would
require C 2 − C probes, where C is the number of clusters,
instead of N 2 − N .
3.2. Hypothesis
Fig. 3.1. Cluster-to-cluster bandwidth.
3. Scheduling and replication algorithms
Since the grid system is distributed, the performance of
networks plays an important role in job scheduling and
dispatching. In this paper, we propose two strategies for job
scheduling and data replication by considering the hierarchical
network structure. HCS (Hierarchical Cluster Scheduling) is
a job scheduling policy and HRS (Hierarchical Replication
Strategy) is a replication method. In this section, we explain
these two mechanisms in detail.
3.1. Monitoring network performance
A poor network performance will limit the efficiency of
data transfer and increase the job execution time further. Thus,
network performance is an important criterion in evaluating
the access cost of required files and the replica selection.
However, network performance has kaleidoscopic changes.
Predicting Network performance can be defined as estimating
the future available bandwidth between grid sites across widearea networks. To have a more efficient use of the resources,
any job scheduling method and replica selection may rely
on the prediction of network performances. However, the
complexity of gathering end-to-end network performances
would be increased with the number of grid sites, N . Some
particular technology can reduce the complexity to be less than
O(N 2 ), such as in NWS where network sensors are organized
hierarchically [18]. Because the wide-area links often are orders
of magnitude slower than links of local networks, bandwidth
within the same cluster would be larger than WANs and
usually is not considered in making job scheduling decisions.
Therefore, only the available bandwidth between grid clusters
that need to pass through the Internet, shown as the dotted
lines in Fig. 3.1, is taken into consideration when making job
dispatching decisions.
To estimate the available bandwidth between grid clusters,
assume each cluster has a leader or gateway to probe other
cluster leaders. When communication is across two clusters,
bottleneck bandwidth of a network path often occurs in widearea links. Without 100% accuracy, it is enough to probe
a single clusterA–clusterB process pair to determine what
the performance of any connection between clusterA and
clusterB will be. In addition, when communication is within a
cluster, each grid site observes approximately the same network
Network performance can be converted into cost model.
There are three factors that affect the scheduling of job j: S j ,
R j , Q jk , where S j is the site to which job j is scheduled,
R j is the list of LFN (Logical File Name) of replicas needed
by job j and Q jk is the queuing latency for job j at the site
Sk . The replicas needed to execute this job are represented as
R j = {LFN1 , LFN2 , . . . , LFNn }. For a grid site S j , we divide
the replicas into three subsets according to the availability of
j
LFNi in S j . The first subset is on-site set Ron that contains all
the locally available replicas. The second subset is intra-site set
j
Rintra that contains the rest of replicas that can be found in the
j
local cluster. The third subset is inter-site set Rinter that contains
the other replicas that must be accessed from other clusters. For
j
each LFNi in Rinter , assume the bandwidth from S j to PFNi
(to the site that LFNi resides) is B ji . Then the time needed to
retrieve PFNi to S j is |LFNi |/B ji , where |LFNi | denotes the
size of replica LFNi . We define some cost terms.
j
Inter-cluster-communication-cost (IrCx ): If job j is dispatched to cluster x, the cost of inter-communications would
be calculated by using the cluster-to-cluster bandwidth.
j
IrCx =
1
×
αj
X
for all LFNi in
j
Rinter
|LFNi |
B ji
(1)
where α j is a constant reflecting the degree of parallelism in
j
S j for replica downloading. IrCx represents the time needed to
j
have all the replicas inRinter available locally in S j .
j
Intra-cluster-communication-cost (IaC S j ): For job j, the
cost of intra-communications at site S j is represented as the
j
total file size of Rintra , since bandwidth in the local cluster is
assumed to be plentiful and roughly the same from sites to sites.
X
j
|LFNi | .
IaC S j =
(2)
j
for all LFNi in Rintra
Queuing latency (Q j ): If job j is going to be scheduled to S j
in cluster x, queuing latency Q j will be the time needed of
running all the jobs that have queued at S j . Therefore,
Qj =
m
X
(IrCkx + IaCkx )
(3)
k=1
where 1, 2, . . . , m are jobs queued before job j.
3.3. HCS (Hierarchical Cluster-based Scheduling) algorithm
Previous job schedulers except Random scheduler search
all resources to find the best one that has the lowest cost.
HCS improves traditional schedulers in two aspects. First,
850
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
Fig. 3.2. An example in cluster grid.
Fig. 3.3. HCS job scheduling.
HCS takes into account hierarchical cluster grid structure and
all of data replicas owned by a cluster. Fig. 3.2 is a simple
example. The number of seconds in each edge denotes the
time required to access a file from another cluster. There is a
grid job that requires four files for execution and the four files
are distributed over four clusters. If we schedule a job to the
cluster based on highest hit ratio of required replicas (most
of required replicas for the job available within the cluster),
like clusterB, the job execution time and the number of intercluster-communications would be reduced since access data
can be faster (larger bandwidth within a cluster). In contrast, if
scheduling a job to the cluster with few of the required replicas,
like clusterD, the number of inter-cluster-communications and
access latency would be increased. It is possible that the amount
of replica in one cluster is more than the other clusters but the
total replicas size may be smaller. Scheduling a job by using
the number of replicas is inexact. Thus, to distribute the jobs to
different sites, we propose to schedule jobs based on the cost
model described previously.
Second, searching the best site from a huge amount of
distributed sites would lead to long latency. HCS uses a
hierarchical tree to schedule a job and minimize the overhead of
searching for the suitable site, as shown in Fig. 3.3. It is a twostep decision process. The first step selects a cluster to minimize
j
the inter-cluster-communication-cost (IrCC ). Referring back to
j
the example shown in Fig. 3.2, the values of IrCC for each
cluster are
(1) In clusterA, it needs to access File2 and File4. The
best PFNs of File2 and File4 are both from clusterC that has
j
minimum IrCclusterA = 4 s.
(2) In clusterB, it has most of the required replicas and
only needs to access File3. But external bandwidth may be
congested. Accessing the best PFN of File3 from clusterA, we
j
have IrCclusterB = 5 s.
(3) In clusterC, it lacks File1 and File3. The best PFNs of
j
File1 and File3 are in clusterA. We have IrCclusterC = 4 s.
(4) In clusterD, it needs to access File1, File2, and File4 for
job execution. The latency of moving File1 from clusterA and
j
File2 and File4 form clusterC is IrCcusterD = 7 s.
j
More than one cluster has minimum value of IrCC (clusterA
and clusterC). In this situation, we will select one cluster
randomly. Therefore, the job is scheduled onto clusterA or
clusterC. This example shows that the cluster that has the largest
number of matches of required replicas may not be the optimal
solution.
After the suitable cluster is selected from Cluster Grid, the
second step selects the best site Sj from local cluster based on
the combined cost of moving replicas into the site Sj (intracluster-communication-cost) and the wait time in the queue in
the site Sj (Queuing latency). The job is scheduled onto the site
which has the minimum combined cost.
3.4. HRS (Hierarchical Replication Strategy) algorithm
After a job is scheduled to S j , the requested data will be
transferred to S j to become replicas. HRS (Hierarchical Replication Strategy) then determines how to handle this replica, as
shown in Fig. 3.4. If there is enough disk space, the replica is
stored. Otherwise, if this replica is from a site in the local cluster, it is only stored in the temporary buffer and will be deleted
after the job completes. If this replica is from other clusters, occupied space will be released to make room for this new replica,
as presented in Fig. 3.4. The first choice to be removed is the
replica that already exists in other sites in the same cluster. After all these locally available replicas are deleted, if the space
is still insufficient, least frequently used replica will be the next
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
851
Fig. 3.4. HRS replication strategy.
target for removal, and so on until enough space is available. To
conclude, HRS considers inter-cluster replica transference as
very costly. Therefore, the successfully received replicas must
be stored locally such that all other sites in the same cluster will
not have to replicate them again later.
3.5. HRS vs BHR (Bandwidth Hierarchy based Replication)
[16]
HRS replication strategy uses the concept of “network
locality” as BHR [16]. The difference between HRS and BHR
can be observed in two aspects. First, required replica within the
same cluster is always the top priority used in HRS, while BHR
searches all sites to find the best replica and has no distinction
between intra-cluster and inter-cluster. It could be anticipated
that HRS will avoid inter-cluster-communications and be stable
in a hierarchical network architecture with variable bandwidth.
Second, HRS considers popularity of replicas at site level, while
BHR is based on cluster level.
BHR adds a scheme, called region optimizer. A replica
would be deleted if its access frequency is smaller than a new
replica within a region optimizer. However, in BHR there is
a discrepancy between region optimizers in the popularity of
replicas. The number of times a file is requested by jobs may
be different from what the region optimizer sees. For instance,
Region A accesses a replica x in Region B. The popularity of
x will be different between A and B because region optimizer
records access frequency of file which “stored” in local region.
The files that stored in local region are allowed to record history
of replica accessed from remote sites or itself. But, the access
frequency of a remote replica will not be recorded since it isn’t
stored in the local region.
Replica deletion of BHR removes duplicated file first in
terms of site access frequency. Then the most unpopular
files are deleted based on the access history gathered by the
region optimizer. Removing unpopular replicas based on region
optimizer is tantamount to LFU for site level. However, local
852
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
region optimizer doesn’t keep track new replica, because it
isn’t stored in this region. If the access frequency of a new
replica comes from a remote region optimizer, the comparison
between local region and remote region is meaningless. Thus,
the function of region optimizer is limited.
Furthermore, a region optimizer gathers the number of file
requests from jobs run on the sites within a region. Each file
requested would be recorded no matter where the required
replica came from. Thus, an access frequency of a new replica
can be obtained. When removing unpopular replica, BHR
requires to sort access frequency from region optimizer. These
frequencies include files that stored within and outside region.
Unpopular replica deletion would be inefficient since BHR
sorts the total records first, and then picks out the unpopular
replica. It is time consuming. Thus, in this paper we follow the
traditional concept that considers popularity of files at the site
level. If there is not enough storage space for replication, HRS
would delete the least frequently accessed file.
4. Simulations
We use OptorSim to evaluate the performance of different
combinations of job scheduling algorithms and replication
strategies. OptorSim was developed to mimic the structure
of a real Data Grid. All of general components are included
into it with an emphasis on file access optimization and
dynamic replication strategies. OptorSim is the project of
EDG [3], a Java-based simulation language. We have modified
some components and embedded HCS and HRS modules in
OptorSim to exactly match our needs. Behavior of OptorSim is
set up and controlled by using configuration files. We describe
in turn the simulation framework, experiments performed and
results.
4.1. Simulation framework
OptorSim simulates the data grid architecture shown in
Fig. 4.1 for evaluating various replication strategies. The
simulation architecture consists of the following principal
components:
(1) Resource Broker (RB) accepts job submission from
users and schedules each job to suitable site according to the
scheduling policy, which gathered some information to make
an optimal decision. For example, it may consider the location
of required replicas, bandwidth, and computational capacity.
(2) Storage Element (SE) represents storage resource, where
grid data are stored. If the storage is full and space is
required for a new replica, SE will choose a victimized file for
deletion based on a replacement algorithm, such as LFU (Least
Frequently Used) and LRU (Least Recently Used).
(3) Computing Element (CE) represents computational
resource to process grid jobs by inputting required replicas. If
any of the required replicas are not stored locally, they should be
fetched from remote sites, shown as dotted lines in Fig. 4.1(a).
(4) Replica Manager (RM) at each site manages the data
movement between sites and provides interface to directly
access the Replica Catalogue, which provides LFN–PFN
mapping and will be migrated to RLS [19] by EDG [3].
Fig. 4.1. (a) OptorSim simulates data grid architecture (b) an expanded
illustration of grid site.
Table 1
Simulation parameters
Topology parameter
Value
No. of cluster
No. of sites in each cluster
Storage space at each site
Connectivity bandwidth
4
13
50 GB
1000 Mbps (WAN)
1000 Mbps (LAN)
Grid job parameter
Value
No. of jobs
No. of job types
No. of file accessed per job
Size of single file
Total size of files
1000
50
15
1 GB
750 GB
(5) Replica Optimizer (RO) within RM contains the
replication algorithm shown in Fig. 4.1(b). When a file is
required by a job, RO would locate the best replica (PFN) in
terms of the file’s LFN and decide whether it should create a
new replica of the file locally, or create a temporary cache of
the file locally.
To simplify the requirements, data replication approaches
in Data Grid environments commonly assume that the data is
read-only. It means that they can be replicated without having
to worry about change propagation back to the master copy.
This is reasonable assumption as it is discussed in several
scenarios [20,21]. Consequently, all replicas are consistent.
To prevent all copies of the same file from being deleted,
for each file, there is one master file that contains the original
copy of data samples and cannot be deleted by the replication
strategies. The location where the master copies of the files are
distributed is defined in configuration file and it can be random.
4.2. Experimental environment
For the experiments, the cluster grid topology of the
simulated platform is given in Fig. 4.2 and this topology is from
the simulation architecture of BHR. There are four clusters and
each one has an average of 13 sites, which all have CE with
associated SE. Node 35 holds all master files at the beginning
of the simulation. Each dotted line between two nodes shows
the inter-cluster communication.
Table 1 specifies the simulation parameters used in our
study. All the network bandwidth is set as 1000 Mb/s (Mbps),
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
853
Fig. 4.2. Topology of the simulated platform.
except bandwidth between master site and adjacent router
(2000 Mbps). There are 50 job types, each job type requires 15
files to execute and there is no overlap between the set of files.
While running, jobs were randomly picked from 50 job types
based on probability of each job, then submitted to the Resource
Broker at regular intervals until 1000 jobs are submitted. Thus,
some job types would occur frequently so that certain required
replicas are accessed repeatedly.
In order to easily interpret the result, users submit jobs at
regular intervals (10 000 ms) until all jobs have been done.
Files are accessed sequentially within a job without any access
pattern. HCS will be compared with an OptorSim scheduler that
searches all sites to find available CE by using a combination
of access cost for the files and the queue length of waiting
jobs, called QAC (Queue Access Cost). QAC performs better
than other scheduler in OptorSim [22]. Additionally, HRS will
be compared with LRU (Least Recently Used), LFU (Least
Frequently Used), and BHR (Bandwidth Hierarchy based
Replication). The LRU algorithm always replicates and then
deletes those files that have been used least recently. Similarly,
LFU deletes the least frequently accessed file in recent past. We
ran a total of six simulation experiments, which two kinds of
scheduling policy combined four kinds of replication strategies.
For each experiment, we measure:
(1) Total job execution time (queuingtime + accesslatency +
executingtime);
(2) Number of inter-communications;
(3) Computing resource usage: the percentage of time that
CEs are in active state at the period of job execution.
4.3. Simulation results and discussion
The following figures show the simulation results to
complete 1000 jobs for each combination of the data replication
and job scheduling algorithms. For replication strategies, LRU
and LFU show similar performance in Fig. 4.3. The same
results are obtained in [2]. We implement BHR replication
strategy into OptorSim. Total job execution time is about 30%
faster using BHR optimizer than LRU and LFU. Our method
takes benefit from network level locality of BHR and we
simplify its replica replacement model. Thus, HRS successfully
accelerates the total execution time up to 40% whether in QAC
or HCS.
Fig. 4.4 illustrates the computing resource usage. It is the
percentage of time that CEs are in active state. It depends on
job turnaround time. In Fig. 4.3, in the same simulation, since
HRS finishes all jobs first, it means the CPUs are not idle most
of the time. Therefore, it has good computing resource usage.
Based on a concept of locality in cluster grids, HCS
can reduce the inter-communications between different cluster
grids. And by careful replication strategy, HRS can also reduce
the number of inter-communications, as shown in Fig. 4.5. The
results show that HCS and HRS combined can save bandwidth
usage.
4.4. Discussions
To analyse the distribution of jobs, we run a simulation
where there is a grid system with four clusters. Each cluster
has three grid sites and 500 jobs. Fig. 4.6 shows the distribution
of where jobs are executed. Since HCS schedules jobs to certain
specific sites and specific cluster according to inter-cluster
communication costs. Therefore, jobs would be executed on
a cluster with the most needed files. It can be observed that
the same type of jobs is almost executed at the same cluster
as shown in Fig. 4.6(a). Different job type means different
file access patterns. If a cluster executes some specific job
frequently, the probability of having the needed data files will
increase in this cluster. Therefore, it is reasonable to schedule
the same type of job to the same cluster. HCS with HRS strategy
854
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
Fig. 4.3. Total job execution times for various job scheduling and replication algorithms.
Fig. 4.4. Computing resource usage for various job scheduling and replication algorithm.
Fig. 4.5. Number of inter-communications.
can schedule the same job type in the appropriate cluster with a
view to reducing the replication overhead of data transmission
from Fig. 4.6(a).
On the contrary, the job distribution of QAC is almost
random as shown in Fig. 4.6(b). This is because QAC mostly
considers queuing cost. One site may have executed every job
type. It will lead to more overhead in transferring file replicas.
HCS might cause some specific sites to have a heavy load if
a large amount of a certain type of jobs is submitted. However,
scheduling jobs to site or cluster without the needed data
would have more access latency than queue time since Internet
network bandwidth still fails to keep up with the computing
capacity, especially if the size of data is from terabyte to
petabyte.
5. Implementation and performance evaluation
5.1. System implementation framework and environment
We have implemented our job scheduling algorithm and
replication strategy in the Taiwan UniGrid platform [23].
Taiwan UniGrid utilizes Globus Toolkit [29] as the system
middleware. There are five clusters in our experimental
environment, including National Dong Hwa University
(NDHU) [24], Academia Sinica [25], National Tsing Hua
University (NTHU) [26], Providence University (PU) [27],
Hsing Kuo University (HKU) [28]. Each cluster has several
grid sits as shown in Fig. 5.1. All clusters are connected by the
Internet.
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
855
Fig. 5.2. Implementation architecture.
Fig. 4.6. 500 jobs distribution (a) HCS with HRS (b) QAC with LFU.
CPUs, free memory space, system loading, storage space and
so on.
A job broker is implemented in our system. The job broker
accepts a user’s job parameters and prepares for the job
scheduling process. In the beginning, the job broker initializes
data distribution to each cluster randomly. Afterward, the
job broker performs the job scheduling procedure according
to the specified algorithm for each job. Taking HCS as an
example, the job broker selects a best cluster with minimal
inter-cluster-communication cost at first and then submits the
job to a best site with minimal intra-communication cost and
queuing latency within the best cluster. With respect to the job
submission and execution, the job broker makes use of GRAM
(Grid Resource Allocation and Management) [31] protocol to
assign a job to a specific grid site. A job is defined by RSL
(Resource Specification Language) [19] in terms of binary
execution file, arguments, standard output, and so forth. All
data transmissions are transferred by GridFTP protocol [32].
Furthermore, the specified data replication strategy will be
performed if the storage space is exhausted completely.
5.2. Taiwan UniGrid Simulator
Fig. 5.1. Implementation environment.
Fig. 5.2 depicts the overall system implementation
architecture. The NWS (Network Weather Service) [30] is
deployed in each cluster. Each cluster header will report back
to the information server about its cluster-to-cluster bandwidth
information periodically. Consequently, the information server
keeps track of the up-to-date cluster-to-cluster bandwidth
information. In addition, the information server has the current
resource information of each grid site such as the number of
We have implemented a user-friendly interface called
Taiwan UniGrid Simulator by means of Java CoG Toolkit [33].
The Java CoG Toolkit provides a series of programming
interfaces as well as reusable objects in grid services, such as
GSI (Grid Security Infrastructure) [34], GRAM, GridFTP, and
so on. It presents programmers with a mapping between the
Globus Toolkit and Java APIs so as to ease the programming
complexity. As illustrated in Fig. 5.3, a user can select the
simulation parameters via the Taiwan UniGrid Simulator, such
as the IP address of job broker, number of file accesses per
job, file size, number of jobs, storage space, job scheduling
algorithm, replication strategy and so on.
The up-to-date cluster-to-cluster bandwidth information can
be obtained through the Taiwan UniGrid Simulator as shown in
856
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
Fig. 5.3. Taiwan UniGrid simulator.
Fig. 5.4. Cluster-to-cluster bandwidth information.
Fig. 5.4. The user can retrieve the recent resource information
of each grid site within a cluster such as system loading, CPU
speed, free memory space, available storage space and so forth.
Figs. 5.5 and 5.6 present the system resource information for
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
Fig. 5.5. Grid resources information.
Fig. 5.6. Job submission status.
857
858
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
Fig. 5.7. Average job execution time.
Fig. 5.8. Average number of intercommunications.
each grid site as well as job submission status. As a simulation
is accomplished, the user can receive the experimental results
from the job broker by means of the simulator interface.
5.3. Experiment results
In our experiment, the job execution time is the
file transmission time plus job processing time. The file
transmission time is the time to move a required file from
a source site to the job execution site by GridFTP. The job
processing time is the queuing time plus the job running time.
The experiment parameters are given in Table 2. We have
compared HCS with QAC (Queue Access Cost) in terms
of four different data replication strategies, including LRU
(Least Recently Used), LFU (Least Frequently Used), BHR
(Bandwidth Hierarchy-based Replication) and HRS.
The experimental results of average job execution time are
presented in Fig. 5.7. The average job execution time for one
job is obtained through dividing the overall job experimental
time by the number of jobs. As mentioned above, the job
execution time is the file transmission time plus job processing
time. Since the file transmission time is the most important
factor to influence the job execution time for data-intensive jobs
in data grids, HCS with HRS can reduce the file transmission
time effectively by virtue of valid scheduling and proper data
replication, as can be seen from the experiments.
The average number of intercommunications for a job
execution is illustrated in Fig. 5.8. By selecting the best cluster
with minimal inter-cluster-communication cost and the best
site with minimal intra-cluster-communication cost, HCS with
HRS can decrease the cost of intercommunications effectively
as compared with the other job scheduling algorithms and
replication strategies.
5.4. Security issues and possible applications
All of our implementation is based on the Globus
Toolkit. The Globus Toolkit provides a security infrastructure
called GSI (Grid Security Infrastructure). It provides the
authentication and authorization mechanisms for system
protection according to X.509 proxy certificates. Therefore, the
user with a valid proxy certificate is allowed to access data or
replicate a data file.
HCS and HRS can be applied to and embedded to any
grid systems. For example, the Taiwan Ecogrid project [35]
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
859
Acknowledgements
Table 2
Experimental parameters
Parameters
Value
Number of jobs
Number of file accessed per job
Size of single file
Storage space for each site
500
15
500 MB
15 GB
has deployed many sensors in several ecological areas in
Taiwan to gather environmental data and real time monitoring
video for ecological analysis. The data must be replicated and
distributed over various areas and grid sites for processing. It is
obvious that the size of ecological data is quite large. Ecology
research jobs processing large amounts of environmental
data consume considerable network bandwidth and computing
resources if without an appropriate scheduling algorithm and
data replication strategy. HCS with HRS strategy could be
applied to such an ecological grid computing environment to
improve the system performance.
6. Conclusions and future work
We have addressed the problem of data movement
operations in cluster grid environment. To achieve good
network bandwidth utilization and reduce data access time,
we consider the inter-cluster communications cost. We propose
a job scheduling policy (HCS) that considers not only
computational capability and data location but also cluster
information, and a dynamic replica optimization strategy (HRS)
where the nearby data has a higher priority to access than
generating new replicas.
To evaluate the efficiency of our job scheduling policy and
replica strategy, we ran the grid simulator OptorSim that is
configured to represent a real world data grid testbed. We study
and evaluate the performance of various replica strategies and
different algorithm combinations.
The simulation results show, first of all, that HCS and HRS
both get better performances than other scheduling policy and
replica strategies. Second, we can achieve particularly good
performance with HCS where jobs are always scheduled to
cluster with most of the data needed, and a separate HRS
process at each site for replication management. Experimental
data show HCS scheduling with HRS replica strategy
outperforms others scheduling algorithms and replication
strategies in total job execution time.
We also implement HCS and HRS on the real Taiwan
Unigrid environment. The experimental results are consistent
with the simulations. It demonstrates the superiority of HCS
and HRS in scheduling jobs and managing replications.
The probability of scheduling the same type of job to
the same cluster will be rather high in our scheduling
algorithm, leading to possible loading balancing problems.
The consideration of system loading balancing with other
scheduling factors will be an important future research
direction. In addition, the balancing between data access time,
job execution time, and network capabilities also needs to be
studied further.
This research is supported in part by NSC under contract
number 93-2213-E-259-013 and 93-2213-E-259-014. The
authors would also like to acknowledge the National Centre
for High-Performance Computing in providing resources under
the national project “Taiwan Knowledge Innovation National
Grid”.
References
[1] I. Foster, The grid: A new infrastructure for 21st century science, Physics
Today 55 (2002) 42–47.
[2] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, S. Tuecke, The data
grid: Towards an architecture for distributed management and analysis of
large scientific datasets, Journal of Network and Computer Applications
23 (2000) 187–200.
[3] The European data grid project.
[4] W.H. Bell, D.G. Cameron, L. Capozza, P. Millar, K. Stockinger, F.
Zini, Simulation of dynamic grid replication strategies in OptorSim,
in: Proceedings of the Third ACM/IEEE International Workshop on Grid
Computing, Grid2002, Baltimore, USA, in: Lecture Notes in Computer
Science, vol. 2536, 2002, pp. 46–57.
[5] I. Foster, K. Ranganathan, Design and evaluation of dynamic replication
strategies for high performance data grids, in: Proceedings of International
Conference on Computing in High Energy and Nuclear Physics, Beijing,
China, September 2001.
[6] I. Foster, K. Ranganathan, Identifying dynamic replication strategies
for high performance data grids, in: Proceedings of 3rd IEEE/ACM
International Workshop on Grid Computing, in: Lecture Notes on
Computer Science, vol. 2242, Denver, USA, 2002, pp. 75–86.
[7] I. Foster, K. Ranganathan, Decoupling computation and data scheduling
in distributed data-intensive applications, in: Proceedings of the 11th
IEEE International Symposium on High Performance Distributed
Computing, HPDC-11, IEEE, CS Press, Edinburgh, UK, 2002,
pp. 352–358.
[8] E. Deelman, H. Lamehamedi, B. Szymanski, S. Zujun, Data replication
strategies in grid environments, in: Proceedings of 5th International
Conference on Algorithms and Architecture for Parallel Processing,
ICA3PP’2002, IEEE Computer Science Press, Bejing, China, 2002,
pp. 378–383.
[9] C.E. Leiserson, Fat tree: Universal networks for hardware-efficient
supercomputing, IEEE Transactions on Computers C-34 (10) (1985)
892–901.
[10] H.H. Mohamed, D.H.J. Epema, An evaluation of the close-to-files
processor and data co-allocation policy in multiclusters, in: 2004 IEEE
International Conference on Cluster Computing, IEEE Society Press, San
Diego, California, USA, 2004, pp. 287–298.
[11] T. Kosar, M. Livny, Stork: Making data placement a first class citizen
in the grid, in: Proceedings of the 24th International Conference on
Distributed Computing Systems, ICDCS2004, Tokyo, Japan, March 2004,
pp. 342–349.
[12] R. Raman, M. Livny, M. Solomon, Matchmaking: Distributed resource
management for high throughput computing, in: Proceedings of the
Seventh IEEE International Symposium on High Performance Distributed
Computing, HPDC7, Chicago, Illinois, USA, July 1998, pp. 140–146.
[13] Condor Project, The Directed Acyclic Graph Manager (DAGMan). http://
www.cs.wisc.edu/condor/dagman/, 2003.
[14] T. Tannenbaum, D. Wright, K. Miller, M. Livny, Condor—a distributed
job scheduler, in: T. Sterling (Ed.), Beowulf Cluster Computing with
Linux, MIT Press, 2001. http://www.cs.wisc.edu/condor/.
[15] A. Chakrabarti, R.A. Dheepak, S. Sengupta, Integration of scheduling and
replication in Data Grids, in: Lecture Notes in Computer Science, vol.
3296, 2004, pp. 375–385.
860
R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860
[16] S.-M. Park, J.-H. Kim, Y.-B. Go, W.-S. Yoon, Dynamic grid replication
strategy based on internet hierarchy, in: International Workshop on Grid
and Cooperative Computing, in: Lecture Note in Computer Science, vol.
1001, 2003, pp. 1324–1331.
[17] M. Carman, F. Zini, L. Serafini, K. Stockinger, Towards an economybased optimisation of file access and replication on a data grid,
in: Proceedings of 2nd IEEE/ACM International Symposium on Cluster
Computing and the Grid, CCGrid 2002, IEEE-CS Press, Berlin, Germany,
2002, pp. 340–345.
[18] J. Hayes, N.T. Spring, R. Wolski, The network weather service: A
distributed resource performance forecasting service for metacomputing,
Future Generation Computer Systems 15 (5–6) (1999) 757–768.
[19] Resource Specification Language (RSL), Globus Project – Globus Toolkit
4.0. http://www.globus.org/toolkit/docs/4.0/data/rls/, 2005.
[20] W. Hoschek, F.J. Jaén-Martı́nez, A. Samar, H. Stockinger, K. Stockinger,
Data management in an international data grid project, in: Proceedings of
First IEEE/ACM International Workshop on Grid Computing, Grid’2000,
in: Lecture Notes in Computer Science, vol. 1971, Bangalore, India,
December 2000, pp. 77–90.
[21] P. Kunszt, E. Laure, H. Stockinger, K. Stockinger, Advanced replica
management with reptor, in: Proceedings of 5th International Conference
on Parallel Processing and Applied Mathemetics, PPAM 2003,
Czestochowa, Poland, September 2003, pp. 848–855.
[22] D.G. Cameron, A.P. Millar, C. Nicholson, OptorSim: A simulation tool
for scheduling and replica optimisation in data grids, in: Proceedings of
Computing in High Energy Physics, CHEP 2004, Interlaken, Switzerland,
September 2004.
[23] Taiwan UniGrid Project. http://www.unigrid.org.tw/.
[24] National Dong Hwa University (NDHU). http://www.ndhu.edu.tw/
english/index.php.
[25] Academia Sinica. http://www.sinica.edu.tw/main e.shtml.
[26] National Tsing Hua University (NTHU). http://www.nthu.edu.tw/index-e/
index.htm.
[27] Providence University (PU). http://web.pu.edu.tw/∼english/.
[28] Hsing Kuo University (HKU). http://english.hku.edu.tw/.
[29] Globus Toolkit. http://www.globus.org/.
[30] Network Weather Service (NWS). http://nws.cs.ucsb.edu/ewiki/.
[31] GRAM (Grid Resource Allocation and Management). http://www.globus.
org/toolkit/docs/development/4.2-drafts/execution/index.html.
[32] GridFTP Protocol. http://www.globus.org/toolkit/docs/3.2/gridftp/key/
index.html.
[33] CoG Toolkit. http://www.cogkit.org/.
[34] Grid Security Infrastructure (GSI). http://www.globus.org/security/.
[35] Taiwan Ecogrid project. http://ecogrid.nchc.org.tw/.
[36] M. Tang, B.-S. Lee, C.-K. Yeo, X. Tang, Dynamic replication algorithms
for the multi-tier data grid, Future Generation Computer Systems 21
(2005) 775–790.
[37] M. Tang, B.-S. Lee, X. Tang, C.-K. Yeo, The impact of data replication of
job scheduling performance in the data grid, Future Generation Computer
Systems 22 (2006) 254–268.
[38] H. Zhuge, X. Sun, J. Liu, E. Yao, X. Chen, A scalable P2P platform for the
knowledge grid, IEEE Transactions on Knowledge and Data Engineering
17 (12) (2005) 1721–1736.
Ruay-Shiung Chang received his B.S.E.E. degree
from National Taiwan University in 1980 and his Ph.D.
degree in Computer Science from National Tsing Hua
University in 1988. He is now a professor in the
Department of Computer Science and Information
Engineering, National Dong Hwa University. His
research interests include Internet, wireless networks,
and grid computing. Dr Chang is a member of ACM
and IEICE, a senior member of IEEE, and founding
member of ROC Institute of Information and Computing Machinery. Dr Chang
also serves on the advisory council for the Public Interest Registry (www.pir.
org).
Jih-Sheng Chang received his B.E. degree from the
Department of Computer Science and Information
Engineering, I-Shou University, Kaohsiung, Taiwan
in 2002 and his M.S. degree from the Department
of Computer Science and Information Engineering,
National Dong Hwa University, Hualien, Taiwan in
2004. He is a Ph.D. candidate at the Department
of Computer Science and Information Engineering
at National Dong Hwa University currently. His
academic research interests focus on wireless network technology and grid
computing.
Shin-Yi Lin received her M.S. degree from the Department of Computer Science and Information Engineering, National Dong Hwa University, Taiwan in
2005. She is an engineer in the Realtek Semiconductor
Crop., located in the Hsinchu Science-based Industrial
Park, Hsinchu, Taiwan. Her research interests include
wireless networks and grid computing.