Academia.eduAcademia.edu

Job scheduling and data replication on data grids

2007, Future Generation Computer Systems

In data grids, many distributed scientific and engineering applications often require access to a large amount of data (terabytes or petabytes). Data access time depends on bandwidth, especially in a cluster grid. Network bandwidth within the same cluster is larger than across clusters. In a communication environment, the major bottleneck to supporting fast data access in Grids is the high latencies of Wide Area Networks (WANs) and Internet. Effective scheduling in such network architecture can reduce the amount of data transferred across the Internet by dispatching a job to where the needed data are present. Another solution is to use a data replication mechanism to generate multiple copies of the existing data to reduce access opportunities from a remote site. To utilize the above two concepts, in this paper we develop a job scheduling policy, called HCS (Hierarchical Cluster Scheduling), and a dynamic data replication strategy, called HRS (Hierarchical Replication Strategy), to improve the data access efficiencies in a cluster grid. We simulate our algorithm to evaluate various combinations of data access patterns. We also implement HCS and HRS in the Taiwan Unigrid environment. The simulation and experiment results show that HCS and HRS successfully reduces data access time and the amount of inter-cluster-communications in comparison with other strategies in a cluster grid.

Future Generation Computer Systems 23 (2007) 846–860 www.elsevier.com/locate/fgcs Job scheduling and data replication on data grids Ruay-Shiung Chang ∗ , Jih-Sheng Chang, Shin-Yi Lin Department of Computer Science and Information Engineering, National Dong Hwa University, Shoufeng, Hualien 974, Taiwan Received 25 July 2006; received in revised form 23 February 2007; accepted 27 February 2007 Available online 16 March 2007 Abstract In data grids, many distributed scientific and engineering applications often require access to a large amount of data (terabytes or petabytes). Data access time depends on bandwidth, especially in a cluster grid. Network bandwidth within the same cluster is larger than across clusters. In a communication environment, the major bottleneck to supporting fast data access in Grids is the high latencies of Wide Area Networks (WANs) and Internet. Effective scheduling in such network architecture can reduce the amount of data transferred across the Internet by dispatching a job to where the needed data are present. Another solution is to use a data replication mechanism to generate multiple copies of the existing data to reduce access opportunities from a remote site. To utilize the above two concepts, in this paper we develop a job scheduling policy, called HCS (Hierarchical Cluster Scheduling), and a dynamic data replication strategy, called HRS (Hierarchical Replication Strategy), to improve the data access efficiencies in a cluster grid. We simulate our algorithm to evaluate various combinations of data access patterns. We also implement HCS and HRS in the Taiwan Unigrid environment. The simulation and experiment results show that HCS and HRS successfully reduces data access time and the amount of inter-cluster-communications in comparison with other strategies in a cluster grid. c 2007 Elsevier B.V. All rights reserved. Keywords: Data replication; Data grid; Job scheduling 1. Introduction In data grids [1,2], distributed scientific and engineering applications often require access to a large amount of data (terabytes or petabytes). Managing this large amount of data in a centralized way is ineffective due to extensive access latency and load on the central server. Hence, such huge dataset must be separated and stored in several physical locations. In a communication environment, the performance of accessing a distributed and huge amount of data depends on the availability of network bandwidth. Namely, slow data access can throttle the performance of data-intensive applications running on grid computers. In Fig. 1, a simple hierarchical form of a grid system, called cluster grid, is shown. A cluster represents an organization unit which is a group of sites that are geographically close. We define two kinds of communications between sites in a cluster grid. Intra-communication is the communication between sites within the same cluster. On the other hand, inter-communication is the communication between ∗ Corresponding author. Tel.: +886 3 8632031; fax: +886 3 8632030. E-mail address: rschang@mail.ndhu.edu.tw (R.-S. Chang). 0167-739X/$ - see front matter c 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.future.2007.02.008 sites across clusters. Network bandwidth between sites within a cluster will be larger than across clusters. Therefore, to reduce access latency and to avoid WAN bandwidth bottleneck in a cluster grid, it is important to reduce the number of intercommunications. To address this problem, we consider two aspects of intercommunication: job scheduling and replication mechanism. Consider a case that many of the authorized users submit jobs to solve data-intensive problems. We want that jobs be executed as fast as possible. The size of the data used on Data Grid is from terabytes to petabytes. Scheduling jobs to suitable grid sites is necessary because data movement between different grid sites is time consuming. The scheduling decisions should be based on the appropriate resources a grid site has. Other factors to be considered include CPU workload, features of computational capability, location of data and network load. If a job is scheduled to a site where the required data are present, the job can process data in this site without any transmission delay for getting data from a remote site. Data replication is another important optimization step to manage large data by replicating data in geographically distributed data stores. Previous replication strategies show R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 847 Fig. 1. A cluster grid architecture. that replicating data can offer high data availability and low bandwidth consumption. When users’ jobs access a large amount of data from remote sites, dynamic replica optimizer running in the site tries to store replicas on local storage for future possible repeated requests. If most data resides on the same site where they are needed, the frequency of remote data access is going to decrease. This can reduce job execution time and increase the robustness of grid application. Inter-cluster communications also can be avoided, if data within the same cluster are always accessed first. Our new scheduling policy considers the locations of required data, the access cost and the job queue length of a computing node. It is called HCS (Hierarchical Clusterbased Scheduling). HCS uses hierarchical scheduling that takes cluster information into account to reduce the search time for an appropriate computing node. When data have to be replicated, we develop a replication strategy, called HRS (Hierarchical Replication Strategy). It integrates previous replication strategy and increases the chances of accessing data at a nearby node. In order to study the complex nature of a typical grid environment and evaluate various replica optimization algorithms, a grid simulator, called OptorSim [4], was developed by EU Data Grid project [3]. Our job scheduling policy and replica strategy are simulated in OptorSim and compared with various scheduling policies and replica strategies. HCS and HRS successfully reduces data access time and the amount of inter-cluster communications in comparing to other combinations in a cluster grid. In addition, we have implemented HCS and HRS in the Taiwan UniGrid Platform [23] for observing the system performance on a real grid platform that contains many clusters distributed across the Internet. This rest of this paper is organized as follows: Section 2 gives an overview of previous work on grid replication and job scheduling. Section 3 introduces our HCS policy and HRS replication strategy. We show results from simulations of HCS and HRS in Section 4. The implementation and performance evaluation is given in Section 5. Finally, Section 6 concludes the paper and outlines some future research work. 2. Related work As jobs are data intensive, scheduling issues often involve effective computation and data management in the data grids. The replication of data sets is not really a new technique. Data replication has been around for decades and it is now adapted to the grid environment. In [5,6], Ranganathan and Foster present six different replica strategies: (1) No replication or caching, (2) Best Client: a replica is created at the best client that has the largest number of requests for the file, (3) Cascading replication: once popularity exceeds the threshold for a file at a given time interval, a replica is created at next level which is on the path to the best client, (4) Plain caching: the client that requests the file stores a copy of the file locally, (5) Caching plus Cascading Replication: this combines Plain caching and Cascading replication strategy, and (6) Fast Spread: replicas of the file are created at each node along its path to the client. These strategies are evaluated with three different data patterns: (1) Random access: there is no locality in access patterns, (2) data access with a small degree of temporal locality (recently accessed file are likely to be accessed again), (3) Data access with a small degree of temporal and geographical locality. (Files recently accessed by a site are likely to be accessed by nearby site.) The results of simulations indicate that different access pattern needs different replica strategies. With suitable strategies, they can dramatically improved the performances in bandwidth savings or access latency. Two strategies performed the best in the simulations: Cascading and Fast Spread when compared to traditional strategies. Ranganathan and Foster also propose a variety of techniques to intelligently replicate data across sites and assign jobs to sites in data grid [7]. They have conducted a study of the performances of various scheduling algorithms by using simulator. A scheduler selects a remote site to dispatch a job based on one of the four algorithms: (1) JobRadom: schedule a job randomly, (2) JobLeastLoaded: schedule a job to where there is the least number of jobs waiting to run, (3) JobDataPresent: schedule a job to where it has the least load and requested data, and (4) JobLocally: always run jobs locally. 848 R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 These job scheduling algorithms are combined with three different replication strategies: (1) DataDoNothing: there is no replication, (2) DataRandom: once the threshold for a file is exceeded, a replica is created at a random site, (3) DataLeastLoad: once the threshold for a file is exceeded, a replica is created at a site where the least number of jobs are waiting in the queue. The simulations show that loosely coupled jobs and remotely distributed large data sets can be optimized separately. They recognize the significance of data location in job dispatching and scheduling in grids. However, this work only considers jobs that use a single input file and assumes homogeneous sites with a simplified First-In–First-Out (FIFO) strategy within local schedulers. The replication strategies mentioned above [5–7] aim to reduce the message traffic in the network. However, the data mapping is not optimal. In [8], a replication algorithm is tested which uses a cost model to predict whether replicas are worth creating. It is found to be more effective in reducing average job time than the basic case where there is no replication. The simulation architecture used was based on a combination of ring and fat-tree structure [9] where leaf client nodes ran jobs but higher nodes contained all the storage resources, in contrast to the EU Data Grid architecture. Like previous scheduling algorithm, the Close-to-Files (CF) algorithm [10] schedules a job to least loaded processors close to a site where data are present. Assuming that a job only needs a single-file input data, it uses an exhaustive algorithm to search across all combinations of computing sites and data sites to find the combination with the minimum cost, including computation and transmission delay. Data replication is also used to improve performance. The simulation results show that CF can achieve good performance when compared to WF (Worst-Fit) job placement algorithm, which places jobs on the execution sites with the largest number of idle processors. Stork [11] provides a scheduler for data placement in Grid environments. The idea is to map data close to computational resource to efficiently complete computational cycles. Thus, Stork needs many data management operations, such as locating, accessing, storing, replicating, queuing and checkpointing. To avoid overloading of network resources, it handles data transfers among heterogeneous systems by considering job priorities and transfer failures. It can also automatically decide which protocol to use to transfer data from one host to another. Stork uses ClassAds [12] as the mechanism to specify job and data requirements. Stork’s method can be coupled with a task scheduler such as DAGMan (Directed Acyclic Graph Manager) [13] of the Condor project [14]. The work in [15] deals with the problem of integrating the scheduling and replication strategies, called Integrated Replication and Scheduling Strategy (IRS). It decouples jobs scheduling from data scheduling. At the end of periodic intervals when jobs are scheduled, the popularity of required files is calculated and then used by the data scheduler to replicate data for the next set of jobs, which may or may not share the same data requirements as the previous set. BHR (Bandwidth Hierarchy based Replication) [7,16] extends current site-level replica optimization study into the “network-level” based on hierarchy of bandwidth appeared in Internet. BHR tries to maximize the number of required data in the same region in order to fetch replica faster, since bandwidth within region would be larger. They record the regional popularity of files. BHR optimizer selects the best replica for a job and if local storage is already fill up, it will delete duplicated replica in other site within region. If the storage space is still deficient, BHR remove unpopular files from the “regional point” for the second time. Our replication strategy, HRS, follows BHR’s concept of maximizing hit ratio of required data within a cluster. Unlike BHR, ours focuses on avoiding a large number of inter-cluster communications. Economy based replication strategy [17] is a long-term optimization technique which aims at minimizing the overall cost of file access on data grid while given a finite amount of storage resources. While storage resources can maximize profit, computational resources can minimize the file purchase cost. In this economy model, data files are regarded as the goods in the market and are traded by different grid site according to file requests from running jobs. When requesting a replica, it will try to access the cheapest replica in the Grid by starting an auction. Storage resources that have the file locally may reply by bidding a price that estimates the cost of data transfer. If the storage resource at a grid site is already filled up with replicas, selection and deletion of expendable file can create space for a newly requested data. Within the economy model, it uses a prediction function for estimating the future revenue of data files. The authors show that improvements compared to traditional replication technique by performing simulations with OptorSim. In [38], a scalable system called IMAGINE-P2P with the capability of supporting distributed index queries on a structured DHT (Dynamic Hash Table) P2P network is proposed. The replication strategy improves the availability of semantic overlay for dynamic networks according to the experiment results. Two dynamic replication mechanisms [36] are proposed in the multi-tier architecture for Data Grids, including Simple Bottom-Up (SBU) and Aggregate Bottom-Up (ABU). The SBU algorithm replicates the data file that exceeds a pre-defined threshold for clients. The main shortcoming of SBU is the lack of consideration to the relationship with historical access records. For the sake of addressing the problem, ABU is designed to aggregate the historical records to the upper tier until it reaches the root. With the hierarchical topology, the client searches for files from a client to the root. In addition, the root replicates the needed data at every node. Therefore, the access latency can be improved significantly. On the other hand, a lot of storage space will be wasted. The storage space utilization and access latency will be the trade-off. In [37], a centralized dynamic replication algorithm is proposed. It finds out the popular files by means of ABU strategy through analysing the data access history. Furthermore, a novel strategy is designed to determine the average number of file accesses from the access history table. 849 R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 performance, so the influence of bandwidth within a cluster on scheduling and replication can be ignored. This simple model that gathers a set of cluster-to-cluster performances would require C 2 − C probes, where C is the number of clusters, instead of N 2 − N . 3.2. Hypothesis Fig. 3.1. Cluster-to-cluster bandwidth. 3. Scheduling and replication algorithms Since the grid system is distributed, the performance of networks plays an important role in job scheduling and dispatching. In this paper, we propose two strategies for job scheduling and data replication by considering the hierarchical network structure. HCS (Hierarchical Cluster Scheduling) is a job scheduling policy and HRS (Hierarchical Replication Strategy) is a replication method. In this section, we explain these two mechanisms in detail. 3.1. Monitoring network performance A poor network performance will limit the efficiency of data transfer and increase the job execution time further. Thus, network performance is an important criterion in evaluating the access cost of required files and the replica selection. However, network performance has kaleidoscopic changes. Predicting Network performance can be defined as estimating the future available bandwidth between grid sites across widearea networks. To have a more efficient use of the resources, any job scheduling method and replica selection may rely on the prediction of network performances. However, the complexity of gathering end-to-end network performances would be increased with the number of grid sites, N . Some particular technology can reduce the complexity to be less than O(N 2 ), such as in NWS where network sensors are organized hierarchically [18]. Because the wide-area links often are orders of magnitude slower than links of local networks, bandwidth within the same cluster would be larger than WANs and usually is not considered in making job scheduling decisions. Therefore, only the available bandwidth between grid clusters that need to pass through the Internet, shown as the dotted lines in Fig. 3.1, is taken into consideration when making job dispatching decisions. To estimate the available bandwidth between grid clusters, assume each cluster has a leader or gateway to probe other cluster leaders. When communication is across two clusters, bottleneck bandwidth of a network path often occurs in widearea links. Without 100% accuracy, it is enough to probe a single clusterA–clusterB process pair to determine what the performance of any connection between clusterA and clusterB will be. In addition, when communication is within a cluster, each grid site observes approximately the same network Network performance can be converted into cost model. There are three factors that affect the scheduling of job j: S j , R j , Q jk , where S j is the site to which job j is scheduled, R j is the list of LFN (Logical File Name) of replicas needed by job j and Q jk is the queuing latency for job j at the site Sk . The replicas needed to execute this job are represented as R j = {LFN1 , LFN2 , . . . , LFNn }. For a grid site S j , we divide the replicas into three subsets according to the availability of j LFNi in S j . The first subset is on-site set Ron that contains all the locally available replicas. The second subset is intra-site set j Rintra that contains the rest of replicas that can be found in the j local cluster. The third subset is inter-site set Rinter that contains the other replicas that must be accessed from other clusters. For j each LFNi in Rinter , assume the bandwidth from S j to PFNi (to the site that LFNi resides) is B ji . Then the time needed to retrieve PFNi to S j is |LFNi |/B ji , where |LFNi | denotes the size of replica LFNi . We define some cost terms. j Inter-cluster-communication-cost (IrCx ): If job j is dispatched to cluster x, the cost of inter-communications would be calculated by using the cluster-to-cluster bandwidth. j IrCx = 1 × αj X for all LFNi in j Rinter |LFNi | B ji (1) where α j is a constant reflecting the degree of parallelism in j S j for replica downloading. IrCx represents the time needed to j have all the replicas inRinter available locally in S j . j Intra-cluster-communication-cost (IaC S j ): For job j, the cost of intra-communications at site S j is represented as the j total file size of Rintra , since bandwidth in the local cluster is assumed to be plentiful and roughly the same from sites to sites. X j |LFNi | . IaC S j = (2) j for all LFNi in Rintra Queuing latency (Q j ): If job j is going to be scheduled to S j in cluster x, queuing latency Q j will be the time needed of running all the jobs that have queued at S j . Therefore, Qj = m X (IrCkx + IaCkx ) (3) k=1 where 1, 2, . . . , m are jobs queued before job j. 3.3. HCS (Hierarchical Cluster-based Scheduling) algorithm Previous job schedulers except Random scheduler search all resources to find the best one that has the lowest cost. HCS improves traditional schedulers in two aspects. First, 850 R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 Fig. 3.2. An example in cluster grid. Fig. 3.3. HCS job scheduling. HCS takes into account hierarchical cluster grid structure and all of data replicas owned by a cluster. Fig. 3.2 is a simple example. The number of seconds in each edge denotes the time required to access a file from another cluster. There is a grid job that requires four files for execution and the four files are distributed over four clusters. If we schedule a job to the cluster based on highest hit ratio of required replicas (most of required replicas for the job available within the cluster), like clusterB, the job execution time and the number of intercluster-communications would be reduced since access data can be faster (larger bandwidth within a cluster). In contrast, if scheduling a job to the cluster with few of the required replicas, like clusterD, the number of inter-cluster-communications and access latency would be increased. It is possible that the amount of replica in one cluster is more than the other clusters but the total replicas size may be smaller. Scheduling a job by using the number of replicas is inexact. Thus, to distribute the jobs to different sites, we propose to schedule jobs based on the cost model described previously. Second, searching the best site from a huge amount of distributed sites would lead to long latency. HCS uses a hierarchical tree to schedule a job and minimize the overhead of searching for the suitable site, as shown in Fig. 3.3. It is a twostep decision process. The first step selects a cluster to minimize j the inter-cluster-communication-cost (IrCC ). Referring back to j the example shown in Fig. 3.2, the values of IrCC for each cluster are (1) In clusterA, it needs to access File2 and File4. The best PFNs of File2 and File4 are both from clusterC that has j minimum IrCclusterA = 4 s. (2) In clusterB, it has most of the required replicas and only needs to access File3. But external bandwidth may be congested. Accessing the best PFN of File3 from clusterA, we j have IrCclusterB = 5 s. (3) In clusterC, it lacks File1 and File3. The best PFNs of j File1 and File3 are in clusterA. We have IrCclusterC = 4 s. (4) In clusterD, it needs to access File1, File2, and File4 for job execution. The latency of moving File1 from clusterA and j File2 and File4 form clusterC is IrCcusterD = 7 s. j More than one cluster has minimum value of IrCC (clusterA and clusterC). In this situation, we will select one cluster randomly. Therefore, the job is scheduled onto clusterA or clusterC. This example shows that the cluster that has the largest number of matches of required replicas may not be the optimal solution. After the suitable cluster is selected from Cluster Grid, the second step selects the best site Sj from local cluster based on the combined cost of moving replicas into the site Sj (intracluster-communication-cost) and the wait time in the queue in the site Sj (Queuing latency). The job is scheduled onto the site which has the minimum combined cost. 3.4. HRS (Hierarchical Replication Strategy) algorithm After a job is scheduled to S j , the requested data will be transferred to S j to become replicas. HRS (Hierarchical Replication Strategy) then determines how to handle this replica, as shown in Fig. 3.4. If there is enough disk space, the replica is stored. Otherwise, if this replica is from a site in the local cluster, it is only stored in the temporary buffer and will be deleted after the job completes. If this replica is from other clusters, occupied space will be released to make room for this new replica, as presented in Fig. 3.4. The first choice to be removed is the replica that already exists in other sites in the same cluster. After all these locally available replicas are deleted, if the space is still insufficient, least frequently used replica will be the next R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 851 Fig. 3.4. HRS replication strategy. target for removal, and so on until enough space is available. To conclude, HRS considers inter-cluster replica transference as very costly. Therefore, the successfully received replicas must be stored locally such that all other sites in the same cluster will not have to replicate them again later. 3.5. HRS vs BHR (Bandwidth Hierarchy based Replication) [16] HRS replication strategy uses the concept of “network locality” as BHR [16]. The difference between HRS and BHR can be observed in two aspects. First, required replica within the same cluster is always the top priority used in HRS, while BHR searches all sites to find the best replica and has no distinction between intra-cluster and inter-cluster. It could be anticipated that HRS will avoid inter-cluster-communications and be stable in a hierarchical network architecture with variable bandwidth. Second, HRS considers popularity of replicas at site level, while BHR is based on cluster level. BHR adds a scheme, called region optimizer. A replica would be deleted if its access frequency is smaller than a new replica within a region optimizer. However, in BHR there is a discrepancy between region optimizers in the popularity of replicas. The number of times a file is requested by jobs may be different from what the region optimizer sees. For instance, Region A accesses a replica x in Region B. The popularity of x will be different between A and B because region optimizer records access frequency of file which “stored” in local region. The files that stored in local region are allowed to record history of replica accessed from remote sites or itself. But, the access frequency of a remote replica will not be recorded since it isn’t stored in the local region. Replica deletion of BHR removes duplicated file first in terms of site access frequency. Then the most unpopular files are deleted based on the access history gathered by the region optimizer. Removing unpopular replicas based on region optimizer is tantamount to LFU for site level. However, local 852 R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 region optimizer doesn’t keep track new replica, because it isn’t stored in this region. If the access frequency of a new replica comes from a remote region optimizer, the comparison between local region and remote region is meaningless. Thus, the function of region optimizer is limited. Furthermore, a region optimizer gathers the number of file requests from jobs run on the sites within a region. Each file requested would be recorded no matter where the required replica came from. Thus, an access frequency of a new replica can be obtained. When removing unpopular replica, BHR requires to sort access frequency from region optimizer. These frequencies include files that stored within and outside region. Unpopular replica deletion would be inefficient since BHR sorts the total records first, and then picks out the unpopular replica. It is time consuming. Thus, in this paper we follow the traditional concept that considers popularity of files at the site level. If there is not enough storage space for replication, HRS would delete the least frequently accessed file. 4. Simulations We use OptorSim to evaluate the performance of different combinations of job scheduling algorithms and replication strategies. OptorSim was developed to mimic the structure of a real Data Grid. All of general components are included into it with an emphasis on file access optimization and dynamic replication strategies. OptorSim is the project of EDG [3], a Java-based simulation language. We have modified some components and embedded HCS and HRS modules in OptorSim to exactly match our needs. Behavior of OptorSim is set up and controlled by using configuration files. We describe in turn the simulation framework, experiments performed and results. 4.1. Simulation framework OptorSim simulates the data grid architecture shown in Fig. 4.1 for evaluating various replication strategies. The simulation architecture consists of the following principal components: (1) Resource Broker (RB) accepts job submission from users and schedules each job to suitable site according to the scheduling policy, which gathered some information to make an optimal decision. For example, it may consider the location of required replicas, bandwidth, and computational capacity. (2) Storage Element (SE) represents storage resource, where grid data are stored. If the storage is full and space is required for a new replica, SE will choose a victimized file for deletion based on a replacement algorithm, such as LFU (Least Frequently Used) and LRU (Least Recently Used). (3) Computing Element (CE) represents computational resource to process grid jobs by inputting required replicas. If any of the required replicas are not stored locally, they should be fetched from remote sites, shown as dotted lines in Fig. 4.1(a). (4) Replica Manager (RM) at each site manages the data movement between sites and provides interface to directly access the Replica Catalogue, which provides LFN–PFN mapping and will be migrated to RLS [19] by EDG [3]. Fig. 4.1. (a) OptorSim simulates data grid architecture (b) an expanded illustration of grid site. Table 1 Simulation parameters Topology parameter Value No. of cluster No. of sites in each cluster Storage space at each site Connectivity bandwidth 4 13 50 GB 1000 Mbps (WAN) 1000 Mbps (LAN) Grid job parameter Value No. of jobs No. of job types No. of file accessed per job Size of single file Total size of files 1000 50 15 1 GB 750 GB (5) Replica Optimizer (RO) within RM contains the replication algorithm shown in Fig. 4.1(b). When a file is required by a job, RO would locate the best replica (PFN) in terms of the file’s LFN and decide whether it should create a new replica of the file locally, or create a temporary cache of the file locally. To simplify the requirements, data replication approaches in Data Grid environments commonly assume that the data is read-only. It means that they can be replicated without having to worry about change propagation back to the master copy. This is reasonable assumption as it is discussed in several scenarios [20,21]. Consequently, all replicas are consistent. To prevent all copies of the same file from being deleted, for each file, there is one master file that contains the original copy of data samples and cannot be deleted by the replication strategies. The location where the master copies of the files are distributed is defined in configuration file and it can be random. 4.2. Experimental environment For the experiments, the cluster grid topology of the simulated platform is given in Fig. 4.2 and this topology is from the simulation architecture of BHR. There are four clusters and each one has an average of 13 sites, which all have CE with associated SE. Node 35 holds all master files at the beginning of the simulation. Each dotted line between two nodes shows the inter-cluster communication. Table 1 specifies the simulation parameters used in our study. All the network bandwidth is set as 1000 Mb/s (Mbps), R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 853 Fig. 4.2. Topology of the simulated platform. except bandwidth between master site and adjacent router (2000 Mbps). There are 50 job types, each job type requires 15 files to execute and there is no overlap between the set of files. While running, jobs were randomly picked from 50 job types based on probability of each job, then submitted to the Resource Broker at regular intervals until 1000 jobs are submitted. Thus, some job types would occur frequently so that certain required replicas are accessed repeatedly. In order to easily interpret the result, users submit jobs at regular intervals (10 000 ms) until all jobs have been done. Files are accessed sequentially within a job without any access pattern. HCS will be compared with an OptorSim scheduler that searches all sites to find available CE by using a combination of access cost for the files and the queue length of waiting jobs, called QAC (Queue Access Cost). QAC performs better than other scheduler in OptorSim [22]. Additionally, HRS will be compared with LRU (Least Recently Used), LFU (Least Frequently Used), and BHR (Bandwidth Hierarchy based Replication). The LRU algorithm always replicates and then deletes those files that have been used least recently. Similarly, LFU deletes the least frequently accessed file in recent past. We ran a total of six simulation experiments, which two kinds of scheduling policy combined four kinds of replication strategies. For each experiment, we measure: (1) Total job execution time (queuingtime + accesslatency + executingtime); (2) Number of inter-communications; (3) Computing resource usage: the percentage of time that CEs are in active state at the period of job execution. 4.3. Simulation results and discussion The following figures show the simulation results to complete 1000 jobs for each combination of the data replication and job scheduling algorithms. For replication strategies, LRU and LFU show similar performance in Fig. 4.3. The same results are obtained in [2]. We implement BHR replication strategy into OptorSim. Total job execution time is about 30% faster using BHR optimizer than LRU and LFU. Our method takes benefit from network level locality of BHR and we simplify its replica replacement model. Thus, HRS successfully accelerates the total execution time up to 40% whether in QAC or HCS. Fig. 4.4 illustrates the computing resource usage. It is the percentage of time that CEs are in active state. It depends on job turnaround time. In Fig. 4.3, in the same simulation, since HRS finishes all jobs first, it means the CPUs are not idle most of the time. Therefore, it has good computing resource usage. Based on a concept of locality in cluster grids, HCS can reduce the inter-communications between different cluster grids. And by careful replication strategy, HRS can also reduce the number of inter-communications, as shown in Fig. 4.5. The results show that HCS and HRS combined can save bandwidth usage. 4.4. Discussions To analyse the distribution of jobs, we run a simulation where there is a grid system with four clusters. Each cluster has three grid sites and 500 jobs. Fig. 4.6 shows the distribution of where jobs are executed. Since HCS schedules jobs to certain specific sites and specific cluster according to inter-cluster communication costs. Therefore, jobs would be executed on a cluster with the most needed files. It can be observed that the same type of jobs is almost executed at the same cluster as shown in Fig. 4.6(a). Different job type means different file access patterns. If a cluster executes some specific job frequently, the probability of having the needed data files will increase in this cluster. Therefore, it is reasonable to schedule the same type of job to the same cluster. HCS with HRS strategy 854 R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 Fig. 4.3. Total job execution times for various job scheduling and replication algorithms. Fig. 4.4. Computing resource usage for various job scheduling and replication algorithm. Fig. 4.5. Number of inter-communications. can schedule the same job type in the appropriate cluster with a view to reducing the replication overhead of data transmission from Fig. 4.6(a). On the contrary, the job distribution of QAC is almost random as shown in Fig. 4.6(b). This is because QAC mostly considers queuing cost. One site may have executed every job type. It will lead to more overhead in transferring file replicas. HCS might cause some specific sites to have a heavy load if a large amount of a certain type of jobs is submitted. However, scheduling jobs to site or cluster without the needed data would have more access latency than queue time since Internet network bandwidth still fails to keep up with the computing capacity, especially if the size of data is from terabyte to petabyte. 5. Implementation and performance evaluation 5.1. System implementation framework and environment We have implemented our job scheduling algorithm and replication strategy in the Taiwan UniGrid platform [23]. Taiwan UniGrid utilizes Globus Toolkit [29] as the system middleware. There are five clusters in our experimental environment, including National Dong Hwa University (NDHU) [24], Academia Sinica [25], National Tsing Hua University (NTHU) [26], Providence University (PU) [27], Hsing Kuo University (HKU) [28]. Each cluster has several grid sits as shown in Fig. 5.1. All clusters are connected by the Internet. R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 855 Fig. 5.2. Implementation architecture. Fig. 4.6. 500 jobs distribution (a) HCS with HRS (b) QAC with LFU. CPUs, free memory space, system loading, storage space and so on. A job broker is implemented in our system. The job broker accepts a user’s job parameters and prepares for the job scheduling process. In the beginning, the job broker initializes data distribution to each cluster randomly. Afterward, the job broker performs the job scheduling procedure according to the specified algorithm for each job. Taking HCS as an example, the job broker selects a best cluster with minimal inter-cluster-communication cost at first and then submits the job to a best site with minimal intra-communication cost and queuing latency within the best cluster. With respect to the job submission and execution, the job broker makes use of GRAM (Grid Resource Allocation and Management) [31] protocol to assign a job to a specific grid site. A job is defined by RSL (Resource Specification Language) [19] in terms of binary execution file, arguments, standard output, and so forth. All data transmissions are transferred by GridFTP protocol [32]. Furthermore, the specified data replication strategy will be performed if the storage space is exhausted completely. 5.2. Taiwan UniGrid Simulator Fig. 5.1. Implementation environment. Fig. 5.2 depicts the overall system implementation architecture. The NWS (Network Weather Service) [30] is deployed in each cluster. Each cluster header will report back to the information server about its cluster-to-cluster bandwidth information periodically. Consequently, the information server keeps track of the up-to-date cluster-to-cluster bandwidth information. In addition, the information server has the current resource information of each grid site such as the number of We have implemented a user-friendly interface called Taiwan UniGrid Simulator by means of Java CoG Toolkit [33]. The Java CoG Toolkit provides a series of programming interfaces as well as reusable objects in grid services, such as GSI (Grid Security Infrastructure) [34], GRAM, GridFTP, and so on. It presents programmers with a mapping between the Globus Toolkit and Java APIs so as to ease the programming complexity. As illustrated in Fig. 5.3, a user can select the simulation parameters via the Taiwan UniGrid Simulator, such as the IP address of job broker, number of file accesses per job, file size, number of jobs, storage space, job scheduling algorithm, replication strategy and so on. The up-to-date cluster-to-cluster bandwidth information can be obtained through the Taiwan UniGrid Simulator as shown in 856 R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 Fig. 5.3. Taiwan UniGrid simulator. Fig. 5.4. Cluster-to-cluster bandwidth information. Fig. 5.4. The user can retrieve the recent resource information of each grid site within a cluster such as system loading, CPU speed, free memory space, available storage space and so forth. Figs. 5.5 and 5.6 present the system resource information for R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 Fig. 5.5. Grid resources information. Fig. 5.6. Job submission status. 857 858 R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 Fig. 5.7. Average job execution time. Fig. 5.8. Average number of intercommunications. each grid site as well as job submission status. As a simulation is accomplished, the user can receive the experimental results from the job broker by means of the simulator interface. 5.3. Experiment results In our experiment, the job execution time is the file transmission time plus job processing time. The file transmission time is the time to move a required file from a source site to the job execution site by GridFTP. The job processing time is the queuing time plus the job running time. The experiment parameters are given in Table 2. We have compared HCS with QAC (Queue Access Cost) in terms of four different data replication strategies, including LRU (Least Recently Used), LFU (Least Frequently Used), BHR (Bandwidth Hierarchy-based Replication) and HRS. The experimental results of average job execution time are presented in Fig. 5.7. The average job execution time for one job is obtained through dividing the overall job experimental time by the number of jobs. As mentioned above, the job execution time is the file transmission time plus job processing time. Since the file transmission time is the most important factor to influence the job execution time for data-intensive jobs in data grids, HCS with HRS can reduce the file transmission time effectively by virtue of valid scheduling and proper data replication, as can be seen from the experiments. The average number of intercommunications for a job execution is illustrated in Fig. 5.8. By selecting the best cluster with minimal inter-cluster-communication cost and the best site with minimal intra-cluster-communication cost, HCS with HRS can decrease the cost of intercommunications effectively as compared with the other job scheduling algorithms and replication strategies. 5.4. Security issues and possible applications All of our implementation is based on the Globus Toolkit. The Globus Toolkit provides a security infrastructure called GSI (Grid Security Infrastructure). It provides the authentication and authorization mechanisms for system protection according to X.509 proxy certificates. Therefore, the user with a valid proxy certificate is allowed to access data or replicate a data file. HCS and HRS can be applied to and embedded to any grid systems. For example, the Taiwan Ecogrid project [35] R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 859 Acknowledgements Table 2 Experimental parameters Parameters Value Number of jobs Number of file accessed per job Size of single file Storage space for each site 500 15 500 MB 15 GB has deployed many sensors in several ecological areas in Taiwan to gather environmental data and real time monitoring video for ecological analysis. The data must be replicated and distributed over various areas and grid sites for processing. It is obvious that the size of ecological data is quite large. Ecology research jobs processing large amounts of environmental data consume considerable network bandwidth and computing resources if without an appropriate scheduling algorithm and data replication strategy. HCS with HRS strategy could be applied to such an ecological grid computing environment to improve the system performance. 6. Conclusions and future work We have addressed the problem of data movement operations in cluster grid environment. To achieve good network bandwidth utilization and reduce data access time, we consider the inter-cluster communications cost. We propose a job scheduling policy (HCS) that considers not only computational capability and data location but also cluster information, and a dynamic replica optimization strategy (HRS) where the nearby data has a higher priority to access than generating new replicas. To evaluate the efficiency of our job scheduling policy and replica strategy, we ran the grid simulator OptorSim that is configured to represent a real world data grid testbed. We study and evaluate the performance of various replica strategies and different algorithm combinations. The simulation results show, first of all, that HCS and HRS both get better performances than other scheduling policy and replica strategies. Second, we can achieve particularly good performance with HCS where jobs are always scheduled to cluster with most of the data needed, and a separate HRS process at each site for replication management. Experimental data show HCS scheduling with HRS replica strategy outperforms others scheduling algorithms and replication strategies in total job execution time. We also implement HCS and HRS on the real Taiwan Unigrid environment. The experimental results are consistent with the simulations. It demonstrates the superiority of HCS and HRS in scheduling jobs and managing replications. The probability of scheduling the same type of job to the same cluster will be rather high in our scheduling algorithm, leading to possible loading balancing problems. The consideration of system loading balancing with other scheduling factors will be an important future research direction. In addition, the balancing between data access time, job execution time, and network capabilities also needs to be studied further. This research is supported in part by NSC under contract number 93-2213-E-259-013 and 93-2213-E-259-014. The authors would also like to acknowledge the National Centre for High-Performance Computing in providing resources under the national project “Taiwan Knowledge Innovation National Grid”. References [1] I. Foster, The grid: A new infrastructure for 21st century science, Physics Today 55 (2002) 42–47. [2] A. Chervenak, I. Foster, C. Kesselman, C. Salisbury, S. Tuecke, The data grid: Towards an architecture for distributed management and analysis of large scientific datasets, Journal of Network and Computer Applications 23 (2000) 187–200. [3] The European data grid project. [4] W.H. Bell, D.G. Cameron, L. Capozza, P. Millar, K. Stockinger, F. Zini, Simulation of dynamic grid replication strategies in OptorSim, in: Proceedings of the Third ACM/IEEE International Workshop on Grid Computing, Grid2002, Baltimore, USA, in: Lecture Notes in Computer Science, vol. 2536, 2002, pp. 46–57. [5] I. Foster, K. Ranganathan, Design and evaluation of dynamic replication strategies for high performance data grids, in: Proceedings of International Conference on Computing in High Energy and Nuclear Physics, Beijing, China, September 2001. [6] I. Foster, K. Ranganathan, Identifying dynamic replication strategies for high performance data grids, in: Proceedings of 3rd IEEE/ACM International Workshop on Grid Computing, in: Lecture Notes on Computer Science, vol. 2242, Denver, USA, 2002, pp. 75–86. [7] I. Foster, K. Ranganathan, Decoupling computation and data scheduling in distributed data-intensive applications, in: Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing, HPDC-11, IEEE, CS Press, Edinburgh, UK, 2002, pp. 352–358. [8] E. Deelman, H. Lamehamedi, B. Szymanski, S. Zujun, Data replication strategies in grid environments, in: Proceedings of 5th International Conference on Algorithms and Architecture for Parallel Processing, ICA3PP’2002, IEEE Computer Science Press, Bejing, China, 2002, pp. 378–383. [9] C.E. Leiserson, Fat tree: Universal networks for hardware-efficient supercomputing, IEEE Transactions on Computers C-34 (10) (1985) 892–901. [10] H.H. Mohamed, D.H.J. Epema, An evaluation of the close-to-files processor and data co-allocation policy in multiclusters, in: 2004 IEEE International Conference on Cluster Computing, IEEE Society Press, San Diego, California, USA, 2004, pp. 287–298. [11] T. Kosar, M. Livny, Stork: Making data placement a first class citizen in the grid, in: Proceedings of the 24th International Conference on Distributed Computing Systems, ICDCS2004, Tokyo, Japan, March 2004, pp. 342–349. [12] R. Raman, M. Livny, M. Solomon, Matchmaking: Distributed resource management for high throughput computing, in: Proceedings of the Seventh IEEE International Symposium on High Performance Distributed Computing, HPDC7, Chicago, Illinois, USA, July 1998, pp. 140–146. [13] Condor Project, The Directed Acyclic Graph Manager (DAGMan). http:// www.cs.wisc.edu/condor/dagman/, 2003. [14] T. Tannenbaum, D. Wright, K. Miller, M. Livny, Condor—a distributed job scheduler, in: T. Sterling (Ed.), Beowulf Cluster Computing with Linux, MIT Press, 2001. http://www.cs.wisc.edu/condor/. [15] A. Chakrabarti, R.A. Dheepak, S. Sengupta, Integration of scheduling and replication in Data Grids, in: Lecture Notes in Computer Science, vol. 3296, 2004, pp. 375–385. 860 R.-S. Chang et al. / Future Generation Computer Systems 23 (2007) 846–860 [16] S.-M. Park, J.-H. Kim, Y.-B. Go, W.-S. Yoon, Dynamic grid replication strategy based on internet hierarchy, in: International Workshop on Grid and Cooperative Computing, in: Lecture Note in Computer Science, vol. 1001, 2003, pp. 1324–1331. [17] M. Carman, F. Zini, L. Serafini, K. Stockinger, Towards an economybased optimisation of file access and replication on a data grid, in: Proceedings of 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGrid 2002, IEEE-CS Press, Berlin, Germany, 2002, pp. 340–345. [18] J. Hayes, N.T. Spring, R. Wolski, The network weather service: A distributed resource performance forecasting service for metacomputing, Future Generation Computer Systems 15 (5–6) (1999) 757–768. [19] Resource Specification Language (RSL), Globus Project – Globus Toolkit 4.0. http://www.globus.org/toolkit/docs/4.0/data/rls/, 2005. [20] W. Hoschek, F.J. Jaén-Martı́nez, A. Samar, H. Stockinger, K. Stockinger, Data management in an international data grid project, in: Proceedings of First IEEE/ACM International Workshop on Grid Computing, Grid’2000, in: Lecture Notes in Computer Science, vol. 1971, Bangalore, India, December 2000, pp. 77–90. [21] P. Kunszt, E. Laure, H. Stockinger, K. Stockinger, Advanced replica management with reptor, in: Proceedings of 5th International Conference on Parallel Processing and Applied Mathemetics, PPAM 2003, Czestochowa, Poland, September 2003, pp. 848–855. [22] D.G. Cameron, A.P. Millar, C. Nicholson, OptorSim: A simulation tool for scheduling and replica optimisation in data grids, in: Proceedings of Computing in High Energy Physics, CHEP 2004, Interlaken, Switzerland, September 2004. [23] Taiwan UniGrid Project. http://www.unigrid.org.tw/. [24] National Dong Hwa University (NDHU). http://www.ndhu.edu.tw/ english/index.php. [25] Academia Sinica. http://www.sinica.edu.tw/main e.shtml. [26] National Tsing Hua University (NTHU). http://www.nthu.edu.tw/index-e/ index.htm. [27] Providence University (PU). http://web.pu.edu.tw/∼english/. [28] Hsing Kuo University (HKU). http://english.hku.edu.tw/. [29] Globus Toolkit. http://www.globus.org/. [30] Network Weather Service (NWS). http://nws.cs.ucsb.edu/ewiki/. [31] GRAM (Grid Resource Allocation and Management). http://www.globus. org/toolkit/docs/development/4.2-drafts/execution/index.html. [32] GridFTP Protocol. http://www.globus.org/toolkit/docs/3.2/gridftp/key/ index.html. [33] CoG Toolkit. http://www.cogkit.org/. [34] Grid Security Infrastructure (GSI). http://www.globus.org/security/. [35] Taiwan Ecogrid project. http://ecogrid.nchc.org.tw/. [36] M. Tang, B.-S. Lee, C.-K. Yeo, X. Tang, Dynamic replication algorithms for the multi-tier data grid, Future Generation Computer Systems 21 (2005) 775–790. [37] M. Tang, B.-S. Lee, X. Tang, C.-K. Yeo, The impact of data replication of job scheduling performance in the data grid, Future Generation Computer Systems 22 (2006) 254–268. [38] H. Zhuge, X. Sun, J. Liu, E. Yao, X. Chen, A scalable P2P platform for the knowledge grid, IEEE Transactions on Knowledge and Data Engineering 17 (12) (2005) 1721–1736. Ruay-Shiung Chang received his B.S.E.E. degree from National Taiwan University in 1980 and his Ph.D. degree in Computer Science from National Tsing Hua University in 1988. He is now a professor in the Department of Computer Science and Information Engineering, National Dong Hwa University. His research interests include Internet, wireless networks, and grid computing. Dr Chang is a member of ACM and IEICE, a senior member of IEEE, and founding member of ROC Institute of Information and Computing Machinery. Dr Chang also serves on the advisory council for the Public Interest Registry (www.pir. org). Jih-Sheng Chang received his B.E. degree from the Department of Computer Science and Information Engineering, I-Shou University, Kaohsiung, Taiwan in 2002 and his M.S. degree from the Department of Computer Science and Information Engineering, National Dong Hwa University, Hualien, Taiwan in 2004. He is a Ph.D. candidate at the Department of Computer Science and Information Engineering at National Dong Hwa University currently. His academic research interests focus on wireless network technology and grid computing. Shin-Yi Lin received her M.S. degree from the Department of Computer Science and Information Engineering, National Dong Hwa University, Taiwan in 2005. She is an engineer in the Realtek Semiconductor Crop., located in the Hsinchu Science-based Industrial Park, Hsinchu, Taiwan. Her research interests include wireless networks and grid computing.