.0
INFN/TC-xx-yy
17th April 2001
ft 1
Workload Management on a Data Grid:
a review of current technology
Cosimo Anglano , Stefano Barale , Stefano Beco , Salvatore Cavalieri ,
Flavia Donno , Luciano Gaido , Antonia Ghiselli , Francesco Giacomini ,
Andrea Guarise , Stefano Lusso , Ludek Matyska , Salvatore Monforte ,
Francesco Pacini , Francesco Prelz , Miroslav Ruda , Massimo Sgaravatto ,
Albert Werbrouck , Zhen Xie
Università del Piemonte Orientale. INFN, Sezione di Torino,
DATAMAT Ingegneria dei Sistemi S.p.A., Universit‘a degli Studi di Catania,
INFN, Sezione di Catania, INFN, Sezione di Pisa, INFN, CNAF,
Masaryk University, CESNET,
INFN, Sezione di Milano,
INFN, Sezione di Padova,
Universit‘a degli Studi di Torino
Abstract
Dra
We report on the status of current technology in the fields of job submission and scheduling (workload management) in a world-wide data grid environment.
PACS:89.80
Published by SIS-Pubblicazioni
Laboratori Nazionali di Frascati
.0
Contents
4
ft 1
1 Introduction
2 Scheduling technology
2.1 Scheduling Approaches . . . . . .
2.1.1 Scheduler organization . .
2.1.2 Scheduling policy . . . . .
2.1.3 State estimation technique
2.2 Evaluation . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Dra
3 Globus
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
3.2 Security services . . . . . . . . . . . . . . . . . . .
3.3 The Globus Security Infrastructure (GSI) . . . . . .
3.3.1 Operation of the GSI model . . . . . . . . .
3.3.2 GSI advantages . . . . . . . . . . . . . . . .
3.3.3 GSI shortcomings and ways to address them
3.4 Information Services for the Grid . . . . . . . . . . .
3.5 The Globus Grid Information Service . . . . . . . .
3.5.1 Data Design . . . . . . . . . . . . . . . . . .
3.5.2 Schema Design . . . . . . . . . . . . . . . .
3.5.3 Extending the GRIS . . . . . . . . . . . . .
3.5.4 Service Reliability . . . . . . . . . . . . . .
3.5.5 Data Caching . . . . . . . . . . . . . . . . .
3.6 Globus Services for Resource Management . . . . .
3.6.1 Globus resource management architecture . .
3.7 GRAM . . . . . . . . . . . . . . . . . . . . . . . .
3.8 GRAM Client API . . . . . . . . . . . . . . . . . .
3.9 GARA . . . . . . . . . . . . . . . . . . . . . . . . .
2
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
6
6
7
8
8
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
12
12
14
14
15
16
16
17
18
18
18
22
22
23
26
27
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Dra
ft 1
4 Condor
4.1 Introduction . . . . . . . . . . . . .
4.2 Overview of the Condor system . .
4.3 Condor Classified Advertisements .
4.4 Remote system calls . . . . . . . .
4.5 Checkpointing . . . . . . . . . . . .
4.6 Vanilla Jobs . . . . . . . . . . . . .
4.7 Condor Flocking . . . . . . . . . .
4.8 Parallel Applications in Condor . . .
4.9 Inter-Job Dependencies (DAGMAN)
4.10 Experiences with Condor . . . . . .
4.11 Condor-G . . . . . . . . . . . . . .
4.12 Condor GlideIn . . . . . . . . . . .
.
.
.
.
.
3
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
28
28
29
29
.
.
.
.
.
.
.
.
.
.
.
.
31
31
32
34
36
37
38
38
39
39
40
42
43
.0
3.9.1 Overview . . . . . . . . . . . . . . .
3.9.2 Network Reservations . . . . . . . .
3.9.3 Comments on Network Reservations .
3.10 Globus Executable Management . . . . . . .
3.11 Heartbeat Monitor . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.0
Chapter 1
ft 1
Introduction
Dra
The resource management of processor time, memory, network, storage, and other resources in a computational grid is clearly fundamental. The Workload Management System (WMS) is the component of the Grid middleware that has the responsibility of managing the Grid resources in such a way that applications are conveniently, efficiently and
effectively executed. To carry out its tasks, the RMS interacts with other components
of the grid middleware, such as the security module to perform user authentication and
authorization, and the resource access module that provides the interface to the local resource management systems (e.g., queueing systems and local O.S. schedulers). From
the users’ point of view, resource management is (or should be) completely transparent,
in the sense that their interaction with the WMS should be limited just to the description,
via a high-level, user-oriented specification language, of the resource requirements of the
submitted job. It is responsibility of the WMS to translate these abstract resource requirements into a set of actual resources, taken from the overall grid resource pool, for which
the user has access permission.
Typically, a WMS encompasses three main modules, namely a user access module,
a resource information module, and a resource handling module. The resource handling
module performs allocation, reclamation, and naming of resources. Allocation is the ability to identify the (set of) resource(s) that best match the job requirements without violating possible resource limits (scheduling), and to control job execution on these resources.
Reclamation is the ability to reclaim a resource when the job that was using it terminates
its execution. Finally, naming is the ability of assigning symbolic names to resources so
that jobs and schedulers may identify the resources required to carry out the computation.
To carry out its tasks, the resource handling module accesses individual grid resources by
interacting with the local managers of these resources (possibly via a set of standardized
interfaces).
The resource handling module strongly interacts with the resource information
4
Dra
ft 1
.0
module, whose purpose is to collect and publish the information concerning the identity,
the characteristics, and the status of grid resources, so that meaningful scheduling decisions may be made. The resource information module typically provides both resource
dissemination and discovery facilities. The resource dissemination service provides information about the various available resources, and is used by a resource to advertise its
presence on the grid so that suitable applications may use it. Resource discovery is the
mechanism by which the information concerning available Grid resources are sought (and
hopefully found) when a scheduler has to find a suitable set of resources for the execution
of a user application.
Finally, the user access module provides users (or applications) with basic access
functionalities to the main WMS service, namely resource handling. Application resource
requests are described using a job description language that is parsed by an interpreter into
the internal format used by the other WMS components. This module usually provides a
user interface that includes commands to start and terminate the execution of a job, and
to inspect the status of its computation.
5
.0
Chapter 2
ft 1
Scheduling technology
Dra
The scheduler is one of the most critical components of the resource management systems, since it has the responsibility of assigning resources to jobs in such a way that the
application requirements are met, and of ensuring that possible resource usage limits are
not exceeded. Although scheduling is a traditional area of computer science research,
and as such many scheduling techniques for various computing systems (ranging from
uniprocessors [47] to multiprocessors [24] and distributed systems [37,38,9]) have been
devised, the particular characteristics of Grids make traditional schedulers inappropriate.
As a matter of fact, while in traditional computing systems all the resources and jobs in
the system are under the direct control of the scheduler, Grid resources are geographically
distributed, heterogeneous in nature, owned by different individuals or organizations with
their own scheduling policies, have different access cost models with dynamically varying
loads and availability conditions. The lack of centralized ownership and control, together
with the presence of many competing users submitting jobs potentially very different
from each other, make the scheduling task much harder than for traditional computing
systems. Many ongoing research projects [16,20,32,14,18,35] are working on the development of Grid schedulers able to deal with the above problems, but a complete solution
is still lacking. By looking at these partial solutions, however, it is possible identify those
characteristics that should be present in a fully-functional Grid scheduler.
2.1 Scheduling Approaches
The importance of devising suitable scheduling techniques for computational grids has
been known for a long time, as witnessed by the intensive research activity carried out in
this field [16,20,32,14,18,35]. Although at the moment of this writing only partial solutions have been devised, the above research efforts have converged to a common model
of grid schedulers (SuperSchedulers), that has been published [43] as a Grid Forum [2]
6
2.1.1
ft 1
.0
specification. The above specification is, of course, only abstract, but its presentation may
help to clarify the issues involved in grid scheduling. According to this specification, the
first step that has to be done when a scheduling decision needs to be made is resource discovery. As a matter of fact, as mentioned before, a grid scheduler does not have neither
complete knowledge nor direct control over the available resources. Consequently, before
choosing the set of resources that best satisfy the needs of the job at hand, the scheduler has to (a) discover which resources are available and their characteristics, and (b)
verify whether the user submitting the job has access permission to the above resources
or not. In the second step, the scheduler has to select, among the resources identified in
the resource discovery step, those that best match the application performance requirements. In the third and final step, the scheduler has to interact with the local resource
manager in order to start the execution of the job, or to plan for its future execution (e.g.,
it may use advance reservation facilities if available). All the existing Grid schedulers
fit more or less in the model described before, and can be classified according to three
factors, namely their organization (that may be centralized, hierarchical, or distributed),
their scheduling policy (that may optimize either system or application performance), and
the state estimation technique they use to construct predictive models of application performance. In the following subsections we review the schedulers developed as part of
very influential Grid resource management systems, namely Condor [32], MSHN [27],
AppLeS [14], Ninf [35], Javelin [36], Darwin [19], Nimrod/G [16], and NetSolve [18].
Scheduler organization
Dra
The schedulers used in Condor and MSHN adopt a centralized organization, that is scheduling is under the control of a single entity, that receives and processes all the allocation
requests coming from the Grid users. However, while this approach has the advantage of
providing the scheduler with a global system-wide view of the status of submitted jobs and
available resources, so that optimal scheduling decisions are possible, it is poorly scalable
and tolerant to failures, a centralized scheduler represents a performance bottleneck and
a single point of failure.
In AppLeS, Ninf, Javelin, and NetSolve, a distributed organization has been adopted,
that is scheduling decisions are delegated to individual applications and resources. In
particular, each application is free to choose (according to suitable policies) the set of
resources that better fit its needs, and each resource is free to decide the schedule of the
applications submitted to it. In this approach there are no bottlenecks and single points
of failure but, being the scheduling decisions based on local knowledge only, resource
allocation is in general sub-optimal.
7
Scheduling policy
ft 1
2.1.2
.0
Finally, Darwin and Nimrod/G adopt a hierarchical approach where scheduling
responsibilities are distributed across a hierarchy of schedulers. Schedulers belonging
to the higher levels of the hierarchy make scheduling decisions concerning larger sets
of resources (e.g., the resources in a given continent), while lower-level schedulers are
responsible for smaller resource ensembles (e.g., the resources in a given state). Finally,
at the bottom level of the hierarchy are the schedulers that schedule individual resources.
The hierarchical approach tries to overcome the drawbacks of the centralized and the
distributed approach, while at the same time keeping their advantages.
Dra
The other important feature of a grid scheduler is the scheduling policy it adopts. The
schedulers of Condor, Darwin, and MSHN adopt a system-oriented policy, aimed at optimizing system performance metrics such as system throughput or resource utilization.
Given the rather vast body of research on system-oriented scheduling policies for traditional (distributed) computing systems, it is natural to try to re-use some of these policies
(e.g., FCSF, Random, or Backfilling as proposed in [26]). However, as discussed in [28],
the need of dealing with resource co-allocation and advance reservation requires the development of new system-oriented scheduling policies. Recent research efforts in this
direction [28] have shown promising results, but we believe that additional research is
necessary in this field.
At the other end of the spectrum, we find systems like AppLes, Javelin, NetSolve,
Nimrod/G, and Ninf, that adopt application-oriented scheduling policies. As a matter of
fact, as discussed in [13], although in several application domains high-throughput schedulers are certainly necessary, there is a complementary need of scheduling techniques able
to maximize user performance. That is, the machines used to execute a given applications
should be chosen in such a way that its performance is maximized, possibly disregarding
the overall system performance.
2.1.3
State estimation technique
In order to obtain satisfactory performance, a scheduler must employ predictive models
to evaluate the performance of the application or of the system, and uses this information to determine the allocation that results in best performance. Condor, Darwin, Nimrod/G, Ninf, and Javelin adopt non-predictive schedulers. These schedulers predict the
application (or system) performance under a given schedule by assuming that the current resource status will not change during the execution of applications. Non-predictive
technique, however, may result in performance much worse than expected because of the
8
.0
possible presence of contention effects on the resources chosen for the execution of an
application. AppLeS, NetSolve, and MSHN address this problem by adopting predictive
schedulers, that use either heuristic or statistical techniques to predict possible changes in
the resource status, so that pro-active scheduling decisions may be made. However, while
predictive techniques have the potential of ensuring better application performance, they
usually require a higher computational cost than their non-predictive counterparts.
2.2 Evaluation
ft 1
By looking at the scheduling needs of typical Grid users, we observe that a suitable Grid
scheduler should exhibit several properties not found in any of the currently available
schedulers, and in particular:
Distributed organization. If several user communities co-exist in a Grid, it is reasonable to assume that each of them will want to use a scheduler that better fits its
particular needs (community scheduler). However, when a number of independent
schedulers is operating simultaneously, a lack of coordination among their actions
may result in conflicting and performance-hampering decisions. The need of coordination among these peer, independent schedulers naturally calls for a distributed
organization.
Predictive state estimation, in order to deliver adequate performance even in face
of dynamic variation of the resource status.
Dra
Ability to interact with the resource information system. At the moment, all the
existing schedulers require that the user specifies the list of the machines that (s)he
has permission to use. However, a fully functional grid scheduler should be able to
autonomously find this information by interacting with the grid-wide information
service.
Ability to optimize both system and application performance, depending on the
needs of Grid users. As a matter of fact, it is very likely that in a Grid users needing
high-throughput for batches of independent jobs have to co-exist with users requiring low response times for individual applications. In this case, neither a systemoriented nor an application-oriented scheduling policy would be appropriate.
Submission reliability: Grids are characterized by an extreme resource volatility,
that is the set of available resource may dynamically change during the lifetime of
an application. The scheduler should be able to resubmit, without requiring the user
9
.0
intervention, an application whose execution cannot continue as consequence of the
failure or unavailability of the machine(s) on which it is running.
Allocation fairness. In a realistic system different users will have different priorities, that determine the amount of Grid resources allocated to their applications. At
the moment, no Grid scheduler provides for mechanisms to specify the amount of
resources assigned to users, and enforce such allocations, although some preliminary investigations on approaches based on computational economy [48,17] have
shown promising results.
Dra
ft 1
As consequence of the above considerations, it is our belief that a scheduler suitable to a
realistic Grid, where different user communities compete for resources, is still lacking.
We will now move on to consider in higher detail some of the existing technologies
on which a suitable grid scheduler can be built.
10
.0
Chapter 3
3.1 Introduction
ft 1
Globus
Dra
The software toolkit developed within the Globus ([3]) project comprises a set of components that implement basic services for security, resource location, resource management,
data access, communication, etc., which facilitate the creation of usable Grids. Rather
than providing a uniform, ”monolithic” programming model, the Globus toolkit provides
a ”bag of services”, from which developers can select to build custom tailored tools or
applications on top of them. These services are distinct and have well defined interfaces,
and therefore can be integrated into applications/tools in an incremental fashion.
This layered Globus architecture is illustrated in Figure 3.1. The first layer, the
Grid Fabric, comprises the underlying systems, computers, operating systems, local resource management systems, storage systems, etc. The components of the Grid fabric
are integrated by Grid Services. These are a set of modules, providing specific services
(information services, resource management services, remote data access services, etc.):
each of these modules has a well-defined interface, which higher level services use to
invoke the relevant mechanisms, and provides an implementation, which uses low-level
operations to implement these functionalities. Examples of the services provided by the
Globus toolkit are:
The Grid Security Infrastructure, GSI, a library for providing generic security services for applications that will be run on the Grid resources, which might be governed by disparate security policies (see Section 3.2).
The Grid Information Service, GIS, also known as the Metacomputing Directory
Service, MDS, which provides middleware information in a common interface (see
Section 3.4)
11
.0
ft 1
Figure 3.1: Layered structure of the Globus toolkit.
The Globus Resource Allocation Manager (GRAM), a basic library service that provides a common user interfaces so that users can submit jobs to multiple machines
on the Grid fabric (see Section 3.6).
Dra
Application toolkits use Grid services to provide higher-level capabilities, often targeted to specific classes of applications. Examples include remote job submission commands (built on top of the GRAM service), MPICH-G2, a Grid-enabled implementation
of the Message Passing Interface (MPI), toolkits to support distributed management of
large datasets, etc.
The fourth layer comprises the custom, tailored applications (e.g. HEP, EOS, biology applications, etc.) built on services provided by the other three layers. The Globus
toolkit uses standard whenever possible, for both interfaces and implementations.
3.2 Security services
In this section we evaluate what is possibly the most mature part of the Globus package,
namely the services that provide user authentication and authorization.
3.3 The Globus Security Infrastructure (GSI)
3.3.1
Operation of the GSI model
In order to describe the workings of GSI we will refer to Figure 3.2.
First of all, GSI is based on an implementation of the standard (RFC 2078/2743)
Generic Security Service Application Program Interface (GSS-API). This implementation
12
.0
tificate
Proxy Cer
y)
(public ke
icate
User Certif
y)
(public ke
Subject =
Service provider
PROXY
ft 1
Requesting user
ser DN,
Subject=U
y
CN = prox
DN
A
C
Signed by
User
Signed by
ortly!
Expires sh
+
Signing policy
(= Which CA(s) can sign
which certificate subjects)
ate
CA Certific
y)
ke
(public
Subject =
CA DN
Proxy (private) key
User (private) key
Signed by
CA
+
icate
User Certif
y)
ke
lic
(pub
DN
Subject =
Dra
CA
Signed by
Gridmap file
(maps certificate subjects
to local UIDs)
Figure 3.2: Basic elements of the GSI one-time-login model.
13
3.3.2
ft 1
.0
is based on X509 certificates and is implemented on top of the OpenSSL[4] library.
The GSS-API requires the ability to pass user authentication information to a remote
site so that further authenticated connections can be established (one-time login). The
entity that is empowered to act as the user at the remote site is called a ”proxy”. In the
Globus implementation, a user ”proxy” is generated by first creating a fresh certificate/key
pair, that are associated with the certificate of the user who is supposed to be represented
by the proxy. The certificate has a short (default value of 1 day) expiration time and is
signed by the user. The user certificate is signed by a Certificate Authority (CA) that is
trusted at the remote end.
The remote end (usually at some service provider’s site) is able to verify the proxy
certificate by descending the certificate signature chain, and thus authenticate the certificate signer. The signer’s identity is established by trusting the CA, and in particular by
understanding and accepting its certificate issuing policy.
The last remaining step is user authorization: the requesting user is granted access
and mapped to the appropriate resource provider’s identifiers by checking the proxy (or
user) certificate subject (X.500 DN) and looking it up in a list (the so-called gridmap file)
that is maintained at the remote site. This list typically links a DN to a local resource
username, so that the requesting user can inherit all the rights of the local user. Many
DNs can be linked to the same local user.
GSI advantages
Dra
The GSI model provides (via the GSS-API compliance) a one-time login mechanism.
This matches the ease of access requirement, and has already proven to be useful, for
instance, in the existing applications of the Andrew File System (AFS).
An advantage of GSI over Kerberos authentication is the use of X509, which is by
now a more widespread user identification and digital signature mechanism. X509 is also
the only digital signature framework that is currently granted legal meaning in certain
european countries.
GSI also provides a scheme (via the Globus Authentication and Authorization library - GAA) for extending relations of trust to multiple CAs without having to interfere
with their X.500 naming scheme.
3.3.3
GSI shortcomings and ways to address them
1. The fact that authorization is based only on certificate subjects allows for the existance of multiple valid certificates for the same subject. This means that care must
be taken in making sure that revoked certificates (certificates that were compro-
14
.0
mised and revoked by the CA) are not accepted. The Globus GSS-API explicitly
tries to find and check a CA Certificate Revocation List (CRL) when verifying
proxy certificates. This means that the CRL must be present and up-to-date at every
GSI-capable site. There are no specific tools in the current Globus toolkit to handle
this.
ft 1
2. The model where generally valid (even if for a limited time only) private keys are
available on remote hosts fits a world where all system administrators are honest
and able to implement a seamless security model. Good security practices call for
limited scope proxy certificates. The only current limitation is in the ability to further delegate a proxy (which already is a delegated form of user credential). There
is an ongoing Globus development activity (in the CAS - Community Authorization
Services framework) to provide limited “capability” certificates.
3. The GSI infrastructure doesn’t provide any tool to handle the association of user
identities (certificate DNs) to activity (experiment) specific groups. This is an important requirement in the grid context, so some appropriate solution has to be
provided, possibly also in the framework of the CAS development.
3.4 Information Services for the Grid
Dra
The Information Service (IS) plays a fundamental role in the Grid environment, since
resource discovery and decision making is based upon the information service infrastructure.
Basically an information service is needed to collect and organize, in a coherent
manner, information about grid resources and status and make it available to consumer
entities.
Areas where information services are used for the grid are:
Static and dynamic information about computing resources: this includes host
configuration and status, a list of services, protocols and devices that can be found
on hosts, access policies and other resource related information.
User Information: User descriptions and groupings need to be available to client
systems for authentication purposes. In the Grid environment, this is mostly a service provided by the Public Key Infrastructure of every participating institution.
Network status and forecasting: Network Monitoring and forecasting software
will use an information system to publish network forecasting data.
15
.0
Software repositories: Information on where software packages are located, what
they provide, what they need etc. needs to be available on the net.
Hierarchical directory services are widely used for this kind of applications, due to
their well defined APIs and protocols.
Anyway, the main issue with using the existing directory services is that they are not
designed to store dynamic information such as the status of computing (or networking)
resources.
ft 1
3.5 The Globus Grid Information Service
The Globus Grid Information Service provides an infrastructure for storing and managing static and dynamic information about the status and the components of computational
grids. The Grid Information Service implemented by INFN is based on the Globus package, so here we’ll refer to the Globus GIS software.
The Globus Grid Information Service is based on LDAP directory services. As we
pointed out, LDAP may not be the best choice for dynamic data storage, but provides a
number of useful features:
A well defined data model and a standard and consolidated way to describe data
(ASN.1).
A standardized API to access data on the directory servers.
A distributed topological model that allows data distribution and delegation of access policies among institutions.
Dra
In the next sections we’ll discuss directory services related topics and how they
apply to the Globus grid information service.
3.5.1
Data Design
There are many issues on describing the data that the information system must contain.
Basic guidelines for the GIS are:
The information system must contain useful data for real applications (not too
much, not too few).
It is important to make a distinction between static and dynamic data so that they
are treated in different ways by the various servers in the hierarchy. This can be
done with the Globus package assigning different time to live to entries.
16
3.5.2
.0
Unfortunately there are some lacking features in the Globus data representation:
though LDAP represents data using ASN.1 with BER encoding (a good reference book
for ASN.1 is [22]), the Globus v1.1.3 GIS implementation has no data definition types
tied to attributes: all the information is represented in text format and the proper handling
of attribute types is left to the backend.
As a consequence of this, it is not straightforward to apply search filters to data (for
example numerical values are compared as they were strings).
Schema Design
ft 1
LDAP schemas are meant to make data useful to data consumer entities. The schema
specifies:
An unique naming schema for objectclasses (not to be confused with the namespace).
Which attributes the objectclass must and may contain.
Hierarchical relationships between objectclasses.
The data type of attributes.
Matching rules for attributes.
Dra
The schema is then what makes data valuable to applications. Globus implemented
its own schema for describing computing resources.
As with other LDAP servers, the Globus schema can be easily modified. Anyway
this process must be lead by an authority, in order to make schema modifications consistent among various GIS’s infrastructures.
From our perspective the Globus schema should become a standard schema to describe computing resources with the aim to make it worth for a wide range of applications
and to allow an easier application development (applications must refer to a schema to
know how to treat data).
3.5.2.1 The Globus Schema
The Globus schema is used to describe computing resources on a Globus host.
The current schema may need to be integrated at least with a description of:
Devices (and file systems)
17
.0
Services (other than the services provided by Globus)
The definition and standardization of information about computing resources is
largely work in progress ([5]) and the upcoming grid development efforts will have to
make sure to feed back their proposals for extensions to the information schema into the
standards process.
3.5.3
Extending the GRIS
3.5.4
ft 1
A good flexibility in extending the set of data returned by each GRIS in the GIS architecture is a definite requirement for a production grid system.
The GRIS currently uses trigger programs to fetch data from the system. Such data
is then cached for a configurable amount of time. The output of trigger programs must
be in LDIF format and must respect the Globus schema. The schema can be extended to
represent information other the those provided by default. In some case we concluded that
a different definition of some standard information “keys” was needed. In the short term,
we hope that such changes can be negotiated in the framework of our collaboration with
Globus. In the long term the entire information modeling schema will probably evolve.
The Globus LDAP schema, at the moment, is not checked at runtime by the GRIS
(LDAP server) to ensure data consistency.
Service Reliability
Dra
When a geographically distributed Information Service is implemented (see Figure 3.3
for an example), it is important, in terms or service reliability, that each site GIIS be able
to operate independently from other sites and from the root GIIS server. This means that
GIS clients at each site must be able to rely only on the local GIIS (as it happens, for
example, with DNS clients, that don’t lose the local visibility of the DNS when the root
name servers are not reachable).
In the LDAP standard way to obtain this, the site GIIS server can hold a superior
knowledge reference to its ancestor (a default referral to the root GIIS) and return it to
the clients (that can point themselves to the local GIIS) if they wish to know about global
resources (compare Figure 3.4).
3.5.5
Data Caching
At the moment the root GIIS server contains the same set of information available at lower
levels, but the caching of dynamic data at the top level is often useless. Furthermore, there
18
.0
ft 1
Dra
Figure 3.3: Hierarchy of Globus v1.1.3 (GIS) information services within the INFNGRID test organization.
19
.0
ft 1
Dra
Figure 3.4: When building a geographically distributed (directory based) information tree,
it is important to make sure that each site can get a reliable and consistent view of the local
information in case the geographical links fail.
20
.0
ft 1
Figure 3.5: Detailed view of the GIS cache.
Dra
is some information that is convenient not to publish for security reasons: users on a host,
software installed etc.
The simplest way to limit the propagation of this “classified” (and useless, if it is
too dynamic) information on higher hierarchy levels is to implement access control on
GIIS’s in such a way that superior GIIS servers can access only a portion of the information, while authorized hosts can access the whole information. At the moment, the GIS
does not implement any security mechanism, but GSI support is included in the current
development code.
As stated earlier, the time to live of static information that should be available at
higher levels has to be greater than the time to live of dynamic information.
As of now, the data provided by the root server is the same that inferior servers can
provide, but expires less often (1 hour versus 5 minutes).
21
.0
ft 1
Figure 3.6: The Globus resource management architecture
3.6 Globus Services for Resource Management
3.6.1
Globus resource management architecture
Dra
The Globus resource management architecture [21][25], illustrated in Figure 3.6.1, is a
layered system, in which high level services are built on top of a set of local services.
The applications express their resource requirements using a high-level RSL (Resource Specification Language) expression. One (or more) broker(s) are then responsible for taking this abstract resource specification, and translating it into more concrete
specifications, using information maintained locally, and/or obtained from an Information
Service (responsible for providing efficient and pervasive access to information about the
current status and capability of resources), and/or contained in the high level specification.
The result of this specialization is a request that can be passed to the appropriate GRAM
or, in the case of a multirequest (when it is important to ensure that a given set of resources
is available simultaneously), to a specific resource co-allocator. Globus doesn’t provide
any generic, general-purpose broker: it has to be developed for the particular applications that must be considered. The GRAM (Globus Resource Allocation Manager) is the
component at the bottom of this architecture: it processes the requests for resources from
remote application execution, allocates the required resources, and manages the active
jobs. The current Globus implementation focuses on the management of computational
22
.0
resources only.
3.7 GRAM
ft 1
The Globus Resource Allocation Manager (GRAM) [21] is responsible for a set of resources operating under the same site-specific allocation policy, often implemented by
a local resource management system (such as LSF, PBS, Condor). Therefore a specific
GRAM doesn’t need to correspond to a single host, but rather represents a service, and
can provide access for example to the nodes of a parallel computer, to a cluster of PCs,
to a Condor pool. In the Globus architecture the GRAM service is therefore the standard
interface to “local” resources: grid tools and applications can express resource allocation
and management requests in terms of a standard interface, while individual sites are not
constrained in their choice of resource management tools.
The GRAM is responsible for:
Processing RSL specifications representing resource requests, by either creating the
process(es) that satisfy the request, or by denying that request;
Enabling remote job monitoring and management;
Periodically updating the GIS (MDS) information service with information about
the current status and characteristics of the resources that it manages.
Dra
The architecture of the GRAM service is shown in Figure 3.7.
The GRAM client library is used by the application: it interacts with the GRAM
gatekeeper at a remote site to perform mutual authentication and transfer the request.
The gatekeeper responds to the request by performing mutual authentication of user
and resource (using the GSI service), determining the local user name for the remote
user, and starting a job manager, which is executed as the selected local user and actually
handles the request.
The job manager is responsible for creating the actual process(es) requested. This
task typically involves submitting a request to an underlying resource management system
(GRAM can operate in conjunction with several resource management systems: Condor,
LSF, PBS, NQE, etc...), although if no such system exists, the fork system call may be
used. Once the process(es) are created, the job manager is also responsible for monitoring
their state, notifying a callback function at any state transition (the possible state transitions for a job are illustrated in Figure 3.8). The job manager terminates once the job(s)
for which it is responsible have terminated. The GRAM reporter is responsible for storing
into GIS (MDS) various information about the status and the characteristics of resources
and jobs.
23
.0
MDS client API calls
to locate resources
MDS
ft 1
GRAM Client
Update MDS with
resource state
information
GRAM client API calls
to request resource allocation
and process creation
Site boundary
Query current status
of resource
Gatekeeper
Authentication
Globus Security
Infrastructure
GRAM Reporter
Local Resource Manager
Create
Request
Job Manager
Monitor &
control
Parse
Dra
RSL Library
Figure 3.7: GRAM architecture
24
Allocate &
create processes
Process
Process
Process
.0
ft 1
Dra
Figure 3.8: State transition diagram for a job
25
.0
GRAM_CLIENT
POLL
GRAM_HTTP
IO
THREAD_POOL
ft 1
THREAD
COMMON
CALLBACK
ERROR
THREAD_COMMON
THREAD_POOL
THREAD
THREAD_COMMON
Figure 3.9: Graph of dependencies for the GRAM client module
3.8 GRAM Client API
Dra
The GRAM Client API includes eleven functions. Before any of these functions is called,
a software module, which is specific of the GRAM client, has to be loaded. Module
activation automatically triggers the activation of possibly other modules, which the first
one relies on. In the case of the GRAM client also the POLL, IO, GRAM HTTP modules
are loaded together with all the other modules that these need. In particular during the
GRAM HTTP activation the client credentials are acquired, allowing to run in a secure
environment.
The full graph of dependencies for the GRAM client module is shown in Figure 3.9.
Once the GRAM client module is not needed any more, it can be deactivated.
A job, in the form of an RSL description, can be submitted through a call to the
function. The function returns as output a unique job
handle, which can then be used for several other functions, in particular to monitor the
26
.0
Figure 3.10: GARA basic architecture
3.9 GARA
ft 1
status of the job (through the
function), or to kill it
(through the
function).
In addition, the callback mechanism provided by the GRAM client API can be used
to allow the job submitter to be asynchronously notified of a job state change.
Two functions of the API (one for jobs already submitted, and another one for
not-yet-submitted jobs) allow obtaining an estimate of the time a certain job would start
running. Unfortunately these two functions are not implemented yet. They would be extremely useful in the implementation of a grid scheduler, because the scheduler, if needed,
could delegate the estimation of a job start time to the resource, which knows its current
state better than what would be possible considering the information published in an Information Service.
From the preliminary tests done so far the GRAM client API seems quite complete
and correctly implemented. Also the documentation, which happens to be quite poor for
other Globus modules, is accurate enough.
Dra
In this section we provide a simple overview on GARA based on [8] and we express some
comments on the network reservation part of GARA.
3.9.1
Overview
The goal of the General-purpose Architecture for Reservation and Allocation (GARA) is
to provide applications with a method to reserve resources like disk space, CPU cycles
and network bandwidth for end-to-end Quality of Service (QoS). GARA provides a single
interface for reservation of diverse resources.
GARA has a hierarchical structure ([8], see Figure 3.10). At the lower layer the resource manager performs resource admission control (to make sure that only entitled customers actually get access to the grid resources) and reservation enforcement. Communi-
27
3.9.2
ft 1
.0
cation with the resource manager is through the Globus Gatekeeper, which authenticates
and authorizes the resource requester. At layer 2 the Local Reservation implements an
API for reservation request in a single trust domain. Reservation authentication through
GSI is supported at layer 3, so that reservations can be requested remotely. The higher
level supports mechanisms for end-to-end reservation.
Reservations can be made in advance or immediately when needed by the application itself. Transmission Quality of Service is implemented according to the Differentiated Services architecture, which provides traffic aggregates with differentiation through
marking, policing, scheduling and shaping. Packets are marked at the ingress point of
the network with a code called the Differentiated Services CodePoint (DSCP). Packets
generated by different application sessions can share the same codepoint. Then, in each
congestion point packets are placed in different dedicated queues, so that depending on
the priority, they will experience different treatment. Quality of Service can be quantified through several performance metrics like: one-way delay, one-way delay variation,
packet loss probability, throughput, etc.
CPU reservation is implemented through a mechanism for process scheduling called
Dynamic Soft Real-Time (DSRT), while disk space reservation is based on DPSS.
Network Reservations
Dra
Quality of Service configuration in GARA requires three fundamental building blocks:
marking, policing and scheduling. Each time a new reservation request is received, the
edge router configuration has to be modified. The prototype is designed to work with
CISCO routers only and it uses the Cisco Command Line Interface. Scheduling preconfiguration at the egress interface of the router is required. The mechanism requires
configuration privileges on the router to proceed with router configuration every time a
reservation request is received.
3.9.3
Comments on Network Reservations
We believe that the approach to network reservation adopted by GARA is of great interest, since it addresses the problem of end-to-end Quality of Service, a fundamental
requirement for networked applications. However, we think that some aspects may need
investigation.
The change in router configuration every time a new reservation is received is a
viable solution only if the number of reservations performed locally is not frequent. The
alternative approach would be to adopt static configuration, which is possible when the
source/destination IP address of GRID hosts or the corresponding subnets is known in
28
3.10
ft 1
.0
advance.
A second issue is related to per-flow policing. The number of policing/marking
instances, which have to be enabled on the input interface of the router, is a critical parameter. Performance of small edge routers is greatly dependent on the number of traffic
filters (access-lists) enabled at one time for traffic policing. Per-microflow policing offers
better traffic isolation at the expense of additional CPU overhead.
A third potential weakness of the architecture depends on the fact that resource
reservation does not automatically recognize the ingress interface to which a policer/marker
has to be associated, i.e. it does not rely on routing information, but rather requires that for
each host allowed to reserve bandwidth, the corresponding input interface on the router is
known. This is specified in a configuration file which has to be manually updated every
time the set of local hosts varies. This approach is human-error prone.
Globus Executable Management
Dra
According to the Globus documentation [25][1], the Globus Executable Management
(GEM) service should provide mechanisms to implement different distributed code management strategies, providing services for the identification, location and instantiation of
executables and run time libraries, for the creation of executables in heterogeneous environments, etc.
Actually, we found that GEM doesn’t exist as a package, and Globus can only
provide some functionalities, to do just executable staging, that is transfer the application
(i.e. the executable file) to a remote machine immediately prior to execution. This is
possible if the executable file is accessible via HTTP or HTTPS, or present on the machine
on which the globusrun command is issued.
This executable staging does not do anything with regard to moving shared libraries
environment variable, so if shared
with the executable and setting the
libraries are in non-standard places on the target machine, or if the application uses nonstandard shared libraries, then this application will probably fail.
Nothing exists in the Globus toolkit about the packaging and portability issues that
would allow new executables to be automatically built for a new architecture from some
portable source packages.
3.11
Heartbeat Monitor
The Heartbeat Monitor (HBM) service [25][44] should provide mechanisms for monitoring the status of a distributed set of processes. Through a client interface, a process should
29
Dra
ft 1
.0
be allowed to register itself with the HBM service, and sending regular heartbeats to it.
Moreover, a data collector API should allow a process to obtain information related to the
status of other processes registered with the HBM service, thus allowing to implement,
for example, fault recovery mechanisms. Unfortunately this service is not seeing active
development: an HBM package, implementing some very preliminary and incomplete
functionalities, has been included in the early Globus releases, but now it is not supported
anymore, and has been dropped from the distribution.
30
.0
Chapter 4
4.1 Introduction
ft 1
Condor
Dra
Many of the typical computing problems that are planned to be submitted to computational grids require long periods (days, weeks, months) of computation to solve. Examples include different kinds of simulations, parametric studies (where many jobs must be
executed to explore the entire parameter space), parallel processing of many independent
data sets, etc. This user community is interested in maximizing the number of computational requests that can be satisfied over long periods of time, rather than improving
the performance over short periods of time. For these users, High Throughput Computing (HTC) environments [10] [34], able to deliver large amounts of computational power
over long periods of times (in contrast with classical High Performance Computing (HPC)
environments, that bring enormous amounts of computing power to bear over relatively
short periods of time) must be considered.
To create and deploy a usable HTC environment, the key is the effective management and exploitation of all available resources, rather than maximizing the efficiency of
the existing computing systems. The distributed ownerships of the resources is probably
the major obstacle: the migration from powerful, expensive central mainframe systems to
commodity workstations or personal computers with better performance/price ratio has
tremendously increased the overall amount of computer power, but, just because of this
distributed ownership, there has not been a proportionate increase in the number of computing power available to any individual users.
For many years the Condor team at the University of Wisconsin/Madison has been
designing and developing tools and mechanisms to implement a HTC environment, able
to manage large collections of distributively owned computing resources: the Condor
system [45] [31] [30] [33] [46] is the result of these activities.
31
.0
4.2 Overview of the Condor system
Dra
ft 1
A resource in the Condor system (typically a node of the distributed computing environment), is represented by a Resource-owner Agent (RA), implemented by a specific
daemon (the startd daemon) running on this node. The resource-owner agent is responsible for enforcing and implementing the policies, which specify when a job may begin
using the resource, when the job will be suspended, etc., depending on possible different factors: CPU load average, keyboard and mouse activities, attributes of the customer
making the resource request, etc. (Condor is popular for harnessing idle computers CPU
cycles, but it can be configured according to different policies as well). These policies
are distributively and dynamically defined by the resource owners, who have complete
control over their resources, and can therefore decide when, to what extent and by whom
his resource can be used. The resource-owner agents periodically probe the resources
to determine their current state, and report this information, together with the owners
policies, to a collector, running on one well-defined machine of the Condor pool, called
central manager. Customers of Condor are represented by Customer Agents (CAs), which
maintain queues of submitted jobs. Like the resource-owner agents, each customer agent
(implemented by the so-called schedd daemon) periodically sends the information concerning the job queues to the collector. For this purpose the RAs and the CAs use the
same language, called Condor Classified Advertisement (ClassAd) language, to describe
resource requests and resource offers (see Section 4.3). Periodically a negotiation cycle
occurs (see Figure 4.1): a negotiator (matchmaker), running on the central manager, takes
information from the collector, and invokes a matchmaking algorithm, which finds compatible resource requests and offers, and notifxiiesy these agents of their compatibility.
Then it is up to the compatible agents to contact each other directly, using a claiming
protocol, to verify if they are still “compatible” with the updated state of the resource and
the request. Therefore the matching and the claiming are two distinct operations: a match
is an introduction between two compatible entities, whereas a claim is the establishment
of a working relationship between the entities.
If in the claiming phase the two agents agree, then the computation can start (see
Figure 4.2): the schedd on the submitting machine starts a shadow process,which acts as
connection for the job running on the remote machine, and the startd on the executing
machine creates a starter process, responsible to stage the executable file from the submitting machine, to start the job, to monitor it, and in case to vacate it (if, for example,
the resource is reclaimed by its owner).
If necessary, the matchmaking algorithm can break an existing match involving a
specific resource, and create a new match between the resource and a job with a better
32
.0
Matchmaker
Resource
Resource
Offer
Request
Match
Notification
Resource
Owner
Agent
ft 1
Customer
Agent
Claiming
Protocol
Dra
Figure 4.1: Condor matchmaking
Figure 4.2: Architecture of a Condor pool
33
ft 1
.0
priority. This preempts the job associated with the broken match, possibly resulting in
application migration.
Just re-linking the application with a specific Condor library, it is possible to exploit
two distinguishing features of the Condor system, which are useful services for a HTC
environment: remote system calls (see Section 4.4) and job checkpointing (see Section
4.5). Besides these so-called standard Condor jobs (job re-linked with the Condor library),
Condor also provides mechanisms to support the execution of applications that haven’t
been re-linked with the Condor library, and therefore can’t exploit the remote system call
and checkpointing capabilities. These so-called vanilla jobs are discussed in section 4.6.
4.3 Condor Classified Advertisements
Dra
As introduced in Section 4.2, in the Condor system the resource offer and the resource request entities advertise their characteristics, requirements and preferences using the classified advertisements (ClassAds) mechanism [33] [40] [41]. A ClassAd is a mapping
from attribute names to expressions. Attributes may be integer, real or string constants,
or they may be more complicated expressions (constructed with arithmetic and logical
operators, and record and list constructors). The ClassAd language includes a query language as part of the data model, so advertising agents can specify their compatibility by
including constraints in their resource offers and requests. The ClassAd mechanisms is
very flexible: it doesn’t constraint which entities take part in the matchmaking process,
their requirements, and how they wish to describe themselves. Moreover ClassAds use a
semi-structured data model, so the matchmaker doesn’t have to consider a well-defined
schema.
In Condor the resource offer ClassAds and the resource request ClassAds conform
to an advertising protocol that states that every ClassAd should include expressions named
Requirement and Rank: in the matchmaking a pair of ClassAds are incompatible unless
their Requirement expressions both evaluate to “true”. To choose the best pair among
compatible matches, the Rank attribute is then considered: among provider ClassAds
matching a given customer ClassAd, the one with the highest Rank value, breaking ties
according to the provider’s Rank, is chosen. Figure 4.3 shows an example of a ClassAd
representing a Condor resource (with a sophisticated resource usage policy), while Figure
4.4 represents a ClassAd for a job submitted for execution.
The ClassAds paradigm is very flexible, and can be generalized to include resources
other than computing systems and customers other than applications. However this bilateral match can’t be applied to a certain class of problems; let’s consider, for example, a
job, which requires for its execution a software license, besides a computing resource.
34
.0
ft 1
Dra
Figure 4.3: A ClassAd describing a Condor resource
35
.0
ft 1
Figure 4.4: A ClassAd describing a submitted job
In this case the classical bilateral match cannot be applied, since three entities must be
considered in the match: the job, the computing resource and the software license. The
Condor team is addressing this problem with the so-called Gang-Matching [42], which replaces the single implicit bilateral match with an explicit list of required bilateral matches.
4.4 Remote system calls
Dra
A big issue that must be faced in overcoming the problem of distributed ownership is data
access, since typically a job, placed on a remote, foreign computing resource, requires to
read from and write to files on the submitting machine. Imposing a uniform, distributed
file system (such as NFS or AFS) between all the machines, is a burdensome requirement,
which could significantly decrease the number of accessible resources.
This problem has been addressed in the Condor system with the remote system
calls mechanisms [33]. Linking the job against a specific Condor library, instead of the
standard C library, nearly every system call a job performs is caught by Condor. As
shown in Figure 4.5, the Condor library contains function stubs for all the system calls:
these stubs send a message to the shadow, running on the submitting machine, asking it to
perform the requested system call. The shadow executes the system call on the submitting
machine, and sends the result back to the job. This way all I/O operations performed by
36
.0
ft 1
Figure 4.5: Remote system calls in Condor
the job are done on the submitting machine.
This is transparent to the job, which has no hint that the system that performed the
call was actually the submitting machine, instead of the machine where it is running.
4.5 Checkpointing
Dra
Checkpointing [33] [29] a running program means taking a snapshot of its current state
in such a way that the program can be restarted from that state at a later time. Since most
operating systems do not provide kernel-level checkpointing services, Condor employs
a user-level checkpointing capability, available for many Unix platforms. When a job
must be checkpointed, its state (which includes the contents of the processs stack and
data segments, all shared library code and data mapped into the processs address space,
all CPU state including register values, the state of all open files, and any signal handlers
and pending signals) is written to a file. To enable checkpointing, the program must be
re-linked with a specific Condor library. Checkpointing is used in the Condor system
when the matchmaker decides to no longer allocate a machine to a job (for example
because the owner reclaims this resource): the job is checkpointed, and when a suitable
replacement machine is found, the process is restored from this checkpoint, resuming
the computation from where it left off, without losing the work already accomplished.
Moreover, it is possible to configure Condor to perform periodic checkpoints of jobs, to
37
4.6 Vanilla Jobs
ft 1
.0
improve the fault tolerance of the system: even in spite of failures (for example a crash
of the execution machine) the job can restart later from its last checkpoint, without losing
the work done so far. Checkpoint files are written on the file system of the submitting
machine, or it is possible to implement specialized checkpoint servers, with dedicated
disk space for storing checkpoints, and with good network connection to the machines of
the Condor pool.
It must be noted that checkpointing can be expensive and time consuming, since
the checkpoint file can be very big, and it must be written (possibly over the network)
to disk [29]. Therefore Condor foresees also other mechanisms, besides checkpointing,
to preempt a running job. A job can be suspended (it keeps staying in the allocated
resource, but the execution is suspended) or it may be even killed, without saving any
intermediate results. For example, a possible configuration that can be implemented in
the Condor system to preempt a job is by suspending it as a first step: this is a useful
mechanism when the owner reclaims his resource for only a short time. If the owner
keeps using the resource, the job can then be checkpointed, but if this checkpointing can’t
be accomplished in a relative short period (and therefore the owners can’t promptly access
their resources), the job is killed and must then be restarted from the beginning.
Dra
Some applications cannot be re-linked with the Condor library, for example because the
object code is not available (this is often true for some commercial software binaries),
or for the limitations of Condor for the jobs that can be checkpointed and migrated: for
example IPC is not supported, only sigle process jobs are allowed, etc. ([46]). Vanilla
jobs are intended for these kinds of jobs: they can’t checkpoint or perform remote system
calls, but they are still scheduled by the matchmaking system. Since vanilla jobs can’t
exploit the checkpointing mechanisms, when a job must be preempted from a machine, it
can be suspended (and completed at a later time), or killed (and then restarted from the
beginning on an other available resource). Since remote system calls are not supported,
a vanilla job must run on a machine that shares the same filesystem with the submitting
machine. The job must also run on a machine where the user has the same UID as on the
submitting machine.
4.7 Condor Flocking
The flocking mechanism [46] [23] allows linking together different Condor pools. In the
standard Condor configuration the schedd daemon contacts the central manager of the
38
ft 1
.0
local Condor pool to locate executing machines available to run jobs in its queue. In
the flocking arrangement, additional central managers of remote Condor pools can be
specified as configuration parameter of the schedd daemon. When the local pool doesn’t
satisfy all its job requests, the schedd daemon will try these remote pools in turn (so the
central managers in the list should be ordered in order of preference) until all jobs are
satisfied. The schedd will only send a request to a remote central manager only if the
local pool and pools earlier in the list are not satisfying all the job requests. Obviously
the machines of the remote pools must be configured to allow the execution from the
remote machine, and the central managers must be configured to listen to requests from
this remote schedd process.
4.8 Parallel Applications in Condor
Dra
Condor provides a framework, called Condor-PVM [39], which allows running PVM
applications in the Condor environment. PVM applications which follow the masterworker paradigm are supported by Condor PVM: this model foresees that one node act as
controlling master for the parallel applications, and sends pieces of work out to the worker
nodes. The worker nodes do some computation, and send the result back to the master
node. Condor PVM doesn’t define a new API, but the existing PVM calls are used. What
happens is that whenever a PVM application asks for a node, the request is re-mapped
to Condor, which finds a suitable CPU in the pool using the usual Condor mechanisms,
and adds it to the PVM virtual machine. If a machine needs to leave the pool, the PVM
program is notified of that as well via the normal PVM mechanisms. Therefore Condor
acts as e resource manager for the PVM daemon: the master is executed on the machine
where the job has been submitted from, while workers are pulled in from the Condor pool
as they become available.
The Condor team is currently working to include support for MPI as well.
4.9 Inter-Job Dependencies (DAGMAN)
Solving a problem may require multiple jobs that need data from each other. These problems are best represented using Directed Acyclic Graphs (DAGs), which represent the
flow of control from one node to another (i.e. from one job to another). To manage
this kind of problems, the Directed Acyclic Graph Manager (DAGMan) [46] can be used
together with a Condor pool. The DAGMan is a meta-scheduler for Condor jobs, responsible to submit batch jobs in a predefined order and process the results: DAGMan is
responsible for all the scheduling, recovery and reporting activities of the submitted job
39
4.10
ft 1
.0
system. To submit a DAG job, a DAG input file must be defined. In this file all the jobs
that will appear in the DAG must be specified: a DAG can contain a mixture of standard
and vanilla jobs, or even other DAG jobs. Then the dependencies between these jobs
must be defined, specifying the parent and the child job: a child job is one whose input is
taken from one or more parent jobs, and therefore it can’t run until all of its parents have
successfully terminated. Moreover, for each job of the DAG it is possible to specify a pre
and/or post script, that are executed before/after the job is run. For example a pre script
could be useful to put the required files into a staging area, while a post script could be
used to copy the output files to another storage system, and then delete the staged input
and output files.
Experiences with Condor
Dra
The need to share the distributely-owned computing resources at the various sites across
Italy has always been a peculiar problem for INFN. Condor was identified as a possible
candidate to implement such a HTC system, and therefore, beginning in 1997, it was
decided to investigate its suitability for the computing needs of the INFN community [6]
[15].
In collaboration with the Condor team, a single Condor pool (to optimize the CPU
usage of all INFN resources) on wide area network was deployed, as a general purpose
computing resource available to all INFN users.
During the test phase of this project we found that some customizations and tailoring
were needed. Foe example the resource owners felt the necessity to guarantee priorities
on their resource usage for particular applications (i.e. local or collaboration jobs). This
requirement has been addressed by configuring sub-pools: a sub pool is a set of collaborative machine (i.e. workstations belonging to the same research group, not necessarily
local to a single site), configured to prioritize collaboration user jobs. Using sub-pools
it is possible to define and implement different policies and priorities on resource usage.
For example, most of the sub-pools of the INFN WAN Condor pool are now configured
to give highest priority to jobs of a specific group, and then prioritize jobs submitted by
local users, while jobs submitted by remote users have lower priority.
A problem that came to light in the test phase is related with checkpointing. As
reported in Section 4.5, checkpointing is an expensive operation, in particular when big
checkpoint files (very common for most applications in the high energy/nuclear physics
field) must be written and read across a network environment. This can significantly
reduce the so-called goodput [11]: the allocation time when a remotely execution application uses the CPU to make forward progress, that is the true throughput obtained by the
40
Dra
ft 1
.0
application. We soon realized that having a single, central checkpoint server for all the
hosts of the INFN WAN Condor pool was not an appropriate choice. What was needed
was a proper checkpoint server topology, able to limit the checkpoint file transfers over
the network, and that allowed the accomplishment of checkpoints in short time, to let the
owners to access their resources without delay, and without losing checkpoint files due
to network timeouts, without reducing the computing throughput. To meet these requirements, the Condor pool was partitioned into different checkpoint domains: a dedicated
checkpoint server was deployed on each checkpoint domain and used by all the executing
machines of the Condor pool belonging to that checkpoint domain. The definition of the
checkpoint domains has been done taking into account the presence of set of machines
with efficient network connectivity to the checkpoint server, the presence of a sufficiently
large CPU capacity inside a domain, and the topology of sub-pools.
We realized that in a distributed environment the network must be considered a
resource [15], [12], and therefore the ClassAds describing a resource of the Condor
pool have been augmented, including also the bandwidth (dynamically updated) between
this machine and its checkpoint server, profiting from the flexibility of the matchmaking
framework, that doesn’t entail a well-defined, predefined schema. When the matchmaker
must find a suitable computing resource for a job, the checkpointing characteristics of
this job is taken into account. The job checkpointing policies are defined by the user
when submitting a job, who for example can decide that his job prefers to stay within a
checkpoint domain, that a machine with a given bandwidth value to its checkpoint server
must be selected, that a job can’t move between different checkpoint domains (suitable
for very large jobs), etc. In the long term the goal is to have a dynamic, “network aware”
checkpointing system, where the association between executing machines and checkpoint
servers are dynamically decided according to the network status.
At present the INFN WAN Condor pool is composed by about 230 machines (mainly
Linux PCs and Digital Unix workstations) scattered in different INFN sites. Usually on
these machines Condor has been configured to harness idle computer CPU cycles. The
pool is currently partitioned in 7 checkpoint domains.
The total allocation time for jobs submitted to the Condor pool in the period January
2000-December 2000 is about 400000 hours (45 years). Many applications, in particular CPU intensive jobs, have succesfully exploited the INFN WAN Condor pool facility,
since for these applications very good workload can be achieved running in the Condor
environment; examples include Monte Carlo event simulation, simulation of Cherenkov
light in the atmosphere, MC integration in perturbative QCD, dynamic chaotic systems,
stochastic differential equations.
People within the INFN community that are using Condor are essentially quite
41
4.11
Condor-G
ft 1
.0
happy, since they have been able to substantially increase the throughput of computational
requests that can be satisfied. The robustness and the reliability of the system are highly
appreciated: in fact, besides the checkpointing mechanism (that provides fault tolerance
in spite of crashes of the executing machine), Condor maintains persistent queues of the
submitted jobs, so if the submitting machine crashes, Condor will be able to recover. The
flexibility of the ClassAd matchmaking framework is appreciated as well, since, for example, resource owners can define new ClassAds describing particular characteristics of
their resources, and users are then allowed to use these new attributes in their job request
expressions.
Other users, on the contrary, reported poor performance running their I/O intensive
applications in the Condor pool: we found that in this case performance improves if these
jobs are forced to run in a sub-pool with a uniform filesystem. Other Condor customers
found difficulties in using the Condor facility. Problems managing Condor have been
reported by some Condor administrators as well, since for example it is not trivial to
define particular usage policies. Also, troubleshooting is not that easy, since it is difficult
to interpret the Condor system log files, primarily useful only to the Condor developers.
There is also a security concern, since the authentication/authorization mechanisms in the
current Condor implementation are quite primitive.
Finally its worthwhile to remark that in some INFN sites Condor is used also a local
resource management system for local dedicated farms.
Dra
Condor-G allows submitting jobs to Globus resources, profiting from some capabilities,
features and mechanisms of the Condor system. In particular, since in Condor the queue
of the submitted jobs is saved in a persistent way, using Condor-G it could be possible
to implement a reliable, crash-proof, checkpointable job submission service. Also, the
Condor tools for job management (job submission, job removal, job status monitoring)
and logging can be exploited.
The current implementation of Condor-G, which relies upon the Condor schedd
service, runs the Globus globusrun command behind the scenes: a condor submit command, used to submit a job to a Globus resource via Condor-G, is simply translated to the
submission library equivalent of a globusrun command.
Condor-G doesn’t provide any brokering/matchmaking functionalities (this means
that the Globus resource where the job must run has to be explicitly specified), and there’s
no plans to provide a way to plug in application specific resource choice policies in
Condor-G: the place to implement this resource choice is within a component that sits
42
4.12
.0
on top of Condor-G itself.
Condor GlideIn
Dra
ft 1
With GlideIn, what happens is that the Condor daemons (specifically, the master and the
startd) are effectively run on Globus resources (machines that use the fork system call as
job manager, or clusters managed by a resource management system). These resources
then temporarily (the Condor daemons exit gracefully when no jobs run for a configurable
period of time) become part of a given Condor pool, which can then be used to submit
any kind (standard or vanilla) of Condor jobs. GlideIn is a particular implementation of
the Condor-G mechanism: the Condor master is submitted as a Condor-G job.
The GlideIn procedure operates in two steps, after acquiring a valid user proxy. In
the first step, that must be considered only once, the Condor executables and configuration
files are downloaded from a server in Wisconsin, while in the second phase the Condor
daemons are executed in the remote Globus resource.
43
.0
Bibliography
.
ft 1
[1]
[2] The global grid forum home page. http://www.gridforum.org.
.
[3] Home page for the globus project.
[4] Home page for the openssl project.
.
[5] Home page of the global grid forum information services working group.
.
.
[6] The infn condor on wan project.
[7] Notes
on
[8] Administrator
to gara.
March 2000.
extending
the
gris
information
.
schema.
guide
,
Dra
[9] C. Anglano. A Fair and Effective Scheduling Strategy for Workstation Clust ers. In
Proc. of the IEEE Int. Conference on Cluster Computing. IEEE-CS Press, December
2000.
[10] J. Basney and M. Livny. Deploying a high throughput computing cluster. In Proc.
IPPS/SPDP ’98 Workshop on Job Scheduling Strategies for Parallel Processing,
volume 1. Prentice Hall, 1999.
[11] J. Basney and M. Livny. Improving goodput by co-scheduling cpu and network
capacity. Int’l Journal of High Performance Computing Applications, 13(3), 1999.
[12] J. Basney and M. Livny. Managing network resources in condor. In Proceedings of
the Ninth IEEE Symposium on High Performance Distributed Computing (HPDC9),
Pittsburgh, Pennsylvania, pages 298–299, August 2000.
44
.0
[13] F. Berman. High-Performance Schedulers. In The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, 1998.
[14] F. Berman and R. Wolski. The AppLeS Project: A Status Report. In Proc. of the 8
NEC Research Symposium, Berlin, Germany, May 1997.
[15] D. Bortolotti, T. Ferrari, A. Ghiselli, P. Mazzanti, F. Prelz, M. Sgaravatto, and C. Vistoli. Condor on wan. In Proceedings of the CHEP 2000 conference, 2000.
ft 1
[16] R. Buyya, D. Abramson, and J. Giddy. Nimrod/G: An Architecture for a Resource
Management and Scheduling System in a Global Computational Grid. In Proc. of
Int. Conf. on High Performance Computing in Asia-Pacific Region, Beijing, China,
2000. IEEE-CS Press.
[17] R. Buyya, D. Abramson, and J. Giddy. An Economy Grid Architecture for ServiceOriented Grid Computing. In Proc. of the 10 Int. Workshop on Heterogeneous
Computing. IEEE-CS Press, 2001.
[18] H. Casanova and J. Dongarra. NetSolve: A Network Server for Solving Computational Science Problems. Intl. Journal of Supercomputing Applications and High
Performance Computing, 11(3), 1997.
[19] P. Chandra, A. Fisher, and C. Kosak et al. Darwin: Customizable Resource Management for Value-Added Network Services. In Proc. of the 6 Int. Conf. on Network
Protocols. IEEE, 1988.
Dra
[20] S. Chapin, J. Karpovich, and A. Grimshaw. The Legion Resource Management
System. In Proc. of the 5 Workshop on Job Scheduling Strategies for Parallel
Processing, volume 1659 of Lecture Notes in Computer Science. Springer, 1999.
[21] K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W. Smith, and
S. Tuecke. A resource management architecture for metacomputing systems. In
Proc. IPPS/SPDP ’98 Workshop on Job Scheduling Strategies for Parallel Processing, 1998.
[22] Olivier Dubuisson. Asn.1 - communication between heterogeneous systems.
[23] D. H. J. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne. A worldwide
flock of condors: Load sharing among workstation clusters. In Journal of Future
Generations of Computer Systems, volume 12, 1996.
45
.0
[24] D.G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. Sevci k, and P. Wong. Theory
and Practice in Parallel Job Scheduling. In Proc. of IPPS ’97 Workshop on Job
Scheduling Strategies for Parallel Processing, Lecture Notes in Computer Science.
Springer, 1997.
[25] I. Foster and C. Kesselman. The globus project: A status report. In Proc. IPPS/SPDP
’98 Heterogeneous Computing Workshop, pages 4–18, 1998.
ft 1
[26] V. Hamscher, U. Schwiegelshohn, A. Streit, and R. Yahyapour. Evaluation of JobScheduling Strategies for Grid Computing. In Proc. of 1 ACM/IEEE Int. Workshop
on Grid Computing, number 1971 in Lecture Notes in Computer Science. Springer,
2000.
[27] D. Hensgen and T. Kidd. An Overview of MSHN: The Management System for Heterogeneous Networks. In Proc. of the 8 Workshop on Heterogeneous Computing.
IEEE-CS Press, April 1999.
[28] J. Hollingsworth and S. Maneewongvatana. Imprecise Calendars: an Approach to
Scheduling Computational Grids. In Proc. of the 19 Int. Conf. on Distributed
Computing Systems. IEEE-CS Press, 1999.
[29] M. Litzkow, T. Tannenbaum, J. Basney, and M. Livny. Checkpoint and migration of
unix processes in the condor distributed processing system, April 1997.
[30] M. J. Litzkow and M. Livny. Experience with the condor distributed batch system.
In Proc. of the IEEE Workshop on Experimental Distributed Systems, Huntsville,
Alabama, October 1990.
Dra
[31] M. J. Litzkow, M. Livny, and M. W. Mutka. Condor-a hunter of idle workstations.
In Proc. of the 8 Int’l Conf. On Distributed Computing Systems, pages 104–111,
1998.
[32] Michael Litzkow, Miron Livny, and Matt Mutka. Condor - A Hunter of Idle Workstations. In Proc. of the 8 Int. Conf. of Distributed Computing Systems, 1988.
[33] M. Livny, J. Basney, R. Raman, and T. Tannenbaum. Mechanisms for high throughput computing. SPEEDUP Journal, 11(1):36–40, 1997.
[34] M. Livny and R. Raman. High-throughput resource management. In The Grid:
Blueprint for a New Computing Infrastructure. Morgan Kaufmann Publishers, 1998.
46
.0
[35] H. Nakada, M. Sato, and S. Sekiguchi. Design and Implementation of Ninf: towards
a Global Computing Infrastructure. Future Generation Computing Systems, October
1999. Special Issue on Metacomputing.
[36] M. Neary, A. Phipps, S. Richman, and P. Cappello. Javelin 2.0: Java-Based Parallel
Computing in the Internet. In Proc. of the European Parallel Computing Conference
(EUROPAR 2000), 2000.
ft 1
[37] J.K. Ousterhout. Scheduling Techniques for Concurrent Systems. In Proc. of 3
Int. Conf. on Distributed Computing Systems, pages 22–30, May 1982.
[38] F. Petrini and W. Feng. Scheduling with Global Information in Distributed Systems.
In Proc. of 20 Int. Conf. on Distributed Computing Systems. IEEE-CS Press, 2000.
[39] J. Pruyne and M. Livny. Interfacing condor and pvm to harness the cycles of workstation clusters. 12, 1996.
[40] R. Raman, M. Livny, and M. Salomon. Matchmaking: Distributed resource management for high throughput computing. In Proceedings of the 7th IEEE International
Symposium on High Performance Distributed Computing, Chicago, Illinois, July
1998.
[41] R. Raman, M. Livny, and M. Solomon. Matchmaking: an extensible framework
for distributed resource management. Cluster: Journal of Software, Networks and
Applications, 2(2), 1999.
Dra
[42] R. Raman, M. Livny, and M. Solomon. Gang-matchmaking: Advanced resource
management through multilateral matchmaking. In Proceedings of the 9 IEEE
Int’l Symposium on High Performance Distributed Computing, August 2000.
[43] J. Schopf. Ten Steps for SuperScheduling. http://www.cs.nwu.edu/ jms/schedwg/WD/schedwd.8.3.pdf, February 2000. Working Draft no. 8.3.
[44] P. Stelling, I. Foster, C. Kesselman, C.Lee, and G. von Laszewski. A fault detection
service for wide area distributed computations. In Proc. 7th IEEE Symp. on High
Performance Distributed Computing, pages 268–278, 1998.
[45] Condor Team.
The condor high throughput computing environment.
.
[46] Condor team. Condor manual.
47
.
.0
[47] U. Vahalia. Unix Internals: The New Frontiers. Prentice Hall, 1996.
Dra
ft 1
[48] R. Wolski, J. Plank, J. Brevik, and T. Bryan. G-Commerce: Market Formulations
Controlling Resource Allocation on the Computational Grid. In Proc. of the Int. Parallel and Distributed Processing Symposium (IPDPS 2001). IEEE-CS Press, 2001.
48