Grid Catalogs of Files and Their Metadata⋆
Ricardo Brito da Rocha1 , Ákos Frohner1 , Peter Kunszt2 , Krzysztof
Nienartowicz1 , and Daniel Rocha da Cunha Rodrigues1
1
2
CERN, IT Department, CH-1211 Geneva
Swiss National Supercomputing Centre, CH-6928 Manno
Abstract. Catalogs play an important role in distributed systems and
especially in Grids where they store necessary user, middleware and system information. In the (first) EU-funded Enabling Grids for E-SciencE
(EGEE) project’s Grid middleware stack (called gLite), the Grid catalogs follow a Service Oriented Architecture, exposing a set of Web Service
interfaces that may be combined and reused by higher-level services or
applications. The functional decomposition of the interfaces was driven
by the requirements on scalability, robustness, wide-area distribution and
security. It is especially important to be able to evolve and extend the
interfaces. We describe the requirements of the EGEE applications with
respect to catalogs, the architecture of the interfaces and the performance
of the gLite implementation.
1
Introduction and Related Work
In the EGEE architecture [1], the data catalogs store information about the data
and metadata that is being operated in the Grid. In this article we describe the
gLite FiReMan catalog service implementation that provides a filesystem-like
interface to data including full security and metadata semantics.
Data catalogs provide an important abstraction to the users of distributed
Grids. To keep the promise of ubiquitous computing, the Grid middleware has to
provide a set of interfaces to the user that hides the complexities of a heterogeneous, distributed system. Grid data catalogs aim to provide a simple interface to
the user to manage and organise his or her data. Grid data management services
then take care of making the data accessible to the user across the whole Grid
infrastructure while still providing the same Grid catalog interfaces everywhere.
Grid data catalogs therefore need to provide a set of simple, extensible, standardized interfaces that are easy to use and adaptable to both lightweight or
sophisticated database back-ends, while being powerful enough to meet the various needs of the Grid user communities. The simplest, most straightforward
abstraction layer that can be provided is that of a file system. In principle,
a perfect global distributed filesystem meets most of the requirements for Grid
⋆
This work was funded by Enabling Grids for E-sciencE (EGEE), a project of the
European Commission (contract number INFSO–508833).
data management, including file catalogs. Many distributed filesystems are in existence today [3]. There are also complete data management solutions available
for Grid computing like the Storage Resource Broker [2], Avaki [5], GridFarm
[6], etc. The same is true for peer-to-peer systems; there are many projects aiming to solve this problem, like Chord[7], Freenet[8], OceanStore[10], Tapestry [9]
and more. The obvious practical problem is that due to the heterogeneous and
collaborative nature of the Grid, a single global solution cannot be put in place
easily. Therefore, a standard interface for data catalogs in the Grid middleware
layer is being worked on in the GGF Grid File System Working Group [11].
This interface may then be exposed by the various middleware providers so that
standardised clients may access the data through this layer independently of
the underlying technology. Work on the Fireman catalog interface was carried
out while closely monitoring the evolution of the GFS standard so that it is
straightforward to provide a GFS-compliant interface once it is ready.
Other data catalog efforts that are being currently carried out include the
Globus Replica Location Service (RLS) [12] that is inherently operated in a
distributed way. The RLS is however not very rich in its interface and aims to
be very generic rather than to provide a specific file-system like interface. The EU
DataGrid implementation of the RLS [13] did address that issue to some extent.
A generic standard interface to data is provided by the OGSA-DAI project,
implementing the OGSA [14] Data Access and Integration Services interface
DAIS [15]. This interface does a good job giving generic access to data stored in
databases, but is also not aimed to provide a filesystem-like abstraction. gLite
FiReMan implementation addresses also backend speed and scalability problems
of previous approaches, exploiting all possible futures of the database backend.
2
Architecture
The Grid Catalogs are used to manage the Grid file namespaces and the location of the files to store and retrieve metadata and to keep data authorization
information.
2.1
Interfaces
We decompose the catalogs into functional feature sets which are represented
by dedicated catalog interfaces (see Figure 1). Then we naturally combine them
(green boxes on the 1) within the services (green ellipses) to achieve performant
solution.
These interfaces expose a well-defined set of operations to the client. The
interfaces are in detail:
Authorization Base The basic authorization interface offers methods to set
and get permissions on one or more catalog entries. The permissions are
represented as basic permissions (permitted operations for owner, owner’s
group and others) and as Access Control List (ACL). The permissible operations in the gLite implementation are read, write, execute, remove, list, set
permission, set metadata and get metadata.
Metadata Base The methods of the base metadata interface deal with setting,
querying, listing and removing metadata attributes to one or more catalog
entries.
Metadata Schema The schema interface allows attribute definition and grouping into schemas. Attributes may be added and removed from existing schemata.
These attributes are then available to be filled with values through the Metadata Base interface.
Replica Catalog The Replica Catalog exposes operations concerning the replication aspect of the grid files. These operations include listing, adding and
removing replicas of a file identified by its GUID [18].
File Catalog The File Catalog allows for operations on the logical file names
(LFN) that it manages. The File Catalog operations are for example making
directories, renaming logical file entries and managing symbolic links.
File Authorization The methods needed to implement a standalone authorization service are in this interface, which are basically the ability to add
and remove permission entries.
Metadata Catalog In order to implement a standalone metadata catalog,
methods for adding and removing entries are required.
Combined Catalog If this interface is implemented by a catalog service, it
either also implements File, Replica and Base Authorization interfaces or
calls upon an external catalog in order to do so. If the implementation is not
in the same place, the Combined Catalog implementation needs to maintain
a persistent state of all operations it performs across catalogs in order to
make sure that the operations only occur in a synchronised manner and that
the method semantics are preserved.
StorageIndex The Storage Index interface is tightly coupled to the File and
Replica catalog functionality. Its sole aim is to return the list of Grid Storage nodes where a given file (identified by its LFN or GUID) has a stored
replica and is being used to steer access to the files by workload or transfer
management services.
All catalog interfaces expose bulk operations. They increase performance and
optimise interaction with the Grid services. The reason for this interface decomposition is to have service interfaces with well-defined semantics which may be
implemented by many parties. In this model, a possible scenario is that more
than one interface is implemented by the same service. In Figure 1 the interfaces
are grouped such as to show which gLite service provides what implementation.
These are
gLite Fireman The gLite FIle and REplica MANager called the Fireman catalog implements all file management interfaces. This will be described in
detail below.
gLite StorageIndex The Storage Index service is provided also by the Fireman
implementation, but is offered as a separate porttype.
gLite Metadata The standalone gLite Metadata Catalog implementation offers a full metadata catalog solution.
Fig. 1. Catalog Interfaces with the gLite services implementing them.
gLite FAS The File Authorization Service can be deployed as a simple authorization enforcement service for file access.
Of course another implementation may choose to implement the interfaces in a
different manner, to even further extend the possibilities for deployment of the
services.
2.2
Scalability and Consistency
The File Catalogs that have been deployed to date are all deployed centrally
and therefore are a single point of failure. The central catalog model has obviously excellent consistency properties (concurrent writes are always managed at
the same place) but it does not scale to many dozens of sites. There are three
possibilities to solve this issue:
– Database Partitioning. The data in the database is kept at different sites.
For some applications where datasets are unlikely to move, partitioning on
data location (and mapping this implicitly into the logical namespace) may
be a good approach, solving the scalability problem.
– Database Replication. The underlying database is replicated using native
database replication techniques. This may mean a lock-in to a vendor-specific
solution. Currently commercial database vendors like Oracle provide multimaster database replication options which may be exploited for such a purpose.
– Lazy database synchronisation exploiting the specific semantics of the catalogs using message oriented middleware to propagate the updates. Reliable
messaging technologies are available commercially (just like replication for
database technologies) and there are some solutions also in the open source
domain.
The Fireman implementation is able to accommodate all solutions by design.
Consistency might be broken in the second and third model i.e. it is possible
to register the same LFN in two remote catalogs at the same time such that
a conflict will occur. The reconciliation techniques apply in both cases. In the
third case we can be specific to the semantics of the system and exploit the
uniqueness of the GUIDs to detect consistency problems and to notify the users
asynchronously. Fireman has publishing logic built on standard Java Messaging
Service standard, which when enabled, publishes all changes to the hierarchical
topic of external JMS service. Then independent component exists that may
subscribe to a freely selected topic and update other catalogs or digest them at
will.
2.3
Bulk Operations
The whole data management design is optimised for bulk operations as opposed
to single-shot operations. The reason is that in the stateless web service model
in the service oriented architecture (SOA) of gLite, there is a considerable performance hit when the connection between client and server is established and
the security or transaction context is built. If this connection has to be rebuilt
for each operation, the client-server interaction would be very inefficient (see
section 4.3).
For this reason, all catalog operations are built around stateless database
backend bulk operations as part of the interface wherever it is reasonable. Using
them to bundle similar tasks as a single operation increases performance considerably, optimising the interaction with the Grid services. We provide also a
’simple’ interface to our user community that will automatically optimize this
aspect of the interface usage, being a thin layer on top of the client interfaces.
3
Requirements
Catalogs were built based on requirements from High Energy Physics (HEP)
experiments carried out at CERN and the Biomedical user communities of the
EGEE project.
The specific application requirements are preceded by four guiding principles
that are applicable to all of Grid computing: Interoperability allowing multiple
implementations of a service or interface, multi-platform support, modularity
in the sense of service oriented architectures (SOA) and scalability to several
hundreds of Grid sites.
The requirements on data catalogs can be summarized as follows:
Logical namespace management This implies having a well structured and
logical namespace allowing easy management of files. Implicitly it has to
keep track of sites where data is stored and to be able to resolve conflicts
resulting from the distributed nature of the Grid.
Virtual Filesystem view Both user communities have explicitly requested a
hierarchical filesystem-like organisation of the logical namespace.
Support Metadata attached to files Give the user additional information
on the file going beyond the regular unix parameters, allowing user-defined
extensions.
Bulk Operations These have been requested explicitly in order to improve
performance.
Security Basic unix file permission semantics as well as fine-grained ACLs have
been requested, with the possibility to use one or the other based on the
policy of the user community.
Support flexible deployment models This includes support for different modes
of operation: Single central catalog, single local catalogs connected to a single
central catalog, or site local catalogs without a single central catalog.
Scalable - Scale up to many clients (in HEP, depending on the place in a processing chain it could be tens to hundreads) and to a large number of entries
(HEP initial 101 1 requirement was relaxed down to 109 per VO per service),
trying to address performance issues of previous catalog implementations.
As expected, the different user communities rated the priorities of these requirements differently. For the HEP user community, performance is much more
essential than security, which was the most important requirement from the
Biomedical applications. Scalability in terms of data volume is again more important to the HEP community (with different scenarios for data acquisition and
analysis) while flexible deployment models was more important to Biomedical
applications due to some data locality restrictions.
The gLite Fireman implementation tried to take all of these requirements
into account, finding a good balance where they were running against each other
(like performance and security).
4
Implementation
Two distinct implementations of the File and Replica Catalogs exist, using Oracle and MySQL as RDBMS backends. The reason to have two solutions is that
one of the requirements is to have an open-source version of the catalog that
does not have a commercial software dependency. At the same time, the hard
scaling requirements of some user communities (especially HEP) cannot be met
by open source RDBMS solutions yet. Oracle was chosen because the physics
community has a license for all LHC related work and because of the existence of
many language features (hierarchical queries, OO functionality) that alleviated
impedance mismatch between hierarchical, bulky manipulated filesystem view
and internal data model ina performant way.
4.1
Oracle Implementation
The Oracle version aims to provide maximal performance and scalability. The
web service layer is very thin, except the hooks for the publishing logic, providing a straight handover to the database where the actual application logic is
Fig. 2. Oracle Fireman architecture.
implemented using stored procedures and Oracle-internal objects (see Figure 2).
The internal architecture is built around postulated access patterns of Grid entities, optimised for bulk operations for neighboring regions in a hierarchical
namespace. Automatic sub-trees clustering with index organized tables resulting in naturally promoted caching of most frequently used namespace regions.
Advanced indexing techniques, object transformations within complex queries
lift the catalog limits to what is physically achievable with databases on commodity hardware nowadays (see Figure 3). The advantage of the Oracle version
is not only that it can scale up to very large data volumes, it still being able to
serve many clients with no significant loss of speed. This has been tested with
several hundreds of millions of entries: 400 millions logical file names (LFNs), 10
million directories, associated replicas and access control lists (ACLs) plus metadata items; leading to over 900 million aggregated items on a single database
node. In addition the Oracle implementation provides a set of features other
implementations are usually unable to provide:
– Internally implemented partial errors reporting without the loss of speed.
This means that for example if 3 items out of 1000 failed during one call,
no rollback is necessary. The remaining 997 items can be committed (or be
read, depending on the operation) based on user-definable policy.
– Fast LFN renaming with negligible cost regardless of position in the hierarchy
– Flat behaviour and same results with 106 ...108 items
– Very big number of catalog principals (i.e. files’ owners) possible, in the
thousands; unlimited number of superusers
– Maximum size of a single logical file name is 32K,unlimited size of ACL and
metadata attributes
– Unlimited number of directory levels
– Fast runtime checks for loops in symlinks
– Fast removal of all pertaining ACLs if user/group removed, easy reporting
on ACL data
– Ownership of newly created items, home directory, dangling GUIDs handling
– Java Messaging Service based, easily pluggable catalog distribution, functionality completely isolated from the protocol (WS, JDBC, C++ protocols)
– Separated service to digest published changes for distribution or accounting
Fig. 3. Internal namespace decomposition and clustering for improved caching and low
level clustering.
Partitioning comes naturally since the schema transparently gives the ability
to balance several levels of physical data skew based on the access patterns
with hierarchy slicing, hierarchy traversal and hierarchical clustering combined
with hierarchical Oracle queries (see Figure 3). All this allows for a very small
performance hit which is a frequent problem when introducing partitioning for
big data volumes.
We would like to stress out that regardless of rich functionality of Oracle,
combining some of the functions together (mainly hierarchical queries and objects), sometimes in exotic functional brew, was very tedious task and solving
all stoppers often reached Oracle low level support developers that had to fix
dozens of internal Oracle bugs discovered by us. Finding of workarounds of Oracle database non-fixable bugs could be a base for several advanced SQL tutorials.
The database has professional backup strategies, for LHC community runs on
minimum 2-node cluster and usually is managed by dedicated database administrators, assuring a very good service quality.
4.2
Open Source Implementation
The open-source version of the Fireman implementation was developed with the
aim of being independent of the RDBMS backend. All of the business logic is
written in Java and is being kept in the web service layer as a web application.
It implements the full set of interfaces as described above, including all security
and metadata features. This implementation has also undergone very thorough
testing.
The advantage of the open-source version is that it can easily be deployed
and adapted to any database solution. Currently it has only one reference implementation with very light ties to the underlying database (MySQL). The
adaptation to other databases is straightforward. Although the MySQL implementation was less stressful and less strenuous than the Oracle one, and none of
the features available from the Oracle implementation are available here, and the
limitations are derived from capabilities of MySQL. For example, the maximal
file name/directory name size within a LFN is 255, due to indexing constraints
or renaming of directories is not really feasible.
Only very few of the options necessary to implement the database logic the
way it is done with Oracle are possible with open source database solutions
resulting in scalability down to orders of magnitude smaller and being slower
10-200 times for bigger input sets. Exact comparisons were not made as MySQL
version could not pass acceptance test for the minimal volume for tests we estimated at 20M entries. It was performing at acceptable levels at up to 5M entries
it was designed for.
4.3
Performance and Scaling
There were several performance tests conducted both by ARDA and internal
team. Results had been varying in the course of evolution as the product was
becoming mature. As a reference setup we choose to have standard CERN disk
server for Oracle DB server: Dual Intel(R) Xeon(TM) CPU 2.4 GHz with 512kB
cache, 2GB RAM (1.7G SGA), roughly 1000GB disk space available for the
database through RAID0 and 9 mirrored disks. Oracle DB configured with 16k
block size. We have used a regular CERN batch node for the web service layer.
This naturally implied bottleneck due to SOAP/security handling on a web
server node for many clients.
The reference database was already preloaded with 25M LFN entries with
2.5 replicas per LFN and 3 ACL associated per LFN item on average. 5% of
LFNs were symbolic links with two levels of indirection depth. The average LFN
directory depth was 10.
Results show huge gains in speed depending on the bulk mode and number of
clients. Speedup is quasi linear and drops after reaching memory threshold on the
web server and I/O limitations of database machine. and Figure 5). The speed
of retrieval may also depend on the data skew in the requested items, i.e. if all
1000 items in a 1000 items bulk call are placed in the same directory one would
see different behaviour as if 1000 items are distributed over 1000 directories.
This relation is non-linear because we optimize security checks internally and if
there are overlapping parts of the path, permission checks are managed in one
go. Also the depth level plays a role as queries actually traverse trees internally.
For the creation, where even more complex processing is needed we also suffer
Fig. 4. Performance of complex lookup operation involving security check, replica retrieval and possible symlink traversal.
Fig. 5. Scalability of bulk creation operation.
from the clustering factor mentioned above, which optimizes bulk reads and joins
for reading later. Still, the performance achieved should satisfy even the tallest
requirements of the HEP or Biomedical applications. This is especially true for
bigger sets of data, where memory congestion of the web service layer could be
solved by clustering.
Security and SOAP overhead The security context has been removed from
the diagrams for clarity. One should add around 200ms-600ms per call for authentification (proxy certificate validation and security context setup) and around
of 10kB of bidirectional wire transfer for a successful one. Depending on the
setup this numbers may vary so we dropped them off the picture. The CPU
stress comes mostly from the security handling and for massive SOAP attack 40
clients could saturate 2Ghz web server. It confirms the importance of the bulk
mode. SOAP messages are big and may reach 200KB per call for 100 item which
should be multiplied by 3 to get the memory space needed on the bean server.
There is clearly space for optimisation here.
4.4
Metadata
In addition to the hierarchical structure and its filesystem specific metadata
(LFN, size, permissions, ...) the gLite FiReMan also provides means for applications to have their own specific metadata and to perform catalog queries based
on this information.
The metadata interface tries to deliver at the same time high flexibility for
metadata definition by clients of the catalog and means for implementors to
add this functionality to their catalogs in an optimized way (even if the generic
nature of application metadata imposes certain limits on these optimizations).
In this interface it is required that all attributes be associated to a schema, with
entries in the catalog being associated with schemas, not attributes directly.
On the querying side, a metadata query language was defined with its grammar being a subset of SQL (the obvious choice considering support and preferences among the involved communities). Capabilities for restricting the result
set of the queries as seen on SQL are provided, as well as the possibility for
intersecting entries associated with different schemas.
The gLite metadata interface is exposed by both gLite FiReMan catalogs
(MySQL and Oracle), as well as the gLite standalone metadata catalog. It has
also been integrated into the Atlas AMI metadata catalog [16], showing that it
can be a solution to achieve interoperability at both the file and dataset metadata
levels.
5
Security
By imposing ACLs on the filesystem the security semantics are straightforward.
This should also help in avoiding concurrency issues when writing into the catalog since each user will have only limited access rights in the LFN namespace
and there should be only a finite set of administrators per VO who have full access rights for all of their LFN namespace. The probability of two users with the
same access rights to write into the catalog in the same directory in a distributed
system without knowing of each other is therefore much lower.
The Authorization Base interface exposes the operations on the file ACLs.
There are two possibilities of how ACLs may be implemented.
– POSIX-like ACL The POSIX semantics follow the Unix filesystem semantics. In order to check whether the user is eligible to perform the requested
operation on a file, all of the parent directory permissions and ACLs need
also be evaluated.
– NTFS-like ACL The Microsoft Windows semantics are simpler, i.e. the
ACLs are stored with the file and the branch has no effect on the ACL.
These are “leaf” ACLs, only operating on the file itself.
In a distributed environment, the NTFS-like semantics are simpler to track and
are probably more efficient, since the namespace hierarchy has no impact on
the distribution. On the other hand POSIX-like semantics are sometime more
intuitive to the target user group.
The Authorization Base interface exposes all operations that deal with querying and setting of the file ACLs. It acts as the authorization authority for file
access and is called by other services such as the File Placement Service to
enforce ACL security.
6
Distribution
The Master Replica Currently in HEP we do not expect the files to be updated
once distributed, but we provide before mentioned proof-of-concept distribution/update mechanism based on subscription paradigm, that tunnels change
operations via message topics. JORAM [17] and Oracle Advanced Queuing were
used as messaging platforms for tests. To manage global state of view of distributed replicas, a placeholder is needed to enable such functionality on the
interface level. The master replica flag for a SURL as present in the File Catalog
may be used to flag a SURL as the only replica where update operations are
allowed. This may then also be the only source for replications. If the master
replica is lost, it might be recovered from other replicas or not, based on VO
policies. A master replica should always be kept on a reliable SE providing high
QoS (permanent space semantics).
7
Summary and Open Issues
As the most important issue, the web service interface is perceived as sub-optimal
especially by the HEP user community where the opinion prevails that it is too
slow due to the XML marshalling. However, we can show that depending on
the usage pattern the specific performance and scalability needs of the HEP
community can also be met by web services. One issue is that the user community
sometimes prefers to implement their own custom catalog solutions, leading to
non-interoperable services that don’t address scalability problems well. Therefore
it is not enough to provide good services meeting the stated requirements but
also to achieve a good acceptance in the user communities especially where some
misconceptions about web services prevail.
Due to the acceptance we have been unable to deploy the Fireman catalog in
a distributed fashion across the wide area, so we cannot present numbers on how
it scales with multiple instances sending update messages between the individual
nodes. This work may be resumed in the future.
In summary however, we can state that the gLite Fireman catalog provides
a complete set of interfaces for file and metadata catalog operations. It is able
to scale to very large numbers of entries depending on the underlying database
technology being used. The distribution of the catalogs has not been tested in
the wide area yet, but preliminary results are promising.
References
1. EGEE JRA1 Middleware Activity Deliverable DJRA1.4: The gLite Middleware
Architecture. https://edms.cern.ch/document/594698/1.0/.
2. Chaitanya Baru, Reagan Moore, Arcot Rajasekar, Michael Wan, The SDSC Storage Resource Broker, Proc. CASCON’98 Conference , Nov.30-Dec.3, 1998, Toronto,
Canada.
3. Peter Honeyman, Distributed File Systems in Distributed Computing: Implementation and Management Strategies, ed. Rhaman Kanna, pp. 27-44, Prentice-Hall
(1994).
4. Scale and performance in a distributed file system, J. Howard and M. Kazar
and S. Menees and D. Nichols and M. West. Proceedings of the eleventh
ACM Symposium on Operating systems principles, 1987 ISBN 0-89791-242-X,
ACM Press.
5. Avaki Corporation. Keep it simple: Overcome information integration Challenges
with Avaki Data Grid Software. White Paper. July 2003.
6. Osamu Tatebe, et. al, ”Worldwide Fast File Replication on Grid Datafarm”, Proceedings of the 2003 Computing in High Energy and Nuclear Physics (CHEP03),
March 2003.
7. I. Stoica, et. al, Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications, SIGCOMM Conf., 2001.
8. I. Clarke, et. al, Protecting Free Expression Online with Freenet, IEEE Internet
Computing, Vol. 6, No. 1, 2002.
9. Ben Y. Zhao, et. al. Tapestry: A Resilient Global-scale Overlay for Service Deployment, IEEE Journal on Selected Areas in Communications, Vol 22, No. 1, January
2004.
10. John Kubiatowicz, et. al, OceanStore: An Architecture for Global-Scale Persistent
Storage, Proc. of ASPLOS 2000 Conference, November 2000.
11. The
Grid
File
System
Working
Group,
GGF,
https://forge.gridforum.org/projects/gfs-wg
12. A. Chervenak et. al., Giggle: A Framework for Constructing Sclable Replica Location Services, Proceedings of Supercomputing 2002 (SC2002)
13. L. Guy, et. al, Replica Management in Data Grids, Global Grid Forum 5, 2002.
14. The Open Grid Services Architecture, Version 1.0, GGF OGSA WG
https://forge.gridforum.org/projects/ogsa-wg
15. ta Acces and Integration Services GGF Data Access and Integration Services
Working Group http://forge.gridforum.org/projects/dais-rg/document/
16. Thomas Doherty, Development of Web Service Based Security Components for the ATLAS Metadata Interface (Master Thesis), September 2005,
http://ppewww.ph.gla.ac.uk/˜tdoherty/MScThesis/MScThesis.pdf
17. JORAM:
Java
(TM)
Open
Reliable
Asynchronous
Messaging
http://joram.objectweb.org/
18. UUIDs and GUIDs, Paul J. Leach and Rich Salz, INTERNET-DRAFT, 1998,
February