Academia.eduAcademia.edu

Grid Catalogs of Files and Their Metadata

ricardorocha.name

Catalogs play an important role in distributed systems and especially in Grids where they store necessary user, middleware and system information. In the (first) EU-funded Enabling Grids for E-SciencE (EGEE) project's Grid middleware stack (called gLite), the Grid catalogs follow a Service Oriented Architecture, exposing a set of Web Service interfaces that may be combined and reused by higher-level services or applications. The functional decomposition of the interfaces was driven by the requirements on scalability, robustness, wide-area distribution and security. It is especially important to be able to evolve and extend the interfaces. We describe the requirements of the EGEE applications with respect to catalogs, the architecture of the interfaces and the performance of the gLite implementation.

Grid Catalogs of Files and Their Metadata⋆ Ricardo Brito da Rocha1 , Ákos Frohner1 , Peter Kunszt2 , Krzysztof Nienartowicz1 , and Daniel Rocha da Cunha Rodrigues1 1 2 CERN, IT Department, CH-1211 Geneva Swiss National Supercomputing Centre, CH-6928 Manno Abstract. Catalogs play an important role in distributed systems and especially in Grids where they store necessary user, middleware and system information. In the (first) EU-funded Enabling Grids for E-SciencE (EGEE) project’s Grid middleware stack (called gLite), the Grid catalogs follow a Service Oriented Architecture, exposing a set of Web Service interfaces that may be combined and reused by higher-level services or applications. The functional decomposition of the interfaces was driven by the requirements on scalability, robustness, wide-area distribution and security. It is especially important to be able to evolve and extend the interfaces. We describe the requirements of the EGEE applications with respect to catalogs, the architecture of the interfaces and the performance of the gLite implementation. 1 Introduction and Related Work In the EGEE architecture [1], the data catalogs store information about the data and metadata that is being operated in the Grid. In this article we describe the gLite FiReMan catalog service implementation that provides a filesystem-like interface to data including full security and metadata semantics. Data catalogs provide an important abstraction to the users of distributed Grids. To keep the promise of ubiquitous computing, the Grid middleware has to provide a set of interfaces to the user that hides the complexities of a heterogeneous, distributed system. Grid data catalogs aim to provide a simple interface to the user to manage and organise his or her data. Grid data management services then take care of making the data accessible to the user across the whole Grid infrastructure while still providing the same Grid catalog interfaces everywhere. Grid data catalogs therefore need to provide a set of simple, extensible, standardized interfaces that are easy to use and adaptable to both lightweight or sophisticated database back-ends, while being powerful enough to meet the various needs of the Grid user communities. The simplest, most straightforward abstraction layer that can be provided is that of a file system. In principle, a perfect global distributed filesystem meets most of the requirements for Grid ⋆ This work was funded by Enabling Grids for E-sciencE (EGEE), a project of the European Commission (contract number INFSO–508833). data management, including file catalogs. Many distributed filesystems are in existence today [3]. There are also complete data management solutions available for Grid computing like the Storage Resource Broker [2], Avaki [5], GridFarm [6], etc. The same is true for peer-to-peer systems; there are many projects aiming to solve this problem, like Chord[7], Freenet[8], OceanStore[10], Tapestry [9] and more. The obvious practical problem is that due to the heterogeneous and collaborative nature of the Grid, a single global solution cannot be put in place easily. Therefore, a standard interface for data catalogs in the Grid middleware layer is being worked on in the GGF Grid File System Working Group [11]. This interface may then be exposed by the various middleware providers so that standardised clients may access the data through this layer independently of the underlying technology. Work on the Fireman catalog interface was carried out while closely monitoring the evolution of the GFS standard so that it is straightforward to provide a GFS-compliant interface once it is ready. Other data catalog efforts that are being currently carried out include the Globus Replica Location Service (RLS) [12] that is inherently operated in a distributed way. The RLS is however not very rich in its interface and aims to be very generic rather than to provide a specific file-system like interface. The EU DataGrid implementation of the RLS [13] did address that issue to some extent. A generic standard interface to data is provided by the OGSA-DAI project, implementing the OGSA [14] Data Access and Integration Services interface DAIS [15]. This interface does a good job giving generic access to data stored in databases, but is also not aimed to provide a filesystem-like abstraction. gLite FiReMan implementation addresses also backend speed and scalability problems of previous approaches, exploiting all possible futures of the database backend. 2 Architecture The Grid Catalogs are used to manage the Grid file namespaces and the location of the files to store and retrieve metadata and to keep data authorization information. 2.1 Interfaces We decompose the catalogs into functional feature sets which are represented by dedicated catalog interfaces (see Figure 1). Then we naturally combine them (green boxes on the 1) within the services (green ellipses) to achieve performant solution. These interfaces expose a well-defined set of operations to the client. The interfaces are in detail: Authorization Base The basic authorization interface offers methods to set and get permissions on one or more catalog entries. The permissions are represented as basic permissions (permitted operations for owner, owner’s group and others) and as Access Control List (ACL). The permissible operations in the gLite implementation are read, write, execute, remove, list, set permission, set metadata and get metadata. Metadata Base The methods of the base metadata interface deal with setting, querying, listing and removing metadata attributes to one or more catalog entries. Metadata Schema The schema interface allows attribute definition and grouping into schemas. Attributes may be added and removed from existing schemata. These attributes are then available to be filled with values through the Metadata Base interface. Replica Catalog The Replica Catalog exposes operations concerning the replication aspect of the grid files. These operations include listing, adding and removing replicas of a file identified by its GUID [18]. File Catalog The File Catalog allows for operations on the logical file names (LFN) that it manages. The File Catalog operations are for example making directories, renaming logical file entries and managing symbolic links. File Authorization The methods needed to implement a standalone authorization service are in this interface, which are basically the ability to add and remove permission entries. Metadata Catalog In order to implement a standalone metadata catalog, methods for adding and removing entries are required. Combined Catalog If this interface is implemented by a catalog service, it either also implements File, Replica and Base Authorization interfaces or calls upon an external catalog in order to do so. If the implementation is not in the same place, the Combined Catalog implementation needs to maintain a persistent state of all operations it performs across catalogs in order to make sure that the operations only occur in a synchronised manner and that the method semantics are preserved. StorageIndex The Storage Index interface is tightly coupled to the File and Replica catalog functionality. Its sole aim is to return the list of Grid Storage nodes where a given file (identified by its LFN or GUID) has a stored replica and is being used to steer access to the files by workload or transfer management services. All catalog interfaces expose bulk operations. They increase performance and optimise interaction with the Grid services. The reason for this interface decomposition is to have service interfaces with well-defined semantics which may be implemented by many parties. In this model, a possible scenario is that more than one interface is implemented by the same service. In Figure 1 the interfaces are grouped such as to show which gLite service provides what implementation. These are gLite Fireman The gLite FIle and REplica MANager called the Fireman catalog implements all file management interfaces. This will be described in detail below. gLite StorageIndex The Storage Index service is provided also by the Fireman implementation, but is offered as a separate porttype. gLite Metadata The standalone gLite Metadata Catalog implementation offers a full metadata catalog solution. Fig. 1. Catalog Interfaces with the gLite services implementing them. gLite FAS The File Authorization Service can be deployed as a simple authorization enforcement service for file access. Of course another implementation may choose to implement the interfaces in a different manner, to even further extend the possibilities for deployment of the services. 2.2 Scalability and Consistency The File Catalogs that have been deployed to date are all deployed centrally and therefore are a single point of failure. The central catalog model has obviously excellent consistency properties (concurrent writes are always managed at the same place) but it does not scale to many dozens of sites. There are three possibilities to solve this issue: – Database Partitioning. The data in the database is kept at different sites. For some applications where datasets are unlikely to move, partitioning on data location (and mapping this implicitly into the logical namespace) may be a good approach, solving the scalability problem. – Database Replication. The underlying database is replicated using native database replication techniques. This may mean a lock-in to a vendor-specific solution. Currently commercial database vendors like Oracle provide multimaster database replication options which may be exploited for such a purpose. – Lazy database synchronisation exploiting the specific semantics of the catalogs using message oriented middleware to propagate the updates. Reliable messaging technologies are available commercially (just like replication for database technologies) and there are some solutions also in the open source domain. The Fireman implementation is able to accommodate all solutions by design. Consistency might be broken in the second and third model i.e. it is possible to register the same LFN in two remote catalogs at the same time such that a conflict will occur. The reconciliation techniques apply in both cases. In the third case we can be specific to the semantics of the system and exploit the uniqueness of the GUIDs to detect consistency problems and to notify the users asynchronously. Fireman has publishing logic built on standard Java Messaging Service standard, which when enabled, publishes all changes to the hierarchical topic of external JMS service. Then independent component exists that may subscribe to a freely selected topic and update other catalogs or digest them at will. 2.3 Bulk Operations The whole data management design is optimised for bulk operations as opposed to single-shot operations. The reason is that in the stateless web service model in the service oriented architecture (SOA) of gLite, there is a considerable performance hit when the connection between client and server is established and the security or transaction context is built. If this connection has to be rebuilt for each operation, the client-server interaction would be very inefficient (see section 4.3). For this reason, all catalog operations are built around stateless database backend bulk operations as part of the interface wherever it is reasonable. Using them to bundle similar tasks as a single operation increases performance considerably, optimising the interaction with the Grid services. We provide also a ’simple’ interface to our user community that will automatically optimize this aspect of the interface usage, being a thin layer on top of the client interfaces. 3 Requirements Catalogs were built based on requirements from High Energy Physics (HEP) experiments carried out at CERN and the Biomedical user communities of the EGEE project. The specific application requirements are preceded by four guiding principles that are applicable to all of Grid computing: Interoperability allowing multiple implementations of a service or interface, multi-platform support, modularity in the sense of service oriented architectures (SOA) and scalability to several hundreds of Grid sites. The requirements on data catalogs can be summarized as follows: Logical namespace management This implies having a well structured and logical namespace allowing easy management of files. Implicitly it has to keep track of sites where data is stored and to be able to resolve conflicts resulting from the distributed nature of the Grid. Virtual Filesystem view Both user communities have explicitly requested a hierarchical filesystem-like organisation of the logical namespace. Support Metadata attached to files Give the user additional information on the file going beyond the regular unix parameters, allowing user-defined extensions. Bulk Operations These have been requested explicitly in order to improve performance. Security Basic unix file permission semantics as well as fine-grained ACLs have been requested, with the possibility to use one or the other based on the policy of the user community. Support flexible deployment models This includes support for different modes of operation: Single central catalog, single local catalogs connected to a single central catalog, or site local catalogs without a single central catalog. Scalable - Scale up to many clients (in HEP, depending on the place in a processing chain it could be tens to hundreads) and to a large number of entries (HEP initial 101 1 requirement was relaxed down to 109 per VO per service), trying to address performance issues of previous catalog implementations. As expected, the different user communities rated the priorities of these requirements differently. For the HEP user community, performance is much more essential than security, which was the most important requirement from the Biomedical applications. Scalability in terms of data volume is again more important to the HEP community (with different scenarios for data acquisition and analysis) while flexible deployment models was more important to Biomedical applications due to some data locality restrictions. The gLite Fireman implementation tried to take all of these requirements into account, finding a good balance where they were running against each other (like performance and security). 4 Implementation Two distinct implementations of the File and Replica Catalogs exist, using Oracle and MySQL as RDBMS backends. The reason to have two solutions is that one of the requirements is to have an open-source version of the catalog that does not have a commercial software dependency. At the same time, the hard scaling requirements of some user communities (especially HEP) cannot be met by open source RDBMS solutions yet. Oracle was chosen because the physics community has a license for all LHC related work and because of the existence of many language features (hierarchical queries, OO functionality) that alleviated impedance mismatch between hierarchical, bulky manipulated filesystem view and internal data model ina performant way. 4.1 Oracle Implementation The Oracle version aims to provide maximal performance and scalability. The web service layer is very thin, except the hooks for the publishing logic, providing a straight handover to the database where the actual application logic is Fig. 2. Oracle Fireman architecture. implemented using stored procedures and Oracle-internal objects (see Figure 2). The internal architecture is built around postulated access patterns of Grid entities, optimised for bulk operations for neighboring regions in a hierarchical namespace. Automatic sub-trees clustering with index organized tables resulting in naturally promoted caching of most frequently used namespace regions. Advanced indexing techniques, object transformations within complex queries lift the catalog limits to what is physically achievable with databases on commodity hardware nowadays (see Figure 3). The advantage of the Oracle version is not only that it can scale up to very large data volumes, it still being able to serve many clients with no significant loss of speed. This has been tested with several hundreds of millions of entries: 400 millions logical file names (LFNs), 10 million directories, associated replicas and access control lists (ACLs) plus metadata items; leading to over 900 million aggregated items on a single database node. In addition the Oracle implementation provides a set of features other implementations are usually unable to provide: – Internally implemented partial errors reporting without the loss of speed. This means that for example if 3 items out of 1000 failed during one call, no rollback is necessary. The remaining 997 items can be committed (or be read, depending on the operation) based on user-definable policy. – Fast LFN renaming with negligible cost regardless of position in the hierarchy – Flat behaviour and same results with 106 ...108 items – Very big number of catalog principals (i.e. files’ owners) possible, in the thousands; unlimited number of superusers – Maximum size of a single logical file name is 32K,unlimited size of ACL and metadata attributes – Unlimited number of directory levels – Fast runtime checks for loops in symlinks – Fast removal of all pertaining ACLs if user/group removed, easy reporting on ACL data – Ownership of newly created items, home directory, dangling GUIDs handling – Java Messaging Service based, easily pluggable catalog distribution, functionality completely isolated from the protocol (WS, JDBC, C++ protocols) – Separated service to digest published changes for distribution or accounting Fig. 3. Internal namespace decomposition and clustering for improved caching and low level clustering. Partitioning comes naturally since the schema transparently gives the ability to balance several levels of physical data skew based on the access patterns with hierarchy slicing, hierarchy traversal and hierarchical clustering combined with hierarchical Oracle queries (see Figure 3). All this allows for a very small performance hit which is a frequent problem when introducing partitioning for big data volumes. We would like to stress out that regardless of rich functionality of Oracle, combining some of the functions together (mainly hierarchical queries and objects), sometimes in exotic functional brew, was very tedious task and solving all stoppers often reached Oracle low level support developers that had to fix dozens of internal Oracle bugs discovered by us. Finding of workarounds of Oracle database non-fixable bugs could be a base for several advanced SQL tutorials. The database has professional backup strategies, for LHC community runs on minimum 2-node cluster and usually is managed by dedicated database administrators, assuring a very good service quality. 4.2 Open Source Implementation The open-source version of the Fireman implementation was developed with the aim of being independent of the RDBMS backend. All of the business logic is written in Java and is being kept in the web service layer as a web application. It implements the full set of interfaces as described above, including all security and metadata features. This implementation has also undergone very thorough testing. The advantage of the open-source version is that it can easily be deployed and adapted to any database solution. Currently it has only one reference implementation with very light ties to the underlying database (MySQL). The adaptation to other databases is straightforward. Although the MySQL implementation was less stressful and less strenuous than the Oracle one, and none of the features available from the Oracle implementation are available here, and the limitations are derived from capabilities of MySQL. For example, the maximal file name/directory name size within a LFN is 255, due to indexing constraints or renaming of directories is not really feasible. Only very few of the options necessary to implement the database logic the way it is done with Oracle are possible with open source database solutions resulting in scalability down to orders of magnitude smaller and being slower 10-200 times for bigger input sets. Exact comparisons were not made as MySQL version could not pass acceptance test for the minimal volume for tests we estimated at 20M entries. It was performing at acceptable levels at up to 5M entries it was designed for. 4.3 Performance and Scaling There were several performance tests conducted both by ARDA and internal team. Results had been varying in the course of evolution as the product was becoming mature. As a reference setup we choose to have standard CERN disk server for Oracle DB server: Dual Intel(R) Xeon(TM) CPU 2.4 GHz with 512kB cache, 2GB RAM (1.7G SGA), roughly 1000GB disk space available for the database through RAID0 and 9 mirrored disks. Oracle DB configured with 16k block size. We have used a regular CERN batch node for the web service layer. This naturally implied bottleneck due to SOAP/security handling on a web server node for many clients. The reference database was already preloaded with 25M LFN entries with 2.5 replicas per LFN and 3 ACL associated per LFN item on average. 5% of LFNs were symbolic links with two levels of indirection depth. The average LFN directory depth was 10. Results show huge gains in speed depending on the bulk mode and number of clients. Speedup is quasi linear and drops after reaching memory threshold on the web server and I/O limitations of database machine. and Figure 5). The speed of retrieval may also depend on the data skew in the requested items, i.e. if all 1000 items in a 1000 items bulk call are placed in the same directory one would see different behaviour as if 1000 items are distributed over 1000 directories. This relation is non-linear because we optimize security checks internally and if there are overlapping parts of the path, permission checks are managed in one go. Also the depth level plays a role as queries actually traverse trees internally. For the creation, where even more complex processing is needed we also suffer Fig. 4. Performance of complex lookup operation involving security check, replica retrieval and possible symlink traversal. Fig. 5. Scalability of bulk creation operation. from the clustering factor mentioned above, which optimizes bulk reads and joins for reading later. Still, the performance achieved should satisfy even the tallest requirements of the HEP or Biomedical applications. This is especially true for bigger sets of data, where memory congestion of the web service layer could be solved by clustering. Security and SOAP overhead The security context has been removed from the diagrams for clarity. One should add around 200ms-600ms per call for authentification (proxy certificate validation and security context setup) and around of 10kB of bidirectional wire transfer for a successful one. Depending on the setup this numbers may vary so we dropped them off the picture. The CPU stress comes mostly from the security handling and for massive SOAP attack 40 clients could saturate 2Ghz web server. It confirms the importance of the bulk mode. SOAP messages are big and may reach 200KB per call for 100 item which should be multiplied by 3 to get the memory space needed on the bean server. There is clearly space for optimisation here. 4.4 Metadata In addition to the hierarchical structure and its filesystem specific metadata (LFN, size, permissions, ...) the gLite FiReMan also provides means for applications to have their own specific metadata and to perform catalog queries based on this information. The metadata interface tries to deliver at the same time high flexibility for metadata definition by clients of the catalog and means for implementors to add this functionality to their catalogs in an optimized way (even if the generic nature of application metadata imposes certain limits on these optimizations). In this interface it is required that all attributes be associated to a schema, with entries in the catalog being associated with schemas, not attributes directly. On the querying side, a metadata query language was defined with its grammar being a subset of SQL (the obvious choice considering support and preferences among the involved communities). Capabilities for restricting the result set of the queries as seen on SQL are provided, as well as the possibility for intersecting entries associated with different schemas. The gLite metadata interface is exposed by both gLite FiReMan catalogs (MySQL and Oracle), as well as the gLite standalone metadata catalog. It has also been integrated into the Atlas AMI metadata catalog [16], showing that it can be a solution to achieve interoperability at both the file and dataset metadata levels. 5 Security By imposing ACLs on the filesystem the security semantics are straightforward. This should also help in avoiding concurrency issues when writing into the catalog since each user will have only limited access rights in the LFN namespace and there should be only a finite set of administrators per VO who have full access rights for all of their LFN namespace. The probability of two users with the same access rights to write into the catalog in the same directory in a distributed system without knowing of each other is therefore much lower. The Authorization Base interface exposes the operations on the file ACLs. There are two possibilities of how ACLs may be implemented. – POSIX-like ACL The POSIX semantics follow the Unix filesystem semantics. In order to check whether the user is eligible to perform the requested operation on a file, all of the parent directory permissions and ACLs need also be evaluated. – NTFS-like ACL The Microsoft Windows semantics are simpler, i.e. the ACLs are stored with the file and the branch has no effect on the ACL. These are “leaf” ACLs, only operating on the file itself. In a distributed environment, the NTFS-like semantics are simpler to track and are probably more efficient, since the namespace hierarchy has no impact on the distribution. On the other hand POSIX-like semantics are sometime more intuitive to the target user group. The Authorization Base interface exposes all operations that deal with querying and setting of the file ACLs. It acts as the authorization authority for file access and is called by other services such as the File Placement Service to enforce ACL security. 6 Distribution The Master Replica Currently in HEP we do not expect the files to be updated once distributed, but we provide before mentioned proof-of-concept distribution/update mechanism based on subscription paradigm, that tunnels change operations via message topics. JORAM [17] and Oracle Advanced Queuing were used as messaging platforms for tests. To manage global state of view of distributed replicas, a placeholder is needed to enable such functionality on the interface level. The master replica flag for a SURL as present in the File Catalog may be used to flag a SURL as the only replica where update operations are allowed. This may then also be the only source for replications. If the master replica is lost, it might be recovered from other replicas or not, based on VO policies. A master replica should always be kept on a reliable SE providing high QoS (permanent space semantics). 7 Summary and Open Issues As the most important issue, the web service interface is perceived as sub-optimal especially by the HEP user community where the opinion prevails that it is too slow due to the XML marshalling. However, we can show that depending on the usage pattern the specific performance and scalability needs of the HEP community can also be met by web services. One issue is that the user community sometimes prefers to implement their own custom catalog solutions, leading to non-interoperable services that don’t address scalability problems well. Therefore it is not enough to provide good services meeting the stated requirements but also to achieve a good acceptance in the user communities especially where some misconceptions about web services prevail. Due to the acceptance we have been unable to deploy the Fireman catalog in a distributed fashion across the wide area, so we cannot present numbers on how it scales with multiple instances sending update messages between the individual nodes. This work may be resumed in the future. In summary however, we can state that the gLite Fireman catalog provides a complete set of interfaces for file and metadata catalog operations. It is able to scale to very large numbers of entries depending on the underlying database technology being used. The distribution of the catalogs has not been tested in the wide area yet, but preliminary results are promising. References 1. EGEE JRA1 Middleware Activity Deliverable DJRA1.4: The gLite Middleware Architecture. https://edms.cern.ch/document/594698/1.0/. 2. Chaitanya Baru, Reagan Moore, Arcot Rajasekar, Michael Wan, The SDSC Storage Resource Broker, Proc. CASCON’98 Conference , Nov.30-Dec.3, 1998, Toronto, Canada. 3. Peter Honeyman, Distributed File Systems in Distributed Computing: Implementation and Management Strategies, ed. Rhaman Kanna, pp. 27-44, Prentice-Hall (1994). 4. Scale and performance in a distributed file system, J. Howard and M. Kazar and S. Menees and D. Nichols and M. West. Proceedings of the eleventh ACM Symposium on Operating systems principles, 1987 ISBN 0-89791-242-X, ACM Press. 5. Avaki Corporation. Keep it simple: Overcome information integration Challenges with Avaki Data Grid Software. White Paper. July 2003. 6. Osamu Tatebe, et. al, ”Worldwide Fast File Replication on Grid Datafarm”, Proceedings of the 2003 Computing in High Energy and Nuclear Physics (CHEP03), March 2003. 7. I. Stoica, et. al, Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications, SIGCOMM Conf., 2001. 8. I. Clarke, et. al, Protecting Free Expression Online with Freenet, IEEE Internet Computing, Vol. 6, No. 1, 2002. 9. Ben Y. Zhao, et. al. Tapestry: A Resilient Global-scale Overlay for Service Deployment, IEEE Journal on Selected Areas in Communications, Vol 22, No. 1, January 2004. 10. John Kubiatowicz, et. al, OceanStore: An Architecture for Global-Scale Persistent Storage, Proc. of ASPLOS 2000 Conference, November 2000. 11. The Grid File System Working Group, GGF, https://forge.gridforum.org/projects/gfs-wg 12. A. Chervenak et. al., Giggle: A Framework for Constructing Sclable Replica Location Services, Proceedings of Supercomputing 2002 (SC2002) 13. L. Guy, et. al, Replica Management in Data Grids, Global Grid Forum 5, 2002. 14. The Open Grid Services Architecture, Version 1.0, GGF OGSA WG https://forge.gridforum.org/projects/ogsa-wg 15. ta Acces and Integration Services GGF Data Access and Integration Services Working Group http://forge.gridforum.org/projects/dais-rg/document/ 16. Thomas Doherty, Development of Web Service Based Security Components for the ATLAS Metadata Interface (Master Thesis), September 2005, http://ppewww.ph.gla.ac.uk/˜tdoherty/MScThesis/MScThesis.pdf 17. JORAM: Java (TM) Open Reliable Asynchronous Messaging http://joram.objectweb.org/ 18. UUIDs and GUIDs, Paul J. Leach and Rich Salz, INTERNET-DRAFT, 1998, February