Academia.eduAcademia.edu

Distributed and Hierarchical Storage Systems

1999

We discuss issues and technologies for implementing and applying distributed, high-performance storage systems. We review a range of software systems for distributed storage management, and summarise what we believe to be the most important issues and the outstanding problems in distributed storage research. We outline our vision for integrated storage management that is compatible with our DISCWorld wide-area service-based metacomputing environment. Very large on-line archives present a challenge and our driving force for developing a distributed storage system. In addition, robotic tape silos form the lowest level of many of these archives. Their integration into the storage hierarchy is often non-trivial, proprietary and expensive. We describe how our system can provide the necessary infrastructure for such applications.

Technical Report DHPC-062 January 1999 Distributed and Hierarchical Storage Systems Craig J. Patten, K. A. Hawick, J. F. Hercus and A. L. Brown Advanced Computational Systems Cooperative Research Centre Department of Computer Science, University of Adelaide SA 5005, Australia Email: fcjp,khawick,james,fredg@cs.adelaide.edu.au Abstract We discuss issues and technologies for implementing and applying distributed, high-performance storage systems. We review a range of software systems for distributed storage management, and summarise what we believe to be the most important issues and the outstanding problems in distributed storage research. We outline our vision for integrated storage management that is compatible with our DISCWorld wide-area service-based metacomputing environment. Very large on-line archives present a challenge and our driving force for developing a distributed storage system. In addition, robotic tape silos form the lowest level of many of these archives. Their integration into the storage hierarchy is often non-trivial, proprietary and expensive. We describe how our system can provide the necessary infrastructure for such applications. Keywords: distribution; NFS; storage; tape silo; metacomputing. 1 Introduction Various distributed computing technologies have enabled greater integration of high-performance computing, storage and visualisation resources distributed across wide-area networks (WANs). These integrated systems are often termed metacomputing systems [5]. In the storage realm, these metacomputing systems present many issues not addressed through existing distributed storage systems or remote data access mechanisms. Distributed file systems predominantly address “everyday” file access over a relatively local area, and most mechanisms for remote data access in metacomputing systems focus too closely on the simple client-server case. Wide-area storage management, hierarchical storage access, network latency and complexity, and support for legacy/commercial and custom applications are examples of issues not suitably covered by these systems. Within the DISCWorld [8] metacomputing infrastructure, we are developing software technology to present storage as a service within the system - the DISCWorld Storage Service (DSS) [14]. However, access will not be limited to other components in the metacomputing system; legacy or commercial applications not specifically designed for use within a metacomputing system will also be able to utilise the service. The DSS is designed to provide a metacomputing storage service which is scalable, reliable and portable without the need for operating system or application modifications, and which is also latency-tolerant and flexible and adaptive in its use of storage and network resources. In Section 3, we outline the relevant technical issues for distributed storage and in section 4 we discuss how hierarchical storage systems add to latency problems. In section 5 we describe our efforts in the area and status of our implemented DWorFS system and partial implementation of a full storage service. We summarise and detail future plans in Section 6. 1 2 Target Application Characteristics Storage in the form of a filesystem is a general requirements of many computer applications, but the access patterns for distributed storage applications are somewhat more selective. We are targeting those applications and user communities that truly need distributed access to large amounts of online storage. A particular example include users and applications for geospatial data analysis. Geospatial data applications often need to access several different sources of data, that may include digital terrain maps, satellite and air-reconnaissance imagery, other vector or cadastral data such as road and utility networks or other measured properties such as population statistics or agricultural properties of a region. Often some of this data is available to the user and will be owned by that user. However the bulk archives of satellite imagery or government survey data is often not owned by a given user or his organisation and is very often managed as a remote archive. Until recently it was common for data sets to be bought and shipped on tape or CD-ROM to users and accessed locally. The availability of high performance networks and Internet connectivity between commercial organisations means that it is now more common to want to access remote data without the inconvenience and delay of shipping a tape copy and unloading it locally. We believe a storage system can combine the best features of distribution and a hierarchy of different storage media to provide a good compromise of cost effectiveness and performance for bulk storage systems distributed nationally and even internationally. Suitable interoperating software to manage distributed bulk storage facilities will allow data custodians to manage their own data collections more effectively and provide their users better access to the collections. We discuss this issue in more depth in[7]. We are developing a distributed storage systems experiment for the Australian Bureau of Meteorology, that involves connecting sites in Adelaide, Melbourne and Canberra using broadband networking technology to investigate efficient ways to stage delivery of data from a central site to regional nodes. Individual sites may have their own local data stored on a hierarchical media storage device such as combined tape silo and array of disks. It is therefore advantageous to have the storage management software at each site interoperate to make best use of the full capacity of the distributed system. This project involves geospatial data that is stored as hyperbricks containing several timeslices and different variables. The seek time, as discussed in section 4, for a particular temporal-spatial slice for a particular application is therefore non-trivial. The software issues for interconnecting the different sites is also of prime consideration and we discuss the issues for this in section 3 below. In general we are targeting applications that may be run semi-interactively and which may need to involve both new components as well as legacy applications for which the source code is either unavailable or for which it is not feasible to modify it. A distributed software infrastructure that uses a conventional file system interface is therefore highly advantageous. This is what is provided by our DWorFS system. Adding in new functionality and modules to allow control of prefetching and other performance enhancements is what we are targeting with our general DSS as described in section 5. 3 Issues and Technologies for Distribution Network latency is an important issue in wide-area distributed computing; the speed of light places a fundamental limit on latency reduction. Round-trip times between geographically separated sites of the order of hundreds of milliseconds and upwards are not uncommon. It is therefore important in the design of any distributed system to amortize the cost of this latency into bulk-data transfers. Some work to improve existing network/distributed file system protocols is being initiated, for example Network File System (NFS) [17] Version 4 [10] and WebNFS Multi-Component Lookup (MCL) [18]. The VAFS project [2] is augmenting the Andrew File System (AFS) [15] to allow bulk-data transfers directly over the Asynchronous Transfer Mode (ATM) Adaptation Layer 5 (AAL5). The WebOS project [19] has implemented a loadable kernel module for Solaris which enables access to HTTP [3] servers through the file system namespace for their wide-area storage infrastructure. However, whilst each of these systems have their merits, there are still many issues associated with the wide-area metacomputing storage problem which they do not address. Network complexity is also of increasing importance in today’s high-performance computing environments. Sites often have access to a variety of networks, from the “standard Internet”, to high-performance WANs. Some of the access to these networks is permanent, sometimes it is part-time or on-demand for specific experiments. Storage is not immune to this variety either; the range of media in the storage hierarchy and the many different ways of organising data on these media places demands on the flexibility of a storage system if it is to perform well. There are many parameters associated with local and remote storage resources, for example media and network latency and bandwidth and optimal layout strategies for differing datasets and applications. Distributed systems must therefore be flexible in their use of such resources and dynamically reconfigurable to handle changes in the resources available, or the application thereof. Most existing distributed file system technology is lacking in this area and existing metacomputing remote data access mechanisms do not address the issue. As wide-area networking technology enables greater integration of geographically distributed resources, the collective level of heterogeneity any distributed system must handle increases. Portability then becomes even more important for any system wishing to function effectively over wide-area computing resources. Therefore it is important for a storage service operating over such resources to provide access to applications without requiring modifications to the operating system or applications themselves. Metacomputing systems also present issues in electronic commerce, such as derived product ownership and storage leasing. If a dataset is accessed through a storage system which provides some level of processing or value-adding, the access of cached results could be tracked for billing purposes. The temporary leasing of storage resources must also be handled in a similar fashion. Existing distributed file systems and metacomputing remote data access mechanisms do not address these issues. Some existing remote data access mechanisms in metacomputing systems, for example Remote I/O [6], support parallel I/O interfaces, however these systems do not address the broader issue of distributed storage management across the wide area. Some distributed file systems are more sophisticated in their use of distributed storage resources, but do not support parallel I/O. Support for the complex data movements that can occur across a metacomputing system must be tied into the storage service of such a system for maximum effectiveness. 4 Hierarchical Storage and Tape Silos Robotic tape silos form the lowest level of many large scale on-line and near-line data storage and backup systems. They are used to provide economically viable solutions for the storage of large data sets which can not economically be stored on disk. However, their integration into the storage hierarchy is anything but ubiquitous or standardised, and is invariably proprietary and expensive (eg. SAM-FS [12]), application specific and data specific. This also often requires manual intervention. Integrating tape silo storage into a distributed data store involves addressing the inherent limitations of this type of storage. These limitations are of two types. The first is the limitation on reliability and convenience in writing to tapes rather than disks. Usually, it is only possible to write to the end of a tape; writes elsewhere destroy all subsequent data. Unless carefully managed this property will lead to fragmentation of the data written to the tapes; a problem which is rendered more serious than it is for disk systems because of the high latency characteristics of tape silos. Latency or performance overhead is the second primary, and more serious, limitation of tape silo storage. It can be divided into three types. Firstly there is the latency of loading tapes and preparing them for access (tape tensioning, mounting). For the StorageTek TimberWolf 9740 silo using Redwood SD-3 drives [16], this time is around 40 seconds, provided there is an available tape drive and the robot arm is free. The delay is considerably longer if there is no available tape drive. This latency can be expected to improve in the future; however, being a large scale (in computing terms) mechanical procedure, it will always be enormous relative to computer speeds. The other types of latency are due to the intrinsic nature of tapes. The first is fast forwarding time. This is the time it takes to seek through the tape to the beginning of the file containing the required data. A fair estimate of the latency imposed by this step is the time to seek half way through a tape from the beginning. For a 50G tape in an SD-3 drive this figure is 53 seconds. The final type of latency is the time required to seek inside the file for the required data. This time is determined by the read rate of the tape drive (11MB/s for the SD-3) and the file offset. With careful organisation of data on the tapes this latency should be minimal. These stated latencies can be expected to improve in the future, but are likely to remain the primary issue to be addressed when integrating tape silo storage into a larger distributed data store. There are a few strategies which can be employed to address the problems posed by these high latencies. The first, and easiest, is to use caching to reduce the frequency of accesses to the silo. The simple way to add caching to a tape silo is to use a disk or RAID attached to the same host as the silo. In a distributed system the caching can be expanded to use multiple machines and also by caching data closer to where it is used. Other strategies for reducing the effect of latency involve management of the data stored on the silo. This assumes that the data has some high level structure or organisation and that the access patterns to the data are structured. The problem is to map the data set onto the silo in a manner which increases locality of access, thereby maximising cache effectiveness. The potential benefits of carefully arranging data on the silo are considerable. For example, using the silo and tapes mentioned above the average time to seek between two data items on separate tapes is about 150 seconds; for items on the same tape it is around 20 seconds which can be further reduced by optimising the layout of data on the tape. The latency problem may be addressed by preempting data requests. This allows some or all of the latency costs to be incurred before the request is actually made. In a wider system this strategy can be extended to prefetching data to parts of the system where it will, or might be, needed. There are two ways to acquire the knowledge about future requests which is required to preempt operations; automatic and manual preempting. Automatic preempting involves predicting future requests by extrapolating from current activity and analysing historical access patterns. The performance of this approach will be determined by the type of application (eg. how structured its data access pattern is), the level of sophistication of the prediction algorithm, and the amount of cache space available. Manual preempting means asking the user of the system or application to specify, in advance, the data required for the application they are using. This option is capable of the best performance but has the weakness of relying on user input which may not be available. 5 Our Vision for Distributed Storage Our initial exploration into metacomputing storage mechanisms produced the DISCWorld File System (DWorFS) [13], an extensible daemon which provided a NFS interface to arbitrary underlying storage mechanisms. A DWorFS daemon accepts NFS requests as per a normal NFS server, however underlying dynamically-loaded modules provide the storage functionality. This allows users to access data from “virtual files” which may not, and may never, fully exist on any storage medium - the modules simply provide the NFS clients with dynamically-generated directory structure and file data in response to requests. This enables on-demand production and processing of data products in a manner which is transparent and portable. Modules built to interface to DWorFS thus far provide access to a GMS-5 [11, 13] satellite imagery repository and a persistent object store [4]. We are currently planning modules for handling other large data archives such as meteorological model data hyperbricks. DWorFS NFS DSS Dataset Modules (GOES-9, ...) Clients Other DSS Peers Storage Manager DSS DISCWorld Node Figure 1: Overview of the DISCWorld Storage Service architecture, illustrating the modular design and communications infrastructure. The DSS, illustrated in Figure 1, builds on our DWorFS work by taking the concept of providing data in potentia through a user-space NFS server to construct a decentralised metacomputing storage service. Each node offering storage services in the DISCWorld Storage System must run a DSS daemon, which as for DWorFS, provides an NFS interface. For scalability and reliability, we have designed the DSS to be completely decentralised. A client wishing to access the DSS infrastructure must simply use NFS to access its nearest DSS node, which will then communicate with its peers to arrange all requested data transfers. Clients running a Unix-variant operating system can themselves run a DSS server, accessing it across their loopback network interface, to remove the need to rely on another host for access to the “cloud” of DISCWorld storage resources. The DSS, unlike general distributed file systems, has some level of understanding of the data products it provides. As in DWorFS, modules respond to client requests for directory structures and file data. For example, all requests for pathnames within the gms5 directory may be fielded to the GMS-5 dataset module for processing. The DSS provides an underlying storage and distribution API for the modules, through the the DSS Storage Manager. As with the DWorFS layer in the DSS, the Storage Manager also utilises dynamically-loaded modules; in this case, to provide access to some (possibly remote) storage mechanism. For example, to provide access to an actual filesystem on disk, a tape silo, or even a remote HTTP server. These modules must provide at least a core minimum of services to the Storage Manager, but can provide their own extensions if so desired. This allows the Storage Manager to provide dataset modules with uniform access to a variety of storage mechanisms, with the capability for providing custom features. For example, a satellite image dataset module may make use of extended tiling and data layout features of a specific raw disk storage module. Through these storage modules, DSS supports the integration of arbitrary levels of the storage hierarchy, for example robotic tape silos. Typically, the software technology used to access these silos is not ubiquitous or standard, and is proprietary and expensive. We have implemented a simple storage module to access a StorageTek TimberWolf 9740 [16] tape silo, however arbitrary functionality, such as caching and prefetching, could be implemented through the storage module API. Distribution in the DSS is handled through a storage module providing a communications protocol, the InterDSS protocol, for use between DSS nodes. Specifically designed for bulk-data transfer, this protocol provides a mechanism for efficient data transfers between DSS nodes, whilst also providing a higher-level API for flexible access to storage resources on remote DSS nodes. We are still in the implementation phase of the DISCWorld Storage Service. The DWorFS layer has been implemented, as have some simple dataset and storage modules. Some initial performance measurements have been produced [4] for the DWorFS persistent object store module which are unfortunately inconclusive relative to the Solaris 2.5 NFS server. The DWorFS implementation performance suffers due to the blocksize and synchronous write limitations of NFSv2, but in some write-heavy operations outperforms the Solaris server when the module minimises stable store write activity. We are presently reviewing the option to meet full NFSv3 capabilities. 6 Summary and Future Work Existing approaches to wide-area distributed and hierarchical storage in a metacomputing system do not sufficiently address the unique issues and problems that arise in such an environment. For example, network latencies and complexities, storage hierarchy integration and the need for ubiquitous access mechanisms. We have outlined the DISCWorld Storage Service, a prototypical metacomputing storage service designed to address some of the open research issues. These issues include: scalability, reliability, portability, latency-tolerance and flexibility and adaptivity in use of storage and network resources. In our planned system the use of hierarchical storage media with a spectrum of additional latency properties presents similar problems to those inherent in distribution over wide areas. The latency overheads can in fact be modeled as another layer in the hierarchy of access latencies. We have completed the DWorFS NFS interface component of our DSS vision, although it remains to decide whether the use of NFSv3 capability is required. Initial performance results from applying DWorFS to persistent object stores are promising. Immediate future work includes completing the InterDSS protocol and Storage Manager development. After this, we plan to design and implement several more dataset and storage modules and to experiment with the various system design parameters to gain a fuller understanding of the issues involved. Future plans also include enabling the DWorFS layer to be extensible in the interfaces it provides, and to use this functionality to integrate the Storage Service with DISCWorld itself. Acknowledgments This work was carried out under the Distributed High Performance Computing Infrastructure Project (DHPC-I) of the On-Line Data Archives Program (OLDA) of the Advanced Computational Systems (ACSys) [1] Cooperative Research Centre (CRC) and funded by the Research Data Networks (RDN) CRC. ACSys and RDN are funded by the Australian Commonwealth Government CRC Program. References [1] Advanced Computational Systems Cooperative Research Centre, http://acsys.adelaide.edu.au. [2] “AFS for Very High Speed Networks”, Charles J. Antonelli, Center for Information Technology Integration, University of Michigan, October 1998, http://www.citi.umich.edu/projects/vafs. [3] “Hypertext Transfer Protocol – HTTP/1.0”, T. Berners-Lee, R. Fielding and H. Frystyk, RFC 1945, May 1996. [4] “Utilising NFS to Expose Persistent Object Store I/O”, A. L. Brown, Submitted to 6th IDEA Workshop, Rutherglen, January 1998. [5] “Metacomputing”, C. Catlett and L. Smarr, CACM, 35 (1992), pp. 44-52. [6] “Remote I/O: Fast Access to Distant Storage”, I. Foster, D. Kohr, R. Krishnaiyer, J. Mogill, Proc. Workshop on I/O in Parallel and Distributed Systems, pp. 14-25, 1997. [7] “Interfacing to Distributed Active Data Archives”, K.A.Hawick and P.D.Coddington, November 1998, To appear in International Journal on Future Generation Computer Systems. [8] “DISCWorld: An Environment for Service-Based Metacomputing”, K. A. Hawick, P .D. Coddington, D. A. Grove, J. F. Hercus, H. A. James, K. E. Kerry, J. A. Mathew, C. J. Patten, A. J. Silis and F. A. Vaughan, Future Generations of Computer Science Special Issue on Metacomputing, and Technical Report DHPC-042, April 1998. [9] “Managing Distributed, High-Performance Storage Technology”, K. A. Hawick and Craig J. Patten, Technical Report DHPC-054, September 1998. [10] “Network File System Version 4 (NFSv4) Working Group Charter”, Internet Engineering Task Force (IETF), http://www.ietf.org/html.charters/nfsv4-charter.html. [11] “GMS User’s Guide”, Japanese Meteorological Satellite Center, 2nd Ed., 1989. [12] “SAM-FS”, LSC Inc., http://www.lsci.com/lsci/products/samfs.htm. [13] “DWorFS: File System Support for Legacy Applications in DISCWorld”, Craig J. Patten, F. A. Vaughan, K. A. Hawick and A. L. Brown, Proc. 5th Integrated Data Environments Workshop, Fremantle, February 1998. [14] “Towards a Scalable Metacomputing Storage Service”, Craig J. Patten, K. A. Hawick and J. F. Hercus, Technical Report DHPC-058, November 1998. [15] “The ITC Distributed File System: Principles and Design”, M. Satyanaranan, J. H. Howard, D. N. Nichols, R. N. Sidebotham, A. Z. Spector, and M. J. West, Proc. Tenth Symposium on Operating Systems Principles, pp. 35-50, 1985. [16] “StorageTek Hardware Products”, Storage Technology Corp., http://www.storagetek.com/StorageTek/hardware. [17] “Network File System Version 3 (NFSv3) Specification”, Sun Microsystems, RFC 1813, June 1995. [18] “WebNFS Server Specification”, Sun Microsystems, RFC 2055, October 1996. [19] “WebOS: Operating System Services for Wide Area Applications”, Amin Vahdat, Thomas Anderson, Michael Dahlin, Eshwar Belani, David Culler, Paul Eastham, and Chad Yoshikawa, Proc. Seventh High Performance Distributed Computing Conference, Chicago, July 1998.